Instruction-finetuning Llama 2 with PEFT's QLoRa method

nganhahv99
Nov 16, 2023
4 min read

Updated: Nov 17, 2023

Introduction

The process of fine-tuning (or instruction tuning) Large Language Models (LLMs) is becoming the standard in the LLMOps workflow. This trend is driven by various factors: the potential for cost savings, the ability to process confidential data, and even the possibility of exceeding the performance of prominent models like ChatGPT and GPT-4 in certain specific tasks.

In this article, we will see why instruction tuning works and how to implement it in a Google Colab notebook to create your own Llama 2 model.

Github repo

Project Overview

Finetuned Llama 2 (7B parameters) on a T4 GPU using Google Colab in a 4-bit precision
Employed Supervised Fine-Tuning (SFTTrainer) to finetune the model with the miniguanaco dataset of 1000 instruction-response pairs
Utilized PEFT (parameter-efficient fine-tuning) techniques to save resources

Resources and Tools

Language: Python

Libraries: pytorch, transformers, peft, trl, accelerate, bitsandbytes

Platform: Google Colab Notebook (T4 GPU)

Table of Contents:

I. Background on fine-tuning LLMs

II. Finetune Llama 2

a. Define parameters

b. Load dataset and configurations

III. Upload the model to Hugging Face hub

I. Background on fine-tuning LLMs

LLMs are pre-trained on an extensive corpus of text. For example, BERT was trained on the BookCorpus (800M words) and English Wikipedia (2,500M words). This is a very costly and long process with a lot of hardware issues.

When the pre-training is complete, auto-regressive models like Llama 2 can predict the next token in a sequence of tokens. However, this does not make them particularly useful assistants since they don't reply to instructions. This is why we employ instruction tuning to align their answers with what humans expect.

Two main fine-tuning techniques are:

Supervised Fine-Tuning (SFT): Models are trained on a dataset of instructions and responses. It adjusts the weights in the LLM to minimize the difference between the generated answers and ground-truth responses, acting as labels.
Reinforcement Learning from Human Feedback (RLHF): Models learn by interacting with their environment and receiving feedback. They are trained to maximize a reward signal, which is often derived from human evaluations of model outputs.

In this project, we will perform SFT, which requires high quality dataset. In addition, the prompt template is also very crucial. In our case, we will reformat our instruction dataset to follow Llama 2's template below.

<s>[INST] <<SYS>>
System prompt
<</SYS>>

User prompt [/INST] Model answer </s>

(Note that you don't need to follow a specific prompt template if you're using the base Llama 2 model instead of the chat version.)

II. Fine-tune Llama 2

We will fine-tune the Llama 2 model with 7B parameters on a T4 GPU using Google Colab. However, due to memory constraint, a full fine-tuning is not possible - we need parameter-efficient fine-tuning (PEFT) techniques like LoRA or QLoRA.

To drastically reduce the VRAM usage, we must fine-tune the model in a 4-bit precision, which is why we'll use QLoRA here. We also will leverage the Hugging Face ecosystem with the transformers, accelerate, peft, trl, and bitsandbytes libraries. We'll do this in the following code based on Younes Belkada's GitHub Gist.

First, let's install and import the libraries.

Now we will load a llama-2-7b-chat-hf model (chat model). As mentioned previously, the quality of the dataset is very important, thus I will be using the mlabonne/guanaco-llama2-1k dataset which contains 1000 samples, processed to match Llama 2's prompt template. This dataset was sampled from timdettmers/openassistant-guanaco dataset which has been used to train Guanaco model with QLoRA. You can learn more about the process of formatting the dataset in this notebook. We will also give our fine-tuned model a name.

Define Parameters

Let's talk about some parameters we can tune here.

First, QLoRA will use a rank of 64 with a scaling parameter of 16 (more details on LoRA parameters here). We will load Llama 2 model directly in 4-bit precision using the NF4 type and train it for one epoch. More information about other parameters can be found in their documentations: TrainingArguments, PeftModel, and SFTTrainer.

Load Dataset and Configurations

We can now load everything and start the fine-tuning process. We will be taking the following steps:

Load the dataset - here, usually we can reformat the dataset according to prompt template, filter bad text, combine multiple datasets, etc. Our dataset has already been preprocessed.
Configure bitsandbytes for 4-bit quantization
Load Llama 2 model in 4-bit quantization on a GPU with the corresponding tokenizer (we just need to use AutoTokenizer - Hugging Face took care of this for us)
Load configurations for QLoRA, regular training parameters, and passing everything to the SFTTrainer

The training can be very long, depending on the size of the dataset.

After the training is done, we can use the text generation pipeline to ask questions like "What is a large language model?". I have formatted the input to match Llama 2's prompt template.

The model's output:

This is a very decent response for a model with only 7B parameters. In the future, we can train a Llama 2 model on the entire dataset used to train Guanaco.

III. Upload model to Hugging Face Hub

How can we store our new model now?

First, we need to merge the weights from LoRA with the base model - load the base model in FP16 and merge everything using peft library. With the weights merged, we also reload the tokenizer.

We can now push everything to the Hugging Face Hub to save our model.

You can now use this model for inference by loading it like any other Llama 2 model from the Hub. Here is the link of the model on the Hub.

The code shown throughout this article is available in this notebook!