Understanding 4bit Quantization: QLoRA explained (w/ Colab)

Рет қаралды 39,505

Жыл бұрын

QLoRA 4bit Quantization for memory efficient fine-tuning of LLMs explained in detailed. 4-bit quantization QLoRA for beginners, theory and code. PEFT - parameter efficient fine-tuning methods.
Based on my first videos on the theory of LoRA and other PEFT methods ( • PEFT LoRA Explained in... ) and the detailed code implementation of LoRA in my video ( • Boost Fine-Tuning Perf... ) now my third video on 4-bit quantization and QLoRA.
An additional Colab NB with code to fine-tune FALCON 7B with QLoRA 4-bit quantization and Transformer Reinforcement Learning (TLR).
Huggingface Accelerate now supports 4-bit QLoRA LLM models.
github.com/huggingface/accele...
QLoRA 4-bit Colab NB:
(all rights with Author Artidoro Pagnoni)
colab.research.google.com/dri...
#4bit
#4bits
#quantization
#languagemodel
#largelanguagemodels

Пікірлер: 77

@TylerLali Жыл бұрын

This was exceptional, truly! Phenomenal explanation

@jmvillwock Жыл бұрын

These is some of the best details I have ever seen on youtube! It is clear you are starting to master the subject as from even 3 months ago to today; your ability to explain the what and why has increased at least 3X. Not bad for a large natural language model:)

@chenkiwi9285 Жыл бұрын

this is the best explanation of QLoRA on youtube. Look forward to see GPTQ and comparison GPTQ vs. QLoRA

@Graverman 11 ай бұрын

thank you for doing the explanation in detail and after that jumping into the code! Really helps to actually understand this

@code4AI 11 ай бұрын

Glad to hear that!

@user-pe4xm7cq5z 7 күн бұрын

Literally the best video on this! Thank you so much!!

@wuhaipeng 3 ай бұрын

Exceptional explaination! Thank you so much!

@BlueDopamine Жыл бұрын

Nice Introduction on QLora Learning alot From Your Channel

@johnbrisbin3626 Жыл бұрын

A very nice explanation. I feel like I understand what is going on much better. Confidence is high.👍

@code4AI Жыл бұрын

Thanks for your feedback. Great to hear.

@DurandalLM 11 ай бұрын

Your videos and energy are great and informative. Thanks for making this stuff!

@code4AI 11 ай бұрын

I appreciate that!

@desmur36 Жыл бұрын

Perfectly explained! Many get this wrong. Beautiful!

@code4AI Жыл бұрын

Thank you!

@VighneshSablok 7 ай бұрын

I could not have asked for anything better. There is something here for other teachers as well. They could learn how to effectively teach students.

@Allen-TAN 10 ай бұрын

Excellent explaination! Thanks for the great work to the community

@hugoibt 11 ай бұрын

Finally some mathematical explanations ! Thanks 🙌

@alvinj.w9540 4 ай бұрын

Thank you for sharing, I am currently investing some time to adapt the quantization method into another memory bounded system. Hope it works out！

@user-xs1wd5yd9m 5 ай бұрын

Very well explained!

@snehotoshbanerjee1938 Жыл бұрын

Awesome!! Best explanation of QLoRa.

@code4AI Жыл бұрын

Great feedback! Thanks!

@nguyenanhnguyen7658 Жыл бұрын

Easy enough, Sir :) Thank you.

@AndyLee-xq8wq 7 ай бұрын

thank u for making this great vedio!

@rramjee1 Жыл бұрын

Excellent Videos. Very insightful. Kudos. I have couple of questions. Can you please help clarify ? 1. My understanding is that though the memory usage will be lower, the QLoRA training can take longer for a single epoch because of quantization and de-quantization during forward and backward passes. Please confirm if this is true. I do understand that it compresses large models to single GPU and there by giving oppurtunity to fine tune LLMs with less resource and marginally longer training time per epoch should not be a concern. 2. What happens during inference ? The weights of all the layers and injected adapter layers continue to be in NF4 or 32-bit precision. if NF4, do they need to be dequantized for every inference and there by inferecing can take longer. isn't it ?

@shaz7163 Жыл бұрын

Nice video! Do we keep the newly added Layer in normal precision or do we need to use quantization and de quantization?

@nazihfattal974 9 ай бұрын

Thank you for a great explanation. Quick question: in google's literature, when they say freezing a layer, they say that the frozen layers do not participate in the calculation during the forward and backward passes, and this is what makes the fine tuning of a model using LoRA (with frozen model weights) faster than the fine tuning without freezing the weights and without LoRA. They also show the number of parameters that are going to be updated in code using LoRA with model weights frozen vs fine-tuning without LoRa and without freezing model weights. Would you please share your thoughts? Thanks

@publicsectordirect982 Жыл бұрын

Thanks very much excellent video

@code4AI Жыл бұрын

Glad you enjoyed it

@TylerLali Жыл бұрын

Is there any information about how to fine tune train domain-specific small parameter models with LoRA? Let’s say we have created a high quality dataset of 1000 examples of domain specific prompt/response. How do I form the dataset to also provide some more broad conversational data? What size of data, where is data in batches of domain-specific versus general? Do I randomly select or is there benefit of doing domain specific parts at beginning or end of training?? Would love to hear your thoughts

@echofloripa 11 ай бұрын

You could add at the end of the notebook the code to load and test the resulting model+qlora adapter :)

@komasoftware1 7 ай бұрын

Running the notebook, I do not see the loss go down, but rather a sawtooth wave around 1.5

@akeshagarwal794 Жыл бұрын

Oh my god! Who are you brother. I really loved this explanation❤ Thankyou so much

@code4AI Жыл бұрын

So nice of you

@akeshagarwal794 Жыл бұрын

@@code4AI one request, please make video on QLoRA VS GPTQ

@JaishreeramCoder 3 ай бұрын

Doing this would mean all different values present in our original floating point 32 representation would be mapped to 16 distinct values and stored accordingly, and we can use these 16 values only for later computations?

@yashtomar9418 4 ай бұрын

Can anyone explain if LoRA has computation weights of type bfloat16 and the injected LoRA weight is of 32 bit, does that make any difference?

@echofloripa 11 ай бұрын

I just noticed I got and error at the step 250: RuntimeError: [enforce fail at inline_container.cc:471] . PytorchStreamWriter failed writing file data/65: file write failed Seems I ran out of disk: System RAM 5.8 / 12.7 GB GPU RAM 12.8 / 15.0 GB Disk 78.2 / 78.2 GB I guess this wandb took the whole disk :) Is there any way to reduce the size taken by it, or to free somespace while doing the training?

@brandomiranda6703 Жыл бұрын

How is it that when it dequantize back to bf16 or fb32 the GPU doesn't run out of memory? (maybe it's just the page optimizer and if it is I'd be surprised tbh)

@msamwelmollel Жыл бұрын

Is there any LoRA method for unsupervised? Or if I want the model to understand a pieces of information from my own text how can I achieve that

@edupignatelli 2 ай бұрын

18:57 How can it be computed if not in memory? Where does the output of the conversion go, if not in memory? I guess it's more that the weights are cast from bf4 to bf16 one layer at a time? As in: Dequantize(L1) -> fwd(L1) -> Quantize(L1) -> Dequantise(L2) -> fwd(L2) -> ... ?

@pdeubel2 10 ай бұрын

At 34:40 why do you use "torch.float16" for the "bnb_4bit_compute_dtype"? Shouldn't that be "torch.bfloat16" (brain float16) since you load a falcon-7b variant that has bfloat16 weights?

@theshrubberer Жыл бұрын

Great presentation an explanation!!!! Technically, these quatization and PEFT optimizations are amazing achievements. You mention that "something always has to give". I agree. So we agree that all things being equal bigger is better as in more trainable parameters. But in order to fine tune "large" as in 7B and above models in a single 40gb GPU the LOR approach restricts it adjustments to the 1% of parameters that are in the adapter layer. And you point out that the performance may depend on how similar your fine tuning dataset is to the pre-training data for a given model and task. So, that leads me to ask,(and maybe you can address in a future video?) the practical considerations or best practice of when it is worth going up in model size and using the quant and PEFT techniques to fine tune an adapter layer VS doing conventional fine tuning on a smaller model? For example, imagine a business specific text classification dataset with text and categories that are unlikely to be similar to the pretraining data. In this case, is it possible that fine tuning a small ( < 1 billion parameter) model might perform as well or better than the quant/lora on a large model? But are there other things to consider that I am not considering that would make the adapter on a large model always the better choice?

@code4AI Жыл бұрын

An oversimplified answer for current systems: a) the more trainable parameter the model has, the more intelligent it can respond to unseen tasks (like in-context-learning). b) Quantization has only one reason, why it exists: if there is not enough memory available. To compensate for memory the "new 4-bit QLoRA model" has to be fine-tuned extensively (costly) with minimum three different methods, to compensate for re-quantization calculation errors. The costs for this are significant! c) The quality of the fine-tuning process and the fine-tuning data set itself have significant impact on models! A "bad" 4-bit QLoRA model can become "good" when fine-tuned on an exceptionell "intelligent" fine-tuning data set. d) A "good" 32-bit model can become "bad" when fine-tuned on an exceptionell bad quality fine-tuning data set. DATA! The quality and "adaptability" of pre-training data sets and exceptionall good fine-tuning data sets are most important for the performance of any model.

@jaskiratsinghsodhi 11 ай бұрын

Your statement "And you point out that the performance may depend on how similar your fine tuning dataset is to the pre-training data for a given model and task. " should be "And you point out that the performance may depend on how similar your fine tuning dataset is to the evaluation data for a given model and task. "

@jaskiratsinghsodhi 11 ай бұрын

What's the difference between 3rd and 4th Ideas, fine-tuning is required when PEFT comes into the picture, what Fourth idea brought extra from 3rd idea?

@UNTITLED-ex1wd 9 ай бұрын

I don't understand if only 32-bit LoRA weight was store in GPU. How can model do forward pass if the other 4-bit frozen weight is not on GPU ?

@echofloripa 11 ай бұрын

By the way, excellent video, I have watched many of them! Where is your accent from? British I guess but I got curious whereabouts. I lived in Reading for 6 years :)

@shichenyuan8430 11 ай бұрын

Very nice video! Thanks for the clear demonstration! Just copied the notebook into a google colab, and executed it without modification yesterday (8/7/23). For the final training, I kept getting a loss function shoot up to 1.7x after and then fall back to 1.2x repeatatively over the 500 training steps (each pulse takes ~ 50 steps), which looks quite different from the downward trend shown in the video as the final result... Am I supposed to run with a different set of parameter?

@echofloripa 11 ай бұрын

I got the same result, didn't change anything and it was up and down, doesn't seem to get much improvement: Step Training Loss 10 1.321400 20 1.254200 30 1.342300 40 1.533300 50 1.773400 60 1.208600 70 1.255300 80 1.339700 90 1.556800 100 1.761800 110 1.202700 120 1.278700 130 1.333700 140 1.584600 150 1.733400 160 1.263200 170 1.255600

@toddnedd2138 Жыл бұрын

Thanks for explaining the topic. Do you know (or someone else) if i can run a bigger model with a dual GPU setup locally (inference and fine-tuning) out of the box or do i have to split the model by myself? E.g. a 30B model on a dual RTX3090 setup where each GPU has 24GB VRAM.

@code4AI Жыл бұрын

I would use HuggingFace Accelerate for multi-GPT configs.

@toddnedd2138 Жыл бұрын

@@code4AI Thanks for the reply. The Accelerate seems to incorporate the Megatron-LM library. This seems the way to go. Unfortunately this does only work with multiple A100 and above (using NVLink and NVBridge). Far over my budget, thank you anyway.

@cryptojointer 11 ай бұрын

Well this was epic. Unbelievably well made. Thank you. I do have a question, where did the SFT come from? Is that needed for QLoRA, or was that a personal choice to use that instead of a normal Trainer? Thanks again.

@code4AI 11 ай бұрын

The Supervised Fine-tuning Trainer module? It is a HuggingFace standard and most simple to use. See HuggingFace: "Supervised fine-tuning (or SFT for short) is a crucial step in RLHF. In TRL we provide an easy-to-use API to create your SFT models and train them with few lines of code on your dataset." If you are interested, here is the link: huggingface.co/docs/trl/main/en/sft_trainer

@cryptojointer 11 ай бұрын

@@code4AI That's brilliant, thank you very much for replying so quick and going into detail! thanks again!

@jayanthbontha876 8 ай бұрын

why does 1.5x sound normal

@wobblynl1742 11 ай бұрын

Is 4bit also divided in a a sign bit, exponent and mantissa? say 2 for exponent and 1 for mantissa would leave 2*2^2 = 8 levels? Excellent explanation video btw, gets me excited to experiment with the techniques

@jaskiratsinghsodhi 11 ай бұрын

I guess there are quantization groups that are mapped to say range 0,15. So as he mentioned if my weight is say 13 assuming unsigned 4-bit (0,15) then 13 corresponds to 0.6 as per mapping for said particular case

@keithkam6749 Жыл бұрын

Could someone explain why fine tuning task is necessary? I would have thought that during the 4 bit quantization process of an original 32 bit tensor, we have the error tensor available. Couldn't we simply initialise the LoRA tensor with the SVD of the quantization error tensor to approximate the "lost" information?

@code4AI Жыл бұрын

Preventing significant degradation of the model accuracy is a complex endeavor that demands meticulous attention to detail. The complete tensor manifold on each layer has to learn during the fine-tuning the learnable, and then in memory stored, PEFT - LoRA bflot16 weight tensors. To fine-tune the complete model with the new fine-tuning training data set (!) for the new task. This fine-tuning includes that the system (now only the LoRA weights, since the other weight tensors are frozen) learns (!) the new (fine-tuning) task with the (slightly wrong) re-quantized weight tensors. The fine-tuning process optimizes these LoRA tensors to the fine-tuning task at hand and (!) takes into consideration the "not perfectly re-quantized" tensors that contribute now to the optimization functions of the LoRA tensors.

@keithkam6749 Жыл бұрын

@@code4AI I think we're talking about slightly different things. My understanding is there are two applications for this technique: 1) supervised fine-tuning on smaller hardware and 2) lowered model size for inference in hardware limited settings. I agree and understand why SFT is needed if we wanted to apply a model to a new task. The question I have is more for use case 2 - Say if we already had a model that does what we want but it doesn't fit on whatever device, could we compress it without the fine tuning step or is that also still needed - (or is this a moot question because there are other more efficient model compression methods?) p.s. great videos as always!

@code4AI Жыл бұрын

Smile. You can compress any data stream to almost zero. Technically doable. Information loss close to 100%. In our case: the model accuracy (performance) falls off the cliff, if you convert 32-bit to 4-bit, without counter measures (like fine-tuning).

@THEMATT222 Жыл бұрын

Noice 👍

@user-wr4yl7tx3w Жыл бұрын

How do we normalize the weights with zero mean? Isn’t there only one value per weight parameter?

@shaz7163 Жыл бұрын

I guess it's just ranging them to -1 and +1. Zero means normalization. As far as I understood, they say this method is an easy method, and they also assume all the weights are normally distributed.

@SofieSimp Жыл бұрын

I don't understand why de-quantization process doesn't increase the memory usage? Can you explain this? Isn't the weight now represented by 32 bit, so the memory should go up?

@code4AI Жыл бұрын

Two different processes: a) computing and b) storing in memory. Frozen 4-bit weight tensor are de-quantized and therefore computed only, not stored. Only the 32-bit (or bfloat16) QLoRA injected weights are computed AND stored.

@SofieSimp Жыл бұрын

@@code4AI I thought in order to perform computation with the frozen 4-bit weight tensor, it must be loaded onto the memory? Then if we de-quantized it, the memory which allocated to that 4-bit weight tensor increase? Sorry I do not have much knowledge about this low-level stuff so can you explain this further?

@mattnas5367 Жыл бұрын

I tried to fine-tune the model on my own dataset which is Q&A and I got this error : Column to remove ['train'] not in the dataset. Current columns in the dataset: ['input', 'output']? any idea how to resolve

@Kartratte Жыл бұрын

not shure if its correct, but i cot same problem. If you are using a dataset from huggingface - look for the "split" pull down and take this name. In the colab notebook it ist train - in another I had another value. I would load the given Dataset and compare ist to yours - look in the given set for the word oder marker "train" - which I think is for splitting each pair of in and output. But unfortunately none of them are working - training always crashes randomly after 110 to 180. I tried a lot and my free instances are used. So I try another time... perhaps i try with a smaller model on a local system (only 8GB VRAM)

@mattnas5367 Жыл бұрын

@@Kartratte I am using my own dataset

@Kartratte Жыл бұрын

@@mattnas5367 Venelin Valkov also made a Falcon QLORA Video. He used a custom training data I believe.

@frederictost6659 9 ай бұрын

I think normalization is not about unit variance but min and max -1 and 1

@cutebabyseal621 Жыл бұрын

Interesting. I immediately have two followup research questions based on this: - Can this technique be used to compress models *without* additional fine tuning? For example, what if you apply QLoRA, but you actually run distillation using the original network? Essentially just taking advantage of the quantization error minimization to compress the model without changing anything about the network's task performance. - I wonder if the quantization errors actually make fine tuning easier? Think of the quantization errors as a type of slight random perturbation to the model weights. I wonder if it might have an effect where it helps the model not to overfit?

@cutebabyseal621 Жыл бұрын

Also: I wonder what this implies about the quantization errors? Does it mean that if I was to quantize a model and make matrices for each layer that consist of just the quantization error, those matrices would be low rank? And if THAT is true, maybe you don't even need distillation. Maybe you could just quantize, compute the quantization errors, and then use SVD to factorize the quantization error matrix and find a low rank representation of it.

@code4AI Жыл бұрын

a) No. b) No.

@cutebabyseal621 Жыл бұрын

@@code4AI I would like to understand why you think the answers to these questions are no.