Quantization explained with PyTorch - Post-Training Quantization, Quantization-Aware Training

Рет қаралды 15,392

Күн бұрын

In this video I will introduce and explain quantization: we will first start with a little introduction on numerical representation of integers and floating-point numbers in computers, then see what is quantization and how it works. I will explore topics like Asymmetric and Symmetric Quantization, Quantization Range, Quantization Granularity, Dynamic and Static Quantization, Post-Training Quantization and Quantization-Aware Training.
Code: github.com/hkproj/quantizatio...
PDF slides: github.com/hkproj/quantizatio...
Chapters
00:00 - Introduction
01:10 - What is quantization?
03:42 - Integer representation
07:25 - Floating-point representation
09:16 - Quantization (details)
13:50 - Asymmetric vs Symmetric Quantization
15:38 - Asymmetric Quantization
18:34 - Symmetric Quantization
20:57 - Asymmetric vs Symmetric Quantization (Python Code)
24:16 - Dynamic Quantization & Calibration
27:57 - Multiply-Accumulate Block
30:05 - Range selection strategies
34:40 - Quantization granularity
35:49 - Post-Training Quantization
43:05 - Training-Aware Quantization

Пікірлер: 72

@zendr0 6 ай бұрын

If you are not aware let me tell you. You are helping a generation of ML practitioners learn all this for free. Huge respect to you Umar. Thank you for all your hard work ❤

@savvysuraj 4 ай бұрын

The content made by Umar is helping me alot.Kudos to Umar.

@vik2189 2 ай бұрын

Fantastic video! Probably the best 50 minutes spent on AI related concepts in the past 1 year or so.

@dariovicenzo8139 2 ай бұрын

Great job, in particular the examples regarding the conversion from/to integer not only with formulas but with true numbers too!

@ankush4617 6 ай бұрын

I keep hearing about quantization so much, this is the first time i have seen someone go so deep into this topic and come up with such clear explanations! Keep up all your great work, you are a gem to the AI community!! I’m hoping that you will have a video on Mixtral MoE soon 😊

@umarjamilai 6 ай бұрын

You read my mind about Mistral. Stay tuned! 😺

@ankush4617 6 ай бұрын

@@umarjamilai❤

@krystofjakubek9376 6 ай бұрын

Great video! Just a clarification: on modern processors floating point operations are NOT slower than integer operations. It very much depends on the exact processor and even then the difference is usually extremely small compared to the other overheads of executing the code. HOWEVER the reduction of size from 32 bit float to 8 bit integer does itself make the operations faster a lot faster. The cause is two fold: 1) modern CPUs and GPUs are typically memory bound and so simply put if we reduce the amount of data the processor needs to load in by 4x we expect the time the processor spends waiting for another set of data to come by to shrink by 4x as well. 2) pretty much all machine learning code is vectorized. This means the processor instead of executing each instruction on a single number grabs N numbers and executes the instruction on all of them at once (SIMD instructions). However most processors dont have N set instead have set the total number of bits all N numbers occupy (for example AVX2 can do operations on 256 bits at a time) so if we go from 32 bits to 8 bits we can do 4x more instructions at once! This is likely what you mean by operations being faster. Note thag CPUs or GPUs are very much similar in this regard, only GPUs have much more SIMD lanes (much more bits).

@umarjamilai 6 ай бұрын

Thanks for the clarification! I was even going to talk about the internal hardware of adders (Carry-lookahead adder) to show how a simple operation like addition works and compare it with the many steps required for the floating-point number (which also involves normalization). You explanation nailed it! Thanks again!

@user-rk5mk7jm7r 5 ай бұрын

Thanks a lot for the fantastic tutorial. Looking forward to the more series on the LLM quantization!👏

@mandarinboy 5 ай бұрын

Great introductory video! Looking forward to GPTQ and AWQ

@myaseena 6 ай бұрын

Really high quality exposition. Also thanks for providing the slides.

@jiahaosu 5 ай бұрын

The best video about quantization, thank you very much!!!! It really helps!

@AbdennacerAyeb 6 ай бұрын

Keep Going. This is perfect. Thank you for the effort you are making

@Aaron-hs4gj 3 ай бұрын

Excellent explanation, very intuitive. Thanks so much! ❤

@jaymn5318 4 ай бұрын

Great lecture. Clean explanation of the field and gives a excellent perspective on these technical topics.

@user-td8vz8cn1h 3 ай бұрын

This is one of a few channels that I subscribed to after watching one video. Your content is very easy to follow and you are covering topic holistically with additional clarifications, what a man)

@jaymn5318 4 ай бұрын

Great lecture. Clean explanation of the field and gives an excellent perspective on these technical topics. Love your lectures. Thanks !

@asra1kumar 4 ай бұрын

This channel features exceptional lectures, and the quality of explanation is truly outstanding. 👌

@user-lg3jo6ih1t 3 ай бұрын

I was searching for Quantization basics and could not find relevant videos... this is a life-saver!! thanks and please keep up the amazing work!

@user-qo7vr3ml4c Ай бұрын

Thank you for the great content. Especially the goal of QAT to have a wider loss function and how that makes it robust to errors due to quantization. Thank you.

@koushikkumardey882 6 ай бұрын

becoming a big fan of your work!!

@ojay666 3 ай бұрын

Fantastic tutorial！！！👍👍👍I’m hoping that you will post a tutorial on model pruning soon🤩

@HeyFaheem 6 ай бұрын

You are a hidden gem, my brotherr

@RaviPrakash-dz9fm Ай бұрын

Legendary content!!

@Youngzeez1 6 ай бұрын

wow, what an eye-opener! I read lots of research papers but mostly confusing! but your explanation just opened my eyes! Thank you. Please can you do a video on the quantization of vision transformers for object detection?

@TheEldadcohen 5 ай бұрын

Umar I've seen many of your videos and you are a great teacher! Thank you for your effort in explaining in plain (Italian accent) English all of these complicated topics. Regarding the content of the video - you showed the quantization-aware training and you were surprised of the worse result it showed in comparison to the post-training quantization in the concrete example you made. I think it is because you trained the post-training quantization on the same data that you tested it on, so the parameters learned (alpha, beta) are overfitted to the test data, that's why the accuracy was better. I think that if you had tested it with true test data, you probably would have seen the result you anticipated.

@sebastientetaud7485 4 ай бұрын

Excellent Video ! Grazie !

@NJCLM 5 ай бұрын

Great video ! Thank you !!

@andrewchen7710 5 ай бұрын

Umar, I've watched your videos on llama, mistral, and now quantization. They're absolutely brilliant and I've shared your channel to my colleagues. If you're in Shanghai, allow me to buy you a meal haha! I'm curious of your research process. During the preparation of your next video, I think it would be neat if you document the timeline of your research/learning, and share it with us in a separate video!

@umarjamilai 5 ай бұрын

Hi Andrew! Connect with me on LinkedIn and we can share our WeChat. Have a nice day!

@Patrick-wn6uj 3 ай бұрын

Glad to see fellow shanghai people here hhhhhhh

@bluecup25 6 ай бұрын

Thank you, super clear

@manishsharma2211 6 ай бұрын

beautiful again, thanks for sharing these

@ngmson 6 ай бұрын

Thank your for your sharing.

@aminamoudjar4561 6 ай бұрын

Very helpful thank you so much

@user-pe3mt1td6y 4 ай бұрын

Need more advanced videos about advanced Quantization!

@ziyadmuhammad3734 Ай бұрын

Thanks!

@asra1kumar 4 ай бұрын

Thanks

@tetnojj2483 5 ай бұрын

Nice video :) A video on the .gguf file format for models would be very interesting :)

@user-kg9zs1xh3u 6 ай бұрын

vary good

@amitshukla1495 6 ай бұрын

wohooo ❤

@lukeskywalker7029 4 ай бұрын

@Umar Jamil you said most embedded devices dont support floating point operatins at all? Is that right? What would be an example and how is that chip architecture called? Does an RaspberryPi or an Arduino operate on only integer operations internally?

@DiegoSilva-dv9uf 6 ай бұрын

Valeu!

@tubercn 6 ай бұрын

Thanks, Great video🐱‍🏍🐱‍🏍 But I have a question, because we'll dequantize the output of the last layer by calibration, why we need another "torch.quantization.DeQuantStub()" layer in the model to dequantize the output, it seems we have two dequantizes consequently

@user-hd7xp1qg3j 6 ай бұрын

One request could you explain mixture of experts I bet you can breakdown the explanation good

@pravingaikwad1337 2 ай бұрын

For one layer Y = XW + b, if X, W and b are quantized so we get Y in the quantized form, then what is the need of dequantizing this Y to feed it to the next layer?

@Erosis 6 ай бұрын

You're making all of my lecture materials pointless! (But keep up the great work!)

@AleksandarCvetkovic-db7lm 3 ай бұрын

Could the difference in accuracy between Static/Dynamic quantization and Quantization Aware Training be because the model was trained for 5 epochs for Static/Dynamic Quant and only one epoch for Quant Aware training? I tend to think that 4 more epochs make more difference than Quantization method

@swiftmindai 6 ай бұрын

I noticed a small correction needs to done at timestamp @28:53 [slide: Low precision matrix multiplication]. In the first line, the dot products between each row of X with each column of Y [Instead of Y, it should be W - the weight matrix]

@umarjamilai 6 ай бұрын

You're right, thanks! Thankfully the diagram of the multiply block is correct. I'll fix the slides

@dzvsow2643 6 ай бұрын

Aslamu aleykum Brother. Thanks for your videos! I have been working on game development using pygame for a while and I just want to start deep learning in python so could you make a road map video?! Thank you again

@umarjamilai 6 ай бұрын

Hi! I will do my best! Stay tuned

@venkateshr6127 6 ай бұрын

Could you please make a video on how to make tokenizers for other languages than English please.

@bamless95 4 ай бұрын

Be careful, cpython does not do JIT compilation, it is a pretty stragithforward stack-based bytecode interpreter

@umarjamilai 4 ай бұрын

Bytecode has to be converted into machine code somehow. That's also how .NET works: first C# gets compiled into MSIL (an intermediate representation), and then it just-in-time compiles the MSIL into the machine code for the underlying architecture.

@bamless95 4 ай бұрын

Not necessarily, bytecode can just be interpreted in place. In a loose sense it is being "converted" to machine code, meaning that we are executing different snippets of machine code through branching, but JIT compilation has a very different meaning in the compiler and interpreter field. What python is really doing is executing a loop and a switch branching on every possible opcode. By looking at the interpreter implementation on the cpython github repo in `Python/ceval.c` and `Python/generated_cases.c.h` (alas youtube is not letting me post links) you can clearly see there is no JIT compilation involved.

@bamless95 4 ай бұрын

What you are saying about C# (and for that matter java and some other languages like luaJIT or v8 javascript) is indeed true, they typically JIT the code either before or during interpretation. But cpython is a much simpler (and thus slower) implementation of a bytecode interprer, that does not implement neither JIT compilation nor any form of serious code optimization (aside from a fairly rudimentary peephole optimization step)

@bamless95 4 ай бұрын

Don't get me wrong, I think the video is phenomenal. Just wanted to correct a little imperfection that, as a programming language nerd, I feel it is important to get right. Also, greetings from italy! It is good for once to see a fellow Italian doing content that is worth watching on YT 😄

@elieelezra2734 6 ай бұрын

Umar, thanks for all your content. I step up a lot thanks to your work! But there is something I don't get about quantization. Let's say you quantize all the weights of your large model. The prediction is not the same anymore! Does it mean you need to dequantize the prediction? If yes, you do not talk about it right? Can I have your email to get more details please?

@umarjamilai 6 ай бұрын

Hi! Since the output of the last layer (the matrix Y) will be dequantized, the prediction of the output will be "the same" (very similar) as the dequantized model. The Y matrix of each layer is always dequantized, so that the output of each layer is more or less equal to the dequantized model

@alainrieger6905 6 ай бұрын

Hi thanks for your answer@@umarjamilai Does it mean, for the post training quantization, that the more the layers in a model, the greater is the difference between the quantized and dequantized model since the error accumulates at each New layer? Thanks in advance

@umarjamilai 6 ай бұрын

@@alainrieger6905 That's not necessarily true, because the error in one layer may be "positive", and in another "negative", and they may compensate for each other. For sure the number of bits used for quantization is a good metric on the quality of quantization: if you use less bits, you will have more error. It's like you have an image that is originally 10 MB, and you try to compress it to 1 MB or 1 KB. Of course in the latter case you'd lose much more quality than the first one.

@alainrieger6905 6 ай бұрын

@@umarjamilaithanks you Sir! Last question : when you talk about dequantizing layer's activations, does it mean that the values go back to 32 bits format ?

@umarjamilai 6 ай бұрын

@@alainrieger6905 yes, it means going back to floating-point format

@theguyinthevideo4183 5 ай бұрын

This may be a stupid question, but what's stopping us from just setting the weights and biases to be in integer form? Is it due to the nature of backprop?

@umarjamilai 5 ай бұрын

Forcing the weights and biases to be integers means adding more constraints to the gradient descent algorithm, which is not easy and computationally expensive. It's like I ask you to solve the equation x^2 - 5x + 4 = 0 but only for integer X. This means you can't just use the formula you learnt in high school for quadratic equations, because that returns real numbers. Hope it helps