Quantization vs Pruning vs Distillation: Optimizing NNs for Inference

  Рет қаралды 14,711

Efficient NLP

Efficient NLP

Күн бұрын

Four techniques to optimize the speed of your model's inference process:
0:38 - Quantization
5:59 - Pruning
9:48 - Knowledge Distillation
13:00 - Engineering Optimizations
References:
LLM Inference Optimization blog post: lilianweng.github.io/posts/20...
How to deploy your deep learning project on a budget: luckytoilet.wordpress.com/202...
Efficient deep learning survey paper: arxiv.org/abs/2106.08962
SparseDNN: arxiv.org/abs/2101.07948

Пікірлер: 24
@thomasschmitt9669
@thomasschmitt9669 3 ай бұрын
This was one of the best explanation videos I have ever seen! Well structured and right complexity grade to follow without getting a headache. 👌
@bonob0123
@bonob0123 21 күн бұрын
that was really nicely done. as a non-expert, I feel like I can now have a great general idea of what a quantized model is. thank you
@kevon217
@kevon217 10 ай бұрын
Thanks for this!
@lucaskeller656
@lucaskeller656 2 ай бұрын
Great format, succinctness, and diagrams. Thank you!
@unclecode
@unclecode 4 ай бұрын
Great content, well done. Please make a video for ONNX, and another one for Flash Attention. Appreciate.
@muhannadobeidat
@muhannadobeidat 3 ай бұрын
Excellent video. Well spoken. Nice visualizations.
@DurgaNagababuMolleti
@DurgaNagababuMolleti 10 күн бұрын
Superb
@heteromodal
@heteromodal 5 ай бұрын
What a great video! Thank you!
@jokmenen_
@jokmenen_ 4 ай бұрын
Awesome video!
@user-qo7vr3ml4c
@user-qo7vr3ml4c Ай бұрын
Great summary, thank you.
@huiwencheng4585
@huiwencheng4585 5 ай бұрын
Fantastic introduction and explanation !
@vineetkumarmishra2989
@vineetkumarmishra2989 3 ай бұрын
wonderfully explained !! Thanks for the video.
@jeremyuzan1169
@jeremyuzan1169 2 ай бұрын
Great video
@420_gunna
@420_gunna 5 ай бұрын
This felt very nicely taught -- I loved that you pulled back a summary/review at the end of the video - great practice. Please continue, thank you!
@user-bd7eq6vx1t
@user-bd7eq6vx1t Жыл бұрын
your teaches so excellent.. we accepted many more videos from your side to understand for the fundamental NLP
@kevon217
@kevon217 10 ай бұрын
^
@MuhammadAli-dw7mv
@MuhammadAli-dw7mv 2 ай бұрын
nicely done
@hrsight
@hrsight 2 ай бұрын
nice video
@yunlu4657
@yunlu4657 5 ай бұрын
Excellent video, learnt a lot! However, the definition of zero-point quantization is off. What you're showing in the video is the abs-max quantization instead.
@EfficientNLP
@EfficientNLP 5 ай бұрын
The example I showed is zero-point quantization because 0 in the original domain is mapped to 0 in the quantized domain (before transforming to unsigned). In abs-max (not covered in this video), the maximum in the original domain would be mapped to 127, and the minimum would be mapped to -128.
@ricardokullock2535
@ricardokullock2535 Ай бұрын
And if one was to quantize a distilled model? Is the outcome any good?
@EfficientNLP
@EfficientNLP Ай бұрын
Yes, these two techniques are often used together to improve efficiency.
@andrea-mj9ce
@andrea-mj9ce 3 ай бұрын
The explanation for distillation remains at the surface, it is not enough to understand it
@EfficientNLP
@EfficientNLP 3 ай бұрын
If you have any specific questions I’ll try to answer them!
Fine-tuning Whisper to learn my Chinese dialect (Teochew)
28:10
Efficient NLP
Рет қаралды 4,5 М.
Better not Bigger: Distilling LLMs into Specialized Models
16:49
Snorkel AI
Рет қаралды 2,1 М.
БОЛЬШОЙ ПЕТУШОК #shorts
00:21
Паша Осадчий
Рет қаралды 8 МЛН
The day of the sea 🌊 🤣❤️ #demariki
00:22
Demariki
Рет қаралды 98 МЛН
NERF WAR HEAVY: Drone Battle!
00:30
MacDannyGun
Рет қаралды 39 МЛН
ROCK PAPER SCISSOR! (55 MLN SUBS!) feat @PANDAGIRLOFFICIAL #shorts
00:31
Speculative Decoding: When Two LLMs are Faster than One
12:46
Efficient NLP
Рет қаралды 9 М.
How a Transformer works at inference vs training time
49:53
Niels Rogge
Рет қаралды 48 М.
Knowledge Distillation: A Good Teacher is Patient and Consistent
12:35
A better Hugging Face model search with OpenAI, RAG, pgvector
22:28
Efficient NLP
Рет қаралды 1,2 М.
Which Quantization Method is Right for You? (GPTQ vs. GGUF vs. AWQ)
15:51
Maarten Grootendorst
Рет қаралды 15 М.
The KV Cache: Memory Usage in Transformers
8:33
Efficient NLP
Рет қаралды 30 М.
Rotary Positional Embeddings: Combining Absolute and Relative
11:17
Efficient NLP
Рет қаралды 26 М.
ПОКУПКА ТЕЛЕФОНА С АВИТО?🤭
1:00
Корнеич
Рет қаралды 3,4 МЛН
Ультрабюджетная игровая мышь? 💀
1:00
iPhone 16 с инновационным аккумулятором
0:45
ÉЖИ АКСЁНОВ
Рет қаралды 1,5 МЛН
Best mobile of all time💥🗿 [Troll Face]
0:24
Special SHNTY 2.0
Рет қаралды 2 МЛН