AWQ for LLM Quantization

TinyChatEngine Coding Demo on Apple MacBook Pro (M1, 2021)

LoRA & QLoRA Fine-tuning Explained In-Depth

Now THIS is entertainment! 🤣

☝️☝️☝️МАЛЫШ-СИЛАЧ 14 лет притворился НОВИЧКОМ | ШОКИРОВАЛ ТРЕНЕРА

БАБУШКИН КОМПОТ В СОЛО

Rabbit friend came to bring snacks for the baby#Short #Officer Rabbit #angel

AWQ for LLM Quantization

Рет қаралды 5,883

MIT HAN Lab

MIT HAN Lab

9 ай бұрын

Large language models (LLMs) have shown excellent performance on various tasks, but the astronomical model size raises the hardware barrier for serving (memory size) and slows down token generation (memory bandwidth). In this paper, we propose Activation-aware Weight Quantization (AWQ), a hardware-friendly approach for LLM low-bit weight-only quantization. Our method is based on the observation that weights are not equally important: protecting only 1% of salient weights can greatly reduce quantization error. We then propose to search for the optimal per-channel scaling that protects the salient weights by observing the activation, not weights. AWQ does not rely on any backpropagation or reconstruction, so it can well preserve LLMs' generalization ability on different domains and modalities, without overfitting to the calibration set. AWQ outperforms existing work on various language modeling and domain-specific benchmarks. Thanks to better generalization, it achieves excellent quantization performance for instruction-tuned LMs and, for the first time, multi-modal LMs. Alongside AWQ, we implement an efficient and flexible inference framework tailored for LLMs on the edge, offering more than 3x speedup over the Huggingface FP16 implementation on both desktop and mobile GPUs. It also democratizes the deployment of the 70B Llama-2 model on mobile GPU. Code: github.com/mit-han-lab/llm-awq

Пікірлер

TinyChatEngine Coding Demo on Apple MacBook Pro (M1, 2021)

0:43

TinyChatEngine Coding Demo on Apple MacBook Pro (M1, 2021)

MIT HAN Lab

Рет қаралды 1 М.

LoRA & QLoRA Fine-tuning Explained In-Depth

14:39

LoRA & QLoRA Fine-tuning Explained In-Depth

Entry Point AI

Рет қаралды 32 М.

Now THIS is entertainment! 🤣

00:59

Now THIS is entertainment! 🤣

America's Got Talent

Рет қаралды 40 МЛН

☝️☝️☝️МАЛЫШ-СИЛАЧ 14 лет притворился НОВИЧКОМ | ШОКИРОВАЛ ТРЕНЕРА

00:49

☝️☝️☝️МАЛЫШ-СИЛАЧ 14 лет притворился НОВИЧКОМ | ШОКИРОВАЛ ТРЕНЕРА

Nikita Zdradovskiy

Рет қаралды 6 МЛН

БАБУШКИН КОМПОТ В СОЛО

00:23

БАБУШКИН КОМПОТ В СОЛО

⚡️КАН АНДРЕЙ⚡️

Рет қаралды 16 МЛН

Rabbit friend came to bring snacks for the baby#Short #Officer Rabbit #angel

00:48

Rabbit friend came to bring snacks for the baby#Short #Officer Rabbit #angel

兔子警官

Рет қаралды 27 МЛН

Quantization vs Pruning vs Distillation: Optimizing NNs for Inference

19:46

Quantization vs Pruning vs Distillation: Optimizing NNs for Inference

Efficient NLP

Рет қаралды 16 М.

Exploring the Latency/Throughput & Cost Space for LLM Inference // Timothée Lacroix // CTO Mistral

30:25

Exploring the Latency/Throughput & Cost Space for LLM Inference // Timothée Lacroix // CTO Mistral

MLOps.community

Рет қаралды 12 М.

Which Quantization Method is Right for You? (GPTQ vs. GGUF vs. AWQ)

15:51

Which Quantization Method is Right for You? (GPTQ vs. GGUF vs. AWQ)

Maarten Grootendorst

Рет қаралды 16 М.

Pruning Deep Learning Models for Success in Production

24:35

Pruning Deep Learning Models for Success in Production

Neural Magic

Рет қаралды 13 М.

Ji Lin's PhD Defense, Efficient Deep Learning Computing: From TinyML to Large Language Model. @MIT

56:18

Ji Lin's PhD Defense, Efficient Deep Learning Computing: From TinyML to Large Language Model. @MIT

MIT HAN Lab

Рет қаралды 9 М.

Deep Dive: Quantizing Large Language Models, part 1

40:28

Deep Dive: Quantizing Large Language Models, part 1

Julien Simon

Рет қаралды 9 М.

Fast LLM Serving with vLLM and PagedAttention

32:07

Fast LLM Serving with vLLM and PagedAttention

Anyscale

Рет қаралды 20 М.

LLaMa GPTQ 4-Bit Quantization. Billions of Parameters Made Smaller and Smarter. How Does it Work?

11:03

LLaMa GPTQ 4-Bit Quantization. Billions of Parameters Made Smaller and Smarter. How Does it Work?

AemonAlgiz

Рет қаралды 22 М.

Deep Dive: Optimizing LLM inference

36:12

Deep Dive: Optimizing LLM inference

Julien Simon

Рет қаралды 19 М.

EfficientML.ai Lecture 1 - Introduction (MIT 6.5940, Fall 2023)

1:17:05

EfficientML.ai Lecture 1 - Introduction (MIT 6.5940, Fall 2023)

MIT HAN Lab

Рет қаралды 43 М.

бим бам бум💥💥 типа..

0:18

бим бам бум💥💥 типа..

Ma1x1

Рет қаралды 6 МЛН

Каха заблудился в горах

0:57

Каха заблудился в горах

К-Media

Рет қаралды 7 МЛН

Finger Heart - Fancy Refill (Inside Out Animation)

0:30

Finger Heart - Fancy Refill (Inside Out Animation)

FASH

Рет қаралды 24 МЛН

老师拍到了什么，居然吓得掏出枪了！#火影忍者 #佐助 #家庭

0:24

老师拍到了什么，居然吓得掏出枪了！#火影忍者 #佐助 #家庭

火影忍者一家

Рет қаралды 20 МЛН

Во сколько смотришь? Заливаю в 10-22😉

0:36

Во сколько смотришь? Заливаю в 10-22😉

Юлия Смирнова

Рет қаралды 2,5 МЛН

Человек подружился с дикой собакой🔥 #животные #animals

0:29

Человек подружился с дикой собакой🔥 #животные #animals

Narezchik Animals

Рет қаралды 7 МЛН

Бандиты перекрыли дорогу ментам #киномоменты #фильмы #кино #shorts #movie #film #шортс #топ

0:40

Бандиты перекрыли дорогу ментам #киномоменты #фильмы #кино #shorts #movie #film #шортс #топ

Brunet

Рет қаралды 7 МЛН