AWQ for LLM Quantization

  Рет қаралды 5,883

MIT HAN Lab

MIT HAN Lab

9 ай бұрын

Large language models (LLMs) have shown excellent performance on various tasks, but the astronomical model size raises the hardware barrier for serving (memory size) and slows down token generation (memory bandwidth). In this paper, we propose Activation-aware Weight Quantization (AWQ), a hardware-friendly approach for LLM low-bit weight-only quantization. Our method is based on the observation that weights are not equally important: protecting only 1% of salient weights can greatly reduce quantization error. We then propose to search for the optimal per-channel scaling that protects the salient weights by observing the activation, not weights. AWQ does not rely on any backpropagation or reconstruction, so it can well preserve LLMs' generalization ability on different domains and modalities, without overfitting to the calibration set. AWQ outperforms existing work on various language modeling and domain-specific benchmarks. Thanks to better generalization, it achieves excellent quantization performance for instruction-tuned LMs and, for the first time, multi-modal LMs. Alongside AWQ, we implement an efficient and flexible inference framework tailored for LLMs on the edge, offering more than 3x speedup over the Huggingface FP16 implementation on both desktop and mobile GPUs. It also democratizes the deployment of the 70B Llama-2 model on mobile GPU. Code: github.com/mit-han-lab/llm-awq

Пікірлер
TinyChatEngine Coding Demo on Apple MacBook Pro (M1, 2021)
0:43
LoRA & QLoRA Fine-tuning Explained In-Depth
14:39
Entry Point AI
Рет қаралды 32 М.
Now THIS is entertainment! 🤣
00:59
America's Got Talent
Рет қаралды 40 МЛН
БАБУШКИН КОМПОТ В СОЛО
00:23
⚡️КАН АНДРЕЙ⚡️
Рет қаралды 16 МЛН
Quantization vs Pruning vs Distillation: Optimizing NNs for Inference
19:46
Which Quantization Method is Right for You? (GPTQ vs. GGUF vs. AWQ)
15:51
Maarten Grootendorst
Рет қаралды 16 М.
Pruning Deep Learning Models for Success in Production
24:35
Neural Magic
Рет қаралды 13 М.
Deep Dive: Quantizing Large Language Models, part 1
40:28
Julien Simon
Рет қаралды 9 М.
Fast LLM Serving with vLLM and PagedAttention
32:07
Anyscale
Рет қаралды 20 М.
Deep Dive: Optimizing LLM inference
36:12
Julien Simon
Рет қаралды 19 М.
EfficientML.ai Lecture 1 - Introduction (MIT 6.5940, Fall 2023)
1:17:05
бим бам бум💥💥 типа..
0:18
Ma1x1
Рет қаралды 6 МЛН
Каха заблудился в горах
0:57
К-Media
Рет қаралды 7 МЛН
Finger Heart - Fancy Refill (Inside Out Animation)
0:30
FASH
Рет қаралды 24 МЛН
Во сколько смотришь? Заливаю в 10-22😉
0:36
Юлия Смирнова
Рет қаралды 2,5 МЛН