Рет қаралды 2,205
In this Jupyter Notebook walkthrough we explore NVIDIA's newly released TensorRT-LLM framework.
The notebook contains a complete setup and installation guide and serves as an easy point of entry for developers to get started.
As an example, we optimize the public LLM BLOOM-560M for inference on GPU and discuss other topics such as Quantization and In-flight batching.
Notebook from this tutorial:
github.com/CactusQ/TensorRT-L...
GPU used in this tutorial: RTX4080 (Laptop)
TensorRT-LLM repo:
github.com/NVIDIA/TensorRT-LLM
Nvidia's BLOOM example:
github.com/NVIDIA/TensorRT-LL...
SDK Release Blog Post:
developer.nvidia.com/blog/opt...
Flash Attention:
/ eli5-flash-attention
Inflight Batching:
developer.nvidia.com/blog/nvi...
EDIT:
I explained the concurrent part of in-flight batching, i.e. multiple sequences can be processed simultaneously.
Another aspect of that is the dynamic task management of requests, i.e. organizing pipeline or task scheduling such that hardware utilization is maximized.
Also, traditionally Batching required the entire batch to finish before inserting the next batch.
With In-flight-Batching, we insert new sequences before the current batch finished.
Multi Head Attention (MHA):
• Multi-Head Attention (...
Masked MHA:
towardsdatascience.com/transf...
stackoverflow.com/questions/5...
Timestamps:
00:00 Introduction
00:55 Starting Jupyter
01:36 Setting Up TensorRT-LLM
02:09 BLOOM
02:41 Converting the model from HuggingFace
04:25 Building the default-optimized model
05:59 Building the quantized model (INT8)
06:49 Quantization
07:21 How does TensorRT-LLM work (high-level)?
08:54 Benchmarking
11:02 Visualizing the Results
13:23 Conclusion & Where To Go
#tensorRT-llm #tensorRT #nvidia #deeplearning #llm