Getting Started with TensorRT-LLM

Рет қаралды 2,205

Күн бұрын

In this Jupyter Notebook walkthrough we explore NVIDIA's newly released TensorRT-LLM framework.
The notebook contains a complete setup and installation guide and serves as an easy point of entry for developers to get started.
As an example, we optimize the public LLM BLOOM-560M for inference on GPU and discuss other topics such as Quantization and In-flight batching.
Notebook from this tutorial:
github.com/CactusQ/TensorRT-L...
GPU used in this tutorial: RTX4080 (Laptop)
TensorRT-LLM repo:
github.com/NVIDIA/TensorRT-LLM
Nvidia's BLOOM example:
github.com/NVIDIA/TensorRT-LL...
SDK Release Blog Post:
developer.nvidia.com/blog/opt...
Flash Attention:
/ eli5-flash-attention
Inflight Batching:
developer.nvidia.com/blog/nvi...
EDIT:
I explained the concurrent part of in-flight batching, i.e. multiple sequences can be processed simultaneously.
Another aspect of that is the dynamic task management of requests, i.e. organizing pipeline or task scheduling such that hardware utilization is maximized.
Also, traditionally Batching required the entire batch to finish before inserting the next batch.
With In-flight-Batching, we insert new sequences before the current batch finished.
Multi Head Attention (MHA):
• Multi-Head Attention (...
Masked MHA:
towardsdatascience.com/transf...
stackoverflow.com/questions/5...
Timestamps:
00:00 Introduction
00:55 Starting Jupyter
01:36 Setting Up TensorRT-LLM
02:09 BLOOM
02:41 Converting the model from HuggingFace
04:25 Building the default-optimized model
05:59 Building the quantized model (INT8)
06:49 Quantization
07:21 How does TensorRT-LLM work (high-level)?
08:54 Benchmarking
11:02 Visualizing the Results
13:23 Conclusion & Where To Go
#tensorRT-llm #tensorRT #nvidia #deeplearning #llm

Пікірлер: 22

@Gerald-iz7mv 2 ай бұрын

is there a benchmark you can run for different models and different frameworks (TGI, vllm, TensorRT-LLM, Sglang, etc) to measure latency, throughput etc?

@LSTMania 2 ай бұрын

Hey! I don't think there is a unified way to do that yet, but I'd encourage you to create your own benchmarks depending on what you are trying to optimize for. For example if you strictly care about performance/GPU usage you can use Nvidia's Profiler for in-depth analysis. docs.nvidia.com/cuda/profiler-users-guide/index.html If you want to do a similar benchmark as I did in this video / Jupyter notebook, feel free to just re-use the Python code in the Notebook. (e.g. ROUGE, time). There are also plenty of example benchmarks on Github, that I would just repurpose for your specific model/framework. Usually changing a couple lines of code should work for changing the model. Also, latency, time-until-first-token, token-per-second are all valid metrics, but I'd say LLMs are such a nascent field that there are no 'hard rules' to benchmarking. It all comes down to what you are trying to optimize or analyze for, or what the specific use-case is. So, definitely experiment a lot and use as many open-source scripts as you can :) ! No need to do everything from scratch

@Gerald-xg3rq 2 ай бұрын

can you also get metrics exposed by each framework? most frameworks like TGI have a /metrics endpoint for the model inference (latency, qps, etc.). how to get the gpu utilization? can you also run the benchmark this with with other frameworks like Text Generation Inference (TGI), Vllm and Aphrodite?

@LSTMania 2 ай бұрын

Hi, I cannot help you with those specific requests since I am no expert in these domains, but in the official TensorRT-LLM docs (nvidia.github.io/TensorRT-LLM/memory.html) there is a section on GPU memory usage and debugging. For GPU utilization, you should look into Nvidia's profiling tools, which are compatible with TensorRT/TensorRT-LLM: docs.nvidia.com/deeplearning/tensorrt/developer-guide/index.html#profiling If you want the metrics from the benchmark of this Jupyter notebook, I'd encourage you to look into the Jupyter notebook and the python scripts that we run to benchmark all 3 models. You can possibly repurpose them for your use-case. Good luck!

@Gerald-xg3rq 2 ай бұрын

can you make another tutorial and show how to run it with tensorrtllm_backend and the Triton Server?

@LSTMania 2 ай бұрын

Great idea. I'll put it on my list, but I have no experience with Triton, yet. I have a video about quantization and benchmarking coming up. For Triton I need more time to research, please refer to the official doc in the mean time: github.com/triton-inference-server/tensorrtllm_backend

@Gerald-iz7mv 2 ай бұрын

How you usually run rensorrt-llm - which server ppl use? Tensorrt-llm doesnt have a server?

@LSTMania 2 ай бұрын

@@Gerald-iz7mv This article might help you: medium.com/trendyol-tech/deploying-a-large-language-model-llm-with-tensorrt-llm-on-triton-inference-server-a-step-by-step-d53fccc856fa No, TensorRT-LLM itself does not have a server. It is a LLM optimization tool. For deployment, you need to use different resources. You seem to have a lot of specific questions, so I'd encourage you to do a lot of research on your own, since I, unfortunately, cannot answer all of them with confidence. I may create a Triton video tutorial in the future and appreciate your engagement! Best of luck!

@navroopbath2089 20 күн бұрын

+1 for this tutorial, would love to see the end-to-end implementation with Triton Server. Thanks for your tutorials!

@user-gp6ix8iz9r 3 ай бұрын

Hi grate video, Do you know if its possible to split the Vram over 2 computers that are networked together in the same room , I have 2 computers that I wanted to test with, Each computer would have 6X6gb 1060 vrams navida graphics on each so 72 gb of vram to use on LLMs

@LSTMania 3 ай бұрын

Yes this should be possible, but might take additional work. In the TensorRT-LLM repo's main README the keyword you are looking for is Tensor/Pipeline Parallelism, which would make use of VRAM across different GPU devices and even machines. A straight forward approach to use multiple GPUs would be to use the "--worker" flag, but that only supports single node, so all GPUs need to be on one computer. I would suggest you try that out for a single computer and its 6x1060 VRAMS = 36GB only. Then if that works, and you want to connect multiple GPUs across multiple nodes (12 GPUs across 2 computers), you need to look into the python scripts and see if (and how) they support that. If not you can always write your own trt-llm compilation code by looking in the official doc here (Multi-GPU and Multi-Node Support): nvidia.github.io/TensorRT-LLM/architecture.html#compilation More resources: You want to look into NVLink, which is a direct serial connection between GPUs. Otherwise you might experience significant latency. (github.com/NVIDIA/TensorRT-LLM/issues/389) Another Github issue that has a similar problem (github.com/NVIDIA/TensorRT-LLM/issues/147). Good luck!

@user-gp6ix8iz9r 3 ай бұрын

Ok I will give it a try and will let you know how I get on, Thank you for your helpful reply 🙂👍

@Gerald-iz7mv 2 ай бұрын

Do you require any special version of ubuntu (does 22.04 work?) Any special drivers or special version of cuda? does it need tensorflow or pytorch installed?

@LSTMania 2 ай бұрын

I think 22.04 works! If you are set on a specific CUDA version you will have to dig into the docs, but I'd recommend just installing the latest available CUDA version. If you follow the TensorRT-LLM install instructions using pip, it will automatically determine compatible pytorch/tensorflow versions and resolve potential conflicts or install missing packages. If still in doubt please use the docker image used in the Jupyter notebook or in the official TensorRT-LLM doc (nvidia.github.io/TensorRT-LLM). Good luck!

@Gerald-iz7mv 2 ай бұрын

@@LSTMania tensorrt-llm seems require NVIDIA Container Toolkit ... does that also install cuda?

@LSTMania 2 ай бұрын

@@Gerald-iz7mv In the Jupyter Notebook (s. description) the first cell actually installs the Nvidia Container Toolkit. I am not sure if it also installs CUDA, but you can try it out. Then you can check the your installed CUDA version with the bash command: 'nvcc --version' On my laptop it's CUDA Ver 11.5. Same goes for Google Colab. So every CUDA version beyond that should work. The docker container in the GitHub README runs Ubuntu 22.04 and CUDA 12.1.0. Hope that helps! (github.com/NVIDIA/TensorRT-LLM)

@Gerald-iz7mv 2 ай бұрын

can you also use it with Llama 2 model?

@LSTMania 2 ай бұрын

Absolutely! Here is the official example doc. It says "Llama" only but it also supports LLama-2. github.com/NVIDIA/TensorRT-LLM/tree/main/examples/llama

@Gerald-iz7mv 2 ай бұрын

@@LSTMania can it run both single and batch inference?

@LSTMania 2 ай бұрын

@@Gerald-iz7mv I think the inference configuration is not restricted by model type, so you should be able to run batch inference as well as single inference (which is just batch size 1) for any model built with TensorRT-LLM. I wouldn't see a reason to prefer single inference over batch (in production environment) even on a single GPU. Batched is almost always more performant. But it seems like you have a very specific use-case in mind, so I'd suggest to check out the README / TensorRT-LLM doc! Good luck with your endeavors! nvidia.github.io/TensorRT-LLM/batch_manager.html

@Gerald-iz7mv 2 ай бұрын

can you run a web server with it?

@LSTMania 2 ай бұрын

You can definitely deploy your model on any web server, but it's outside the scope of this tutorial and tensor-RT-LLM. TensorRT-LLM is for optimizing the inference speed of a model, but for serving it through the internet you will need other tools such as Triton Inference (which I am no expert in).