Self-Host and Deploy Local LLAMA-3 with NIMs

Рет қаралды 3,767

Күн бұрын

In this video, I walk you through deploying Llama models using NVIDIA NIM. NVIDIA NIM uses microservices to enhance the deployment of various AI models, offering up to three times improvement in performance. I demonstrate how to set up an NVIDIA Launchpad, deploy the Llama 3 8 billion instruct version, and stress test it to see throughput. I also show you how to utilize OpenAI compatible API servers with NVIDIA NIM.
LINKS:
NIM: nvda.ws/44u5KYH
org.ngc.nvidia.com/setup/pers...
NIM Previous Video: • Deploy AI Models to Pr...
💻 RAG Beyond Basics Course:
prompt-s-site.thinkific.com/c...
Let's Connect:
🦾 Discord: / discord
☕ Buy me a Coffee: ko-fi.com/promptengineering
|🔴 Patreon: / promptengineering
💼Consulting: calendly.com/engineerprompt/c...
📧 Business Contact: engineerprompt@gmail.com
Become Member: tinyurl.com/y5h28s6h
💻 Pre-configured localGPT VM: bit.ly/localGPT (use Code: PromptEngineering for 50% off).
Signup for Newsletter, localgpt:
tally.so/r/3y9bb0
TIMESTAMPS
00:00 Introduction to Deploying Large Language Models
00:13 Overview of NVIDIA NIM
01:02 Setting Up and Deploying a NIM
01:51 Accessing and Monitoring the GPU
03:39 Generating API Keys and Running Docker
05:36 Interacting with the Deployed Model
07:16 Stress Testing the API Endpoint
09:53 Using OpenAI Compatible API with NVIDIA NIM
12:32 Conclusion and Next Steps
All Interesting Videos:
Everything LangChain: • LangChain
Everything LLM: • Large Language Models
Everything Midjourney: • MidJourney Tutorials
AI Image Generation: • AI Image Generation Tu...

Пікірлер: 16

@DearGeorge3 2 күн бұрын

It's not clear can I run NIM locally and get 5x in perfomance or not.

@petergasparik924 2 күн бұрын

Im curious too

@engineerprompt Күн бұрын

Here are the configurations that they used for running the tests on H100 [Llama 3-70b-instruct, input token length: 7,000, output token length: 1,000. Concurrent client requests: 100. 4xH100 SXM NVLink. NIM Off: FP16, TTFT: ~120s, ITL: ~180ms. NIM On: FP8. TTFT: ~4.5s, ITL: ~70ms You can run NIM locally on Tensor Core GPU but the performance you will get is dependent on your configurations and hardware. So your milage may vary.

@user-nl7ur5mc2p 2 күн бұрын

Thank you. Amazing channel

@engineerprompt 2 күн бұрын

Thanks

@petergasparik924 2 күн бұрын

Hi, are you sure that inference speed on H100 is correct? Because on my RTX 4090 with Llama 3 Instruct 8B Q8_0 inference speed is about 72t/s, so you have lower speed than me

@orlingueorguiev Күн бұрын

Can you provide a benchmark comparison fortwhen using ollama server? I really want to see if the claimed performance improvement is actually there.

@engineerprompt Күн бұрын

Let me see if I can do a comparison between different options (ollama, llamacpp, vllm and NIM). Here is a blogpost from NVIDIA that might be helpful (note numbers here are for 8B, the results I showed in the video are for different configuration - 70B) tinyurl.com/as7uvbv8

@Nihilvs 2 күн бұрын

Thanks ! what do you actually pay for, when buying NIM ?

@engineerprompt 2 күн бұрын

You are paying for the license fee. My understanding is you can run this on your own hardware but paying licensing fee for using the software stack.

@Nihilvs Күн бұрын

@@engineerprompt Good to know ! thanks

@rousabout7578 2 күн бұрын

Is this correct? For production use, NIM is part of NVIDIA AI Enterprise, which has different pricing models: - On Microsoft Azure, there's a promotional price of $1 per GPU per hour, though this is subject to change. - For on-premises or other cloud deployments, NVIDIA AI Enterprise is priced at $4,500 per year per GPU.

@engineerprompt 2 сағат бұрын

Here is the info: resources.nvidia.com/en-us-ai-enterprise/en-us-nvidia-ai-enterprise/nvidia-ai-enterprise-licensing-guide

@zikwin 2 күн бұрын

i dnt have friend kind enough to give me acess to H100

@eod9910 2 күн бұрын

So I'll say this because evidently other people are too polite, but this is absolute garbage. Who has an H100 hanging around to do this? Don't post stuff that 99% of the people can't do. If you want to post stuff that only people with tens of thousands of dollars and access to this type of hardware can use, go work for one of those companies. Otherwise, you're wasting everybody's time.

@christosmelissourgos2757 2 күн бұрын

Actually I don’t agree. We are building a product and that is something that we are really interested in

@vitalis 2 күн бұрын

dude why are you so bitter? Go out and touch grass for a bit. Have you learnt nothing from the last two decades in tech history? All industrial tech sips through to prosumer and then mainstream. Your local GPU performance would be considered alien tech not too long ago. Sheesh