Run 70Bn Llama 3 Inference on a Single 4GB GPU

Рет қаралды 10,863

Ай бұрын

Code : github.com/rohan-paul/LLM-Fin...
🐦 Connect with me in Twitter : / rohanpaul_ai
Airllm Github - github.com/lyogavin/Anima/tre...
Checkout the MASSIVELY UPGRADED 2nd Edition of my Book (with 1300+ pages of Dense Python Knowledge) 🐍🔥
Covering 350+ Python 🐍 Core concepts ( 1300+ pages ) 🚀
🟠 Book Link - rohanpaul.gumroad.com/l/pytho...
-----------------
Hi, I am a Machine Learning Engineer | Kaggle Master. Connect with me on 🐦 TWITTER: / rohanpaul_ai - for daily in-depth coverage of Large Language Model bits
----------------
You can find me here:
**********************************************
🐦 TWITTER: / rohanpaul_ai
👨🏻‍💼 LINKEDIN: / rohan-paul-ai
👨‍🔧 Kaggle: www.kaggle.com/paulrohan2020
👨‍💻 GITHUB: github.com/rohan-paul
🧑‍🦰 Facebook : / rohan.paul.562
📸 Instagram: / rohan_paul_2020
**********************************************
Other Playlist you might like 👇
🟠 MachineLearning & DeepLearning Concepts & interview Question Playlist - bit.ly/380eYDj
🟠 ComputerVision / DeepLearning Algorithms Implementation Playlist - bit.ly/36jEvpI
🟠 DataScience | MachineLearning Projects Implementation Playlist - bit.ly/39MEigt
🟠 Natural Language Processing Playlist : bit.ly/3P6r2CL
----------------------
#LLM #Largelanguagemodels #Llama2 #LLMfinetuning #opensource #NLP #ArtificialIntelligence #datascience #textprocessing #deeplearning #deeplearningai #100daysofmlcode #neuralnetworks #datascience #generativeai #generativemodels #OpenAI #GPT #GPT3 #GPT4 #chatgpt #genai

Пікірлер: 49

@scottmiller2591 24 күн бұрын

Good writeup - covered when it's applicable, and pros and cons. I would recommend using it on a machine w a lot of RAM, setting up a RAM disk, and using that for your cache - that would knock the latency down somewhat.

@tshawtshi3040 26 күн бұрын

I was thinking about this for a while. Im glad someone did it. O think if done properly you can have similar performance to all weights in vram

@RohanPaul-AI 26 күн бұрын

Indeed.

@javiergimenezmoya86 28 күн бұрын

Is it possible configure that library for use of RAM instead if SSD? It would be useful if you have a computer with many RAM (p.e 64GB of RAM) because all layers would be able in memory in 4 bit quantization.

@RohanPaul-AI 28 күн бұрын

I was thinking the same about offloading to RAM, as it has become so much cheap. However on my quick search could not find that option yet with that lib. Will need to investigate more. If you find please let me know as well.

@i6od 28 күн бұрын

... isnt this question ironic? doesnt LLM naturually load into RAM / VRAM, and the whole reason of this project is to switch it to an Actual Storage Drive so You can use the 70B in the Drive instead of having issues with over loading VRAM / RAM

@RohanPaul-AI 28 күн бұрын

@@i6od Indeed, this project brings a completely new way to deal with LLMs beyond RAM/VRAM.

@brianlink391 27 күн бұрын

Really simple to do just create a ram Drive a simple application you can download and then put your model into the Rand Drive and load it from there and you're all set

@poldiderbus3330 27 күн бұрын

I would then just try to use a RAM-disk..

@honestgoat 26 күн бұрын

Using this method then, is it possible to run say a 350b model on a gpu with 20/24gb vram? Say running Grok-1 which is 314b could run on a 3090/4090 using this method? I know it would be slow af, but it could work right?

@RohanPaul-AI 26 күн бұрын

theoretically possible . The layered inference approach will just do the sequential loading and unloading of model layers. Ofcourse, the latency will accumulate and result in super super slow inference.

@gaborcsurke6937 27 күн бұрын

The question is if we have more VRAM like 16 or 24GB that can be used and mitigate the SSD bottleneck more? Maybe that way can read not only one layer but multiple and that way can be even faster

@RohanPaul-AI 26 күн бұрын

Yes, I think its possible i.e. you can managing the num of layers to allocate to GPU reducing the frequency of SSD reads. Here's the long ans. - In the current implementation of their code (check the github repo), the `AirLLMBaseModel` class in `airllm_base.py` loads and processes one layer at a time during the forward pass. However, you can modify the `forward` method to load and cache a certain number of layers based on the available GPU memory. For example, you can introduce a configuration parameter to specify the number of layers to cache in GPU memory. Then, in the `forward` method, you can load and store the layers in a cache until the specified number of layers is reached. When processing the next layer, you can check if it is already in the cache before loading it from SSD. Here's a simplified example of how you could modify the `forward` method to cache multiple layers: ```python def forward(self, ...): ... cached_layers = [] max_cached_layers = 4 # Specify the maximum number of layers to cache for i, (layer_name, layer) in enumerate(zip(self.layer_names, self.layers)): if layer_name in cached_layers: # Layer is already cached, use it directly layer = self.cached_layers[layer_name] else: # Load the layer from SSD and add it to the cache state_dict = self.load_layer_to_cpu(layer_name) self.move_layer_to_device(state_dict) cached_layers.append(layer_name) self.cached_layers[layer_name] = layer # Remove the oldest cached layer if the cache size exceeds the maximum if len(cached_layers) > max_cached_layers: oldest_layer = cached_layers.pop(0) del self.cached_layers[oldest_layer] # Process the layer ... ``` In this example, the `max_cached_layers` variable determines the maximum number of layers to cache in GPU memory. The `cached_layers` list keeps track of the currently cached layers. When processing a layer, it first checks if it is already cached. If not, it loads the layer from SSD, adds it to the cache, and removes the oldest cached layer if the cache size exceeds the maximum. - By caching multiple layers in GPU memory, you can reduce the number of SSD reads required during inference. Additionally, you may need to handle the case where a single layer itself exceeds the available GPU memory. In such scenarios, you might need to explore other techniques like tensor parallelism or model sharding to distribute the layer across multiple GPUs or devices.

@Linuslkm 25 күн бұрын

have you tried it on a Ramdisk? If so, could you make another video comparing perfomance?

@RohanPaul-AI 24 күн бұрын

No haven't tried on that yet, but will try.

@nexusphreez 26 күн бұрын

So my only question is can this be integrated with ollama?

@RohanPaul-AI 26 күн бұрын

Dont think ollama supports this.

@perelmanych 25 күн бұрын

There are many comments about loading layers from RAM instead of SSD. Basically, it doesn't make sense. You will have a better performance doing all the computations on CPU. Why? Very simple, when you run LLM on CPU the main bottleneck is not a CPU speed, but the bandwidth of RAM and that is why it is much faster to run LLM on GPU because it has much higher bandwidth. With this lib you will have to copy each time a layer form RAM to VRAM and then compute output of a layer on GPU. That doesn't make sense, since your CPU makes computations faster than it gets data from RAM. So no magic here, if you want to run very big model and it fits to the RAM then just run it on CPU.

@krisKrag 26 күн бұрын

is there a paper of apple in 2023 doing this the difference is that Apple targets efficiency in reading chunks specifically on its own hardware. KZfaq censor my previous comment where i paste the link and tittle of the paper :/

@RohanPaul-AI 26 күн бұрын

Yes I think you are talking about "LLM in a Flash: Efficient Large Language Model Inference with Limited Memory" twitter.com/rohanpaul_ai/status/1737425137451073573

@jnchacon 28 күн бұрын

Why ssd? Why not RAM? If i have enough RAM to save the entire LLM? Can the layer be read from RAM? (RAM to VRAM)

@brianmi40 27 күн бұрын

google "RAM disk" still a thing in Win 11...

@RohanPaul-AI 26 күн бұрын

Yes offloading to RAM will always be better given how cheap it is. But this library wanted a new way to deal with LLM bypassing RAM/VRAM as much as possible.

@RobertMcGovernTarasis 25 күн бұрын

How much disk space would this all need?

@RohanPaul-AI 24 күн бұрын

You just need to be able to accomodate the entire model into your SSD.

@BrokenOpalVideos 26 күн бұрын

How many tokens per second would you get though

@RohanPaul-AI 26 күн бұрын

Depends on SSD read speed. It may vary but in Mac hardware was getting 1 tok/2sec.

@Gatrehs 23 күн бұрын

@@RohanPaul-AI is this an regular SSD or an NVME?

@RohanPaul-AI 23 күн бұрын

@@Gatrehs its NVME

@lou.later269 28 күн бұрын

damn, imagine the same optmization for an 8B model, the speeds would rival Groq

@RohanPaul-AI 28 күн бұрын

Yes, but actual speed may not improve much as you still have to do Disk IO. So you will always be bottlenecked by your SSD read speed.

@dinoscheidt 27 күн бұрын

Exactly. The smaller the model the higher the proportional IO overhead compared to compute… at 8B paging memory like this in and out makes it far slower than it is right now. That is because the compute time needed per additional parameter in an XB model grows exponentially. So large models are so slow in compute, that IO overheads like these can become neglectable. But there are interesting developments like vLLMs that use something like virtual memory management to pack a very large model still in small GPU memory. Skipping the need for IO speed (since there is no IO to disk), since everything is still in memory on the graphics card.

@RohanPaul-AI 27 күн бұрын

@@dinoscheidt very well explained. Thanks.

@damien2198 26 күн бұрын

To my understanding, Groq uses a similar trick as their LPU has only 250MB(yes MB) memory

@lostpianist 26 күн бұрын

@@dinoscheidt can't wait for vLLM Llama 3 400B. For a few years I've been hoping for something like that, then really top level AI can be run locally by anyone with a reasonable computer and ok graphics card... Will be amazing for productivity, gaming, etc.

@MuhammadAdnan-tq3fx 29 күн бұрын

It's possible offline?

@RohanPaul-AI 28 күн бұрын

Yes, you can use the locally downloaded model's local path like below model = AutoModel.from_pretrained("/home/ubuntu/.cache/huggingface/hub/models--garage-bAInd--Platypus2-70B-instruct/snapshots/b585e74bcaae02e52665d9ac6d23f4d0dbc81a0f")

@caseyhoward8261 27 күн бұрын

@@RohanPaul-AIThank you! ❤