Coding LLaMA 2 from scratch in PyTorch - KV Cache, Grouped Query Attention, Rotary PE, RMSNorm

No video

Coding LLaMA 2 from scratch in PyTorch - KV Cache, Grouped Query Attention, Rotary PE, RMSNorm

Рет қаралды 33,725

Күн бұрын

Full coding of LLaMA 2 from scratch, with full explanation, including Rotary Positional Embedding, RMS Normalization, Multi-Query Attention, KV Cache, Grouped Query Attention (GQA), the SwiGLU Activation function and more!
I explain the most used inference methods: Greedy, Beam Search, Temperature Scaling, Random Sampling, Top K, Top P
I also explain the math behind the Rotary Positional Embedding, with step by step proofs.
Repository with PDF slides: github.com/hkproj/pytorch-llama
Download the weights from: github.com/facebookresearch/l...
Prerequisites:
1) Transformer explained: • Attention is all you n...
2) LLaMA explained: • LLaMA explained: KV-Ca...
Chapters
00:00:00 - Introduction
00:01:20 - LLaMA Architecture
00:03:14 - Embeddings
00:05:22 - Coding the Transformer
00:19:55 - Rotary Positional Embedding
01:03:50 - RMS Normalization
01:11:13 - Encoder Layer
01:16:50 - Self Attention with KV Cache
01:29:12 - Grouped Query Attention
01:34:14 - Coding the Self Attention
02:01:40 - Feed Forward Layer with SwiGLU
02:08:50 - Model weights loading
02:21:26 - Inference strategies
02:25:15 - Greedy Strategy
02:27:28 - Beam Search
02:31:13 - Temperature
02:32:52 - Random Sampling
02:34:27 - Top K
02:37:03 - Top P
02:38:59 - Coding the Inference

Пікірлер: 84

@imbingle 28 күн бұрын

would love to see lighterweight llms trained on custom datasets, thanks for the video! this channel is a gold mine.

@RaghavendraK458 6 ай бұрын

Very good video. You have a knack for conveying complex content in understandable format. Thank you and keep up the great work

@TheMzbac 7 ай бұрын

Highly recommended for anyone who wants to understand open source LLM inside and out.

@gabchen 11 ай бұрын

Haven't watched the full video yet but thanks for the promising content. please keep it going. Would like to see more of the environment set up and the debugging process.

@sounishnath513 11 ай бұрын

No comments.... Need to learn many things... Thank you very much for creating such interesting and helpful content... I am fortunate - that I found your channel.

@ravimandliya1881 11 ай бұрын

Very excited for this!!! Weekend is going to be fun!

@user-jf6li8mn3l 6 ай бұрын

Thank you for such a detailed analysis of the architecture and implementation features of the model! You are very good at presenting information!

@dongdongqiaqia 11 ай бұрын

Marked for my next watch. Thanks for producing high quality video for the series. Hope you have fun in China.

@marshallmcluhan33 10 ай бұрын

Thanks for explaining all of these concepts. Keep up the good work 😎

@mazenyasser8299 6 ай бұрын

You are a hidden gem, great explanation with theoretical and technical concepts.

@yonistoller1 9 ай бұрын

Thank you so much for sharing this, it was really well done!

@tljstewart 11 ай бұрын

Great content as usual! Thanks

@pi5549 11 ай бұрын

Might you consider creating a Discord guild? I'd love to hang with the people that are watching these videos!

@umarjamilai 11 ай бұрын

Hi! I am considering it, will let you know with a public post when it's online 🤖🦾

@FireFly969 3 ай бұрын

Yep, such great people

@Umar-Ateeq Ай бұрын

Great idea man!!

@GrifinsBrother 7 ай бұрын

Incredible explanation!

@wilfredomartel7781 8 ай бұрын

Amazing work Umar.

@renanangelodossantos4726 4 ай бұрын

EXCELENT! I would like to see the se series with Llava.

@jiaxingyu8300 10 ай бұрын

Thank you so much for sharing!

@n.8642 Ай бұрын

Thanks! I learned a lot from your excellent video.

@justcars2454 3 ай бұрын

its an honor to me, to be in those 23500 viewers who watched this video, thank you so much umar jamil for your content

@jasonzhai2584 3 ай бұрын

Thanks for the amazing tutorial! As a student I found it so clear to follow. Just a minor issue to point out, during the illustration of the RoPE, the efficient computing equation for complex frequencies (Eq. 34 in the original paper) should have the third matrix's last two terms $-x_{d}$ and then $x_{d-1}$. The video shows $-x_{d-1}$ and then $x_{d}$, probably just a typo reversing the order of suffixes.

@saima6759 2 ай бұрын

this is hardcore machine learning engineering!

@Patrick-wn6uj 4 ай бұрын

55:44 "I could have also written the code and not tell you and not tell you anything but I like to give proof to what i do " Wow thank you for going that extra mile we really appreciate it.

@ehsanzain5999 10 ай бұрын

Thank you Umar very much for the efforts here. One question, is there any PPO and finetuning on above of this in next videos?

@modaya3382 9 ай бұрын

Thank you very much for your efforts

@oiooio7879 11 ай бұрын

Great video very educational

@hussainshaik4390 11 ай бұрын

great content !

@马国鑫 8 күн бұрын

Thanks! I learned a lot from your excellent video.

@atanuchowdhury6582 8 ай бұрын

awesome work boss

@tarequeovi4051 11 ай бұрын

Great Content

@RayGuo-bo6nr 8 ай бұрын

Thanks! 谢谢你！

@stsouko 11 ай бұрын

Wow. Now I got this trick

@user-yf7qv8zj6y 11 ай бұрын

This is the way!

@SumanGameDev 4 ай бұрын

oh boy this amazing video

@user-vh5ni1gs3w 5 ай бұрын

great video❤

@mohammadyahya78 Ай бұрын

amazing

@PaoloTshiyole 6 ай бұрын

Great video So what about the dataset used in this video?

@zz79ya 2 ай бұрын

Thanks for your lecture, and I have a question. What happens if the start_pos is longer than the query cache size? If this code is not dealing with such a situation, which kind of additional modification do we need?

@shamimibneshahid706 6 ай бұрын

Hi, I want to fine tune the model. In that case, will it be required to get rid of the k-v caching?

@zhenfutaofang2534 8 ай бұрын

anyone know how to execute the code on cuda 4090gpu , i faced the out of memoery error

@hautran-uc8gz 4 ай бұрын

thank you

@DiegoSilva-dv9uf 7 ай бұрын

Thanks!

@umarjamilai 7 ай бұрын

Thank you Diego for your support!

@jensenlwt Ай бұрын

Can somebody help to explain why when calculating theta, we are not including the -2, e.g., theta = theta ** (-2 * theta_numerator / head_dim)

@umarjamilai 11 ай бұрын

As always the PDF slides and the source code are available on GitHub: github.com/hkproj/pytorch-llama/ Prerequisites: 1) Transformer explained: kzfaq.info/get/bejne/mKmqZ7J-ytOnk3U.html 2) LLaMA explained: kzfaq.info/get/bejne/g9SPbLpi06mqfKM.html

@hussainshaik4390 10 ай бұрын

Thanks

@umarjamilai 10 ай бұрын

Thank you for your support!

@adatalearner8683 3 ай бұрын

why is the context window size limited? Is it because these models are based on transformers and for a given transformer architecture, long distance semantic relationship detection will be bounded by the number of words/context length ?

@skanderbegvictor6487 6 ай бұрын

I tried loading the model from M1 mac 8GB RAM but it seems that it requires more memory (I am guessing 28GB RAM)

@tharunbhaskar6795 5 ай бұрын

What are the system requirements to run the inference for this model? By the way, its a great video

@adatalearner8683 3 ай бұрын

Lets say a llm application has a context window of 4000 words. It also supports historical chats. So user can effectively send more than allowed words in a given prompt, and yet get answers related to the previous historical conversation? How does this work ?

@edoziemenyinnaya7637 9 ай бұрын

Please can we get the training code too?

@IRFANSAMS 5 ай бұрын

Can i use llama2 model open source for life time or can i code along with you and use the model

@wilfredomartel7781 8 ай бұрын

🎉🎉

@sharjeel_mazhar Ай бұрын

Umar bhai, your tutorials on transformer architectures and open-source LLMs are truly remarkable. As a Pakistani, seeing your expertise in deep learning is incredibly inspiring. Have you ever considered creating Urdu versions of your content? It could make your valuable knowledge more accessible to a wider audience. Your contributions are invaluable to the global tech community. Keep up the fantastic work! Huge fan of your work. May ALLAH bless you with health and success!

@azain47 3 күн бұрын

He's Italian, I doubt he knows urdu

@sharjeel_mazhar 3 күн бұрын

@@azain47 oh, i thought he was Pakistani, but nvm, it's really good to see a Muslim working and sharing his knowledge and expertise to the rest of the world, generally we don't see Muslim people making great content regarding to CS

@azain47 3 күн бұрын

@@sharjeel_mazhar be the change you wish to see, brother.

@feixyzliu5432 6 ай бұрын

Wouldn't it be 'cur_pos - 1' for start_pos argument (line 81 in inference.py, 2:45:58)?

@mikeliu8533 5 ай бұрын

Agreed.

@user-xt7bu8sz7j 5 ай бұрын

watch again

@mathlife5495 10 ай бұрын

A suggestion for all your videos is to increase the font size or the zoom level. They are kind of unreadable.

@umarjamilai 10 ай бұрын

Thanks for your feedback! I'll keep that in mind 🤗

@edoziemenyinnaya7637 9 ай бұрын

Do you’ve a discord channel

@Rookie_AI 7 ай бұрын

where do you apply the causal mask?

@Rookie_AI 7 ай бұрын

and the sliding window attention. Thank you

@feixyzliu5432 6 ай бұрын

causal mask is not needed since kv cache is used

@spencerfunk6697 Ай бұрын

please do mistral

@coolguy69235 8 ай бұрын

is llama 2 encoder only or decoder only model ?

@umarjamilai 8 ай бұрын

People call it "Decoder-Only", because it resembles the Decoder of the Transformer, but it lacks the Cross Attention. Technically it's the Encoder of the Transformer plus a Linear Layer and Softmax. But commonly, people call LLaMA a "decoder only" and BERT a "Encoder only" model.

@coolguy69235 8 ай бұрын

@@umarjamilai Thanks a lot for your prompt reply. And amazing video

@user-yf5wy7qk9r 8 ай бұрын

We need one more video to explain download weights and inferencing, because it is not clear.

@umarjamilai 8 ай бұрын

Hi! To download the LLaMA weights, you need to request access to it by using the following link: ai.meta.com/resources/models-and-libraries/llama-downloads/ Meta will send you an email with the details on how to download the model.

@wd25548 4 ай бұрын

Great video! one question though: In kzfaq.info/get/bejne/pbNkidCgxsiocX0.htmlsi=TBFoV5Kj0lnbNaee&t=4272, Why do we have to have the "gamma" of all one? I did comparison on the code with and without the self.weight, the outputs are the same

@wd25548 4 ай бұрын

Oh forgive me dummy question - for anyone else who's thinking about it, the self.weight is learnable

@feixyzliu5432 6 ай бұрын

Thank you for the wonderful lecture. I wondering why you use torch.matmul / transpose things in the video, but use torch.einsum in the slides? They are mathematically equal, but how about their efficiency, which one will be run faster?