LLaMA explained: KV-Cache, Rotary Positional Embedding, RMS Norm, Grouped Query Attention, SwiGLU

Рет қаралды 52,881

Күн бұрын

Full explanation of the LLaMA 1 and LLaMA 2 model from Meta, including Rotary Positional Embeddings, RMS Normalization, Multi-Query Attention, KV-Cache, Grouped Multi-Query Attention (GQA), the SwiGLU Activation function and more!
I also review the Transformer concepts that are needed to understand LLaMA and everything is visually explained!
As always, the PDF slides are freely available on GitHub: github.com/hkproj/pytorch-lla...
Chapters
00:00:00 - Introduction
00:02:20 - Transformer vs LLaMA
00:05:20 - LLaMA 1
00:06:22 - LLaMA 2
00:06:59 - Input Embeddings
00:08:52 - Normalization & RMSNorm
00:24:31 - Rotary Positional Embeddings
00:37:19 - Review of Self-Attention
00:40:22 - KV Cache
00:54:00 - Grouped Multi-Query Attention
01:04:07 - SwiGLU Activation function

Пікірлер: 147

@umarjamilai 7 ай бұрын

As many of you have asked: LLaMA 2's architecture is made up of the ENCODER side of the Transformer plus a Linear Layer and a Softmax. It can also be thought of as the DECODER of the Transformer, minus the Cross-Attention. Generally speaking, people call a model like LLaMA a Decoder-only model, while a model like BERT an Encoder-only model. From now on I will also stick to this terminology for my future videos.

@aiyogiravi 5 ай бұрын

Yeah, It make sense now. Since we are not doing any Encoding and using it as Cross-Attention later. We will call this model a Decoder-only model. Edit: Really appreciate the effort you are putting. Great Channel :)

@haiphan980 2 ай бұрын

Great video about LLAMA! I have one question regarding inference steps. How does the input [SOS] predicts "Love" in the beginning when the model does not have any information about input sentence? In Transformer, we have encoder which encode the whole input sentence before going to decoder, offering conditional probabilistic graph mechanism however in LLAMA, we do not know it. [SOS] can predict any next words and how does it know that it is "LOVE"?

@kanakraj3198 Ай бұрын

@@haiphan980 That was just for understanding. He didn't show the prompt part. the model will first take your prompt and perform self-attention on it and then only it will start predicting, so it will have information based on your prompt on how to start.

@kqb540 29 күн бұрын

Umar, Andrew Ng, 3Blue1Brown and Andrej are all you need. You are one of the best educators of deep learning. Thank you.

@mandarinboy 10 ай бұрын

The best 1 hour I spent! I had so many questions exactly on all these topics and this video does an outstanding job at explaining enough details in an easy way!

@umarjamilai 10 ай бұрын

Glad you liked it! I just posted another video on how to code LLaMA 2 from scratch! Check it out

@librakevin1983 8 күн бұрын

The best machine learning videos I've ever watched. Thanks Umar!

@umarjamilai 10 ай бұрын

As always, the PDF slides are freely available on GitHub: github.com/hkproj/pytorch-llama-notes/

@UnknownHuman11110 5 күн бұрын

Amazing video ! Thanks for taking the time to explain core new concepts in language models

@muthukumarannm398 10 ай бұрын

I became your fan in 55:00 when you explain how GPU capability drives the development. 🙂

@dgl3283 2 ай бұрын

This video can be like official textbook of llama architecture. Amazing.

@Jc-jv3wj 2 ай бұрын

Fantastic explanation on LLaMA model. Please keep making this kind of videos.

@mojtabanourani9988 10 ай бұрын

The network came in Feb 2023! This is a youtube channel worth subscribing. Thanks man

@jordanconnolly1046 10 ай бұрын

Really glad I found your channel you create some of the most in depth and easy to follow explanations I've been able to find.

@ravimandliya1881 10 ай бұрын

Such an amazing step by step breakdown of concepts involved! Thank you so much.

@TheMzbac 6 ай бұрын

Very underrated video. Thanks for providing such a good lecture to the community

@Paluth 10 ай бұрын

Thank you very much for your work. The community is blessed with such high quality presentations about difficult topics.

@user-ue7en6dw9p 6 ай бұрын

This video, along with the previous ones about coding up transformers from scratch, are really outstanding. Thank you so much for taking such a tremendous amount of your free time to put all of this together!

@danish5326 4 ай бұрын

AMAZING! AMAZING AMAZING! Great work Umar .. Thanks a ton

@tubercn 8 ай бұрын

Thanks for your free time and offering this valuable tutorial👏👏👏 Hope you keep going to do this, thanks again

@cobaltl8557 4 ай бұрын

This intro to llama is awesome ❤, thank you for making such a great video.

@Best9in 9 ай бұрын

Thank you very much! I once read the paper, but I think watching your video provided me with more insights about this paper than reading it many more times would have.

@abhishekshinde-jb5pn 6 ай бұрын

Your videos are the best man !! Please keep releasing as much content as possible, on the famous papers

@user-nf7oy2zw8n 10 ай бұрын

Fantastic video! Your explanations are very clear, thank you!

@Tensordroid 3 ай бұрын

One of the best explanations on youtube right now !!

@hieungo770 9 ай бұрын

Please keep doing content like this. Thank you very much. I learnt alot

@Angadsingh95 10 ай бұрын

Thank you for creating such quality content!

@NJCLM 5 ай бұрын

I didn't even see the time passe ! Great work your are a future rock star at teaching complex thing in ML

@parmanandchauhan6182 Ай бұрын

Great content ,deep understanding after watching video

@siqb 4 ай бұрын

My TLDR for the video (please point out the mistakes): - LLaMA uses RMS normalization instead of LayerNorm because it provides the same benefits with less computation. - LLaMA uses rotary embeddings. These act as a distance-based scaling to the original dot product scalar value coming out of queries and keys. In other words, two tokens, X and Y, will have a larger scalar value versus two tokens X and Y that are far apart. This makes sense from the point of view that closer tokens should have a bigger say in the final representation of a given token than the ones far away. This is not the case for vanilla transformer. - LLaMA uses Grouped Query Attention as an alternative to vanilla attention mostly to optimize GPU Flops (and its much slow memory access). Key slide on 1:03:00. In vanilla attention, each token (within each head) has its own key, query and value vector. In multi-query attention (MQA), there is only one key and value vector for all query vectors. In between lies the MQA where a few query vectors (say 2-4) may be mapped to one key and value vector. - LLaMA uses SwiGLU activation function since it works better - LLaMA uses 3 layers instead of 2 for the FFNN part of the encoder block but keeps the number of parameters same.

@GrifinsBrother 6 ай бұрын

Incredible explanation!> You have really predisposition to explaining materials. Keep going!

@Engrbilal143 10 ай бұрын

Amazing. Just wow. I cannot find this stuff on whole internet

@meili-ai 10 ай бұрын

Very good explanation! Keep up the good work!

@saratbhargavachinni5544 9 ай бұрын

Thanks a lot, Great explanation of KV cache and Multi Query Attention.

@TheAero 10 ай бұрын

You actually explained attention here better than the previous presentation!

@umarjamilai 10 ай бұрын

Keep up with the journey! Watch my other video on how to code LLaMA 2 from scratch and you'll put to practice what you've learnt here

@mickelliu5559 10 ай бұрын

It's surprising that content of this quality is free.

@ahmetfirat23 10 ай бұрын

very informative video, details are clearly explained. thanks a lot

@goelnikhils 9 ай бұрын

Exceptional Video on LLaMA

@MarcosVinicius-bd6bi 3 ай бұрын

Fantastic video, thanks Umar!

@TianyiZhang-ns8pg Ай бұрын

Thanks a lot. Looking forward to your next video!!!

@user-iq4hf7wf5q 7 ай бұрын

Great video! Worth spending the time going over and over again. I actually saw your video from a Chinese site (the video probably has been forwarded by many other people already), then I come here for the author. 讲得超棒，谢谢分享！

@umarjamilai 7 ай бұрын

什么国内的网站？我有点想看一下😁

@feixyzliu5432 5 ай бұрын

bilibili@@umarjamilai

@Tomcat342 5 ай бұрын

You are doing god's work.Keep it up and Thank you.

@ethanhe42 9 ай бұрын

great illustration!

@georgealexandruvlad7837 9 ай бұрын

Great explanation! 👌🏻

@user-fg9gc7dk2n 8 ай бұрын

Great video. Thanks, 小乌!

@umarjamilai 8 ай бұрын

谢谢你😸

@goelnikhils 9 ай бұрын

Amazing explanation

@saranyav2581 Ай бұрын

Thank you for the amazing explanation

@siddharthasubramaniyam 9 ай бұрын

Great work man🙌

@satviknaren9681 2 ай бұрын

Thank you for posting Thank you existing Thank you

@Vignesh-ho2dn 3 ай бұрын

Thanks for the very resourceful video

@haocongzhan1806 6 ай бұрын

u are doing great job, ty for tutoring me part by part! it helps a lot

@umarjamilai 6 ай бұрын

你在中国吗？我们在领英联系吧

@haocongzhan1806 6 ай бұрын

@@umarjamilai 在的，已经加你了！

@BadmintonTV2008 6 ай бұрын

really awesome work!

@charlesriggins7385 7 ай бұрын

It's really useful. Thank you.

@kjkszpjab1510 3 ай бұрын

Brilliant, thank you.

@berkk1993 9 ай бұрын

Your video are great. Please keep going.

@TrelisResearch 9 ай бұрын

Great channel and content Umar

@EugenioDeHoyos 9 ай бұрын

Thank you!

@user-ot4zz4pl9s 10 ай бұрын

This is the way!

@mprone 3 ай бұрын

Despite your name deceived me at first, I had no doubt you were a fellow Italian!

@umarjamilai 3 ай бұрын

Dovresti farmi l'inganno della cadrega per vedere se sono davvero un Milanese 😂😇

@mprone 3 ай бұрын

@@umarjamilai Senta lì, Brambilla Jamil, s'accomodi, si serva, prenda una cadrega. Una bella cadreghina non si rifiuta mai!

@umarjamilai 3 ай бұрын

@@mprone 😋 mmmhhh... Buona sta cadrega 🍎 Scrivimi pure su LinkedIn se hai qualche dubbio o domande. Buona giornata!

@localscope6454 9 ай бұрын

Beautiful, thx.

@Sisco404 3 ай бұрын

Love your videos, and I also love the meowing cat in background😂

@user-lo1pk4sg5n 5 ай бұрын

great explanation bro!!!

@xujiacao6776 2 ай бұрын

This video is great！

@baiyouheng5365 4 ай бұрын

good content,nice explanation, THANKSssss.

@vassilisworld 10 ай бұрын

another amazing video Umar! you do know how to teach for sure, it would be nice if you put into a repo very influential papers to read! Did I hear a baby in the background? Also given you are from Italy, there is a lovely video worth watching by Asianometry on 'Olivetti & the Italian Computer: What Could Have Been'. thank you again for the hard work you put on this video

@umarjamilai 10 ай бұрын

Hello Vassilis! Thanks for the kind words and the suggestion! The voice in the background is from 奥利奥 (Oreo), our black and white cat 😺. Unfortunately I'm aware of Olivetti's history and what could have been. If you're curious, you should also check out Enrico Mattei, and what ENI could have been. Have a wonderful day! Hopefully in a few days I'll upload the video on how to code LLaMA from scratch

@satpalsinghrathore2665 6 ай бұрын

Amazing video

@AntonyWilson 9 ай бұрын

Thanks!

@cfalguiere 6 ай бұрын

Thanks for sharing

@weihuahu8179 6 ай бұрын

love it

@TIENTI0000 5 ай бұрын

thank you a lot

@subhamkundu5043 10 ай бұрын

This is great. Thank you. It will be very helpful you could also create a video for hands on coding a Llama model, the way you did for vanilla transformer. Thanks in advance

@umarjamilai 10 ай бұрын

It's coming soon Stay tuned!

@weicheng4608 10 ай бұрын

Same here. Eagerly waiting for a coding session for llama model.

@mohammadyahya78 19 күн бұрын

amazing

@taltlusty6804 6 ай бұрын

Great video!!! Thank you very much for enriching the community with such great explanations! Can you please share your slides?

@umarjamilai 6 ай бұрын

Check the video description, there's a link.

@user-hm8gu2ze6w 5 ай бұрын

Genio!

@user-td2sz3yh9b 9 ай бұрын

nice，man

@saurabh7337 2 ай бұрын

Hello , Many thanks for such a great content. Really enjoyed your work. Sorry for being greedy, may I request to create a video showing show llama model can be run effectively on local machines(with or without gpu) for inference(say with a custom flask api)

@hosseinhajipour7817 8 ай бұрын

Thanks for the videos. There are a few errors in the video which I mentioned below: 1- Llama is an decoder-only model 2- The size of Q and K are the same. However, they are not the "same" tensor.

@umarjamilai 8 ай бұрын

Hi! 1 - To be the decoder, it should have the cross attention, which it doesn't. The closest architecture is the encoder (the left side of the transformer model). People commonly call it "decoder only" because we do not "encode" text into a latent representation, but rather, just "generate text" from pre trained embeddings. Technically, from an architecture point of view, it's more similar to an encoder, hence the name. 2 Q, K and V have the same size, but have the same content in self attention, at least in the vanilla transformer. In LLaMA, because of the KV Cache and the positional encodings, which are only applied to Q and K, the content is different. Have a nice day!

@MENGRUWANG-qk1ip 2 ай бұрын

hi Umar! Your explanation is really excellent! By the way, llama3 has been released, will you continue to explain llama3 in a video?

@vincentabraham7690 4 ай бұрын

Hi Umar, I recently started studying LLMs and I loved your explanations on transformers and the Llama architecture. I wanted to know that is there any way to look at the attention weights of a model and gain insights on which specific portions of the text influenced the output prediction? Is there any way to do this which is beginner friendly?

@ml.9106 3 ай бұрын

thanks! super helpful. Seems your channel didn't tell the GPT model architecture. would you be open to introduce that?

@samc6368 6 ай бұрын

Great explanation for all levels. Que: did you say around 2:55 that its encoder only arch, i read that its decoder only?

@wilfredomartel7781 7 ай бұрын

❤

@moacirponti 3 ай бұрын

Many thanks for the great video. From your explanation Llama uses Rotary Positional Embedding (RPE) as positional encoding. It is applied to *each* Q and K vector just after transformation by their respective W. In this case, I don't get what the Relative Positional Encoding has to do with it (and why it was explained before Rotary PE). It is because Rotary PE has connections with the relative positional method, or both are applied in the case of Llama?

@eitancohen8717 2 ай бұрын

Hi, great explanations. btw, is there a chance you explain separately this torch.einsum operations shown in the code at time 58:41?

@bipulbikramthapa8256 6 ай бұрын

Great video +1. I have a few queries about the LLaMA model. 1. In the architecture diagram, does one NX represent a single layer for the LLaMA? 2. Could you also please clarify how many NX are utilized for the LLAMA-2 13B model and any variants? 3. Finally, what are the potential for distributed computation in LLaMA model inference? What are the possible breakpoints in the model from an architectural standpoint?

@DiegoSilva-dv9uf 6 ай бұрын

Valeu!

@umarjamilai 6 ай бұрын

Thank you very very very very much

@just4visit 7 ай бұрын

hey! it there any way you can avoid using background bangs when slide changes, please?

@yunhuaji3038 8 ай бұрын

Thanks for the great video. at 48:56, you mentioned that we don't care about previous attentions. Does that mean we will trim the attention tensor from (SeqLen x d_model) to (1 x d_model) ? If so, does GPT do the trim as well? I thought GPT uses the whole attention tensor (including the previous attention "vectors") to predict the next token. That seems redundant but I wonder what they did exactly here. If they did use the whole attention tensor, does it mean that GPT inferencing needs to cache QKV instead of just KV? Thank you.

@umarjamilai 8 ай бұрын

Hi! I think you misunderstood my words: I said "we don't need to recompute the dot products again", because they have already been computed in the previous steps. The dot products for which we "don't care" are the ones above the principal diagonal of the attention scores matrix, because they're the ones masked out when we apply the causal mask (that is, we force each token to only attend tokens to its left). I suggest you watch my other video in which we code LLaMA from scratch in order to understand how this works in practice.

@yunhuaji3038 8 ай бұрын

@@umarjamilai Thanks for the fast reply. I actually should have pointed the timestamp to 50:47 instead, but I believe you already got my question which is about the outcome attention tensor. Do you mean that we will cache the result attention tensor as well as the k-v pairs? (I will meanwhile go watching your llama video.)Thanks again.

@yunhuaji3038 8 ай бұрын

@@umarjamilai Oh... I figured it out... thanks a lot

@subhamkundu5043 8 ай бұрын

What is the size of the key and value matrices in grouped query multihead attention. In vanilla attention we were dividing it equally?

@tubercn 8 ай бұрын

Yes, In vanilla attention we were dividing it equally In the Grouped Query Attention, the size of the key or value is same as query matrix in each head, the difference is the quantity of heads .

@feixyzliu5432 5 ай бұрын

I'am wondering why multi-head attention with KV cache uses operations on O(bnd^2) in 1:00:01, isn't it on O(bnd + bd^2)? Could you please explain this or give some references to refer?

@kanakraj3198 Ай бұрын

What I still didn't understand how grouping is happening in Grouped Multi Query attention? I didn't understand Rotary Positional Encoding concept, but will re-watch or read more?

@dataflex4440 10 ай бұрын

Wowwwwwwwwwwwwww

@serinevcim5390 23 күн бұрын

I could not understand something: is token2 that we append to the Q after first inference equal to attention1 ???????

@yw5uu4jo2c 9 ай бұрын

Can you explain where did you come up with time complexity of O(bnd^2) ? I personally think it's wrong, since performing just one scaled dot product attention should already make the quadratic term n^2 appear in the complexity formula, which is not your case. To give you more details into my calculations: Considering we have a batch size of 1, with one scaled dot product attention we are already making O(n^2d+nd^2) computations with d_k=d_q=d_v, and since it's perfomed h times we can say it's O(h*(n^2d+nd^2)). Am I wrong somewhere ?

@umarjamilai 9 ай бұрын

Hi! The complexity calculation comes from the paper, in which they also highlight the assumptions made. For brevity, I didn't attach all the details of how it was computed, so I recommend having a look at the paper directly.

@npip99 4 ай бұрын

49:10 "Since the model is causal, we don't care about the attention of a token with its successors" ~ I mean, a simpler explanation is also that the matrix is symmetric anyway, right? Like regardless of whether or not we care, it would be duplicated values.

@grownupgaming 9 ай бұрын

1:01:42 great video! what is this beam 1 beam 4

@umarjamilai 9 ай бұрын

Beam 1 indicates the greedy strategy for inference, while Beam 4 indicates "Beam search" with K = 4.

@ManelPiera 2 ай бұрын

Is the Rotary Positional Embedding just applied in the first decoder or in every decoder?

@umarjamilai 2 ай бұрын

Every Decoder layer.

@codevacaphe3763 26 күн бұрын

Hi Umar, can I ask you how do you get the LlaMa architecture, which paper is it from ?

@umarjamilai 25 күн бұрын

I've built it myself by studying the code.

@codevacaphe3763 22 күн бұрын

@@umarjamilai Wow that's amazing thank you for sharing.

@Tubernameu123 5 күн бұрын

How do i give you most of my money to make sure you keep making videos like this?

@umarjamilai 4 күн бұрын

Share it on all social media you have. Best way to pay me ;-)

@manohar_marri 9 ай бұрын

Isn' t the embedding size 768, 512 --the max seq length.

@abdulahmed5610 8 ай бұрын

Llama is Decoder only plz check 2:50

@umarjamilai 8 ай бұрын

Hi! You can call it "Decoder-only" or "Encoder-only" interchangeably, because it's neither. To be a decoder, it should also have a cross-attention (which it lacks), to be an encoder it should not have a linear layer (which it does). So technically it can be an Encoder with a final linear layer or a Decoder without cross-attention. As a matter of fact, the "E" in BERT, which is also based on the Transformer model, stands for "Encoder". Have a nice day!