RoPE (Rotary positional embeddings) explained: The positional workhorse of modern LLMs

  Рет қаралды 20,093

DeepLearning Hero

DeepLearning Hero

Күн бұрын

Unlike sinusoidal embeddings, RoPE are well behaved and more resilient to predictions exceeding the training sequence length. Modern LLMs have already steered away from sinusoidal embeddings for better alternatives like RoPE. Stay with me in the video and learn about what's wrong with sinusoidal embeddings, the intuition or RoPE and how RoPE works.
Original Transformer paper: arxiv.org/pdf/1706.03762.pdf
RoPE paper: arxiv.org/pdf/2104.09864.pdf
Using interpolation for RoPE: arxiv.org/pdf/2306.15595.pdf
0:00 - Introduction
1:06 - Attention computation
1:51 - Token and positional similarity
2:52 - Vector view of query and key
4:52 - Sinusoidal embeddings
5:53 - Problem with sinusiodal embeddings
6:34 - Conversational view
8:50 - Rope embeddings
10:20 - Rope beyond 2D
12:36 - Changes to the equations
13:00 - Conclusion

Пікірлер: 30
@elliotstein165
@elliotstein165 3 ай бұрын
Breaking up the visualisation of the 4D vector into 2x2D vectors is lovely - looks like a clock. A very intuitive notion for encoding position (in time)
@egeres14
@egeres14 10 ай бұрын
This was incredibly well explained, thank you for the effort on editing this and publishing this video, it's been incredibly helpful
@rajanghimire4022
@rajanghimire4022 11 ай бұрын
Wow, this is by far the best video on this topic that I have come across. The information presented was clear, concise, and very informative. I learned a lot.
@anujlahoty8022
@anujlahoty8022 6 ай бұрын
I loved the analogies and the concept is explained very beautifully!
@user-gk3ue1he4d
@user-gk3ue1he4d 6 ай бұрын
Great work!The best explanation I have ever seen for RoPE.
@kristophersmith4067
@kristophersmith4067 3 ай бұрын
Keep it up. I really enjoy your teaching style and visualizations!
@1littlecoder
@1littlecoder 10 ай бұрын
This is a great explanation and I'd quote it in my upcoming updates video!
@adityagulati1540
@adityagulati1540 6 ай бұрын
This video is highly under-rated! :D
@tejag8149
@tejag8149 11 ай бұрын
Great explanation. Looking forward to more such videos. Would appreciate some videos around Computer Vision and Diffusion too!
@sujantkumarkv5498
@sujantkumarkv5498 2 ай бұрын
incredibly explained sensei.
@felipemello1151
@felipemello1151 2 ай бұрын
Amazing video. Thank you
@sherinmuckatira8333
@sherinmuckatira8333 11 ай бұрын
Nice explanation!
@1PercentPure
@1PercentPure 10 ай бұрын
amazing, thank you so much!
@octour
@octour 11 ай бұрын
Great video, thank you! It is really the only one source, except paper explaining it in very approachable manner. An manim also helps a lot ;)
@octour
@octour 11 ай бұрын
@deeplearninghero and you mention, that positional embedding is applied to k and q vectors. Is it new with RoPE? Because I thought, that in transformer architecture positional embedding is added to token embedding (which we get from tokenizer). And this summed vector goes to encoder/decoder, where it splitted to k, q and v. And inside encoder/decoder we are not applying any positional encoding.
@deeplearninghero
@deeplearninghero 11 ай бұрын
Yes it's a major change from sinusoidal embeddings to RoPE. As per RoPE's motivation you need the positional distinction between q and k so adding them there is ideal. :)
@mkamp
@mkamp 8 ай бұрын
Great video. There is still a question left for me though. With the traditional PE they are very small and added to the original embeddings of the inputs. It is easy to see why the embeddings are still recognizable. But with RoPE, in your nice animations, the input embeddings are changed dramatically. How does the network learn that a dog embedding rotated by 180 degrees is still a dog?
@yourmomsboyfriend3337
@yourmomsboyfriend3337 7 ай бұрын
Hi, I'm speaking as a bit of a newbie to this concept, but I was also curious about your question. From what I found, in the transformer's architecture the meaning of a word, like "dog", within a given context is influenced by both its semantic embedding and its position in the sequence. The model is forced to learn these two pieces of information in conjunction, meaning it will see "dog" in many different positions and many different contexts. The other big thing is that transformers are not isolated word processors; the model processes the entire sequence when generating text, so even though the vector for "dog" is rotated, it's interpreted in the context of the surrounding words and their respective positional encodings. This is combined with the benefits of high-dimensionality. As you add more and more dimensions, it becomes increasingly less likely that the word "dog" could get rotated at any position to match any other word. Since the model processes sequences in parallel, it almost always will have contextual information such as "walk" or "leash", etc. that teaches the model the original semantic meaning during training regardless of how it is rotated.
@qinranqu
@qinranqu 7 ай бұрын
very intuitive@@yourmomsboyfriend3337
@mkamp
@mkamp 6 ай бұрын
@@yourmomsboyfriend3337 hey, took me a while to answer as I had to mull it over, still ongoing. Thanks for your answer. I suggest to change our PoV a little. Instead of seeing the embedding as a whole, that we look at the individual dimensions. Each dimension is rotated differently. So it would only be a few dimensions at a time, and even fewer important ones, that would be totally distorted by an 180 degrees rotation. So most of the dimensions would still be recognizable? I am still not really sold though.
@yannickpezeu3419
@yannickpezeu3419 10 ай бұрын
Thanks !
@jordanfarr3157
@jordanfarr3157 11 ай бұрын
Spectacular! Thank you so much for making this! Can I ask a very naive question as someone who is fairly new in this field? Are embeddings from LLMs, like the ones obtained through the OpenAI API, rotational? Do they take on the shape that you describe in the video or are they more positional? I currently use a vector database to compare language embeddings from Steam reviews, and that database utilizes a simple L2 Euclidean distance metric when making its comparisons. Are these concepts related?
@deeplearninghero
@deeplearninghero 11 ай бұрын
Hi, thanks for the question. It's a good question, not a naive question :) It's a bit difficult to know exactly what OpenAI is doing due to their secretive way of operating. But I can make an educated guess. There's three popular models; GPT3/3.5/4. GPT3 (arxiv.org/abs/2005.14165) - This came before the rotary embeddings paper, so I'm assuming this uses standard sinusoidal embeddings. GPT3.5 - Don't think there's a paper for this, so again I'm not sure if they use RoPE. But there's a good chance they are, as the timing is right. GPT4 (arxiv.org/pdf/2303.08774.pdf) - They do have a reference for RoPE in their paper, so quite possibly GPT4 is using RoPE But keep in mind these are speculative guesses and unfortunately there's no way to say which type of embedding by looking at the embeddings themselves.
@jordanfarr3157
@jordanfarr3157 11 ай бұрын
@@deeplearninghero that is absolutely fascinating! Thank you for such an insightful response. The two types of embeddings cannot be distinguished from another if given a set of similar text inputs? I'm not saying I'm savvy enough to figure out how to accomplish something like that, but I suppose that surprises me. If the visual analogy of the clock holds, would similar inputs not read as somewhat similar "times"? I know that's underselling of the complexity of working with high-dimensional embeddings. I guess the rotational nature of capturing information in this way has sparked my imagination as someone with a background in genetics.
@deeplearninghero
@deeplearninghero 11 ай бұрын
> If the visual analogy of the clock holds, would similar inputs not read as somewhat similar "times"? That's an interesting analogy. Thanks for sharing. But IMO, numerically it wouldn't hold. I guess your suggesting that we can perhaps "count" how many time a vector passes through a certain point? The problem would be two-fold. 1) You wouldn't get exact overlaps because these vectors are high dimensional and they may not be rotating in properly divisible pieces of the 360 degree circle. 2) Sinusoidal embeddings does something similar. So if you do this counting on approximate basis, you'd still see sinusoidals passing through this point. So would be difficult. And finally, being high dimensional, it's very hard to reason about (unfortunately).
@mattoh1468
@mattoh1468 6 ай бұрын
A question about 9:28, when computing theta, if d=2, why does theta=1? or did you mean that there is only one value for theta?
@kevinxu9562
@kevinxu9562 10 ай бұрын
Coming from Yacine
@ledescendantdeuler6927
@ledescendantdeuler6927 10 ай бұрын
plugged by kache
@HeyFaheem
@HeyFaheem 7 ай бұрын
Kind of dingboard?
Rotary Positional Embeddings: Combining Absolute and Relative
11:17
Efficient NLP
Рет қаралды 27 М.
RoPE Rotary Position Embedding to 100K context length
39:56
code_your_own_AI
Рет қаралды 2,5 М.
Nutella bro sis family Challenge 😋
00:31
Mr. Clabik
Рет қаралды 14 МЛН
Incredible magic 🤯✨
00:53
America's Got Talent
Рет қаралды 81 МЛН
Дарю Самокат Скейтеру !
00:42
Vlad Samokatchik
Рет қаралды 8 МЛН
Smart Sigma Kid #funny #sigma #comedy
00:25
CRAZY GREAPA
Рет қаралды 36 МЛН
The math behind Attention: Keys, Queries, and Values matrices
36:16
Serrano.Academy
Рет қаралды 222 М.
Gail Weiss: Thinking Like Transformers
1:07:12
Formal Languages and Neural Networks Seminar
Рет қаралды 13 М.
Speculative Decoding: When Two LLMs are Faster than One
12:46
Efficient NLP
Рет қаралды 10 М.
The animated Transformer: the Transformer model explained the fun way!
25:00
Mamba - a replacement for Transformers?
16:01
Samuel Albanie
Рет қаралды 246 М.
Why Does Diffusion Work Better than Auto-Regression?
20:18
Algorithmic Simplicity
Рет қаралды 238 М.
Positional encodings in transformers (NLP817 11.5)
19:29
Herman Kamper
Рет қаралды 1,9 М.
Красиво, но телефон жаль
0:32
Бесполезные Новости
Рет қаралды 1,3 МЛН
PART 52 || DIY Wireless Switch forElectronic Lights - Easy Guide!
1:01
HUBAB__OFFICIAL
Рет қаралды 64 МЛН
تجربة أغرب توصيلة شحن ضد القطع تماما
0:56
صدام العزي
Рет қаралды 56 МЛН
ОБСЛУЖИЛИ САМЫЙ ГРЯЗНЫЙ ПК
1:00
VA-PC
Рет қаралды 2 МЛН
Как правильно выключать звук на телефоне?
0:17
Люди.Идеи, общественная организация
Рет қаралды 1,7 МЛН