Positional Encoding in Transformer Neural Networks Explained

  Рет қаралды 38,736

CodeEmporium

CodeEmporium

Күн бұрын

Positional Encoding! Let's dig into it
ABOUT ME
⭕ Subscribe: kzfaq.info...
📚 Medium Blog: / dataemporium
💻 Github: github.com/ajhalthor
👔 LinkedIn: / ajay-halthor-477974bb
RESOURCES
[ 1🔎] Code for video: github.com/ajhalthor/Transfor...
[ 2🔎] My video on multi-head attention attention: • Multi Head Attention i...
[3 🔎] Transformer Main Paper: arxiv.org/abs/1706.03762
PLAYLISTS FROM MY CHANNEL
⭕ ChatGPT Playlist of all other videos: • ChatGPT
⭕ Transformer Neural Networks: • Natural Language Proce...
⭕ Convolutional Neural Networks: • Convolution Neural Net...
⭕ The Math You Should Know : • The Math You Should Know
⭕ Probability Theory for Machine Learning: • Probability Theory for...
⭕ Coding Machine Learning: • Code Machine Learning
MATH COURSES (7 day free trial)
📕 Mathematics for Machine Learning: imp.i384100.net/MathML
📕 Calculus: imp.i384100.net/Calculus
📕 Statistics for Data Science: imp.i384100.net/AdvancedStati...
📕 Bayesian Statistics: imp.i384100.net/BayesianStati...
📕 Linear Algebra: imp.i384100.net/LinearAlgebra
📕 Probability: imp.i384100.net/Probability
OTHER RELATED COURSES (7 day free trial)
📕 ⭐ Deep Learning Specialization: imp.i384100.net/Deep-Learning
📕 Python for Everybody: imp.i384100.net/python
📕 MLOps Course: imp.i384100.net/MLOps
📕 Natural Language Processing (NLP): imp.i384100.net/NLP
📕 Machine Learning in Production: imp.i384100.net/MLProduction
📕 Data Science Specialization: imp.i384100.net/DataScience
📕 Tensorflow: imp.i384100.net/Tensorflow
TIMSTAMPS
0:00 Transformer Overview
2:23 Transformer Architecture Deep Dive
5:11 Positional Encoding
7:25 Code Breakdown
11:11 Final Coded Class

Пікірлер: 87
@CodeEmporium
@CodeEmporium Жыл бұрын
If you think I deserve it, please do consider a like and subscribe to support the channel. Thanks so much for watching ! :)
@mello1016
@mello1016 Жыл бұрын
Amazing videos, but dude please remove your face from the thumbnails. It adds zero value and is distracting from choosing the content. Don't follow a herd, better represent something unique in there.
@LuizHenrique-qr3lt
@LuizHenrique-qr3lt Жыл бұрын
tks!!
@aryansoriginals
@aryansoriginals Жыл бұрын
i think you desrve it, big thank you to you
@wishIKnewHowToLove
@wishIKnewHowToLove Жыл бұрын
:)
@becayebalde3820
@becayebalde3820 9 ай бұрын
shut up @@mello1016 adding his facing makes it less abstract. You can see it is a human behind, and it makes it easier to focus.
@becayebalde3820
@becayebalde3820 9 ай бұрын
Man you are awesome! I thought transformers were too hard and needed too much effort to understand. While I was willing to put that much effort, your playlist has been extraordinarily useful to me. Thank you! I subscribed
@sabzimatic
@sabzimatic 8 ай бұрын
Hands down!! You have put in sincere effort in explaining crucial concepts in Transformers. Kudos to you! Wishing you the best !!
@CodeEmporium
@CodeEmporium 8 ай бұрын
Thanks for the super kind words! Definitely more to come. In the middle of making a series on Reinforcement Learning now :)
@BenderMetallo
@BenderMetallo Жыл бұрын
Attention is all you need! cit your tutorials are gold, thank you
@CodeEmporium
@CodeEmporium Жыл бұрын
You are so welcome !
@user-wr4yl7tx3w
@user-wr4yl7tx3w Жыл бұрын
Really enjoying this Transformer series.
@CodeEmporium
@CodeEmporium Жыл бұрын
Thanks so much for watching and commenting on them :)
@mihirchauhan6346
@mihirchauhan6346 8 ай бұрын
At time 6:24 reason 1 (periodicity) for positional encoding was under-specified, hence needed more clarity where it was mentioned that a word pays attention to other words (farther apart) in the sentence using periodicity property of sine and cosine function in order to make the solution tractable? Is it mentioned in some papers or can you cite this. Thanks.
@judedavis92
@judedavis92 Жыл бұрын
Thanks for the great video! Loving this series!
@CodeEmporium
@CodeEmporium Жыл бұрын
Thanks so much for watching ! Hope you enjoy the rest :)
@DeepakKandel-go3ff
@DeepakKandel-go3ff 11 ай бұрын
Yes, Totally worth a like.
@shauryai
@shauryai Жыл бұрын
Thanks for detailed videos on Transformer concepst!
@CodeEmporium
@CodeEmporium Жыл бұрын
My pleasure :) Thank you for the support
@PravsAI
@PravsAI 8 ай бұрын
One of the great explanation ajay , happy to see kannada words here ! . Look forward for more videos like this :-) Kudos ! Great work ....
@XNexezX
@XNexezX 7 ай бұрын
Dude these videos are so nice. Starting my masters thesis on a transformer-based topic soon and this is really helping me learn the basics
@CodeEmporium
@CodeEmporium 7 ай бұрын
Perfect! Super glad you’re on this journey. The field is very fun :)
@SanjithKumar-xf4sg
@SanjithKumar-xf4sg 9 ай бұрын
One of the best series for transformers😄
@pizzaeater9509
@pizzaeater9509 Жыл бұрын
Most brilliant and simple to understand video
@CodeEmporium
@CodeEmporium Жыл бұрын
Haha thanks a lot :) I try
@kimrichies
@kimrichies Жыл бұрын
Your efforts are much appreciated
@CodeEmporium
@CodeEmporium Жыл бұрын
Thanks so much for watching :)
@paull923
@paull923 Жыл бұрын
Thx! Clear and concise!
@CodeEmporium
@CodeEmporium Жыл бұрын
Thanks So much
@superghettoindian01
@superghettoindian01 Жыл бұрын
As before, great work on this Transformer Series! Am trying to go through all your code / videos slowly so I make sure I'm fully absorbing it. Where I'm struggling / slowest right now is in my intuition behind some of these tensor operations with stack / concatenate. Do you have any recommendations for study material apart from the torch documentation?
@CodeEmporium
@CodeEmporium Жыл бұрын
Thanks so much! Hmm. Maybe hugging face has some good resources too. Aside from this, I’ll be making a playlist on the evolution of language models so some design choices become more intuitive. Hope you’ll stick around for that
@AbhishekS-cv3cr
@AbhishekS-cv3cr Жыл бұрын
approved!
@caiyu538
@caiyu538 9 ай бұрын
Clear explanation. If I want to use transformer for time series and the time is not evenly changing, there is irregularities of time points. How could I positional encoding of these time into transformer?
@sangabahati3545
@sangabahati3545 Жыл бұрын
You video are useful for me ,Congratulation for excellent works. But I suggest you demonstrate a real video in multivariate time series forecasting or classification.
@user-pm9nt6xk3c
@user-pm9nt6xk3c 11 ай бұрын
Hey Ajay, first of all, this is a video so well-built that I will be recommending it to our data science, AI, and Robotics clubs, your content is great and I can see the next Andrew NG before me, regardless I do have a question, why is it that there must be a max number of words in a transformer architecture I dont fully understand the reason behind it considering most of the operations conducted on the first half don't require a fixed length of input data since this isn't your usual neural network layer, do you mind explaining? because I do feel like this is flying above my head
@CodeEmporium
@CodeEmporium 11 ай бұрын
Your words are too kind. And good question. So what is fixed in length in this specific architecture is the maximum number of words in a sentence, not the number of words in a sentence. The remain unused words are filled with “padding tokens”. This will be come clearer when you watch the videos of coding out the complete transformer in the playlist “Transformers from Scratch”. We essentially do this so we can pass fixed size vector inputs through every part of the transformer. That said, I have seen more recent implementations where the size is dynamic
@srivatsa1193
@srivatsa1193 Жыл бұрын
Bro.. You are Awesome!
@CodeEmporium
@CodeEmporium Жыл бұрын
Nah you are awesome
@ChrisHalden007
@ChrisHalden007 Жыл бұрын
Great video. Thanks
@CodeEmporium
@CodeEmporium Жыл бұрын
My pleasure
@ThinAirElon
@ThinAirElon 9 ай бұрын
Theoratically what does it mean to add embedding vector and positional vector ?
@aliwaleed9173
@aliwaleed9173 Жыл бұрын
thanks for the information in this video however i think i have a miss understanding , you said before that before the vocabs are going into the embedding victor which is like a bag of related word together in a box , but in the start of this video you said at first the words has done into a one hot encoder then passed to the positional encoding so what i want to know know is which is the scenarios is the right: 1- we take the word and search it into the embedding space then pass it into the positional encoder 2- we take the word and do it a one hot encoder then send it to the positional encoder
@ziki5993
@ziki5993 Жыл бұрын
wonderful video
@CodeEmporium
@CodeEmporium Жыл бұрын
Thank you so much! :D
@lexingtonjackson3657
@lexingtonjackson3657 4 ай бұрын
I liked you already , now you are a kannadiga and i like you more.
@Philippe_Rougier
@Philippe_Rougier Жыл бұрын
what a voice !!!
@ilyas8523
@ilyas8523 Жыл бұрын
Great videos, especially the one where you explained what a transformer is. Beside youtube, do you have a full time job or is this it? Just curious
@CodeEmporium
@CodeEmporium Жыл бұрын
Thank you! And yep I have a full time job as a Machine Learning Engineer outside of KZfaq :)
@user-kd7xd2gb5s
@user-kd7xd2gb5s 3 ай бұрын
i love your shit man, this was so usefull i actually understood this ml shit and now can be elon musk up in this llm shit
@FirstNameLastName-fv4eu
@FirstNameLastName-fv4eu 17 күн бұрын
Ajay you are starting a cult man!!! May God bless you.
@ajaytaneja111
@ajaytaneja111 Жыл бұрын
Hi Ajay, isn't the purpose of positional encoding to figure out where the word is located in the sequence which actually the attention mechanism derives benefit from? Thanks... And again great content, grateful
@CodeEmporium
@CodeEmporium Жыл бұрын
Yes! The idea overall is to create meaningful embedding for words that understand context. This is opposed to the tradition CBoW or Skip gram word embeddings that don’t quite get this context.
@andreytolkushkin3611
@andreytolkushkin3611 11 ай бұрын
hi, you probably wont see this since its been 6 months siince youve posted the video, however: im trying to write code for handwritten mathematical expression recognition and am trying to recreate the BTTR model. In it they use a Densenet as the transformer encoder and use "image positional encoding" whhich is supposed to be a generalization for 2d of the sinusodal positional encoding. What would be the logic behind the 2d image positional encoding. They do have code on github but i have no idea how to interpret it, could you please help
@neetpride5919
@neetpride5919 Жыл бұрын
Is there an advantage to using one-hot encoding instead of an integer index encoding for the words? If we're gonna download a pre-existing word2vec dictionary and map each word to its word vector during the data preparation anyway, the one-hot encoding seems like it'd just create an unnecessary large sparse matrix.
@CodeEmporium
@CodeEmporium Жыл бұрын
The idea here is we are not going to use a preexisting word2vec for the transformer. Everything in clouding the embedding for every word will be learned during training. An issue with word2vec is they are fixed embeddings and don’t necessarily capture word context very well. This concept was introduced in the paper that introduced ELMo “Deep Contextualized word presentations” (Peter et al., 2018). Would recommend giving this a read if you’re interest
@Slayer-dan
@Slayer-dan Жыл бұрын
Thanks a lot 💚
@CodeEmporium
@CodeEmporium Жыл бұрын
Super welcome
@LuizHenrique-qr3lt
@LuizHenrique-qr3lt Жыл бұрын
Hey Ajay, great video!! Congratulations, I'm learning a lot from you thank you! Ajay I have some doubts, the first is that I didn't quite understand the difference between max sequence length and d_model. For example, if I have texts with 50 tokens in size, that is, my largest text has up to 50 tokens, this would be my max sequence length, however if my d_model were 10, my largest sequence would have to be divided into 5 to be able to pass through the model because it only accepts 10 tokens at a time, is my thinking correct?
@CodeEmporium
@CodeEmporium Жыл бұрын
They way you described sequence length = 50 is correct. It is the maximum number of tokens you can pass into your network at a time (it’s the max number of words/ subwords/characters). D_model is the embedding dimension. Models don’t understand words, but they understand numbers. And so, you transform every token into some set of numbers (called a vector) and the number of numbers in this vector is d_model. Let’s say d_model is 512 and also say we have a sentence “my name is ajay”. The word “my” would be converted into a 512 dimensional vector. As would “name”, “is” and “ajay”. The idea of these vectors/embedding is to get some dense numeric representation of the context of a word (so similar words are represented with vectors that are close to each other and dissimilar words are represented with vectors that are farther from each other)
@LuizHenrique-qr3lt
@LuizHenrique-qr3lt Жыл бұрын
@@CodeEmporium hm ok good answer, now a doubt if d_model is the dimension that I will put my tokens. Why don't some transformer models accept very long texts? for example: if I have the string length = 10 d_model = 3 the phrase "my name is Ajay" would turn 4 vectors my: [0,0.2,0.6] name: [0, 0.1, 0.11] is: [0.5, 0.2, 0.0] Ajay: [0,0,1] with d_model dimensions each Why can't I put very large sequences in my model? Why does d_model interfere with this
@DevelopersHutt
@DevelopersHutt Жыл бұрын
@@LuizHenrique-qr3lt The max sequence length refers to the maximum number of tokens in a sequence, while d_model represents the dimensionality of the token embeddings. They serve different purposes in the Transformer model. The max sequence length determines the size of the input that can be processed at once, whereas d_model influences the complexity and expressive power of the model. In your example, if the max sequence length is 50 and d_model is 10, the largest sequence would need to be divided into smaller segments or chunks of 10 tokens to fit within the model's input limit.
@lorenzobianconi7724
@lorenzobianconi7724 Жыл бұрын
hi ajy thanfor your videos. why are there 512 dimensions? who established this number? and how can we count the 175b parameters in gpt3. can you make a video when you break down the whole process of a transformer in one clear shot. possibly not using a translation but for exemple an answer task. thanks love your video and determination to spread knowledge
@giacomomunda3359
@giacomomunda3359 9 ай бұрын
512 is a hyperparameter. You can actually decide which dimension to use, but it has been proven that higher dimension usually work better, since they are able to capture more linguistic information, e.g. semantics, syntax, etc. BERT for instance uses 768 dimensions and the OpenAI ada embeddings have 1536 dimensions.
@convolutionalnn2582
@convolutionalnn2582 Жыл бұрын
In the code on the final class, position is 1 to max sequence length....Which include both even and odd...I think we use cos for odd and sin for even..Why all the position are pass which mean 1 to max sequence length including even are pass in cos and odd are pass in sin.
@CodeEmporium
@CodeEmporium Жыл бұрын
I think I responded to this in another video you asked this question. Hope that helped tho :)
@convolutionalnn2582
@convolutionalnn2582 Жыл бұрын
@@CodeEmporium Yeah but you didn't answer it fully
@hermannangstl1904
@hermannangstl1904 Жыл бұрын
From what I understood each word/token is represented by a 512-dimensional vector. This values of this vector are modified by means of (Self)Attention and Positional Encoding. What is a bit counter-intuitive for me is that the place in which a word/token comes can be different in different sentences. For example lets take the word "Ajay". (1) In this sentence it's in 4th position: "My name is Ajay" (2) In a different sentence it is on 1st position: "Ajay explains very well". So the Positional Encodings for the word "Ajay" vary - they might be different in each sentence. How can the network be trained, how can it learn, with such contradicting input data?
@CodeEmporium
@CodeEmporium Жыл бұрын
This is a good question. But it intuitively does make sense that the same word In different sentences can have different meanings. Take the word “grounded”. You can represent this as a 512 dimensional vector. But let’s say “grounded” occurs in 2 sentences: (1) The truth is grounded in reality (2) You’re grounded! Go to your room. In these examples, “grounded” has differing meanings and should hence have different vector representations. This is why we need surrounding context to understand word vectors individually. This is probably a lil hard to see with your example since “Ajay” is a proper noun. However, for non-proper nouns, context matters. I think you should take a look at the paper “Deep Contextualized word Representations” by Matthew Peters (2018). They more formally answer the question you are asking. This is the paper that introduced ELMo embedding. According to this paper, Turns out that using different vectors based on context really improved models on Part of Speech Tagging and Language modeling
@DevelopersHutt
@DevelopersHutt Жыл бұрын
You've raised an important point. While it is true that the positional encoding for a word like "Ajay" can vary depending on its position in different sentences. Let's consider the word "Ajay" in two different sentences and see how the Transformer model handles it: (1) Sentence 1: "My name is Ajay." (2) Sentence 2: "Ajay explains very well." In both sentences, the word "Ajay" has different positions, but the Transformer model can still learn and make sense of it. Here's a simplified example of how it works: Input Encoding: Each word, including "Ajay," is initially represented by a 512-dimensional vector. Sentence 1: "Ajay" is represented as [0.1, 0.2, 0.3, ..., 0.4]. Sentence 2: "Ajay" is represented as [0.5, 0.6, 0.7, ..., 0.8]. Positional Encoding: The model incorporates positional encodings to differentiate the positions of words. Sentence 1: The positional encoding for the 4th position is [0.4, 0.3, 0.2, ..., 0.1]. Sentence 2: The positional encoding for the 1st position is [1.0, 0.9, 0.8, ..., 0.5]. Attention and Context: The Transformer's attention mechanism considers the positional encodings along with the input representations to compute contextualized representations. Sentence 1: The attention mechanism incorporates the positional encoding and input embedding of "Ajay" at the 4th position to capture its contextual information within the sentence. Sentence 2: Similarly, the attention mechanism considers the positional encoding and input embedding of "Ajay" at the 1st position in the context of the second sentence. By attending to different positions and incorporating positional encodings, the model can learn to associate the word "Ajay" with its specific context and meaning in each sentence. Through training on various examples, the model adjusts its weights and learns to generate appropriate representations for words based on their positions, allowing it to make meaningful predictions and capture the contextual relationships between words effectively.
@balakrishnaprasad8928
@balakrishnaprasad8928 Жыл бұрын
Please make a detailed video series on the math for data science
@CodeEmporium
@CodeEmporium Жыл бұрын
I have made some math in machine learning videos. Maybe check the playlist “Th e math you should know” on the channel
@LuizHenrique-qr3lt
@LuizHenrique-qr3lt Жыл бұрын
my second doubt is that when I use BertTokenizer for example it transforms the text: [my name is ajay] in a list of integers for example [101, 11590, 11324, 10124, 138, 78761, 102], where does that part go? I couldn't understand that part
@CodeEmporium
@CodeEmporium Жыл бұрын
So I haven’t shown the text encoding details just yet. :) since 4 words were encoded into 7 numbers, I assume the “BestTokenizer” is encoding each subword / word piece into some number. Essentially, the tokenizer is taking the sentence, breaking it down into word pieces (7 in this case) and each is being mapped to a unique integer number. Later on, you will see each number being mapped to a larger vector (I explained more details about why these vectors exist in the other comment)
@wishIKnewHowToLove
@wishIKnewHowToLove Жыл бұрын
I like your English is clean :) no disgusting non-Californian accent :)
@CodeEmporium
@CodeEmporium Жыл бұрын
Thank you for the compliments
@7_bairapraveen928
@7_bairapraveen928 Жыл бұрын
I am kind on new bie here, if you think this is valid please answer. why are you introducing parameters of dimension 512 for the vocab size , making a neural network, I mean what happens if we dont do that?
@CodeEmporium
@CodeEmporium Жыл бұрын
Why are we using 512 dimensions instead of the 1 hot vector of size equal to the vocabulary size? This is because of the curse of dimensionality. Vocabulary sizes are huge (often in the 10s of thousands). This is a lot for any model, neural network or not, to process. There was a 2001 paper by Yashua Bengio “A Neural Probabilistic Language Model” that describes exactly this issue and why it was introduced. I would recommend giving it a read. Also, my next series will delve into the history of language models so I hope you’ll stay tuned for this. Maybe some of the design choices will become clearer.
@SAIDULISLAM-kc8ps
@SAIDULISLAM-kc8ps Жыл бұрын
can you please tell me the between sequence length and dimension of embedding ?
@CodeEmporium
@CodeEmporium Жыл бұрын
Sequence length = maximum number of characters/words we can pass into the transformer at a time. Dimension of embedding = size of vector representing each character / word.
@SAIDULISLAM-kc8ps
@SAIDULISLAM-kc8ps Жыл бұрын
@@CodeEmporium thanks a lot.
@joaogoncalves1149
@joaogoncalves1149 11 ай бұрын
I think that queen/king example is somewhat cherry picked, as the principle behind the analogy fails for many examples.
@aar953
@aar953 7 ай бұрын
There is one mistake that you are making. We are not taking a single output as input to the decoder, but all the previous outputs up to the current time step as input to the decoder.
@CodeEmporium
@CodeEmporium 7 ай бұрын
Yea that’s correct from a practical standpoint. I dive into this when coding this out in the rest of this playlist “Transformers from scratch “. Hope those videos clear things up!
@aar953
@aar953 7 ай бұрын
@@CodeEmporium Thanks for answering. I understand that you have to make a trade-off between simplicity and accuracy. Here, I just wanted to note that little more complexity would have added quite a lot more accuracy. Your content is excellent!
@ShivarajKarki
@ShivarajKarki Жыл бұрын
Kannada abhimanige kannidigana namaskara.. Nimma gyana bhandarakke namana.
@__hannibaalbarca__
@__hannibaalbarca__ Жыл бұрын
I don't nothing about Python, but it look extremely slow.
Layer Normalization - EXPLAINED (in Transformer Neural Networks)
13:34
Multi Head Attention in Transformer Neural Networks with Code!
15:59
THEY made a RAINBOW M&M 🤩😳 LeoNata family #shorts
00:49
LeoNata Family
Рет қаралды 41 МЛН
HOW DID HE WIN? 😱
00:33
Topper Guild
Рет қаралды 48 МЛН
Double Stacked Pizza @Lionfield @ChefRush
00:33
albert_cancook
Рет қаралды 46 МЛН
Каха и суп
00:39
К-Media
Рет қаралды 6 МЛН
A Very Simple Transformer Encoder for Time Series Forecasting in PyTorch
15:34
Let's Learn Transformers Together
Рет қаралды 4,1 М.
Blowing up the Transformer Encoder!
20:58
CodeEmporium
Рет қаралды 17 М.
Scientific Concepts You're Taught in School Which are Actually Wrong
14:36
I Built a Neural Network from Scratch
9:15
Green Code
Рет қаралды 184 М.
Rotary Positional Embeddings: Combining Absolute and Relative
11:17
Efficient NLP
Рет қаралды 27 М.
Transformer Neural Networks - EXPLAINED! (Attention is all you need)
13:05
Positional encodings in transformers (NLP817 11.5)
19:29
Herman Kamper
Рет қаралды 1,9 М.
Embeddings - EXPLAINED!
12:58
CodeEmporium
Рет қаралды 6 М.
THEY made a RAINBOW M&M 🤩😳 LeoNata family #shorts
00:49
LeoNata Family
Рет қаралды 41 МЛН