Positional Encoding in Transformer Neural Networks Explained

Рет қаралды 38,736

Күн бұрын

Positional Encoding! Let's dig into it
ABOUT ME
⭕ Subscribe: kzfaq.info...
📚 Medium Blog: / dataemporium
💻 Github: github.com/ajhalthor
👔 LinkedIn: / ajay-halthor-477974bb
RESOURCES
[ 1🔎] Code for video: github.com/ajhalthor/Transfor...
[ 2🔎] My video on multi-head attention attention: • Multi Head Attention i...
[3 🔎] Transformer Main Paper: arxiv.org/abs/1706.03762
PLAYLISTS FROM MY CHANNEL
⭕ ChatGPT Playlist of all other videos: • ChatGPT
⭕ Transformer Neural Networks: • Natural Language Proce...
⭕ Convolutional Neural Networks: • Convolution Neural Net...
⭕ The Math You Should Know : • The Math You Should Know
⭕ Probability Theory for Machine Learning: • Probability Theory for...
⭕ Coding Machine Learning: • Code Machine Learning
MATH COURSES (7 day free trial)
📕 Mathematics for Machine Learning: imp.i384100.net/MathML
📕 Calculus: imp.i384100.net/Calculus
📕 Statistics for Data Science: imp.i384100.net/AdvancedStati...
📕 Bayesian Statistics: imp.i384100.net/BayesianStati...
📕 Linear Algebra: imp.i384100.net/LinearAlgebra
📕 Probability: imp.i384100.net/Probability
OTHER RELATED COURSES (7 day free trial)
📕 ⭐ Deep Learning Specialization: imp.i384100.net/Deep-Learning
📕 Python for Everybody: imp.i384100.net/python
📕 MLOps Course: imp.i384100.net/MLOps
📕 Natural Language Processing (NLP): imp.i384100.net/NLP
📕 Machine Learning in Production: imp.i384100.net/MLProduction
📕 Data Science Specialization: imp.i384100.net/DataScience
📕 Tensorflow: imp.i384100.net/Tensorflow
TIMSTAMPS
0:00 Transformer Overview
2:23 Transformer Architecture Deep Dive
5:11 Positional Encoding
7:25 Code Breakdown
11:11 Final Coded Class

Пікірлер: 87

@CodeEmporium Жыл бұрын

If you think I deserve it, please do consider a like and subscribe to support the channel. Thanks so much for watching ! :)

@mello1016 Жыл бұрын

Amazing videos, but dude please remove your face from the thumbnails. It adds zero value and is distracting from choosing the content. Don't follow a herd, better represent something unique in there.

@LuizHenrique-qr3lt Жыл бұрын

tks!!

@aryansoriginals Жыл бұрын

i think you desrve it, big thank you to you

@wishIKnewHowToLove Жыл бұрын

@becayebalde3820 9 ай бұрын

shut up @@mello1016 adding his facing makes it less abstract. You can see it is a human behind, and it makes it easier to focus.

@becayebalde3820 9 ай бұрын

Man you are awesome! I thought transformers were too hard and needed too much effort to understand. While I was willing to put that much effort, your playlist has been extraordinarily useful to me. Thank you! I subscribed

@sabzimatic 8 ай бұрын

Hands down!! You have put in sincere effort in explaining crucial concepts in Transformers. Kudos to you! Wishing you the best !!

@CodeEmporium 8 ай бұрын

Thanks for the super kind words! Definitely more to come. In the middle of making a series on Reinforcement Learning now :)

@BenderMetallo Жыл бұрын

Attention is all you need! cit your tutorials are gold, thank you

@CodeEmporium Жыл бұрын

You are so welcome !

@user-wr4yl7tx3w Жыл бұрын

Really enjoying this Transformer series.

@CodeEmporium Жыл бұрын

Thanks so much for watching and commenting on them :)

@mihirchauhan6346 8 ай бұрын

At time 6:24 reason 1 (periodicity) for positional encoding was under-specified, hence needed more clarity where it was mentioned that a word pays attention to other words (farther apart) in the sentence using periodicity property of sine and cosine function in order to make the solution tractable? Is it mentioned in some papers or can you cite this. Thanks.

@judedavis92 Жыл бұрын

Thanks for the great video! Loving this series!

@CodeEmporium Жыл бұрын

Thanks so much for watching ! Hope you enjoy the rest :)

@DeepakKandel-go3ff 11 ай бұрын

Yes, Totally worth a like.

@shauryai Жыл бұрын

Thanks for detailed videos on Transformer concepst!

@CodeEmporium Жыл бұрын

My pleasure :) Thank you for the support

@PravsAI 8 ай бұрын

One of the great explanation ajay , happy to see kannada words here ! . Look forward for more videos like this :-) Kudos ! Great work ....

@XNexezX 7 ай бұрын

Dude these videos are so nice. Starting my masters thesis on a transformer-based topic soon and this is really helping me learn the basics

@CodeEmporium 7 ай бұрын

Perfect! Super glad you’re on this journey. The field is very fun :)

@SanjithKumar-xf4sg 9 ай бұрын

One of the best series for transformers😄

@pizzaeater9509 Жыл бұрын

Most brilliant and simple to understand video

@CodeEmporium Жыл бұрын

Haha thanks a lot :) I try

@kimrichies Жыл бұрын

Your efforts are much appreciated

@CodeEmporium Жыл бұрын

Thanks so much for watching :)

@paull923 Жыл бұрын

Thx! Clear and concise!

@CodeEmporium Жыл бұрын

Thanks So much

@superghettoindian01 Жыл бұрын

As before, great work on this Transformer Series! Am trying to go through all your code / videos slowly so I make sure I'm fully absorbing it. Where I'm struggling / slowest right now is in my intuition behind some of these tensor operations with stack / concatenate. Do you have any recommendations for study material apart from the torch documentation?

@CodeEmporium Жыл бұрын

Thanks so much! Hmm. Maybe hugging face has some good resources too. Aside from this, I’ll be making a playlist on the evolution of language models so some design choices become more intuitive. Hope you’ll stick around for that

@AbhishekS-cv3cr Жыл бұрын

approved!

@caiyu538 9 ай бұрын

Clear explanation. If I want to use transformer for time series and the time is not evenly changing, there is irregularities of time points. How could I positional encoding of these time into transformer?

@sangabahati3545 Жыл бұрын

You video are useful for me ,Congratulation for excellent works. But I suggest you demonstrate a real video in multivariate time series forecasting or classification.

@user-pm9nt6xk3c 11 ай бұрын

Hey Ajay, first of all, this is a video so well-built that I will be recommending it to our data science, AI, and Robotics clubs, your content is great and I can see the next Andrew NG before me, regardless I do have a question, why is it that there must be a max number of words in a transformer architecture I dont fully understand the reason behind it considering most of the operations conducted on the first half don't require a fixed length of input data since this isn't your usual neural network layer, do you mind explaining? because I do feel like this is flying above my head

@CodeEmporium 11 ай бұрын

Your words are too kind. And good question. So what is fixed in length in this specific architecture is the maximum number of words in a sentence, not the number of words in a sentence. The remain unused words are filled with “padding tokens”. This will be come clearer when you watch the videos of coding out the complete transformer in the playlist “Transformers from Scratch”. We essentially do this so we can pass fixed size vector inputs through every part of the transformer. That said, I have seen more recent implementations where the size is dynamic

@srivatsa1193 Жыл бұрын

Bro.. You are Awesome!

@CodeEmporium Жыл бұрын

Nah you are awesome

@ChrisHalden007 Жыл бұрын

Great video. Thanks

@CodeEmporium Жыл бұрын

My pleasure

@ThinAirElon 9 ай бұрын

Theoratically what does it mean to add embedding vector and positional vector ?

@aliwaleed9173 Жыл бұрын

thanks for the information in this video however i think i have a miss understanding , you said before that before the vocabs are going into the embedding victor which is like a bag of related word together in a box , but in the start of this video you said at first the words has done into a one hot encoder then passed to the positional encoding so what i want to know know is which is the scenarios is the right: 1- we take the word and search it into the embedding space then pass it into the positional encoder 2- we take the word and do it a one hot encoder then send it to the positional encoder

@ziki5993 Жыл бұрын

wonderful video

@CodeEmporium Жыл бұрын

Thank you so much! :D

@lexingtonjackson3657 4 ай бұрын

I liked you already , now you are a kannadiga and i like you more.

@Philippe_Rougier Жыл бұрын

what a voice !!!

@ilyas8523 Жыл бұрын

Great videos, especially the one where you explained what a transformer is. Beside youtube, do you have a full time job or is this it? Just curious

@CodeEmporium Жыл бұрын

Thank you! And yep I have a full time job as a Machine Learning Engineer outside of KZfaq :)

@user-kd7xd2gb5s 3 ай бұрын

i love your shit man, this was so usefull i actually understood this ml shit and now can be elon musk up in this llm shit

@FirstNameLastName-fv4eu 17 күн бұрын

Ajay you are starting a cult man!!! May God bless you.

@ajaytaneja111 Жыл бұрын

Hi Ajay, isn't the purpose of positional encoding to figure out where the word is located in the sequence which actually the attention mechanism derives benefit from? Thanks... And again great content, grateful

@CodeEmporium Жыл бұрын

Yes! The idea overall is to create meaningful embedding for words that understand context. This is opposed to the tradition CBoW or Skip gram word embeddings that don’t quite get this context.

@andreytolkushkin3611 11 ай бұрын

hi, you probably wont see this since its been 6 months siince youve posted the video, however: im trying to write code for handwritten mathematical expression recognition and am trying to recreate the BTTR model. In it they use a Densenet as the transformer encoder and use "image positional encoding" whhich is supposed to be a generalization for 2d of the sinusodal positional encoding. What would be the logic behind the 2d image positional encoding. They do have code on github but i have no idea how to interpret it, could you please help

@neetpride5919 Жыл бұрын

Is there an advantage to using one-hot encoding instead of an integer index encoding for the words? If we're gonna download a pre-existing word2vec dictionary and map each word to its word vector during the data preparation anyway, the one-hot encoding seems like it'd just create an unnecessary large sparse matrix.

@CodeEmporium Жыл бұрын

The idea here is we are not going to use a preexisting word2vec for the transformer. Everything in clouding the embedding for every word will be learned during training. An issue with word2vec is they are fixed embeddings and don’t necessarily capture word context very well. This concept was introduced in the paper that introduced ELMo “Deep Contextualized word presentations” (Peter et al., 2018). Would recommend giving this a read if you’re interest

@Slayer-dan Жыл бұрын

Thanks a lot 💚

@CodeEmporium Жыл бұрын

Super welcome

@LuizHenrique-qr3lt Жыл бұрын

Hey Ajay, great video!! Congratulations, I'm learning a lot from you thank you! Ajay I have some doubts, the first is that I didn't quite understand the difference between max sequence length and d_model. For example, if I have texts with 50 tokens in size, that is, my largest text has up to 50 tokens, this would be my max sequence length, however if my d_model were 10, my largest sequence would have to be divided into 5 to be able to pass through the model because it only accepts 10 tokens at a time, is my thinking correct?

@CodeEmporium Жыл бұрын

They way you described sequence length = 50 is correct. It is the maximum number of tokens you can pass into your network at a time (it’s the max number of words/ subwords/characters). D_model is the embedding dimension. Models don’t understand words, but they understand numbers. And so, you transform every token into some set of numbers (called a vector) and the number of numbers in this vector is d_model. Let’s say d_model is 512 and also say we have a sentence “my name is ajay”. The word “my” would be converted into a 512 dimensional vector. As would “name”, “is” and “ajay”. The idea of these vectors/embedding is to get some dense numeric representation of the context of a word (so similar words are represented with vectors that are close to each other and dissimilar words are represented with vectors that are farther from each other)

@LuizHenrique-qr3lt Жыл бұрын

@@CodeEmporium hm ok good answer, now a doubt if d_model is the dimension that I will put my tokens. Why don't some transformer models accept very long texts? for example: if I have the string length = 10 d_model = 3 the phrase "my name is Ajay" would turn 4 vectors my: [0,0.2,0.6] name: [0, 0.1, 0.11] is: [0.5, 0.2, 0.0] Ajay: [0,0,1] with d_model dimensions each Why can't I put very large sequences in my model? Why does d_model interfere with this

@DevelopersHutt Жыл бұрын

@@LuizHenrique-qr3lt The max sequence length refers to the maximum number of tokens in a sequence, while d_model represents the dimensionality of the token embeddings. They serve different purposes in the Transformer model. The max sequence length determines the size of the input that can be processed at once, whereas d_model influences the complexity and expressive power of the model. In your example, if the max sequence length is 50 and d_model is 10, the largest sequence would need to be divided into smaller segments or chunks of 10 tokens to fit within the model's input limit.

@lorenzobianconi7724 Жыл бұрын

hi ajy thanfor your videos. why are there 512 dimensions? who established this number? and how can we count the 175b parameters in gpt3. can you make a video when you break down the whole process of a transformer in one clear shot. possibly not using a translation but for exemple an answer task. thanks love your video and determination to spread knowledge

@giacomomunda3359 9 ай бұрын

512 is a hyperparameter. You can actually decide which dimension to use, but it has been proven that higher dimension usually work better, since they are able to capture more linguistic information, e.g. semantics, syntax, etc. BERT for instance uses 768 dimensions and the OpenAI ada embeddings have 1536 dimensions.

@convolutionalnn2582 Жыл бұрын

In the code on the final class, position is 1 to max sequence length....Which include both even and odd...I think we use cos for odd and sin for even..Why all the position are pass which mean 1 to max sequence length including even are pass in cos and odd are pass in sin.

@CodeEmporium Жыл бұрын

I think I responded to this in another video you asked this question. Hope that helped tho :)

@convolutionalnn2582 Жыл бұрын

@@CodeEmporium Yeah but you didn't answer it fully

@hermannangstl1904 Жыл бұрын

From what I understood each word/token is represented by a 512-dimensional vector. This values of this vector are modified by means of (Self)Attention and Positional Encoding. What is a bit counter-intuitive for me is that the place in which a word/token comes can be different in different sentences. For example lets take the word "Ajay". (1) In this sentence it's in 4th position: "My name is Ajay" (2) In a different sentence it is on 1st position: "Ajay explains very well". So the Positional Encodings for the word "Ajay" vary - they might be different in each sentence. How can the network be trained, how can it learn, with such contradicting input data?

@CodeEmporium Жыл бұрын

This is a good question. But it intuitively does make sense that the same word In different sentences can have different meanings. Take the word “grounded”. You can represent this as a 512 dimensional vector. But let’s say “grounded” occurs in 2 sentences: (1) The truth is grounded in reality (2) You’re grounded! Go to your room. In these examples, “grounded” has differing meanings and should hence have different vector representations. This is why we need surrounding context to understand word vectors individually. This is probably a lil hard to see with your example since “Ajay” is a proper noun. However, for non-proper nouns, context matters. I think you should take a look at the paper “Deep Contextualized word Representations” by Matthew Peters (2018). They more formally answer the question you are asking. This is the paper that introduced ELMo embedding. According to this paper, Turns out that using different vectors based on context really improved models on Part of Speech Tagging and Language modeling

@DevelopersHutt Жыл бұрын

You've raised an important point. While it is true that the positional encoding for a word like "Ajay" can vary depending on its position in different sentences. Let's consider the word "Ajay" in two different sentences and see how the Transformer model handles it: (1) Sentence 1: "My name is Ajay." (2) Sentence 2: "Ajay explains very well." In both sentences, the word "Ajay" has different positions, but the Transformer model can still learn and make sense of it. Here's a simplified example of how it works: Input Encoding: Each word, including "Ajay," is initially represented by a 512-dimensional vector. Sentence 1: "Ajay" is represented as [0.1, 0.2, 0.3, ..., 0.4]. Sentence 2: "Ajay" is represented as [0.5, 0.6, 0.7, ..., 0.8]. Positional Encoding: The model incorporates positional encodings to differentiate the positions of words. Sentence 1: The positional encoding for the 4th position is [0.4, 0.3, 0.2, ..., 0.1]. Sentence 2: The positional encoding for the 1st position is [1.0, 0.9, 0.8, ..., 0.5]. Attention and Context: The Transformer's attention mechanism considers the positional encodings along with the input representations to compute contextualized representations. Sentence 1: The attention mechanism incorporates the positional encoding and input embedding of "Ajay" at the 4th position to capture its contextual information within the sentence. Sentence 2: Similarly, the attention mechanism considers the positional encoding and input embedding of "Ajay" at the 1st position in the context of the second sentence. By attending to different positions and incorporating positional encodings, the model can learn to associate the word "Ajay" with its specific context and meaning in each sentence. Through training on various examples, the model adjusts its weights and learns to generate appropriate representations for words based on their positions, allowing it to make meaningful predictions and capture the contextual relationships between words effectively.

@balakrishnaprasad8928 Жыл бұрын

Please make a detailed video series on the math for data science

@CodeEmporium Жыл бұрын

I have made some math in machine learning videos. Maybe check the playlist “Th e math you should know” on the channel

@LuizHenrique-qr3lt Жыл бұрын

my second doubt is that when I use BertTokenizer for example it transforms the text: [my name is ajay] in a list of integers for example [101, 11590, 11324, 10124, 138, 78761, 102], where does that part go? I couldn't understand that part

@CodeEmporium Жыл бұрын

So I haven’t shown the text encoding details just yet. :) since 4 words were encoded into 7 numbers, I assume the “BestTokenizer” is encoding each subword / word piece into some number. Essentially, the tokenizer is taking the sentence, breaking it down into word pieces (7 in this case) and each is being mapped to a unique integer number. Later on, you will see each number being mapped to a larger vector (I explained more details about why these vectors exist in the other comment)

@wishIKnewHowToLove Жыл бұрын

I like your English is clean :) no disgusting non-Californian accent :)

@CodeEmporium Жыл бұрын

Thank you for the compliments

@7_bairapraveen928 Жыл бұрын

I am kind on new bie here, if you think this is valid please answer. why are you introducing parameters of dimension 512 for the vocab size , making a neural network, I mean what happens if we dont do that?

@CodeEmporium Жыл бұрын

Why are we using 512 dimensions instead of the 1 hot vector of size equal to the vocabulary size? This is because of the curse of dimensionality. Vocabulary sizes are huge (often in the 10s of thousands). This is a lot for any model, neural network or not, to process. There was a 2001 paper by Yashua Bengio “A Neural Probabilistic Language Model” that describes exactly this issue and why it was introduced. I would recommend giving it a read. Also, my next series will delve into the history of language models so I hope you’ll stay tuned for this. Maybe some of the design choices will become clearer.

@SAIDULISLAM-kc8ps Жыл бұрын

can you please tell me the between sequence length and dimension of embedding ?

@CodeEmporium Жыл бұрын

Sequence length = maximum number of characters/words we can pass into the transformer at a time. Dimension of embedding = size of vector representing each character / word.

@SAIDULISLAM-kc8ps Жыл бұрын

@@CodeEmporium thanks a lot.

@joaogoncalves1149 11 ай бұрын

I think that queen/king example is somewhat cherry picked, as the principle behind the analogy fails for many examples.

@aar953 7 ай бұрын

There is one mistake that you are making. We are not taking a single output as input to the decoder, but all the previous outputs up to the current time step as input to the decoder.

@CodeEmporium 7 ай бұрын

Yea that’s correct from a practical standpoint. I dive into this when coding this out in the rest of this playlist “Transformers from scratch “. Hope those videos clear things up!

@aar953 7 ай бұрын

@@CodeEmporium Thanks for answering. I understand that you have to make a trade-off between simplicity and accuracy. Here, I just wanted to note that little more complexity would have added quite a lot more accuracy. Your content is excellent!