Live -Transformers Indepth Architecture Understanding- Attention Is All You Need

Рет қаралды 210,217

3 жыл бұрын

All Credits To Jay Alammar
Reference Link: jalammar.github.io/illustrated...
Research Paper: papers.nips.cc/paper/7181-att...
youtube channel : • Jay's Visual Intro to AI
Please donate if you want to support the channel through GPay UPID,
Gpay: krishnaik06@okicici
Discord Server Link: / discord
Telegram link: t.me/joinchat/N77M7xRvYUd403D...
Please join as a member in my channel to get additional benefits like materials in Data Science, live streaming for Members and many more
/ @krishnaik06
Please do subscribe my other channel too
/ @krishnaikhindi
Connect with me here:
Twitter: / krishnaik06
Facebook: / krishnaik06
instagram: / krishnaik06

Пікірлер: 234

@dandyyu0220 2 жыл бұрын

I cannot express the amount of appreciation enough of your videos, especially NLP deep learning related topics! They are extremely helpful and so easy to understand from scratch! Thank you very much!

@mohammadmasum4483 Жыл бұрын

@ 40:00 why we consider 64? - It is based on the how many multi head attention you want to apply. We used embedding size for each word = 512 and want to apply 8 multi head self attention; there fore for each attention we are using (512/8 =) 64 dimensional Q, K, V vector. So that, when we concatenate all the multi attention heads afterward, we can achieve the same 512 dimensional word embeddings which will be the input to the feed forward layer. Now, for instance, if you want 16 multi attention head, in that case you can use 32 dimensional Q, K, and V vector. My opinion is that, the initial word embedding size and the number of multi attention head are the hyperparameters.

@naveenkumarjadi2915 29 күн бұрын

great bro

@Musicalphabeats 20 күн бұрын

thanks bro

@shrikanyaghatak Жыл бұрын

I am very new to the world of AI. I was looking for easy videos to teach me about the different models. I cannot imagine that I was totally enthralled by this video as long as you taught. You are a very good teacher. Thank you for publishing this video free. Thanks to Jay as well for simplifying such complex topic.

@nim-cast 11 ай бұрын

Thanks for your fantastic LLM/Transformer series content, and I admire your positive attitude and support for the authors of these wonderful articles! 👏

@sivakrishna5557 2 ай бұрын

Could you please help me to get started on llm series, could you pls share the playlist link

@apppurchaser2268 Жыл бұрын

You are a really good teacher that always check your audiences weather they get the concept or not. Also, I appreciate your patience and the way you try to rephrase to have a better explanations.

@ss-dy1tw 3 жыл бұрын

Krish, I really see the honesty in you man, lot of humility, very humble person. In the beginning of this video, you gave credit to Jay several times who created amazing blog for Transformers. I really liked that. Be like that.

@suddhasatwaAtGoogle 2 жыл бұрын

For anyone having a doubt at 40:00 as to why we have taken a square root of 64 is because, as per the research it was mathematically proven to be the best method to keep the gradients stable! Also, note that the value 64, which is the size of the Query, Keys and Values vectors, is in itself a hyperparameter which was found to be working the best. Hope this helps.

@latikayadav3751 Жыл бұрын

The embedding vector dimension is 512. We divide this in 8 heads. We 512/8 =64. therefore size of query, keys and values is 64. therefore size is not hyperparameter.

@afsalmuhammed4239 Жыл бұрын

normlizing the data

@sg042 9 ай бұрын

Another reason is that we generally we want weights, inputs, etc. to follow normal distribution N(0, 1) and when we compute dot product then its summation of 64 values which mathematically increases the variance to sqrt(64) and make the distribution as N(0, sqrt(64)) and therefore dividing it by sqrt(64) will normalize it.

@sartajbhuvaji 9 ай бұрын

The paper states that: " While for small values of dk the two mechanisms(attention functions: additive attention and dot product attention) (note: paper uses dot product attention (q*k)) perform similarly, additive attention outoerforms dot product attention without scaling for larger values of dk. We suspect that for larger values of dk, the dot product grows large in magnitude, pushing the softmax function into regions where it has extremely small gradients. To counteract this effect, we scale the dot product by (1/sqrt(dk))"

@story_teller_1987 3 жыл бұрын

Krish is a hard working person, not for himself but for our country in the best way he could...We need more persons like him in our country

@lohithklpteja 4 ай бұрын

Alu kavale ya lu kavale ahhh ahhh ahhh ahhh dhing chiki chiki chiki dhingi chiki chiki chiki

@akhilgangavarapu9728 3 жыл бұрын

Million tons appreciation for making this video. Thank you soo much for your amazing work.

@anusikhpanda9816 3 жыл бұрын

You can skim through all the youtube videos explaining transformers, but nobody comes close to this video. Thank you Sir🙏🙏🙏

@kiran5918 5 ай бұрын

Difficult to understand foreign accents. Desi away zindabad

@prasad5164 3 жыл бұрын

I really admire you now. Just because you give the credit to the deserving at the beginning of the video. That attitude will make you a great leader. All the best!!

@Adil-qf1xe Жыл бұрын

How did I miss the subscription to your channel? Thank you so much for this thorough explanation, and hats off to Jay Alammar.

@user-or7ji5hv8y 3 жыл бұрын

thank you, appreciate your time going through this material

@TusharKale9 3 жыл бұрын

Very well covered GPT-3 topic. Very important from NLP point of view. Thank you for your efforts.

@jeeveshkataria6439 3 жыл бұрын

Sir, Please release the video of Bert. Eagerly waiting for it.

@roshankumargupta46 3 жыл бұрын

This might help the guy who asked why we take the square root and also for other aspirants : The scores get scaled down by getting divided by the square root of the dimension of query and key. This is to allow for more stable gradients, as multiplying values can have exploding effects.

@tarunbhatia8652 3 жыл бұрын

nice. i was also wondering about the same . it all started from gradient exploding or vansihing , how can i forget that :D

@apicasharma2499 3 жыл бұрын

can this attention encoder-decoder be used in financial time series as well.. multivariate time series?

@matejkvassay7993 2 жыл бұрын

Hello, I think the sq root od dimension is not chosen just empirically but actually it's to normalize the length of vector or smth similar, it holds the vector length scales by sq root with increasing dimension size when some conditions I forgot are met, this way you scale it down to 1 ans thus prevent exploding dot product scores

@kunalkumar2717 2 жыл бұрын

@@apicasharma2499 yes, although i have not used it, but it can be used.

@generationgap416 Жыл бұрын

The normalizing should come from softmax or by using the tri function to zero out the bottom of the matrix concatenated q, k and V MATRIX. to have good initialization weights, i think

@tshepisosoetsane4857 Жыл бұрын

Yes this is the best video explaining these Models so far even non computer science people can understand what is happening , great work

@sarrae100 3 жыл бұрын

Excellent blog from Jay, Thanks Krish for introducing this blog on ur channel !!

@harshitjain4923 3 жыл бұрын

Thanks for explaining Jay's blog. To add to the explanation at 39:30, the reason for using sqrt(dk) is to prevent the problem of vanishing gradient as mentioned in the paper. Since we are applying softmax on Q*K and if we consider a high dimension of these matrices, it will produce a high value which will get transformed close to 1 after softmax and hence leads to a small update in gradient.

@neelambujchaturvedi6886 3 жыл бұрын

Thanks for this Harshit

@shaktirajput4711 2 жыл бұрын

Thanks for explanation but I guess it will be called as exploding gradient not vanishing gradient. Hope I am not wrong.

@harshavardhanachyuta2055 3 жыл бұрын

Thank you. Your teaching and jay's blog combination pull this topic. I like the way you are teaching. Keep going.

@hiteshyerekar9810 3 жыл бұрын

Great Session Krish. Because of Research paper I understand things very easily and clearly.

@ashishjindal2677 2 жыл бұрын

Wonderful explanation of blog, thanks for introducing with jay. Your teaching style is awesome.

@mequanentargaw 10 ай бұрын

Very helpful! Thank you all contributors!

@madhu1987ful 2 жыл бұрын

Jay alammar blog is of course awesome. But you made it even more simpler while explaining. Thanks a lot

@underlecht 3 жыл бұрын

I love your patience how many times you go around explaining things until they get clear even for such dumb guys as me. BTW residual connection are not due some layers are not important and we have to skip them, it is for to solve the vanishing gradients problem.

@MuhammadShahzad-dx5je 3 жыл бұрын

Really nice sir, looking forward to Bert Implementation 😊

@lshagh6045 Жыл бұрын

Very huge and tremendous effort, million thanks for your dedication

@gurdeepsinghbhatia2875 3 жыл бұрын

sir thanks a alot , mza agia sir , your way of teaching with so humble and honest and most important patience , awesome video sir, too gud

@tarunbhatia8652 3 жыл бұрын

Thanks Krish, Awesome session, keep doing the great work!

@kiran5918 6 ай бұрын

Wow what an explanation of transformers.. perfect for us.. it aligns with the way we r taught at school…

@smilebig3884 2 жыл бұрын

Very underrated video... this is super awesome explanation. I m watching and commenting 2nd time after a month.

@faezakamran3793 Жыл бұрын

For those getting confused with 8 heads, all the words would be going to all the heads. It's not one word per head. The X matrix remains the same only the W matrix would change in case of multi-head attention.

@shanthan9. 4 ай бұрын

Every time I get confused or distracted while listening to the Transformers, I have to watch the video again; this is my third time watching it, and now I understand it better.

@wentaowang8622 Жыл бұрын

Very clear explanation. And Jay's blog is also amazing!!

@louerleseigneur4532 3 жыл бұрын

After watching your lecture it's more clear to me Thanks Krish

@af121x 3 жыл бұрын

Thank you Krish. I learned so many things from your video.

@jimharrington2087 3 жыл бұрын

Great effort Krish, Thanks

@michaelpadilla141 2 жыл бұрын

Superb. Well done and thank you for this.

@MrChristian331 2 жыл бұрын

Great presentation! I understand it fully now I think.

@aqibfayyaz1619 3 жыл бұрын

Great Effort. Very well explained

@thepresistence5935 2 жыл бұрын

Took More than 5 hours to understand this. Thanks Krish wonderful explanation.

@RanjitSingh-rq1qx 7 ай бұрын

Video was so good, i understand each and every thing just except only decoder side .

@avijitbalabantaray5883 Жыл бұрын

THank you Krish and Jay for this work.

@junaidiqbal5018 2 жыл бұрын

@31:45 If my understanding is correct, reson why we have 64, is because we we divide 512 into 8 equal heads. As we are computing the dot products to get the attention vaue, if we do the dot product of 512 embedding dimension length it will not only be computationally expensive but also the fact that we will get only one relation between the words . Taking advantage of parallel computation we divide 512 into 8 equal parts. this is why we call it as multi head attention. This way its computationally fast and we also get 8 different relation between the words. (FIY Attention is basically a relation between the words ). Any way Good work on explaining the architecture krish.

@markr9640 7 ай бұрын

Very well explained Sir! Thank you.

@pavantripathi1890 9 ай бұрын

Thanks to jay alamaar sir and you for the great explanation.

@Schneeirbisify 3 жыл бұрын

Hey Krish, thanks for the session. Great explanation! Could you please suggest if you have already uploaded session on Bert? And if not do you have still on plans? Would be very interesting to deep dive into practical application of Transformers.

@ruchisaboo29 3 жыл бұрын

Awesome explanation.. when will you post BERT video ? waiting for it and if possible please cover GPT-2 as well.. Thanks a lot for this amazing playlist.

@Deepakkumar-sn6tr 3 жыл бұрын

Great Session!....looking forward to Transformer Based recommender system

@zohaibramzan6381 3 жыл бұрын

Great to overcome confusions. I hope next to get hands on Bert.

@kameshyuvraj5693 3 жыл бұрын

sir the way you explained the topics is ultimate sir

@sujithsaikalakonda4863 9 ай бұрын

Very well explained. Thank you sir.

@tapabratacse Жыл бұрын

superby you made the things look so easy

@Ram-oj4gn 7 ай бұрын

great explanation.. I understood Transformers now..

@prekshagampa5889 Жыл бұрын

Thanks a lot for detailed explaination. Really appreciate your effort for creating these videos

@pranthadebonath7719 10 ай бұрын

Thank you, sir, that's a nice explanation. also thanks to Jay Alammar sir.

@ganeshkshirsagar5806 8 ай бұрын

Thank you so much sir for this superb session.

@parmeetsingh4580 3 жыл бұрын

Hi Krish, great session. I have a question - the Z we get after the self-attention block of the encoder, is it interpretable? that means if we could figure out by just looking at Z what results does the multi-head self-attention block gives? Kindly help me out with this.

@raghavsharma6430 3 жыл бұрын

krish sir, it's amazing!!!!

@elirhm5926 2 жыл бұрын

I don't know how to thank you and jay enough!

@armingh9283 3 жыл бұрын

Thank you sir. It was awsome

@mayurpatilprince2936 9 ай бұрын

Why they multiply each value vector by the softmax score because they want to keep intact the values of the all word(s) and they want to focus on, and drown-out irrelevant words (by multiplying them by tiny numbers like 0.001, for example) ... they wanted to immerse whatever that sentence has irrelevant words ...

@utkarshsingh2675 Жыл бұрын

thanks for such free contents!!...u r awesome sir!

@jaytube277 3 ай бұрын

Thank you Krish for making such a great video. Really appreciate your hard work. One thing I have not understood here is that where is the loss getting calculated? Is it happening on the multiple heads or at the encoder decoder attention layer. What I am assuming is that while we are training the model, the translations will not be accurate and we should get some loss which we will try to minimize but I am not understanding where is that comparison is happening?

@sagaradoshi 2 жыл бұрын

Thanks for the wonderful explanation .. For the decoder in the 2nd time instance we passed word/letter 'I', then in 3rd time instance do we pass both the words 'I' and 'Am' or only the word 'Am' is passed? Similarly for the 3rd time instance do we pass the words 'I', 'am' and 'a' or just the word/letter 'a' is passed?

@manikaggarwal9781 7 ай бұрын

superbly explained

@happilytech1006 2 жыл бұрын

Always helpful Sir!

@joydattaraj5625 3 жыл бұрын

Good job Krish.

@kiran082 2 жыл бұрын

Great Explanation

@mdmamunurrashid4112 Жыл бұрын

You are amazing as always !

@hudaalfigi2742 2 жыл бұрын

i really want to thank you for your nice explanation actually i could not be able to understsnd it befor watchining this video

@sweela1 Жыл бұрын

In my opinion, At 40:00 the under root is taken for the purpose of scaling to normalize the value from larger value to be transformed to smaller value so that SoftMax function of these values can also be calculated easily. Dk is the dimension whose under root is taken to scale the values.

@sg042 9 ай бұрын

Another reason probably is that we generally we want weights, inputs, etc. to follow normal distribution N(0, 1) and when we compute dot product then its summation of 64 values which mathematically increases the variance to sqrt(64) and make the distribution as N(0, sqrt(64)) and therefore dividing it by sqrt(64) will normalize it.

@toulikdas3915 3 жыл бұрын

More this kind of videos on Research paper explanations and advanced concepts of deep learning and reinforcement learning sir.

@AshishBamania95 2 жыл бұрын

Thanks a lot!

@121MrVital 3 жыл бұрын

Hi Krish, When you gonna make a video on "Bert" with practical implementation ??

@sreevanthat3224 Жыл бұрын

Thank you.

@mdzeeshan1148 Ай бұрын

Wooowoooooooowwwwwwwwwwww At last I got clarity. Thanks so much for wonderful explanation

@digitalmbk 3 жыл бұрын

My MS SE thesis completion totally depends on your videos. Just AWESOME!!!

@pratheeeeeesh4839 3 жыл бұрын

Bro are you pursuing your ms?

@digitalmbk 3 жыл бұрын

@@pratheeeeeesh4839 yes

@pratheeeeeesh4839 3 жыл бұрын

@@digitalmbk where brother?

@digitalmbk 3 жыл бұрын

@@pratheeeeeesh4839 GCUF Pakistan

@captiandaasAI Жыл бұрын

great!!!!!!! Krish

@bruceWayne19993 8 ай бұрын

thank you🙏

@Rider12374 2 ай бұрын

Thanks krish don!!!

@generationgap416 Жыл бұрын

The reason to divide by sq of k is to prevent a constant value of x. That x = 1/2 for values near x = 0 from the left or right f(x) approaches y = 1/2. Look at the shape of the sigmoid function.

@BINARYACE2419 3 жыл бұрын

Well Explained Sir

@dataflex4440 Жыл бұрын

Pretty good Explanation Mate

@User-nq9ee 2 жыл бұрын

Thank you so much ..

@dhirendra2.073 2 жыл бұрын

Superb explanation

@apoorvneema7717 11 ай бұрын

awesome bro

@muraki99 10 ай бұрын

Thanks!

@shahveziqbal5206 2 жыл бұрын

Thankyou ❤️

@learnvik 9 ай бұрын

thanks, Question: in step 1 (30:52), what if the randomly initialized weights have the same value during the start? then all resulting vectors will have same values.

@mohammedbarkaoui5218 Жыл бұрын

You are the best 😇

@BalaguruGupta 3 жыл бұрын

The layer normalization does (X + Z) here X is input Z is result of self attention calculation. You mentioned when the Self attention doesn't perform well, the self attention calculation will be skipped and jumps to Layer Normalization, hence the Z value will be 'EMPTY' (Please correct me here, if I'm wrong). In this case the layer normalization happens only on X (the imput). Am I correct?

@bofloa Жыл бұрын

watching through this video, I can only conclude that the whole process is more of a Art than it is a science

@rajns8643 Жыл бұрын

Definitely!

@lakshmigandla8781 6 ай бұрын

Clear explaining

@sayantikachatterjee5032 8 ай бұрын

at 58.49 it is told that if we increase no of heads it will give more importance to different words. so 'it' can give more importance to 'street' also. so between 'The animal' and 'street' which word will be more prioritized?

@ayushrathore8916 3 жыл бұрын

After the encoder. Is there any repository like which store all the output of encoder and then one by one it will pas to decoder to get one on one decoded output!

@ranjanarch4890 2 жыл бұрын

This video describes the inference of the Transformer. Can you do a video on training Architecture? I suppose we would need to give both languages datasets for training.

@desrucca Жыл бұрын

AFAIK Resnet is not like dropout, instead it brings information from the previous layer to the n_th layer by doing this, vanishing gradients are less likely to occur.

@vishwasreddy6626 3 жыл бұрын

How do we get K and V vectors from encoding output. It would be helpful if you can explain ot with dimensions

@neelambujchaturvedi6886 3 жыл бұрын

Hey Krish, Had a quick question related to the explanation at 1:01:07 about positional encodings. How do we exactly create those embeddings, as in the paper the authors have used sine and cosine waves to produce these embeddings, I could not understand the intuition behind this, could you please help me understand this part, Thanks in advance.

@1111Shahad 2 ай бұрын

The use of sine and cosine functions ensures that the positional encodings have unique values for each position. Different frequencies allow the model to capture both short-range and long-range dependencies. These functions ensure that similar positions have similar encodings, providing a smooth gradient of positional information, which helps the model learn relationships between neighboring positions.