How do transformers work? (Attention is all you need)

Рет қаралды 18,929

Күн бұрын

❤️ Become The AI Epiphany Patreon ❤️ ► / theaiepiphany
▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬
In this video, I give you a semi-quick tour through the "Attention is all you need" paper. The paper that introduced the first-ever transformer model!
I also show you some cool blogs along the way and my half-baked implementation of the original transformer model.
You'll learn about:
✔️ How the original transformer model works
▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬
✅ The Annotated Transformer blog: nlp.seas.harvard.edu/2018/04/0...
✅ Jay Alammar's blog: jalammar.github.io/illustrate...
✅ Original paper: arxiv.org/abs/1706.03762
▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬
⌚️ Timetable:
0:00 prerequisite theory and my semi-done transformer implementation
1:40 High-level overview of the paper
2:55 Visualization of positional encodings (my code)
5:07 Attention mask (no looking forward!)
7:35 Optimizer
10:20 Multi-head attention in depth
15:15 A glimpse at the code implementation
17:49 Training procedure - machine translation
18:09 Na ja, wie geht's?
▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬
💰 BECOME A PATREON OF THE AI EPIPHANY ❤️
If these videos, GitHub projects, and blogs help you,
consider helping me out by supporting me on Patreon!
The AI Epiphany ► / theaiepiphany
One-time donation:
www.paypal.com/paypalme/theai...
Much love! ❤️
▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬
💡 The AI Epiphany is a channel dedicated to simplifying the field of AI using creative visualizations and in general, a stronger focus on geometrical and visual intuition, rather than the algebraic and numerical "intuition".
▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬
👋 CONNECT WITH ME ON SOCIAL
LinkedIn ► / aleksagordic
Twitter ► / gordic_aleksa
Instagram ► / aiepiphany
Facebook ► / aiepiphany
👨‍👩‍👧‍👦 JOIN OUR DISCORD COMMUNITY:
Discord ► / discord
📢 SUBSCRIBE TO MY MONTHLY AI NEWSLETTER:
Substack ► aiepiphany.substack.com/
💻 FOLLOW ME ON GITHUB FOR COOL PROJECTS:
GitHub ► github.com/gordicaleksa
📚 FOLLOW ME ON MEDIUM:
Medium ► / gordicaleksa
▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬
#transformer #attention #deeplearning

Пікірлер: 47

@godelian 3 жыл бұрын

This channel is a true gem.

@TheAIEpiphany 3 жыл бұрын

Appreciate it! Thank you! I still have a lot of space for improvement when it comes to my presentation skills but I'll learn it along the way that's my moto. 😅😂

@mehular0ra 2 жыл бұрын

Best AI channel

@sachinprabhusachinprabhu007 3 жыл бұрын

Thank you so much for brief summary! That was so helpful :)

@TheAIEpiphany 3 жыл бұрын

You're welcome Sachin!

@tahmidhossain007 3 жыл бұрын

Keep up the good work mate. In the era of Ravals, its you guys keeping things honest.

@TheAIEpiphany 3 жыл бұрын

Thank you Tahmid!! Hahaha 😅

@TheAIEpiphany 3 жыл бұрын

I don't have a strong attitude towards Siraj. He did some bad stuff - plagiarism, but I also guess he got many people interested in AI otherwise he wouldn't have that many followers. Who am I to evaluate the net effect he had on the society.

@tahmidhossain007 3 жыл бұрын

@@TheAIEpiphany I believe people rather not come into ML with a presumption that things just magically occur in 5 mins. As a society, we are becoming impatient in every aspect and want overnight success, no one wants to play the long game.....

@TheAIEpiphany 3 жыл бұрын

@@tahmidhossain007 That's true. 100%. But then again it's a basic human trait. If it was possible we'd all want it. The difference is many people still don't know that especially the less experienced/knowledgeable ones. They still believe you can achieve things overnight!

@user-co6pu8zv3v 2 жыл бұрын

Thank you!

@TheAIEpiphany 3 жыл бұрын

Hallo zusammen! (Throwing machine translation hints because I can) Do you find these kinds of paper overview useful? What's your favourite paper?

@salimmiloudi4472 3 жыл бұрын

Really useful, thanks for sharing. If you could review some papers on unsupervised deep learning, that would be great. Keep up the good work !

@TheAIEpiphany 3 жыл бұрын

@@salimmiloudi4472 Thanks! Are you more interested in unsupervised deep learning in the context of NLP (GPT family, BERT-like models, etc.) or computer vision (e.g. GANs)? Thanks again I find you feedback extremely important!

@practicallinguistics4030 3 жыл бұрын

@@TheAIEpiphany Not sure about @milas, I'd love to hear about unsupervised DL for NLP

@TheAIEpiphany 3 жыл бұрын

@@practicallinguistics4030 hahaha that was not too hard to infer given your channel name lol! Thanks for chiming in!

@practicallinguistics4030 3 жыл бұрын

@@TheAIEpiphany Yep :) Loved the German btw :))

@umutalihandikel5490 2 жыл бұрын

Hi many thanks for creating such blissful content!! May I ask what knowledge organizer you are using, which is shown on you first seconds of video?

@TheAIEpiphany 2 жыл бұрын

Thank you!! Microsoft's OneNote. 😁

@umutalihandikel5490 2 жыл бұрын

@@TheAIEpiphany wow dark theme makes it look like a supercool *nix-based foss tool :)) However I use it too for the same purpose. Your reply became a validation to me on using onenote further. Aside from the knowledge organizer your content is really great for researchers and practitioners. Infinitely many thanks for your effort. Please do continue on creating such educative content

@moeinhasani8718 3 жыл бұрын

subscribed and turned notifications on! great channel

@TheAIEpiphany 3 жыл бұрын

Much love! Thank you, glad you like it

@shandi1241 3 жыл бұрын

Ok , fine! you got me subscribed 😄

@TheAIEpiphany 3 жыл бұрын

Hahahaha welcome to the dark side

@cocoarecords 3 жыл бұрын

What a gem channel

@TheAIEpiphany 3 жыл бұрын

Thanks! Hahah nice channel name dude!

@akhileshbisht6469 Жыл бұрын

Hi, I am having a doubt. When we train a transformers model for machine translation, the network takes the input as a batch of sentences (according to the batch size). But after the training when we use the model for translating the test sentences, we translate sentence by sentence not as batch. So, how it is happening that during training a batch is given as an input and during testing only one sentence. The configuration becomes different, right?

@abhishekmann Жыл бұрын

During training time, transformers use something called "teacher-forcing". Where the ground truth itself is used for training the decoder side and the prediction at a certain time-step is only used for the computation of loss. The transformer decoder is "auto-regressive", that is it generates output one at a time. That is exactly how transformer is able to generate sequence of a different length compared to the input sentence at the decoder side.

@akhileshbisht6469 Жыл бұрын

@@abhishekmann how can we hard code for it?

@abhishekmann Жыл бұрын

@@akhileshbisht6469 You can look up masked attention to understand what exactly is going on at training time. The self-attention layers on the decoder sides use a mask before softmax step, this essentially makes it illegal for the network to peek at outputs in succeeding time-steps (i.e. cheating during training time) There is a cross attention layer as well in decoder, this cross attention layer obtains keys and values from the output of the encoder and queries from the output of the preceding decoder layer.

@akhileshbisht6469 Жыл бұрын

@@abhishekmann I think in cross attention, keys and values are formed by output obtained from the last encoder in the encoder stack

@abhishekmann Жыл бұрын

@@akhileshbisht6469 Yes, from the encoder stack output. Corrected it

@SuperLOLABC 3 жыл бұрын

Hey Aleksa, I am very interested in Deep Learning but I have to get a job within 3 months. Due to the time constraint I am applying for Data Analyst which are mostly product facing roles at companies like Microsoft. Is it possible to transition into a more ML/DL heavy role internally after a couple of years?

@TheAIEpiphany 3 жыл бұрын

Hey! Good luck with your job pursuit! It is although it's a slow process. After a couple of years yeah for sure but if you expect to get a new role in 6 months that's next to impossible.

@emiliomorales2843 3 жыл бұрын

Linformer, longformer, sparse-transformer and self attention gan in coming videos please.

@TheAIEpiphany 3 жыл бұрын

Awesome, thanks Emilio! I've got encouraged by this comment haha I'l consider doing sparse transformer first as that's also related to GPT-3. Btw. can you follow along if I only speak or visualizations such as Jay Alammar's blog, that simple cosine similarity pic, etc. helped?

@emiliomorales2843 3 жыл бұрын

@@TheAIEpiphany Actually I prefer kinda Ali Ghodsi style, where you just go through the math. No visualization or code required.

@TheAIEpiphany 3 жыл бұрын

@@emiliomorales2843 interesting! That teaching style is only for a really small percentage of people in my experience, including me. The reason being often times you just don't speak "the same math language", different notation, etc. Anyways thanks for letting me know!

@emiliomorales2843 3 жыл бұрын

@@TheAIEpiphany Yeah, you are right. I just suggest that it would be super helpful if you dive deep into the math. Not all the paper, but maybe just in the 2 or 3 main equations. Just to understand whats really going on within the model.

@TheAIEpiphany 3 жыл бұрын

@@emiliomorales2843 100% agree. Will do. I'll add the "most important math" section haha