Adding vs. concatenating positional embeddings & Learned positional encodings

Рет қаралды 21,147

Күн бұрын

When to add and when to concatenate positional embeddings? What are arguments for learning positional encodings? When to hand-craft them? Ms. Coffee Bean’s answers these questions in this video.
➡️ AI Coffee Break Merch! 🛍️ aicoffeebreak.creator-spring....
Outline:
00:00 Concatenated vs. added positional embeddings
04:49 Learned positional embeddings
06:48 Ms. Coffee Bean deepest insight ever
▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀
🔥 Optionally, help us boost our Coffee Bean production! ☕
Patreon: / aicoffeebreak
Ko-fi: ko-fi.com/aicoffeebreak
▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀
📺 Positional embeddings explained: • Positional embeddings ...
📺 Fourier Transform instead of attention: • FNet: Mixing Tokens wi...
📺 Transformer explained: • The Transformer neural...
Papers 📄:
Vaswani, Ashish, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Łukasz Kaiser, and Illia Polosukhin. "Attention is all you need." In Advances in neural information processing systems, pp. 5998-6008. 2017. proceedings.neurips.cc/paper/...
Wang, Yu-An, and Yun-Nung Chen. "What do position embeddings learn? an empirical study of pre-trained language model positional encoding." arXiv preprint arXiv:2010.04903 (2020). arxiv.org/pdf/2010.04903.pdf
Dosovitskiy, Alexey, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani et al. "An image is worth 16x16 words: Transformers for image recognition at scale." arXiv preprint arXiv:2010.11929 (2020). arxiv.org/abs/2010.11929
✍️ Arabic Subtitles by Ali Haidar Ahmad / ali-ahmad-0706a51bb .
🔗 Links:
AICoffeeBreakQuiz: / aicoffeebreak
Twitter: / aicoffeebreak
Reddit: / aicoffeebreak
KZfaq: / aicoffeebreak
#AICoffeeBreak #MsCoffeeBean #MachineLearning #AI #research

Пікірлер: 68

@gabriellara9954 2 жыл бұрын

I have yet to encounter videos with content that is better curated, researched, summarized and communicated. These videos on position encoding must have saved me about one or more days work of paper reading and made it easier to dive into those same papers. There are already way too many videos on the basics !! Thank you so much :) PS: time series related content would be appreciated for us unfortunates in time series research 😅

@DerPylz 3 жыл бұрын

More positional embeddings! Yeah!

@MegaNightdude Жыл бұрын

Thanks for the video! Never been able to get to concatenated positional embeddings and now I know about them

@rinogrego9262 Жыл бұрын

I just want to say thank you very much for this very concise yet CLEAR explanation. Only watched 2 of your videos (this one and "demystifying positional encoding one") and both of them gave me much better explanations and understandings than other resources that I used in the last few days.

@preadaptation Жыл бұрын

Thanks

@huonglarne 2 жыл бұрын

Thank you so much

@Omedalus Жыл бұрын

These videos are excellent! Thank you, Miss Coffee Bean!

@murphp151 2 жыл бұрын

you are sooo good at this!

@AICoffeeBreak 2 жыл бұрын

Wow, this is a nice thing to say, thanks!

@toompi1 2 жыл бұрын

Thanks a lot for this video and the last one! I think I'm missing something. How come the layer learn "position"? Is there something enforcing the desired properties? It seems wishful to expect a random layer to learn specifically about positional informations.

@nicohambauer 3 жыл бұрын

Thaanks! This is just helping me to enhance my bachelor thesis even further!

@AICoffeeBreak 3 жыл бұрын

It's great to hear this! Good luck with your thesis! 💪

@nicohambauer 2 жыл бұрын

@@AICoffeeBreak Thanks! I figure it is going to work out quite well! The implementation and small details with major impacts like concatenating positional encodings really went well. Now running some experiments! Seems like concatenating and also replacing attention with Fourier transform are interesting hacks! Really appreciate your videos! Best regards!

@AICoffeeBreak 2 жыл бұрын

@@nicohambauer Happy to hear that your work on your thesis is going well. By what you write, it looks like you are crushing it! 👏 And wow, thanks so much for writing this! It means so much to know we made a difference!

@piotr780 2 жыл бұрын

I feel this field needs a lot more mathematicians then CS, because some fundamental theory is needed to be created to explain how DL networks process information so we could stop iterating back and forth all possible combinations xd btw. we could choose something in the middle like learning frequency of sin/cos for positional embedding instead of fully hand-crafted or learned emb.

@gabrielhishida 8 ай бұрын

I'm a bit late considering this was posted 2 years ago, but thank you. That was a really good explanation!

@AICoffeeBreak 8 ай бұрын

Thanks a lot. Passing the test of time. 😅

@hedgehog1962 Жыл бұрын

Also amazing video!

@sultan2000 3 жыл бұрын

Very nice 👌 and good luck 👍👍

@alexandermichaelson5652 3 жыл бұрын

Bună Letiii! Ti am dat share! ❤️

@AICoffeeBreak 3 жыл бұрын

Hei, mersi!

@techtam3505 3 жыл бұрын

The intro bgm is cool 😎

@AICoffeeBreak 3 жыл бұрын

Yay, I think so too! It is by far my favorite from all tracks I've used so far!

@techtam3505 3 жыл бұрын

@@AICoffeeBreak what music is it? I mean name of the music.

@AICoffeeBreak 3 жыл бұрын

Holi Day Riddim - Konrad OldMoney I got it from KZfaq Audio Library, like everything else I use.

@techtam3505 3 жыл бұрын

@@AICoffeeBreak ooh I asked that as it sounds similar to Indian music 🎶 😄

@AICoffeeBreak 3 жыл бұрын

It sounds African to me (not an expert with neither Indian nor African music).😅

@oleschmitter55 9 ай бұрын

I love your videos!

@user-zw5jj2uf1p Жыл бұрын

The opinions of the academics fluctuate as much as my Fourier series 🙃

@michaelpadilla141 10 ай бұрын

Superb stuff. Thank you!

@AICoffeeBreak 9 ай бұрын

Glad you enjoyed it! Thanks for being here! :)

@justinwhite2725 3 жыл бұрын

I also saw the parallel between positional embedding and Fourier transforms (and commented such on that video). I think for my (image generation) projects I'm going to go with a Fourier transform and CNN. Self attention is complex to implement and I'm just running on my home computer.

@AICoffeeBreak 3 жыл бұрын

I noticed you were making the parallel already in the FNet video. It's interesting to follow your comments. 😊

@justinwhite2725 3 жыл бұрын

@@AICoffeeBreak I try to leave a comment on most videos, because it feeds the algorithm. KZfaq's AI is easily manipulated by typing in stuff because it makes contextual deductions to make recommendations based on them.

@AICoffeeBreak 3 жыл бұрын

@@justinwhite2725 It was at this moment when she realized: Justin must have a YT channel himself, being so knowledgeable about the platform. [clicks on his channel]. Nice, you play Factorio! Love that game!

@IVIorrill 3 жыл бұрын

Why does a positional embedding need to be the same dimension as the input? Surely in theory you can give the positional information with a single additional concatenated dimension by just using the integer position in the sequence for example.

@andybrice2711 7 ай бұрын

This is what I thought. Apparently you can't use an integer position. Because you need some system which works for arbitrarily long text. And you might not even know the length of the text when you first start encoding or decoding it. But yes. I don't see why it needs so many dimensions.

@techtam3505 3 жыл бұрын

Can you make a video about Top-k sampling Top-p(nuclei) sampling and temperature coz these influence a LM by alot in real world applications. I mean as a hyperparam

@harumambaru 3 жыл бұрын

Could you please tell little bit more about adding vs. concatenating in CNN? All skip connections starting from ResNet have this, mostly by adding, as far as I remember, but in CNN concatenation doesn't bring too much memory and computation load because there are many other conv layers.

@maxvell77 26 күн бұрын

Thanks!

@AICoffeeBreak 26 күн бұрын

Oh, thank you!

@sucim Жыл бұрын

I side with concatenated learned positional embeddings, keeping it simple

@thegigasurgeon Жыл бұрын

how about we concat the index of the word directly as an extra dimension? let's say we have each word represented by a 512 vector. How about we add 513th dimension which would just be the index of the word in the sentence. Would this pose any problem? We are just increasing one dimension

@andybrice2711 7 ай бұрын

As I understand it: You need to be able to reference words outside the sentence. And you can't use an integer index because the text could be arbitrarily long. And all the later words having large numbers causes mathematical problems. So you want some kind of encoding which goes in a long infinite loop, is continuously variable, and stays between ±1. But yeah, adding it to the semantic vector still seems weird to me. And I don't see why it needs so many dimensions.

@killers31337 3 жыл бұрын

I wonder if somebody tried combining element vector with positional encoding using a dense layer. E.g. suppose you have 256-dimensional vector representing an input element and 256-dimensional positional encoding, then do one dense layer combining the two yielding 256-dimension vector. So it is like adding, but with learned weights.

@alvinphantomhive3794 3 жыл бұрын

That is a great idea though. In my view, there are 2 ways to define the input tensor for the dense layer. 1) in : embed_size, this can be done by summing word_embed + positional_embed. 2) in : embed_size * 2, this input shape obtained by concatenating the word_embed and positional embed. Theoretically, the first way is a function f that takes R^n and get mapped into any arbitary point in the same dimensional space R^n. The second way, takes R^n*2 and get projected into lower dimensional space R^n. Which one is the best? I would hypothesized, the second way (experimental required :) ). But on the other side, I also argue if we are using pretrained word embedding (such as glove, word2vec), the dot product (forward pass) to dense layer weight will mess up the learned semantic meaning, which means probably pretrained word embeddings will not be really helpful with this type of representation. However, we could build a simple experimental setting, if you are interested. ;)

@RajdeepBorgohainRajdeep 2 жыл бұрын

Yes that's one technique which used in the self-attention general method where the dim of encoder output is not equal to the decoder state and to calculate the similarity score we need to use dense layer i.e the number of attention units. I guess this is somewhat similar but here you are talking about the same number of dim

@nicolasdufour315 3 жыл бұрын

What about Rotary Positional Embeddings that recently were trending?

@AICoffeeBreak 3 жыл бұрын

This is a great suggestion (was looking at this while preparing the video). But these two videos so far have been more about the basics than about the latest and greatest trends. 😁

@flamboyanta4993 Жыл бұрын

"Bert-eyeview" 😂😂😂😂😂😂

@pisoiorfan 2 жыл бұрын

Hi, does the concatenated position embedding vector needs to be large? I mean if it is (can be) much smaller than the word embedding itself, the increase in model size shouldn't be too dramatic.

@AICoffeeBreak 2 жыл бұрын

Great observation. It depends on the application and can be dependent on the sequence length, the dimensionality of order (1d for a sequence like text and sound, but 2d for images, etc) and the resolution of the positional embeddings you need to achieve. Therefore the increase is also in the time you need to spend optimizing this hyperparameter.

@minma02262 3 жыл бұрын

First! (Yeah, I'm childish)

@anonymousiguana168 Жыл бұрын

Couldn't we just concatenate a single scalar to our original word embedding vector? Just concatenate 1 for the first word, 2 for the second word, 3 for the third, etc. That way we don't increase the number of parameters too much and we keep a semantic and positional information separate. What's wrong with that?

@chrismawata8755 4 ай бұрын

👋👋👋

@AICoffeeBreak 4 ай бұрын

@amirrezamehrabi7240 7 ай бұрын

Could you please explain about positional embedding of dichotomous responses?

@KevinTurner-aka-keturn 7 ай бұрын

Shouldn't the added memory requirements of concatenated positional embeddings be quite modest? Many of these models deal with sequences on the order of hundreds to thousands of tokens. It seems like 16 bits should be plenty to describe positional information within sequences of that length -- only a fraction of the memory allocated to the semantic embedding.