Linformer: Self-Attention with Linear Complexity (Paper Explained)

Рет қаралды 31,516

Күн бұрын

Transformers are notoriously resource-intensive because their self-attention mechanism requires a squared number of memory and computations in the length of the input sequence. The Linformer Model gets around that by using the fact that often, the actual information in the attention matrix is of lower rank and can be approximated.
OUTLINE:
0:00 - Intro & Overview
1:40 - The Complexity of Self-Attention
4:50 - Embedding Dimension & Multiple Heads
8:45 - Formal Attention
10:30 - Empirical Investigation into RoBERTa
20:00 - Theorem: Self-Attention is Low Rank
28:10 - Linear Self-Attention Method
36:15 - Theorem: Linear Self-Attention
44:10 - Language Modeling
46:40 - NLP Benchmarks
47:50 - Compute Time & Memory Gains
48:20 - Broader Impact Statement
49:55 - Conclusion
Paper: arxiv.org/abs/2006.04768
Abstract:
Large transformer models have shown extraordinary success in achieving state-of-the-art results in many natural language processing applications. However, training and deploying these models can be prohibitively costly for long sequences, as the standard self-attention mechanism of the Transformer uses O(n2) time and space with respect to sequence length. In this paper, we demonstrate that the self-attention mechanism can be approximated by a low-rank matrix. We further exploit this finding to propose a new self-attention mechanism, which reduces the overall self-attention complexity from O(n2) to O(n) in both time and space. The resulting linear transformer, the \textit{Linformer}, performs on par with standard Transformer models, while being much more memory- and time-efficient.
Authors: Sinong Wang, Belinda Z. Li, Madian Khabsa, Han Fang, Hao Ma
Links:
KZfaq: / yannickilcher
Twitter: / ykilcher
Discord: / discord
BitChute: www.bitchute.com/channel/yann...
Minds: www.minds.com/ykilcher

Пікірлер: 78

@mafaldahopfkirch299 3 жыл бұрын

It is just amazing how you manage to explain and still fit in a bit of humor. Thank you!

@MrVaunorage 4 жыл бұрын

4:30 best explanation of transformer ever! I finally understood the intuition behind it :)

@florianhonicke5448 4 жыл бұрын

I like that you explain it with your own words and bring examples

@RaivoKoot 4 жыл бұрын

This paper is really impressive. You're explanation was incredibly great and needed as this was a more mathematical paper of course. Thank you. I literally watch all of your papers explained. They're great.

@gilshapira3498 3 жыл бұрын

Thanks Yannic for the great value you add to the ML community by clearly distilling these interesting papers.

@jg9193 3 жыл бұрын

I love the combination of code plus paper explanation. Great work

@DavenH 4 жыл бұрын

Masterful presentation. Thank you.

@samjoel4152 4 жыл бұрын

Wow you're fassssssstttttt..... Yannick: I am SPEED....

@ProfessionalTycoons 3 жыл бұрын

thank you for covering this paper, much appreciated!

@beepbuupbuupbeep 3 жыл бұрын

That review was fantastic!

@siyn007 4 жыл бұрын

Great explanation, easy to follow

@ChuanChihChou 4 жыл бұрын

Now I really wonder what would happen if we make those E & F projection matrices trainable...

@adamtran5747 2 жыл бұрын

this is a very good video. Please keep up the good work. I love this content.

@SpicyMelonYT 3 жыл бұрын

im watching this video with such interest and i'm constantly think about the information here but then i see the word cumsum and can't stop laughing. I feel like im two people sometimes haha

@jasdeepsinghgrover2470 4 жыл бұрын

Really amazing video!!

@neetikapanwar8866 4 жыл бұрын

Nice explanation, very helpful. Thanks :-)

@Anirudh-cf3oc 2 жыл бұрын

This explaination saved me lot of time, Amazing work!!

@bluel1ng 4 жыл бұрын

This time it felt a bit more like a lecture. Especially cool were the low-rank numpy demos! The dim reduction from 64k to 128 seems to me quite drastic, I am not directly convinced that the "granularity" of the attention will not significantly rise. But this is my initial (naive) thought without looking into the paper. Anyway this is definitely an important and very practical addition in the attention-toolbox. I am curious if they first did the math or if they just tried the random dim-reduction and then went out to explain it. ;-)

@YannicKilcher 4 жыл бұрын

Yea good point. I guess that's stuff you'd ask at the poster presentation.

@jinlaizhang312 4 жыл бұрын

awesome！

@mookee 4 жыл бұрын

Yannic nice investigation into why the rank is low :D I wonder if this will replace transformer, since Reformer model had similar improvement but apparently nobody really cared enough to switch, GPT3 still has the O(n^2) dot products.

@isaackod 4 жыл бұрын

They had an argument in the paper that the reformer was "only more efficient than the vanilla transformer when sequence length is extremely long.", so perhaps this new method will gain more momentum? Certainly excited as this has a lot going for it in terms of both improving training compute and inference speed.

@berin4427 3 жыл бұрын

Thanks a lot for the great video(s)! I do think the projection matrices are trained (not pre-fixed) though, given they can be shared among different layers

@srinathtankasala Жыл бұрын

Great video and I love your channel. I have 3 questions on this: How does padding and masking work in this case? Since E and F depend on the sequence length (n), wouldn't this make the Linformer a fixed sequence length encoder-decoder? How would this model work during inference time when you are doing a greedy decoding of the output? For ex., say you generate the input embeddings from the encoder and you then start with a token at the decoder. In this case the Linear attention in the decoder cannot handle a single output as it can't project a single output embedding into the k-dimensional space. The decoder E,F map from n to k dimension, so it can't be multiplied with n=1 output embedding. Any help is appreciated

@convolvr 4 жыл бұрын

11:30 if you order the eigenvalues by magnitude, high rank matrices have a slower decaying curve? Couldn't you use a diagonal matrix (with eigenvalues sorted down the diagonal) to produce a full rank matrix with whatever curve you want?

@YannicKilcher 4 жыл бұрын

Very true. It's more like the intrinsic dimensionality of the data rather than the rank.

@akimtsvigun8783 4 жыл бұрын

Yannic, thanks for your amazing video. I would like to ask about the matrices rank - you use to say many things in the authors' proof come from the fact n (sequence length) is much larger than d (hidden dimensionality). However, d usually equals 512 in BERT-based models, while it is difficult to imagine the sequence length of 512 (and imagine the one which is "much larger than 512" is extremely difficult). Could you please clarify this?

@YannicKilcher 4 жыл бұрын

512 is actually pretty limited, if you think of entire paragraphs / chapters / websites, etc. especially since the text is tokenized to word pieces, not entire words. Plus the d parameter usually gets sub-divided into multiple heads and ends up being more like 64 or 32.

@shaz7163 2 жыл бұрын

Hi amazing presentation and lecture :). Btw I think projection matrices are made of learnable parameters and they do not use SVD.

@charlesfoster6326 4 жыл бұрын

Another great video! Thank you for this service to the community. :) One point I wasn't clear on: should we use a different set of random projection matrices for each input sequence (as in Reformer), or are they proposing they can be fixed?

@YannicKilcher 4 жыл бұрын

I think they can just be fixed once. They even experiment with having the same projection throughout the entire network.

@beepbuupbuupbeep 3 жыл бұрын

I wonder what would happen if one learned to projection matricies. The backprob should automatically afjust them to retain maximal informstion right? So we might learn better projection matricies then the precomputed ones.

@YannicKilcher 3 жыл бұрын

True, might be worth a try

@TheThirdLieberkind 4 жыл бұрын

could this be thought of as a kind of static dropout system, but with essentially no loss of data between the layers?

@YannicKilcher 4 жыл бұрын

*approximately* no loss of data ;)

@bosepukur 4 жыл бұрын

thanks for this....Are you saying they are just saying that projecting the self attention matrix to lower dimention and claiming that because its self attention this projection is possible ? While in reality any matrix can be properly projected while keeping original distance constraints ?

@YannicKilcher 4 жыл бұрын

yes, but if you do it with any matrix then the guarantees depend on the original dimension, whereas here they depend on the rank, which is the hidden dimension.

@dwrety 4 жыл бұрын

I love you Yannic

@hannesstark5024 4 жыл бұрын

Is that superior in every way to the binning part of the Reformer?

@YannicKilcher 4 жыл бұрын

Probably each has its pros and cons.

@andreasv9472 4 жыл бұрын

Thanks Yannic. Seems interesting. If possible, please help out an ai math-illiterate, what does big and small theta represent in these theorems? in 5Θ(log(n)/epsilon^2 ) for example (8).

@YannicKilcher 4 жыл бұрын

They are various bounds on complexity, look up "big o notation" on wikipedia.

@andreasv9472 4 жыл бұрын

@@YannicKilcher thank you for quick answer! I'm aware of big O from calculus, didn't know it was written as Theta. Awesome!

@siyn007 4 жыл бұрын

@@andreasv9472 theta means a constant times the function within the theta can be a lower bound and an upper bound of the function in question

@Xaelum 4 жыл бұрын

I was going to ask a video about the Linformer, but you are way too fast. Great job man!

@MatsErikPistol 4 жыл бұрын

Thanks for the video. Would it be possible to learn the projection? You allude to it in the video.

@YannicKilcher 4 жыл бұрын

I guess, but you'd lose all the good properties.

@howardkong8927 2 жыл бұрын

I'm not very convinced... As I understand, what this paper is saying is essentially this: If I say a very long sentence of n words, where each word carry an information vector of constant length d, you can compress the information into a matrix of size O(dlog(d)) regardless of n. In other words, if I say a long enough sentence, most of the words are going to be completely gibberish and can be safely ignored?

@souradipchakraborty7071 4 жыл бұрын

One question I have is that is there any logical/theoretical reason why the softmax delays that stability point ?? One thing coming on my mind is regarding the reduction in variance that happens due to that and possible that is the reason, not sure ?

@YannicKilcher 4 жыл бұрын

Good question, I have no idea.

@souradipchakraborty7071 4 жыл бұрын

@@YannicKilcher Hahaha True. Actually, I am doing some research on this. The interesting thing is that JL lemma doesn't mention anything about low-rank matrices and their claim is for general any matrix. I don't know how JL lemma is relevant in this case as well.

@shivamraisharma1474 3 жыл бұрын

@@souradipchakraborty7071 they probably just saw that the data doesn't need to be that high in dimensionality empirically and just used the JL lemma to project it into a lower dimension, I naively think that this paper was just an extensive work to make a simple intuition work

@mathematicalninja2756 4 жыл бұрын

This is more like using SVD on the routing matrix. Sure, you can use this in production to reduce the model size but you will have to train it in O(n**2) way I think

@charlesfoster6326 4 жыл бұрын

You won't. Training is also O(nk). So long as you use a small k, it's effectively linear in time and space complexity with respect to the sequence length.

@Kerrosene 4 жыл бұрын

Hi Yannic, One quick question, How long did it take you to read and make sense of this paper? And, how long in general would it take you to read a paper like this (fairly mathematical)?

@YannicKilcher 4 жыл бұрын

This one I was quite confused by, it took me multiple hours and a night's sleep.

@scottmiller2591 4 жыл бұрын

That future Yannic is pretty smart.

@YannicKilcher 4 жыл бұрын

I gave him my number. xoxo

@stefans.8027 4 жыл бұрын

I like your videos going into details and help to understand the paper. But: you might know 2minutepapers... these are really nice to just get a short look into the key point and i can read further into them. What do you think about some short episodes... just looking into the papers and other episodes going more into details... view count and comment will tell you how to balance. Keep it up!

@YannicKilcher 4 жыл бұрын

yea I feel there are already many channels that do that and I'm not good at keeping it short :)

@sucim 3 жыл бұрын

@@YannicKilcher The kids want long form content! ;) (approximate credit goes to Lex Fridman)

@taylorchu 4 жыл бұрын

what do they do with padding mask?

@YannicKilcher 4 жыл бұрын

Good question. Don't know.

@herp_derpingson 4 жыл бұрын

There is no such thing as free lunch. The big O notation ignores the constants. In practice, we might find that for some problems the rank is not just ln(n) but c * ln(n) where c is very close to n. Also this getting rid of higher dimensions, reminds me of SqueezeNet. . Regardless, I think this will be used a lot in mobile devices in the future. . Very honest Broader Impact Statement. If this gets accepted I think all papers in the future will simply write, "We see no immediate negative effects beyond what applies to core machine learning" and be done with it.

@YannicKilcher 4 жыл бұрын

That's true, but in the ideal case where the matrix is truly low-rank, there is actual speed-up to be gained for free. It's like there are different ways to implement the same algorithm and some are just faster.

@herp_derpingson 4 жыл бұрын

I am excited to see what Broader Impact Statement other authors are able to conjure.