Transformers are RNNs: Fast Autoregressive Transformers with Linear Attention (Paper Explained)

Рет қаралды 26,157

Күн бұрын

Пікірлер: 82

@dingleberriesify 4 жыл бұрын

The one thing that irks me is kernels keep being referred to as "mapping to infinite dimensional space" in these vids. While true, it's kind of unhelpful. It's probably more helpful to think of it as mapping to a (potentially infinite) space where the (potentially infinite) bases aren't restricted to be vectors. The whole trick about the Hilbert space is it's just a space with a well-defined inner product, and that inner product can be between functions etc. You could have a function mapping to a space where the bases are the sin and cosine functions, for example, and you use that space to take the inner product between different sine/cosine wave combinations. Typically the kind of space you want to map to will express something about the problem you're trying to solve. I bring this up because I've seen kernels mentioned twice now, and both times the explanations are kind of lacking. Not Yannic's fault, I was lucky enough to spend a bunch of time around kernel gurus that cleared up a bunch of misconceptions for me...before that, I was making similar mistakes. Content otherwise very solid, as usual!

@YannicKilcher 4 жыл бұрын

Thank you :)

@zhonghuahe3556 3 жыл бұрын

Hi, can you recommend some learning materials about "kernels"

@brandomiranda6703 2 жыл бұрын

Mit’s 9.520 with Lorenzo Rosasco ;)

@CosmiaNebula 2 жыл бұрын

This is a note for the mathematicians: When people say "vector" in computer science, they mean "an array with finitely many entries". In other words, they mean a function from {1, 2, ..., n} to some set X. This is a severe generalization/weakening of the typical vectors that we mean, where a vector lives in a vector space, where there are all kins of symmetries (the group GL(n, K)). So when Alex Stenlake says "they aren't restricted to be vectors", he meant "they aren't restricted to be a finite-entry array". They are in fact, *exactly* what we mean when we say "vectors in a countable, separable vector space", in other words, any vector space we use in functional analysis. So as we see, this is a severe strengthening of a severe weakening (first, the CS scientist weakens the mathematician's idea of finite dimensional vector space, then they strengthen it to return to the mathematician's idea), which is why Alex had to point it out.

@andrewgjkgjk Ай бұрын

This probably comes from the idea that, practically, these bases are computed via an infinite polynomial series that can only be approximated in compute using techniques like random fourier sampling, no?

@NicheAsQuiche 4 жыл бұрын

I love that you go over and explain the topics each time you introduce them (e.g. transformers) even though you go over them in other videos it's so useful to me to hear different explanations of it and realy clrafiy what's going on inside. Really helps me get a better understanding.

@florianhonicke5448 4 жыл бұрын

Haha I was writing almost the same

@williamtepe4167 4 жыл бұрын

Me too, I third this ^

@florianhonicke5448 4 жыл бұрын

I like that you explain some stuff again in different videos. It makes sure everyone is on track and if someone would be annoyed by the redundance the person could just fast forward. To me it was very helpful to get multiple explanations about transformers

@utku_yucel 4 жыл бұрын

the idea of Explaining the papers is awesome! Thanks!

@snippletrap 4 жыл бұрын

It's a big part of grad school.

@andyblue953 4 жыл бұрын

Great explanation. I really like the way you break down publications to the most essential parts. Never was "reading" articles that comfortable.

@fotisj321 4 жыл бұрын

Thanks for making a difficult paper accessible. I especially like it that there a red threads in you selection of papers, which create almost something like a story-line, for example the one starting with the paper on the transformer architecture and the follow-up papers trying to remedy some of its shortcomings but keep its achievements.

@OwenCampbellMoore 4 жыл бұрын

These are amazing, love the depth and really appreciate the amount of content! I know you’ve said you’re not planning on it, but I’d be super happy to support on Patreon if that helps support you in making so much content!

@good_user35 4 жыл бұрын

At first, I thought your videos are a bit lengthy, but after watching this now I feel that I woudn't be able to understand the contents without sacrficing such an amount of time. And what you point out on the paper similarly tells me that the cost of softmax function is the needed price considering that the function can do in an unknown task. Thanks, it was good watching.

@YannicKilcher 4 жыл бұрын

also, I suggest just changing the playback speed to 2x if it's too long :)

@sitioprueba2855 2 жыл бұрын

Thank you so much for all this great explinations!

@darkmythos4457 4 жыл бұрын

Thanks Yannic, soon enough you will be one of the requirements for ML jobs; Published at Neurips, ICML, and Yannic YT Channel.

@YannicKilcher 4 жыл бұрын

Haha :D problem is I'm very corrupt so idk how that's going to work out ;)

@davidsabatini9100 2 жыл бұрын

A really good video! I found it super informative, thank you!

@veedrac 4 жыл бұрын

The natural question I have after this paper: why not both? Test a model with X% compute allocated towards ‘real’ attention and 1-X% towards linear attention, and see what's best.

@GeekProdigyGuy 4 жыл бұрын

You can always ensemble any two models, but unless they provide independent contributions, there isn't really any point. Plus in this case their proposed model's main advantage is reduced compute/memory, so it would seem entirely counterproductive to ensemble with plain attention.

@veedrac 4 жыл бұрын

@@GeekProdigyGuy The idea is that traditional attention is high quality, but struggles at reaching a large context. Clearly that high quality attention is necessary on short scales, but only a small reduction in the traditional attention would be needed to allow for a linear attention that reached very large context sizes. Assuming the far context needs less character-level detail (same thesis as Compressive Transformers), the lossier linear attention should suffice there.

@lucidraisin 4 жыл бұрын

Veedrac - I actually explore that here github.com/lucidrains/linear-attention-transformer

@that_guy4690 4 жыл бұрын

Your videos are pretty long, but they are really not long enough :). Each time the video ends, I get a feeling that I need an even deeper explanation e.g. their code review

@raminrasoulinezhad 4 жыл бұрын

Cool description. Thanks

@daryoushmehrtash7601 3 жыл бұрын

I appreciate your comment about different neural network architecture as different routing of information.

@tomw4688 3 жыл бұрын

Good shit! This video was very helpful and many ELI5 moments useful for students. Thanks!

@snippletrap 4 жыл бұрын

Re: causal masking implementation @38:00, input[i] != target[i] because the target is shifted by one before being passed into the model.

@noreddine 3 жыл бұрын

Yannic: just lett me think what you think. Me: I think you're awesome

@DanielHesslow 4 жыл бұрын

This is super cool and I think we will see the performance improve down the line as people explore other feature mappings, elu + 1 seems a biiit arbitrary

@samm9840 4 жыл бұрын

How about applying the kernel (O(n)) trick at inferencing alone after training? I am especially interested whether the DETR model (Yannic already did a video on that), that uses the heavy-weight transformer be modified to do object detection simply with this linear version! May be it's an idea for a new paper. Looking forward to it.

@muneebth5508 4 жыл бұрын

Sir, Thank You

@myungchulkang5716 4 жыл бұрын

thank you :)

@andyblue953 4 жыл бұрын

I was wondering to which extend their idea is related to A² double-net attention? In the latter, the outer product is likewise separated into a gathering step and a distribution step, and it's likewise shown to be on par with the original transformer. Anyway, the general mathematical idea can already be found in the SVD.

@welcomeaioverlords 5 ай бұрын

I'm very late, but well done. Good call out on the RNN caveat! I missed that. And one correction (or I misunderstand something): it seems that we're not actually going to a higher dimensional space through phi, but rather applying a point-wise non-linearity and maintaining the same dimensionality. Any comments on that choice? That's seemingly contrary to the point of the kernel trick as I understood it. Why not do the typical DL thing and simply make phi a learnable function?

@rayaengazou2140 3 ай бұрын

Hey! that's greate !! i would like to ask about the model he used to produce results in the paper ?in github there was only the library not the model ! can you share with us the model ?

@MrMIB983 4 жыл бұрын

Love it! Universal transformer please!

@mookee 4 жыл бұрын

Cool charts at 26:20. Interesting that Reformer is slower than regular transformer at shorter sequences. You would't know that just from asymptotic time complexity. Looks like the paper conveniently omits that :) Linear attention seems super fast tho. Also it's probably obvious, but two times higher slope on a log chart corresponds to a square of a function on a linear chart.

@YannicKilcher 4 жыл бұрын

Yep, I'm just a moron when it comes to non-linear charts :D

@Bencurlis 3 жыл бұрын

The generalized formulation of equation 3 looks like Bayes formula, has anyone noticed? Is it accidental or is there a deeper meaning here?

@sajjadayobi688 3 жыл бұрын

thank Yannic, we saw a few papers for transformers in long sequences like ReFormer, LinFormer, LongFormer but which one of them is really well in real-world if you make a video about comparing these papers for faster or longer Transformer I would be very thankful

@ShokhanBirlikov 4 жыл бұрын

Your intuition at 31:40 is great! Maybe a silly question, but is it a possible architecture where the transformation function phi is learned as Nx(#of intermediate units) matrix of trainable parameters?

@YannicKilcher 4 жыл бұрын

sure, I don't see why not

@Erotemic 4 жыл бұрын

That is quite an impressive speed increase, and the ability to perform RNN like inference is incredibly useful. I wish they investigated other kernel functions (or maybe they did and I missed it), but I'd be interested in seeing how the choice of kernel impacts the results. Am I understanding the RNN autoregressive-transformer equivalence correctly? Is it the case that translation transformer models are not RNNs?

@YannicKilcher 4 жыл бұрын

only transformers that are using the specific causal masking are RNNs

@Kerrosene 4 жыл бұрын

I seems like the closer we are to approximating the soft-max operator with a finite dimensional feature map, the better the performance of this linear transformer would be...which is what they suggested in the conclusion..Also, they don't use positional encodings..

@mattiasfagerlund 4 жыл бұрын

Hi - is the reason that the softmax version is O(N^2) that softmax (in the general case) has N^2 gradients before they're collapsed back? Or is there something else going on?

@YannicKilcher 4 жыл бұрын

It's because you need to do N^2 inner products. Maybe if the softmax wasn't there, you could somehow optimize that away, so I think you do have a point

@kiyoonyoo4761 4 жыл бұрын

Thanks Yannic for the explanation. I was actually having some difficulty understanding how they used the associative property to transition from Eq. 4 -> Eq.5. Glad to see you've already made a video! This wasn't explained in detail on the video also, maybe because it is something trivial, but am I the only one who cannot get around why the two formulations (Eq.4 and Eq.5) are equivalent? The dimensions of the numerators definitely are not the same as the former is a linear combinations of column vectors, while the latter is a row vector multiplied by square matrix. Also, associative property does not seem to be true in general for vector multiplication involving vector dot products. Any help or explanation from anyone would be very appreciated :0

@YannicKilcher 4 жыл бұрын

Sometimes people are very loose with row/column vectors and don't properly apply transposes. Can you find a counter-example where the equation doesn't hold (even when glossing over transpose issues)?

@norik1616 3 жыл бұрын

@@YannicKilcher I had hard times with the indexes as well, so I wrote it in numpy: It does not help with transpositions (as numpy matches them *mostly correctly* automatically), but it helps with where to put which product - the equation holds true. import numpy as np shape = 5, 7 K = np.random.rand(*shape) Q = np.random.rand(*shape) V = np.random.rand(*shape) classic_attention = np.matmul(np.matmul(Q, K.T), V) assert classic_attention.shape == shape # stripped down classic attention res = [] for i in range(shape[0]): res.append(sum([np.matmul(Q[i], K[j]) * V[j] for j in range(shape[0])])) matmul_byhand_attention = np.array(res) print(np.allclose(classic_attention, matmul_byhand_attention)) res = [] for i in range(shape[0]): s = sum([np.outer(K[j], V[j]) for j in range(shape[0])]) # associativity and distributivity res.append(np.matmul(Q[i], s)) matmul_byhand_attention2 = np.array(res) print(np.allclose(classic_attention, matmul_byhand_attention2))

@RuminRoman 4 жыл бұрын

I wonder why you do not have a video about Universal Transformers

@herp_derpingson 4 жыл бұрын

12:33 If Adding all the K[i] terms means that there is no way for the transformer to know the position of words in the sentence. Maybe the positional embeddings are helping it somehow. It would be interesting to see the performance of this transformer without the positional embeddings. Then again, RNNs dont have any explicit mechanism to store positions of words.

@YannicKilcher 4 жыл бұрын

True, the positional embeddings probably do a lot of work here. RNNs don't have explicit position, but they can theoretically learn to count the number of input tokens - or what's probably more relevant - learn to estimate distances between tokens, which is another thing that position embeddings in transformers are good for.

@bhuvanb8156 4 жыл бұрын

at 14:17 numerator and denominator are same except multiplying with Vj, how's is that possible

@YannicKilcher 4 жыл бұрын

it characterizes a normalized distribution over V

@couragefox 4 жыл бұрын

commenting for algo. love your channel

@lucasnestler6933 4 жыл бұрын

As the paper states that transformers attending to one side only are equal to RNNs, doesn't it imply that full transformers are equal to residual bidirectional RNNs?

@YannicKilcher 4 жыл бұрын

Only in an abstract sense. The computations are really different from a bidirectional RNN

@zhangcx93 4 жыл бұрын

According to there code, the main computation is like this: KV = torch.einsum("nshd,nshm->nhmd", K, values) Z = 1/(torch.einsum("nlhd,nhd->nlh", Q, K.sum(dim=1))+self.eps) V = torch.einsum("nlhd,nhmd,nlh->nlhm", Q, KV, Z) can someone translate this einsum into a normal pytorch way?

@eelcohoogendoorn8044 4 жыл бұрын

Better off just reading up on einsum and getting familiar with it. it is by far the most readable and intuitive way of expressing these statements; and I say that as a numpy veteran of 15 years.

@YannicKilcher 4 жыл бұрын

yea the problem is that would translate to multiple matmuls, elementwise products, transposes and summations :D I also suggest you dive into einsum notation

@markdaoust4598 3 жыл бұрын

I came to ask “what’s the einsum for 21:03.”. Thanks! So.. dropping the “batch” and “head” indices it’s: Vld = sum_sd Qld*Ksd*Vsm / sum_sd Qld Ksd

@andres_pq 4 жыл бұрын

So in practice you could stack way more tranformer layers and have better performance?

@YannicKilcher 4 жыл бұрын

that's always the goal :D

@RobotProctor 3 жыл бұрын

If Q.K is usually the highest for the same token, I wonder why Q and K need to be different at all for a transformed to work properly. Why not just output a K and a V?

@YannicKilcher 3 жыл бұрын

interesting idea. some papers fuse k and v, but I've never heard of fusing k and q

@minhoryu2786 4 жыл бұрын

Can anyone summarize difference between linformer and this paper?

@meerkatj9363 4 жыл бұрын

Linformer has a linear complexity thanks to a constant dimension of the keys and values because they are projected in a fixed lower dimension. Whereas this does not fix the dimension of the sequence, it only reorders the computation. The reordering can only be made with conditions that need the kernel thing.

@baqirhusain5652 2 жыл бұрын

The complexity works well when N > D^2. So the paper is saying if dimensionality is 512 my sequence length should be greater than 512^2 = 262144. Is this really practical !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!

@seraphim9723 4 жыл бұрын

Wait a second: Figure 5c is showing PER over wall-clock time, and despite the significant speed-up the original softmax transformer is better AND faster (at reaching any given error rate).

@YannicKilcher 4 жыл бұрын

Yes, that's because on this task it seems to be very important to have the quadratic attention

@siyn007 4 жыл бұрын

speed over accuracy for me tbh. Not everyone has Google's resources to run it on 112312412312423 TPUs

@YannicKilcher 4 жыл бұрын

That's a lot of damage. I mean TPUs

@first-thoughtgiver-of-will2456 3 жыл бұрын

If I was rich I'd give you money.

@charudattamanwatkar8340 4 жыл бұрын

Who else saw balls in the thumbnail?

@herp_derpingson 4 жыл бұрын

Now, that I see it. I cannot unsee it.

@dmitrysamoylenko6775 4 жыл бұрын

Now we are all see the message

@joery8290 4 жыл бұрын

I love your videos, but could you try to make them more compact? 50 minutes is a big time investment. I feel like most papers can be explained in high detail in half that time.