FNet: Mixing Tokens with Fourier Transforms (Machine Learning Research Paper Explained)

Рет қаралды 28,584

Күн бұрын

#fnet #attention #fourier
Do we even need Attention? FNets completely drop the Attention mechanism in favor of a simple Fourier transform. They perform almost as well as Transformers, while drastically reducing parameter count, as well as compute and memory requirements. This highlights that a good token mixing heuristic could be as valuable as a learned attention matrix.
OUTLINE:
0:00 - Intro & Overview
0:45 - Giving up on Attention
5:00 - FNet Architecture
9:00 - Going deeper into the Fourier Transform
11:20 - The Importance of Mixing
22:20 - Experimental Results
33:00 - Conclusions & Comments
Paper: arxiv.org/abs/2105.03824
ADDENDUM:
Of course, I completely forgot to discuss the connection between Fourier transforms and Convolutions, and that this might be interpreted as convolutions with very large kernels.
Abstract:
We show that Transformer encoder architectures can be massively sped up, with limited accuracy costs, by replacing the self-attention sublayers with simple linear transformations that "mix" input tokens. These linear transformations, along with simple nonlinearities in feed-forward layers, are sufficient to model semantic relationships in several text classification tasks. Perhaps most surprisingly, we find that replacing the self-attention sublayer in a Transformer encoder with a standard, unparameterized Fourier Transform achieves 92% of the accuracy of BERT on the GLUE benchmark, but pre-trains and runs up to seven times faster on GPUs and twice as fast on TPUs. The resulting model, which we name FNet, scales very efficiently to long inputs, matching the accuracy of the most accurate "efficient" Transformers on the Long Range Arena benchmark, but training and running faster across all sequence lengths on GPUs and relatively shorter sequence lengths on TPUs. Finally, FNet has a light memory footprint and is particularly efficient at smaller model sizes: for a fixed speed and accuracy budget, small FNet models outperform Transformer counterparts.
Authors: James Lee-Thorp, Joshua Ainslie, Ilya Eckstein, Santiago Ontanon
Links:
TabNine Code Completion (Referral): bit.ly/tabnine-yannick
KZfaq: / yannickilcher
Twitter: / ykilcher
Discord: / discord
BitChute: www.bitchute.com/channel/yann...
Minds: www.minds.com/ykilcher
Parler: parler.com/profile/YannicKilcher
LinkedIn: / yannic-kilcher-488534136
BiliBili: space.bilibili.com/1824646584
If you want to support me, the best thing to do is to share out the content :)
If you want to support me financially (completely optional and voluntary, but a lot of people have asked for this):
SubscribeStar: www.subscribestar.com/yannick...
Patreon: / yannickilcher
Bitcoin (BTC): bc1q49lsw3q325tr58ygf8sudx2dqfguclvngvy2cq
Ethereum (ETH): 0x7ad3513E3B8f66799f507Aa7874b1B0eBC7F85e2
Litecoin (LTC): LQW2TRyKYetVC8WjFkhpPhtpbDM4Vw7r9m
Monero (XMR): 4ACL8AGrEo5hAir8A9CeVrW8pEauWvnp1WnSDZxW7tziCDLhZAGsgzhRQABDnFy8yuM9fWJDviJPHKRjV4FWt19CJZN9D4n

Пікірлер: 100

@YannicKilcher 3 жыл бұрын

Do we even need Attention? FNets completely drop the Attention mechanism in favor of a simple Fourier transform. They perform almost as well as Transformers, while drastically reducing parameter count, as well as compute and memory requirements. This highlights that a good token mixing heuristic could be as valuable as a learned attention matrix. OUTLINE: 0:00 - Intro & Overview 0:45 - Giving up on Attention 5:00 - FNet Architecture 9:00 - Going deeper into the Fourier Transform 11:20 - The Importance of Mixing 22:20 - Experimental Results 33:00 - Conclusions & Comments Paper: arxiv.org/abs/2105.03824 ADDENDUM: Of course, I completely forgot to discuss the connection between Fourier transforms and Convolutions, and that this might be interpreted as convolutions with very large kernels.

@aspergale9836 3 жыл бұрын

4:30 - Which papers did you have in mind here?

@emilzakirov5173 3 жыл бұрын

@@aspergale9836 he mentioned some of them later. AFAIK it's probably: "MLP-Mixer", "Pay Attention to MLPs" and "ResMLP ..." papers

@RobertWeikel 3 жыл бұрын

KISS always wins.

@bediosoro7786 3 жыл бұрын

Are they saying that position encoding was useless. I am not seeing that part in the paper. Can someone explain what I missed.

@ibrahimaba8966 2 жыл бұрын

Hello Yannick, what do tou think about mixing strategies like fourier transform for one layer and attention for the next layer and so one?

@samernoureddine 3 жыл бұрын

"I know I'm a bit late with this one" - paper hasn't even been up for two weeks

@ChaiTimeDataScience 3 жыл бұрын

Thank you Yannic. I know what I'm watching for Friday evening now, no more scrolling! :D

@julianke455 3 жыл бұрын

Your videos are so valuable. Seriously I've learned way more watching your explanation than listening my profs at college. Thank you Yannic!

@antraxuran9 3 жыл бұрын

@17:20 You need the imaginary part to do the reverse, it contains the phase information!

@Sam-pd1be 3 жыл бұрын

Yeah, I don’t know why the authors talk about the reverse when they drop the imaginary part. That fact makes me wonder just how relevant the Fourier transforms properties could possibly be to these results. I feel like the main reason to use it might be that we have readily available fast implementations.

@carlossegura403 3 жыл бұрын

I was getting ready to read the paper, but then I said, "oh, I'll just wait for Yannic to explain it." Thank you!

@galchinsky 3 жыл бұрын

I'm a guy with DSP roots, and this paper causes a lot of cringe. There was an elephant (the convolution theorem) and it was totally missed. FFT offers circular convolution, which makes little sense in NLP context, so a natural solution to improve results would be to try to pad values. Also cudnn does perform FFT for large kernels, breaking news. But that's not interesting actually. The only interesting part is taking "real" part of the FFT transform. This looks like quite an exotic type of non-linearity: resetting phase. I wonder if it's only additional noise, or really adds something new. Instead of this, there are soulless words about "mixing tokens". I hope LeCun will kick their ass not only in twitter, because this "convolutions in mustache" things are starting to be frightening.

@narsimhar.chilkuri7290 3 жыл бұрын

so true lol

@narsimhar.chilkuri7290 3 жыл бұрын

Although I must say that having the kernel weights analytically defined allows us to "see" the whole sequence in just one layer. This is I believe different from how CNN architectures do it i.e they use small kernels but very deep architecurtes to have a large receptive field.

@aspergale9836 3 жыл бұрын

Doesn't feel like LeCun would be too opposed. This seems to be the style of ML papers that's generally accepted nowadays. More rigorous works are harder to read and apply, unless they come with a ready-to-use, open source framework. So the easier incremental works end up taking the spotlight, more often than not. I didn't have that many reservations about it when reading it, but one especially stood out when you mentioned it: "there are soulless words like 'mixing tokens'". That ties back to my "unsubstantiated" claim on lack of rigor in most well-publicized papers recently.

@aniruddhadatta925 3 жыл бұрын

I agree with you my comrade

@alexeifigueroa 3 жыл бұрын

Same cringe here, I wonder if they tried first ABS() then RE() and settled for the best number, plus this intuition of multilayered FTs inverting each other in the sense of "duality" sounds plain wrong to me

@akimtsvigun8783 3 жыл бұрын

Thank you, Yannic! Amazing video.

@andrewgrebenisan6141 3 жыл бұрын

I read this last weekend! Really interesting work

@kirillnovik8661 3 жыл бұрын

Wow, you make reading papers entertaining 👏

@machinelearningdojowithtim2898 3 жыл бұрын

I was intrigued by yannics comments about the existence of a universal attention (or information routing) matrix. Try visualising the DFT matrix, it's got a cool repeating circular pattern. The most interesting thing here is that not (that much) worse results can be obtained this way and that such information transfer regularities exist in language. We can almost certainly learn a better universal matrix but then we lose the FFT efficiency boost. Why don't we do something with neural program synthesis in this attention layer? We could learn a replacement for the FFT! Great video as always

@sangramgaikwad7663 3 жыл бұрын

Thanks Yannic. Great Explaination, tradeoff between accuracy/Speedup. Linear Transforms instead of Attention mechanisms seem more practical to deploy for small-mid scale dataset.

@user-wg6ch7wx2x 4 ай бұрын

Fantastic work.

@rabago85 3 жыл бұрын

Was waiting for this.

@CristianGarcia 3 жыл бұрын

There are 2 properties I think are still missing from these Attention-less alternatives: * Variable length inputs and outputs. * To be order agnostic.

@ce6535 3 жыл бұрын

In response to the comment "I'm even open to the idea that the Fourier transform might even be the optimal mixing technique." I think the actual optimum is something that could be trained, or at least investigated empirically. Other functions, such as Legendre polynomials or Bessel functions, are 'a' Fourier basis for different differential operators/measures. It's easy to find the functional bases using Sturm-Liouville theory. It is possible that you could optimize the basis by allowing the functions that define the ODE to become parameters to the model.

@ce6535 3 жыл бұрын

To be clear, the advantage of this method is that once you have your basis, you can bake it in to the architecture. The method is trained, but the parameters you are training don't ship with the final model.

@alpers.2123 3 жыл бұрын

We need more weird math:)

@st0a 4 ай бұрын

Wow, I'm dumb.

@ashoknp 3 жыл бұрын

Thank you Yannic!!!

@SDKIM0211 3 жыл бұрын

Hi Yannic, thanks for your nice explanation. May I ask which tool and device did you use to record this video? (e.g., copy and paste pdf on Onenote and scribble on it)

@LukaszWiklendt 3 жыл бұрын

Why do we even need a position embedding if the DFT over the tokens already provides this? Is it because they drop the imaginary component? If so, why not keep both imaginary and real and drop the position embedding?

@YannicKilcher 3 жыл бұрын

Good question! Maybe worth a try.

@machinelearningdojowithtim2898 3 жыл бұрын

I think we would lose the information after the inverse is computed

@EyedMoon 3 жыл бұрын

That was my thought too, since the Imaginary component tells you where to shift the sines

@sharp7j Жыл бұрын

Soon as I learned about attention I thought "Pretty sure the main benefit is the weight sharing per token just like in convolution, combined with looking at the entire sentence per token" Turns out I was fucking right since the linear model is about as good as BERT, I'm sure if you added a few more layers/made it fancier it would perform better. This paper is awesome and your summary is amazingly intuitive

@MrMIB983 3 жыл бұрын

Great video

@freasy123 3 жыл бұрын

How about a multi-scale approach by using wavelet transform?

@lesumsi 3 жыл бұрын

It has come to a point, where my first reaction to a hyped paper is checking whether Yannic has already published a video.

@sarvagyagupta1744 3 жыл бұрын

What is the "Type" embedding? Can I get more information on that?

@sarvagyagupta1744 3 жыл бұрын

The Nx value that's there in the architecture, does it work in serial fashion or parallel? Is it replacing multi-head attention or just increasing the number of encoding layers? Because in the implementations, it works in a serial fashion. So do let me know.

@sampruden6684 3 жыл бұрын

Do we expect this to work equally well with ViTs, or is this Fourier magic likely to be limited to the NLP domain? That might be an obvious next paper.

@TheShadyStudios 3 жыл бұрын

wassup Yannic youre awesome

@user-qu2oz2ut2h 3 жыл бұрын

I think that we should investigate the application of a fractional Fourier transform instead of a regular one. Fractional FT is a generalization of FT and is parametrized by an angle alpha. If alpha=0 function doesn't change (skip connection) If alpha=pi/2 it performs regular FT. If alpha=-pi/2 it performs inverse FT. If alpha=pi it gives mirror reflection of a function. In consecutive transforms angles of individual transforms add. pi/2 + (-pi/2)=0 It corresponds to FT followed by an inverse FT and that is identity transform. So we could use linear operations in different domains, parametrized by alpha. It could be organized as, say, 12 channel tensor with alpha ranging from 2*pi *0/12 to 2*pi*11/12 then we normalize and apply fully connected layer to all these 12 channels and get skip connection, convolution, order change and 9 other promising operations or we could just use linear operation and then collapse these 12 channels into one by element-wise addition

@user-qu2oz2ut2h 3 жыл бұрын

If i get it right, element-wise multiplication in fractional Fourier domain attenuates various chirps in the signal and in case of 2d images(tensors) you could use different transform angles for x and y directions: for example transform only rows and don't transform the columns (pi/2 along x and 0 along y)

@abhilashnandy 3 жыл бұрын

Thanks, very cool! Is there any huggingface model on this?

@MsFearco 3 жыл бұрын

I love attention but I was saying fourier is the key to everything since like forever. Do we need attention? Idk. Probably yes. But fourier is also needed. In what context can we combine these ? I have no idea.

@tinski 3 жыл бұрын

Sorry, if this has been asked a million times already, but does anyone know what pdf annotation software Yannic is using? Looks so clean!

@alpers.2123 3 жыл бұрын

OneNote

@SimonJackson13 3 жыл бұрын

The transform is a 2D version as the axis pair are different. Yes the FFT is a self inverse if scaling is done right when the same data axis is used. The fact that convolution becomes a multiplication filter also likely helps to extract data.

@JTMoustache 3 жыл бұрын

Whole sequence + Fully connected + Conditionned meta parameters is all you need.

@sieyk 3 жыл бұрын

Glad to see attention go. I never could make sense of it! 😂

@ericadar 3 жыл бұрын

why throw away half the Fourier output by only considering the real part? What would be the effect of doubling the number of tokens/nodes of the Fourier output later by splitting into real and imaginary parts?

@aspergale9836 3 жыл бұрын

4:30 - Which papers?

@AIology2022 3 жыл бұрын

So maybe wavelet can be used instead of Fourier, with this difference that wavelet parameters can be tuned by network.

@NeoShameMan 3 жыл бұрын

Isn't attention mechanism with query, value, key, the evolution of neural turing architecture? Can someone clue me into this?

@chihochan7345 3 жыл бұрын

Eq3 bascially is 2D FFT and it only keeps real part. I guess it simplifies the computation and the real part of FFT is related to the power spectrum of FFT. In fact, the power spectrum of FFT is autocorrleation. Self attention is the softmax of the cross-correlation of signal pairs. Therefore, I think they are equivalent in some sense.

@alpers.2123 3 жыл бұрын

Instead of learning to route information in quadratic complexity, cant we train a layer to output an index of a permutation table?

@alpers.2123 3 жыл бұрын

Partially structured data like natural languages probably would have a small subset if possible routes. So it can be discretised.

@dandan-gf4jk 3 жыл бұрын

@@alpers.2123 No because it's not differentiable

@herp_derpingson 3 жыл бұрын

I havent found a good implementation of an efficient differentiable permutation algorithm.

@alpers.2123 3 жыл бұрын

Permutation is a matrix multiplication with a binary matrix

@dandan-gf4jk 3 жыл бұрын

@@alpers.2123 Oh and where will you get the binary matrix from?

@christianc8265 Жыл бұрын

did they completely drop the whole decoder part?

@Mrbits01 2 жыл бұрын

A very nitpicky error in the video: at around 10:44, when taking about Fourier and inverse Fourier transforms, you say "inverse transform is simply if you don't do the negative sign right here (the complex exponential)" That's not entirely correct, in inverse DFT, x_n and X_k are interchanged, and the summation goes from k=0,....,N-1 instead. (Although this summation thingy doesn't matter much, esp when doing DFT of a real-valued signal, it's necessary to keep the notation uniform)

@mbunooo 3 жыл бұрын

excelent video!! we should note here that they are using the discrete Fourier Transform… Ir only lasts that someone makes an interpretation on how exactly is that FT mixing the tokens vs linear mixing… how the time vs frecuency applies to tokens? what means frecuency on discrete tokens? but looks like not even the authors have figured out that😂😂😂

@CristianGarcia 3 жыл бұрын

gMLP is also interesting!

@Guesstahw 3 жыл бұрын

The way I see it, the success of attention in transformers or models alike was in explicit "bilinear" nature of information flow between tokens and hence the O(n^2) issue. I dont see how replacing such a nonlinear interaction with a weighted sum (they might as well used an MLP) could bring in the same expressive power. On a different note the frequencies for sequences of different lengths would mean differently and hence probably one would have to resort to STFT-like transformations which would not resolve the variable sequence length.

@aniruddhadatta925 3 жыл бұрын

Now it's my time to shine with my ece degree

@G12GilbertProduction 3 жыл бұрын

I realize yourself what it could be this O(n² × log14²) attention matrix work in superposition difference equation by VGA.

@Skythedragon 3 жыл бұрын

Next paper: Fourier transforms are all you need I actually wonder, how long will it take to go full circle and have a "neural network training thing" that literally generates code that does a specific task?

@herp_derpingson 3 жыл бұрын

26:30 Why is FF-only worse than random? . Also, I wonder if someone tried a similar idea before. It sounds like a very obvious thing to do.

@YannicKilcher 3 жыл бұрын

I guess any mixing is still preferred. It's random, but always the same random afaik.

@Coolguydudeness1234 3 жыл бұрын

instead of doing 2 1D fourier transforms like total virgins they should just stack the vectors and do a 2D fourier transform

@Coolguydudeness1234 3 жыл бұрын

which is equivalent to just using a CONV layer with a large kernel 🤡

@bertchristiaens6355 3 жыл бұрын

isn't it equivalent but easier to compute? "a 2D Fourier transform is achieved by first transforming each row, i.e. replacing each row with its 1D Fourier transform. This first step yields an intermediary 'picture' in which the horizontal axis is frequency f and the vertical axis is space y. The second step is to apply 1D Fourier transform individually to the vertical line of the intermediate image. This new image will be the 2D Fourier transformation of the initial image."

@kushasareen3259 3 жыл бұрын

Someone can correct me if I'm wrong but applying the DFT twice and taking the real part will absolutely not give you back the real signal however it does something close. Taking the real part will immediately throw out all of the phase information. Taking the FT twice actually returns f(-x), essentially reversing the signal (this can be seen pretty clearly from the definition of the FT and its inverse). Taking the FT 4 times however will give you back the signal but I don't think this reversing really plays a role on the learning since the signal is identically reversed each time. I think once you take the real part only, the transformation will roughly approximate this with phase information lost.

@pratik245 2 жыл бұрын

Although this is game theory, i can see for each token, complexity is o(1), for each attention layer above it is o(1) ,but real worl concepts are not so neatly segregated. But, that is mysterious part

@pratik245 2 жыл бұрын

Thats infinte dimensional space

@pratik245 2 жыл бұрын

We could if possible do inverse Fourier transform of real world, it would tell us why we are what we are. The cure of all diseases, the solution of all mysteries

@leecarraher 3 ай бұрын

Feels like we want locality and sparsity,why not trade floating point complex fft with walsh hadamard.

@mathematicalninja2756 5 ай бұрын

Wavelet transform is the most effective non parametric mixing, I bet. They are mathematical microscopes.

@natthaphonhongcharoen 3 жыл бұрын

3:47 Anything but n squared? Let's go n power n.

@ankitaharwal5886 3 жыл бұрын

I think attention mechanism is much more than just information sharing between datapoints. Like, weights in attention are computed at run time and variable. That makes attention much more general. So, how about we try to Fourier transform and apply attention layer above that.

@Ronnypetson 3 жыл бұрын

30:45 LOL

@ecitslos 3 жыл бұрын

People have investigated predefined basis functions before in the early days of deep learning (the paper even pointed that out). They performed badly (even against regular dense layers). This paper does the same thing and again predefined basis lost to dense layers. This paper is just getting attention (pun intended) it does not deserve due to it being from Google. I think the various works on local attention should be much better than the idea in this paper.

@G12GilbertProduction 3 жыл бұрын

Did you implain for this a Turing machine algorithm?

@justinwhite2725 3 жыл бұрын

'for the longest time'. Wasn't 'Attention all you need' published in 2015? And wasn't FNET published in 2019? We're talking 4 years.

@user-vx1fj9uf6w Жыл бұрын

17:20 if you did not tf to infinite. It cannot be perfectly restored

@adamweishaupt7074 3 жыл бұрын

"The verb at the end attends to the second word." Human language was a huge mistake.

@XOPOIIIO 3 жыл бұрын

So attention is not what you need after all.

@pascalhuber157 3 жыл бұрын

I don‘t really get the hype about this paper. Their results are apparantly much worse than with Attention. In my opinion this is an interesting approach, but ultimately nothing but hot air.