Perceiver: General Perception with Iterative Attention (Google DeepMind Research Paper Explained)

Рет қаралды 55,070

Күн бұрын

#perceiver #deepmind #transformer
Inspired by the fact that biological creatures attend to multiple modalities at the same time, DeepMind releases its new Perceiver model. Based on the Transformer architecture, the Perceiver makes no assumptions on the modality of the input data and also solves the long-standing quadratic bottleneck problem. This is achieved by having a latent low-dimensional Transformer, where the input data is fed multiple times via cross-attention. The Perceiver's weights can also be shared across layers, making it very similar to an RNN. Perceivers achieve competitive performance on ImageNet and state-of-the-art on other modalities, all while making no architectural adjustments to input data.
OUTLINE:
0:00 - Intro & Overview
2:20 - Built-In assumptions of Computer Vision Models
5:10 - The Quadratic Bottleneck of Transformers
8:00 - Cross-Attention in Transformers
10:45 - The Perceiver Model Architecture & Learned Queries
20:05 - Positional Encodings via Fourier Features
23:25 - Experimental Results & Attention Maps
29:05 - Comments & Conclusion
Paper: arxiv.org/abs/2103.03206
My Video on Transformers (Attention is All You Need): • Attention Is All You Need
Abstract:
Biological systems understand the world by simultaneously processing high-dimensional inputs from modalities as diverse as vision, audition, touch, proprioception, etc. The perception models used in deep learning on the other hand are designed for individual modalities, often relying on domain-specific assumptions such as the local grid structures exploited by virtually all existing vision models. These priors introduce helpful inductive biases, but also lock models to individual modalities. In this paper we introduce the Perceiver - a model that builds upon Transformers and hence makes few architectural assumptions about the relationship between its inputs, but that also scales to hundreds of thousands of inputs, like ConvNets. The model leverages an asymmetric attention mechanism to iteratively distill inputs into a tight latent bottleneck, allowing it to scale to handle very large inputs. We show that this architecture performs competitively or beyond strong, specialized models on classification tasks across various modalities: images, point clouds, audio, video and video+audio. The Perceiver obtains performance comparable to ResNet-50 on ImageNet without convolutions and by directly attending to 50,000 pixels. It also surpasses state-of-the-art results for all modalities in AudioSet.
Authors: Andrew Jaegle, Felix Gimeno, Andrew Brock, Andrew Zisserman, Oriol Vinyals, Joao Carreira
Links:
TabNine Code Completion (Referral): bit.ly/tabnine-yannick
KZfaq: / yannickilcher
Twitter: / ykilcher
Discord: / discord
BitChute: www.bitchute.com/channel/yann...
Minds: www.minds.com/ykilcher
Parler: parler.com/profile/YannicKilcher
LinkedIn: / yannic-kilcher-488534136
BiliBili: space.bilibili.com/1824646584
If you want to support me, the best thing to do is to share out the content :)
If you want to support me financially (completely optional and voluntary, but a lot of people have asked for this):
SubscribeStar: www.subscribestar.com/yannick...
Patreon: / yannickilcher
Bitcoin (BTC): bc1q49lsw3q325tr58ygf8sudx2dqfguclvngvy2cq
Ethereum (ETH): 0x7ad3513E3B8f66799f507Aa7874b1B0eBC7F85e2
Litecoin (LTC): LQW2TRyKYetVC8WjFkhpPhtpbDM4Vw7r9m
Monero (XMR): 4ACL8AGrEo5hAir8A9CeVrW8pEauWvnp1WnSDZxW7tziCDLhZAGsgzhRQABDnFy8yuM9fWJDviJPHKRjV4FWt19CJZN9D4n

Пікірлер: 127

@YannicKilcher 3 жыл бұрын

OUTLINE: 0:00 - Intro & Overview 2:20 - Built-In assumptions of Computer Vision Models 5:10 - The Quadratic Bottleneck of Transformers 8:00 - Cross-Attention in Transformers 10:45 - The Perceiver Model Architecture & Learned Queries 20:05 - Positional Encodings via Fourier Features 23:25 - Experimental Results & Attention Maps 29:05 - Comments & Conclusion

@mgostIH 3 жыл бұрын

This approach is so elegant! Unironically Schmidhuber was right that the more something looks like an LSTM the better 😆

@reesejammie8821 3 жыл бұрын

I always thought the human brain is a recurrent neural network with a big hidden state and being constantly fed data from the environment.

@6lack5ushi 3 жыл бұрын

Powerful!!!

@srikanthpolisetty7476 3 жыл бұрын

Congratulations. I'm so glad this channel is growing so well, great to see a channel get the recognition they deserve. Can't wait to see where this channel goes from here.

@RS-cz8kt 3 жыл бұрын

Stumbled upon your channel a couple of days ago, watched a dozen videos since then, amazing work, thanks!

@Gorulabro 3 жыл бұрын

Your videos are a joy to watch. Nothing I do in my spare time is so usefull!

@sanzharbakhtiyarov4044 3 жыл бұрын

Thanks a lot for the review Yannic! Great work

@bardfamebuy 3 жыл бұрын

I love how you did the cutting in front of a green screen and not even bother editing it out.

@jamiekawabata7101 3 жыл бұрын

The scissors scene is wonderful!

@maxdoner4528 2 жыл бұрын

Good Job, It's pretty great to have These topics explained by someone other than the aufhorchen, Keep it up!

@HuyNguyen-rb4py 2 жыл бұрын

so touching for an excellent video

@CristianGarcia 3 жыл бұрын

This is VERY nice! I'd love to give it a spin on a toy dataset. 😍 BTW: Many transformer patterns can be found in the Set Transformers paper, the learned query reduction strategy is termed Pooling by Attention.

@JTedam 2 жыл бұрын

this helps a lot to make research accessible

@timdernedde993 3 жыл бұрын

Hey Yannic, great Video as usual :) If you want some feedback I feel like you could have covered the results a bit more. I do think the methodology of course is much more important but it helps to have a bit of an overview of how good it performs at what tasks. Maybe give it a few minutes more in the results section next time. But anyways still enjoyed the video greatly. Keep up the great work!

@emilianpostolache545 3 жыл бұрын

27:30 - Kant is all you need

@silvercat4 3 жыл бұрын

underrated comment

@simonstrandgaard5503 3 жыл бұрын

Excellent walkthrough

@robboswell3943 Жыл бұрын

Excellent video! A critical question: How exactly are the learned latent arrays being learned? Is there some kind of algorithm used to create the learned latent array by reducing the dimensions of the input "byte array"? They never really go into detail about the exact process they used to do this in the paper. Surprisingly, no online sources on this paper that I have found speak about the exact process either. On pg. 3, it does state, "The model can also be seen as performing a fully end-to-end clustering of the inputs with latent positions as cluster centres..." But this is a pretty generic explanation. Could you please provide a short explanation of the process they used?

@Coolguydudeness1234 3 жыл бұрын

I lost it when you cut the piece of paper 😂

@justindaniels863 Жыл бұрын

unexpected combination of humour and intelligence!

@TheCreativeautomaton 3 жыл бұрын

ey Thanks for doing this, very much like the direction of transformers in ML, im newer to NLP and looking at where the direction of ML might go next. once again thanks.

@Ronschk 3 жыл бұрын

Really nice idea. I wonder how much improvement it would bring if the incoming data would converted through a "sense". Our brain also doesn't receive images directly, but instead receives signals from our eyes which transform the input image (and use something akin to convolutions?). So you would have this as a generic compute structure, but depending on the modality you would have a converter. I think they had something like this in the "one model to rule them all" paper or so...

@MsFearco 3 жыл бұрын

I just finished this, its an extremely interesting paper. Please review the SWIN transformer next. Its even more interesting :)

@cptechno 3 жыл бұрын

Yes, I like this type of content. Keep up the good work. Bringing this material to our attention is a prime service. You might consider creating an AI.tv commercial channel. I'll join.

@ruroruro 3 жыл бұрын

Yeah, the attention maps look really really suspicious. Almost like the network only attends to the fourier features after the first layer. Also, the whole idea, that they are feeding the same unprocessed image into the network multiple times seems really weird. The keys should basically be a linear combination of r,g,b and the same fourier features each time. How much information can you realistically extract from an image just by attending to the low level color and positional information. I would have expected them to at least use a simple resnet or FPN alongside the "thin" attention branch thingy.

@reesejammie8821 3 жыл бұрын

Couldn't agree more. It's like the attention maps are far from being content-based. Also agree on the features being too low level, what does it even mean to attend to raw pixels?

@hugovaillaud5102 3 жыл бұрын

Is this architecture slower than a resnet with a comparable amount of parameters due to the fact that it is somehow recurrent? Great video, you explain things so clearly!

@AbgezocktXD 3 жыл бұрын

One day you will stop explaining how transformers work and I will be completely lost

@amirfru 3 жыл бұрын

This is incredibly similar to Tabnet ! but with the attentive blocks changed to attention layers

@Daniel-ih4zh 3 жыл бұрын

Things are going so fast in the last year or two.

@ssssssstssssssss 3 жыл бұрын

I disagree... There haven't really been many major innovations in machine learning in the past two years.

@L9X 2 жыл бұрын

Could this perhaps be used to model incredibly long distance relationships, i.e. incredibly long term memory? As in, the latent query vector (i'll just call it Q from here) becomes the memory. Perhaps we start of with a randomly initialised latent Q_0 and input KV_0 - let's say the first message sent by a user - to the perceiver which produces latent output Q_1, and we then feed Q_1 back into the perceiver with the next message sent by the user KV_1 as an input and get output Q_2 from the perceiver and so on. Then at every step we take Q_n and feed that to some small typical generative transformer decoder to produce a response to the user's message. This differs from typical conversational models, such as those using GPT-whatever, because they feed the entire conversation back into the model as input, and since the model has a constant size input, the older messages get truncated as enough new messages are given, which means the older memories get totally lost. Could this be a viable idea? We could have M >> N which means we have more memory than input length, but if we keep M on the order of a thousand that gives us 1000 'units' of memory that retain only the most important information.

@notsure7132 3 жыл бұрын

Thank you.

@neworldemancer 3 жыл бұрын

Thanks for video, Yannic! i would imagine that the attention "lines" @27:00 could indeed be static, but the alternative - they are input dependent, yet too overfitted to FF, as this lines are clear artefact.

@axeldroid2453 3 жыл бұрын

Does it have something todo with sparse sensing ? It basically attentds to the most relevant data points.

@maks029 3 жыл бұрын

Thanks for for an amazing video, I didn't really catch what the "Latent array" represents? It's array of zeros at first?

@48956l 2 жыл бұрын

thank you for that wonderful demonstration with the piece of paper lol

@bender2752 3 жыл бұрын

Great video! Consider making a video about DCTransformer maybe? 😊

@Anujkumar-my1wi 3 жыл бұрын

can you tell me why neural nets with many hidden layer requires less number of neurons than a neural net with a single hidden layer to approximate a function?

@henridehaybe525 3 жыл бұрын

It would be nice to see how the Perceiver would perform when the KV of the cross-attentions are not the raw image at each "attend" but the feature maps of a pretrained ResNet. E.g. the first "attend" KV are the raw image, the second KV is the feature maps of the second ResNet output, and so on. A pretrained ResNet would do the trick but it could technically be feasible to train it concurrently. It would be a Parallel-Piped Convolutionnal-Perceiver model.

@pvlr1788 2 жыл бұрын

Thanks for the video! But I can't understand where from the first latent array comes..

@emmanuellagarde2212 3 жыл бұрын

If the attention maps for layers >2 are not image specific, then this echoes the results of the paper "Pretrained Transformers as Universal Computation Engines" which suggests that there is a universal mode of operation for processing "natural" data

@LaNeona 3 жыл бұрын

If I have a gamification model is there anyone you know that does meta analysis on system mechanisms?

@aday7475 2 жыл бұрын

Any chance we can get a compare and contrast between perciever, percieverIO, and percieverAR?

@Shan224 3 жыл бұрын

Thank you yannic

@piratepartyftw 3 жыл бұрын

Very cool. I wonder if it works when you feed in multimodal data (e.g. both image and text in the same byte array).

@galchinsky 3 жыл бұрын

Proper positional encodings should somehow work

@synthetiksoftware5631 3 жыл бұрын

Isn't the 'fourier' style positional encoding just a different way to build a scale space representation of the input data? So you are still 'baking' that kind of scale space prior into the system.

@gz6963 Жыл бұрын

4:10 Is this related to the puzzles we have to solve with Google Captcha? Things like "select all the squares containing a boat"

@herp_derpingson 3 жыл бұрын

17:30 Since you already bought a green screen, maybe next time put Mars or the Apollo landing in the background. Or a large cheese cake. Thats good too. . All in all. Once architecture to rule them all.

@YannicKilcher 3 жыл бұрын

Great suggestion :D

@dr.mikeybee 3 жыл бұрын

Even with my limited understanding, this looks like a big game change.

@TheGreatBlackBird 3 жыл бұрын

I was very confused until the visual demonstration.

@xealen2166 2 жыл бұрын

i'm curious, how are the queries generated from the latent matrix, how is the latent matrix initially generated?

@Deez-Master 3 жыл бұрын

Nice video

@thegistofcalculus 3 жыл бұрын

Just a silly question, instead of big data input vector and small latent vector could they have a big latent vector that they use as a summary vector and spoon feed slices of data in order to achieve some downstream task such as maybe predicting the next data slice? Would this allow for even bigger input which is summarized (like HD video)?

@thegistofcalculus 2 жыл бұрын

Looking back it seems that my comment was unclear. It would involve a second cross attention module to determine what gets written into the big vector.

@patf9770 3 жыл бұрын

Something I just noticed about the attention maps: they seem to reflect something about the positional encodings? It looks like the model processes images hierarchically, globally at first and with a progressively finer tooth comb. My understanding is that CNNs tend to have a bias towards local textural information so it'd be really cool if an attention model learned to process images more intuitively

@petrroll 3 жыл бұрын

There's one thing I don't quite understand. How does this model do low features capture / how does it retain the information? I.e. how does it do the processing that happens in the first few layers of CNN. I can clearly see how this mechanism works well for higher-level processing but how does it capture (and keep) low-level features? The reason why I don't quite understand it that the amount of information that flows between the first and second layer of this and e.g. first and second module of ResNet is quite drastically different. In this case it's essentially N*D which I suppose is way smaller than M* (not M because there's some pooling even in the first section of Resnet, but still close) in case of ResNet, simply on the account of N

@yassineabbahaddou4369 2 жыл бұрын

why they have used a GPT-2 architecture in the latent transformer instead of BERT architecture?

@marat61 3 жыл бұрын

Also you did not say about dimension size in ablation part

@ibrahimaba8966 2 жыл бұрын

17:28 best way to solve the quadratic bottleneck 😄!

@hanstaeubler 3 жыл бұрын

It would also be interesting to 'interpret' this model or algorithm on the music level as well (I compose music myself for my pleasure)? Thanks in any case for the good interpretation of this AI work!

@jonathandoucette3158 3 жыл бұрын

Fantastic video, as always! Around 20:05 you describe transformers as invariant to permutations, but I believe they're more accurately equivariant, no? I.e. permuting the input permutes the output in exactly the same way, as opposed to permuting the input leading to the exact same output. Similar to convolutions being equivariant w.r.t. position

@mgostIH 3 жыл бұрын

You could say those terms are just equivariant to mistakes!

@ruroruro 3 жыл бұрын

Transformers are invariant to key+value permutations and equivariant to query permutations. The reason, why they are invariant to k+v permutations is that for each query all the values get summed together and the weights depend only on the keys. So if you permute the keys and the values in the same way, you still get the same weights and the sum is still the same.

@jonathandoucette3158 3 жыл бұрын

@@ruroruro Ahh, thanks for the clarification! In my head I was thinking only of self attention layers, which based on your explanation would indeed be permutation equivariant. But cross-attention layers are more subtle; queries equivariant, keys/values invariant (if they are permuted in the same way).

@anonymouse2884 2 жыл бұрын

I belive that it is permuation invariant, since you are doing a weighted sum of the inputs/ context, you should "roughly" (the positional encoder might encoder different time indices slightly differently, but this should not matter a lot) get the same results even if you permute the inputs.

@GuillermoValleCosmos 3 жыл бұрын

this is clever and cool

@NilabhraRoyChowdhury 3 жыл бұрын

What's interesting is that the model performs better with weight sharing.

@peterszilvasi752 2 жыл бұрын

17:07 - The visual demonstration of how the quadratic bottleneck is solved was a true "Explain Like I'm Five" moment. 😀

@marat61 3 жыл бұрын

I belive there are error in the paper 23:07 Q must be MxC not MxD otherwise QK.transpose() will be imposible

@Kram1032 3 жыл бұрын

Did the house sit on the mat though

@azimgivron1823 3 жыл бұрын

Are the query dimension and the latent array in figure 1 of the same dimensions ? It is written that Q belongs to the space of matrices of real numbers of dimensions MxD which does not make sens to me. I believe they meant NxD where D=C since you need to do a dot product to compute the cross-attention between the query Q and the keys K ==> Q.Kt with Kt being the transpose of K so it implies that the dimensions D and C are equal, isn't right ? I am kinda disappointed by the paper because this the core of what they want to show and they do not make the effort to dive in the math and explain this clearly.

@kirtipandya4618 3 жыл бұрын

Where can we find source code?

@TheJohnestOfJohns 3 жыл бұрын

Isn't this really similar to facebook's DETR with their object queries, but with shared weights?

@antoninhejny8156 3 жыл бұрын

No, since DETR is just for localising objects from extracted features via some backbone like resnet, while this is the feature extractor. Furthemore, DETR just puts the features into a transformer, whereas this is like making an idea about what is in the image while consulting with the raw information in the form of RGB. This is however very suspitious, because linear combination of RGB is just three numbers.

@evilby Жыл бұрын

WAHHH... Problem Solved!😆

@jonatan01i 3 жыл бұрын

2:44 "And the image is of not a cat!, a house! What did you think??!.." I thought nothing; my mind was empty :(

@NextFuckingLevel 3 жыл бұрын

:( ifeel you

@cocoarecords 3 жыл бұрын

Yannic can you tell us your approach to understand papers quickly?

@YannicKilcher 3 жыл бұрын

Look at the pictures

@TheZork1995 3 жыл бұрын

@@YannicKilcher xD so easy yet so far. Thank you for the good work though. Literally the best youtube channel I ever found!

@swoletech5958 3 жыл бұрын

PointNet++ from 2017 outperformed the perceiver in image point clouds. 91.9 accuracy versus 85.7 See @ 27:19

@happycookiecamper8101 3 жыл бұрын

nice

@teatea5528 Жыл бұрын

It is stupid, but I want to ask how the author claims their method is better than VIT in ImageNet in the appendix A, Table 7 while their accuracy is not higher?

@brll5733 3 жыл бұрын

Performers already grow entirely linearly, right?

@martinschulze5399 3 жыл бұрын

habt ihr phd stellen offen? ^^

@vadimschashecnikovs3082 3 жыл бұрын

Hmm, I think it is possible to add some GLOM-like hierarchy of "words". This could improve the model...

@enriquesolarte1164 3 жыл бұрын

haha, I love the scissors...!!!

@DistortedV12 2 жыл бұрын

“General architecture”, but can it understand tabular inputs??

@kenyang687 Жыл бұрын

The "hmm by hmm" is just too confusing lol

@conduit242 3 жыл бұрын

Embeddings are still all you need 🤷

@hiramcoriarodriguez1252 3 жыл бұрын

This is huge, i'm not going to surprise if "perceiver" becomes the gold standard for CV tasks.

@galchinsky 3 жыл бұрын

The way it is it seems to be classification only

@nathanpestes9497 3 жыл бұрын

@@galchinsky You should be able to run it backwards for generation. Just say my output (image/point-cloud/text I want to generate) is my latent(as labeled in the diagram), and my input (byte array in the diagram) is some latent representation that feeds into my outputs over several steps. I think this could be super cool for 3D GANs since you don't wind up having to fill 3d grids with a bunch of empty space.

@galchinsky 3 жыл бұрын

@@nathanpestes9497 @Nathan Pestes won't you get o(huge^2) this way?

@nathanpestes9497 3 жыл бұрын

@@galchinsky I think it would be cross attention o(user defined * huge) same as the paper (different order). Generally we have o(M*N), M - the size of input/byte-array, N - the size of the latent. The paper goes after performance by forcing the latent to be non-huge so M=huge, N=small O(huge * small). Running it backwards you would have small input (which is now actually our latent so a low dimensional random sample if we want to do a gan, perhaps the (actual) latent from another perceiver in a VAE or similar). So backwards you have M=small N=huge so O(small*huge).

@galchinsky 3 жыл бұрын

@@nathanpestes9497 Thanks for pointing this. I thought we would get Huge x Huge attention matrix, while you are right, if we set Q length to be Huge and K/V to be Small, the resulting complexity will be O(Huge*Small). So we want to get new K/V pair each time and this approach seems quite natural: (here was an imgur link but youtube seems to hide it). So there 2 parallel stacks of layers. The first set is like in the article: latent weights, then cross attention, then stack of transformers and so on. The second stack consists of your cross-attention layers, so operates in byte-array dimension. The first Q is the byte array input and K,V is taken from the stack of the "latent transformers". Then its output is fed as K,V back to the "latent" cross attention, making new K,V. So there is an informational ping-pong between "huge" and "latent" cross-attention layers.

@bensums 3 жыл бұрын

So the main point is you can have less queries than values? This is obvious even just by looking at the definition of scaled dot-product attention in Attention Is All You Need (Equation 1). From the definition there, the number of outputs equals the number of queries and is independent of the number of keys or values. The only constraints are: 1. the number of keys must match the number of values, 2. the dimension of each query must equal the dimension of the corresponding key.

@bensums 3 жыл бұрын

(in the paper all queries and keys are the same dimension (d_k), but that's not necessary)

@moctardiallo2608 3 жыл бұрын

Yeah 30min is very better!

@timstevens3361 3 жыл бұрын

attention looped is consciousness

@AvastarBin 3 жыл бұрын

+1 For the visual representation of M*N hahah

@errrust 3 жыл бұрын

Clearly you are more of a fan of row vectors rather than column vectors Yannic (refererring to your visual demo :))

@TechyBen 3 жыл бұрын

Oh no, they are making it try to be alive. XD

@freemind.d2714 3 жыл бұрын

Good job Yannic, But I start to feel like lot of paper you talk in video those days are all about transformer, and frankly they kind similar and most are about engineering research not scientific research, hope you don't mind to talk more about interesting paper on different subject

@muhammadaliyu3076 3 жыл бұрын

Yannick follows the hype

@NeoShameMan 3 жыл бұрын

So basically it's conceptually close to rapide eye movement, where we refine over time data we need to resolve recognition...

@seraphim9723 3 жыл бұрын

The ablation study consists of three points without any error bars and could just be coincidence? One cannot call that "science".

@oreganorx7 2 жыл бұрын

Very similar to MemFormer

@Stefan-bs3gm 3 жыл бұрын

with O(M*M) attention you quickly get to OOM :-P

@allengrimm3039 3 жыл бұрын

I see what you did there

@Vikram-wx4hg Жыл бұрын

17:15

@omegapointil5741 3 жыл бұрын

I guess curing Cancer is even more complicated than this.

@Mordenor 3 жыл бұрын

second

@jianjianh_ 3 жыл бұрын

Problem solved! Lmao

@insighttoinciteworksllc1005 2 жыл бұрын

Humans can do the iterative process too. The Inquiry Method is the only thing that requires it. If you add the trial and error element with self-correction, young minds can develop a learning process. Learn How to learn? Once they get in touch with their inner teacher, they connect to the Information Dimension (theory). Humans can go to where the Perceiver can't go. The Inner teacher uses intuition to bring forth unknown knowledge to mankind's consciousness. The system Mr. Tesla used to create original thought. Unless you think he had a computer? The Perceiver will be able to replace all the scientists that helped develop it and the masses hooked on the internet. It will never replace the humans that develop the highest level of consciousness. Thank you, Yeshua for this revelation.