GLOM: How to represent part-whole hierarchies in a neural network (Geoff Hinton's Paper Explained)

Рет қаралды 44,650

Күн бұрын

#glom #hinton #capsules
Geoffrey Hinton describes GLOM, a Computer Vision model that combines transformers, neural fields, contrastive learning, capsule networks, denoising autoencoders and RNNs. GLOM decomposes an image into a parse tree of objects and their parts. However, unlike previous systems, the parse tree is constructed dynamically and differently for each input, without changing the underlying neural network. This is done by a multi-step consensus algorithm that runs over different levels of abstraction at each location of an image simultaneously. GLOM is just an idea for now but suggests a radically new approach to AI visual scene understanding.
OUTLINE:
0:00 - Intro & Overview
3:10 - Object Recognition as Parse Trees
5:40 - Capsule Networks
8:00 - GLOM Architecture Overview
13:10 - Top-Down and Bottom-Up communication
18:30 - Emergence of Islands
22:00 - Cross-Column Attention Mechanism
27:10 - My Improvements for the Attention Mechanism
35:25 - Some Design Decisions
43:25 - Training GLOM as a Denoising Autoencoder & Contrastive Learning
52:20 - Coordinate Transformations & Representing Uncertainty
57:05 - How GLOM handles Video
1:01:10 - Conclusion & Comments
Paper: arxiv.org/abs/2102.12627
Abstract:
This paper does not describe a working system. Instead, it presents a single idea about representation which allows advances made by several different groups to be combined into an imaginary system called GLOM. The advances include transformers, neural fields, contrastive representation learning, distillation and capsules. GLOM answers the question: How can a neural network with a fixed architecture parse an image into a part-whole hierarchy which has a different structure for each image? The idea is simply to use islands of identical vectors to represent the nodes in the parse tree. If GLOM can be made to work, it should significantly improve the interpretability of the representations produced by transformer-like systems when applied to vision or language
Authors: Geoffrey Hinton
Links:
TabNine Code Completion (Referral): bit.ly/tabnine-yannick
KZfaq: / yannickilcher
Twitter: / ykilcher
Discord: / discord
BitChute: www.bitchute.com/channel/yann...
Minds: www.minds.com/ykilcher
Parler: parler.com/profile/YannicKilcher
LinkedIn: / yannic-kilcher-488534136
BiliBili: space.bilibili.com/1824646584
If you want to support me, the best thing to do is to share out the content :)
If you want to support me financially (completely optional and voluntary, but a lot of people have asked for this):
SubscribeStar: www.subscribestar.com/yannick...
Patreon: / yannickilcher
Bitcoin (BTC): bc1q49lsw3q325tr58ygf8sudx2dqfguclvngvy2cq
Ethereum (ETH): 0x7ad3513E3B8f66799f507Aa7874b1B0eBC7F85e2
Litecoin (LTC): LQW2TRyKYetVC8WjFkhpPhtpbDM4Vw7r9m
Monero (XMR): 4ACL8AGrEo5hAir8A9CeVrW8pEauWvnp1WnSDZxW7tziCDLhZAGsgzhRQABDnFy8yuM9fWJDviJPHKRjV4FWt19CJZN9D4n

Пікірлер: 114

@stalinsampras 3 жыл бұрын

Yannic "The light Speed" Kilcher never stop the good work

@yimingqu2403 3 жыл бұрын

There are not many who can just publish paper like this in the whole world.

@444haluk 3 жыл бұрын

And this is not a compliment :D

@Daniel-ih4zh 3 жыл бұрын

@@444haluk How is this paper a bad idea?

@JamesAwokeKnowing 3 жыл бұрын

I can't shake the feeling that Hinton might feel he doesn't have much time left and so wanted to get the idea out there asap without waiting for working system. That makes me sad. Then again, capsules and boltzman networks etc also always were getting presented 'not yet working' so many this is Hinton in peak form. :)

@andres_pq 3 жыл бұрын

Thought the same. Also wanted to protect the idea so Schmidhuber can't say it's his.

@IdiotDeveloper 3 жыл бұрын

The GLOM architecture feels similar to the neocortex column structure. Thanks for the easy explanation.

@snippletrap 3 жыл бұрын

Directly inspired by it.

@hemanthkotagiri8865 3 жыл бұрын

I just saw his tweet and I was thinking if you have uploaded yet and here you are. Wow dude, you're crazy! I love it! Keep'em coming! Helping me alot!

@Freddiriksson 3 жыл бұрын

Mr Sampras already wrote it. Just dont stop!! I really appreciate your work.

@florianhonicke5448 3 жыл бұрын

Made my Sunday! Thanks!!!

@andrewmeowmeow 3 жыл бұрын

Cool! Your cross-layer attention mechanism is very clever. Thank you for sharing such a clever idea and this high quality video. From a deep learning newbie😀

@BaddhaBuddha 2 жыл бұрын

But what would be the biological correlate of this method of summing over layers?

@jamiekawabata7101 3 жыл бұрын

Love the video, thank you.

@hoangnhatpham8076 3 жыл бұрын

The feedback from upper layers and lateral connections remind me of Neural Abstraction Pyramid. Anyway, nice video as usual!

@avishkarsaha8506 3 жыл бұрын

luv these vids, the best lunchtime infotainment

@whale27 3 жыл бұрын

Love these videos

@lm-gn8xr 3 жыл бұрын

Does the paper discuss any relations with graph neural networks ? The way features are updated by aggregating top / down / same layer features looks a lot like what is done in graph networks to me Thanks for the video btw, it's incredible to be able to create such contents 1 day after the release of a 40 pages papers 👏

@suleymanemirakin 10 ай бұрын

Thank you so much this video helped a lot

@billykotsos4642 3 жыл бұрын

Τhe man, the myth, the legend

@kevivmodi7019 3 жыл бұрын

the one and only

@marcelroed 3 жыл бұрын

Yannic or Geoffrey?

@dibyanayanbandyopadhyay3018 3 жыл бұрын

@@marcelroed both

@abhishekmaiti8332 3 жыл бұрын

coldzera vs yannic

@nikhilm4418 3 жыл бұрын

I’m wondering if a UNet++ architecture has an idea similar to this as far as the information sharing across levels of a column are concerned. GLOM is way more sophisticated of course w.r.t attention based inter-column representation sharing etc.

@RishitDagli 3 жыл бұрын

Really Cool video, loved the way you explain stuff. I also tried to implement GLOM in code after watching this video.

@Idiomatick 2 жыл бұрын

did it work at all?

@simonstrandgaard5503 3 жыл бұрын

This is a much better layout than the previous video. It works well with the semi transparent text in the bottom right corner. It doesn't work with the channel icon wasting precious screen estate throughout the video. Instead make circle with YK inside, doesn't have to be a fancy font nor colorful, it's important that it's readable/recognizable.

@simonstrandgaard5503 3 жыл бұрын

By second thought. I think reability of the youtube link can be improved using white font color with black outline. Also no YK circle. Just the youtube link.

@michaelwangCH 3 жыл бұрын

Intuitive and interesting paper - Hinton replicates how human brain recognize an object in an image. AI-researchers should write the code and try it out.

@andreassyren329 3 жыл бұрын

It might be possible to bias the the network into learning an appropriate attention modulation such as the one you proposed by introducing positional encoding in the columns. Then columns far apart are less similar and their influence is modulated. An interesting consequence of such a learned modulation would be that, over several iterations, an "island" could arrive on an island-global position encoding in addition to the "object" encoding. This could be useful for higher level layers, that would benefit from using the location information of lower level islands. PS: GLOM has a distinct smell of graph nets.

@eladwarshawsky7587 3 жыл бұрын

I feel like this is basically message passing like in GNNs (plus some attention), but with patches of an image as nodes.

@jonathanr4242 Жыл бұрын

Great explanation. It seems a bit similar in some senses to the hierarchical mixture of experts model.

@sfarmapietre 3 жыл бұрын

Nice!!

@vsiegel 3 жыл бұрын

Potatoes can be much more irregular shaped than avocados. And you have previous knowledge about the avocado shape! That was not a good physics experiment ;) (It is not claimed to be one), which is only noticeable based on the contrast to the rest of the presentation - which is obviously great, to be clear. And the fact of including an experiment in itself is brilliant, thanks for including it!

@swazza9999 3 жыл бұрын

Thanks Yannic! I noped out of the paper within the first few pages but this video will help me gather the courage to tackle it again. By the way what software are you using for the pdf drawing? (and if anyone else knows, would love to know from you)

@jeshweedleon3960 3 жыл бұрын

One Note iirc

@frankjin7086 3 жыл бұрын

Your idea is so brilliant. Thanks for sharing. I am searching for some methodologies in the hierarchy structure in the NN. Your idea is the smartest.

@andrehoffmann2018 3 жыл бұрын

Great video! I'm thinking that maybe it makes sense to gather all information in the same layer for the island creation consensus, independently if it may break the parse tree logic a little. Like, utilizing the cat example, fur is fur, even though we may make arbitrary symbolic separations at higher levels, like "cat ear fur" or "cat neck fur".'

@andrehoffmann2018 3 жыл бұрын

This may help when treating a video, for example. If in the video a cat moves, a lower level channel may now receive the image of the cat, when in previous timesteps it received background. If it gathers all information of decisions at the same level in the previous timestep, it can quickly decide "fur" (or "cat" at higher levels), because there were other columns at this level that already processed this, and had agreed for "fur" But if it ignores some information because it was at a different parse tree node (higher level vector island), it will be harder to make this decision, because there is no information about "cat" in the "background" parse tree node that this column was part in the previous step. Maybe this doesn't make sense, but this is how I understand it

@nikronic 3 жыл бұрын

Absolutely, I have the same idea too. We may use "breaking parse tree" but actually we can interpret it as updating parse tree. At 29:50 ish, it has been mentioned that lower levels of the different columns of different higher nodes should not contribute in attention propagation while including it makes this possible to create new trees or destory previous ones depending on the state of pixels. Even in static image, depending on the patch size (let's say bigger than pixel), then multiple patches may refer to same object which could be represented by a node in higher levels of column when we let attention pass information between two distinct nodes (branches of trees).

@SudhirPratapYadav 2 жыл бұрын

Which software are you using for displaying paper, so much of margin to draw ?

@JamesAwokeKnowing 3 жыл бұрын

@yannic can you speculate how well this architecture maps to spiking networks (eg on neuromorphic chips). Because of the iterative and time based nature it would seem it could map nicely.

@priyamdey3298 3 жыл бұрын

The so-called columns feels like being inspired from the cortical columns with its 4 layers being consistently present throughout the neo-cortex, although those are way way more complex to understand.

@JamesAwokeKnowing 3 жыл бұрын

It seems that way but not really. The layers would be across cortical columns with cortical columns closer to pixels/image patches

@snippletrap 3 жыл бұрын

They are absolutely inspired by cortical columns. Hinton is frank about this in his talks.

@JamesAwokeKnowing 3 жыл бұрын

@@snippletrap If so, it's foolish because studies show clearly that that's now how the brain has objects/information organized. It's 100% wrong to think that the layers in cortex correspond to higher level features, so that all the higher concepts are on one layer and lower on other. instead it's clearly organized on the other dimension where eg you have an area for hands, and it has sub areas for fingers and sub areas for fingertips etc. Tell me, which of the layers of the cortex has all the high level (eg cat, car) classes?

@charlesfoster6326 3 жыл бұрын

@@JamesAwokeKnowing I don't understand. That's also how GLOM works. There'd be a section of columns that map onto inputs from, say, the left eye, and build hierarchical representations (in each column) of visual inputs. The representations are modality specific, localized, but also distributed.

@snippletrap 3 жыл бұрын

@@JamesAwokeKnowing Hinton distinguishes between levels and layers. Watch his presentation on this paper. Charles Foster is right.

@herp_derpingson 3 жыл бұрын

25:00 X * X.T can just be interpreted as, "How many things similar to myself are near me". So, if we attend to everything in the picture, it will not be very informative I think. We need to have some spacial window. Speaking of which, I think we should also add position embeddings in this attention. . 44:40 Cant we set the vector lengths of embeddings to 1 by normalized them after each step? . 47:00 So, we have another loss term where the loss is the deviation of individual predictions from the final summation? Is this even needed? Just because I am a pixel of cat fur, doesnt mean that the pixel next to me is going to be a pixel of cat fur. It can also be grass. . 55:00 Humans cant do that too. Human eyes are not rotation invariant. . 58:30 The video analogy is excellent! . I think a easier way to train this model would be to take image net. Make every class one orthogonal vector with the embedding length. Then calculate the MSE loss where all the vectors in the last layer should be equal to the orthogonal vector representing the class. Basically loss = sum(mse(Y, Y_[i]), 0, n) where Y is the orthogonal vector corresponding to that image class ground truth and Y_[i] is the activation of last layer of ith column.

@bertchristiaens6355 2 жыл бұрын

Is this similar to the cortical columns of the thousand brains theory?

@morkovija 3 жыл бұрын

My prediction: somebody IS going to cite the channel in their papers this year =)

@patrickh7721 3 жыл бұрын

34:08 haha

@DanFrederiksen 3 жыл бұрын

Video might be a little long. I've only followed Hinton very little but I get the impression that he might publish an idea that might not work just in case something along those lines turns out to work as a way to claim it for himself. Even if the idea isn't originally his either.

@mdmishfaqahmed5523 3 жыл бұрын

10:04 thats one adorable cat :p

@NM-jq3sv 3 жыл бұрын

Lamda should be the multiplied "shortest distance between two nodes in a graph" times but we dont know the graph. I couldnt understand your math in the attention modification :(

@zhicongxian1582 3 жыл бұрын

Thank you for the great video. I have one question in section 7, "learning islands". To avoid collapse in the latent variables, Hinton proposes one obvious solution is to regularize the bottom-up and top-down neural networks by encouraging each of them to predict the consensus opinion. Is the following interpretation correct? The bottom-up network learns a part-whole relationship, e.g., a cat's ear and a cat's neck suggest the presence of a cat's head. The top-down network learns a whole-part relationship, e.g., if there is a cat's head in the area, then there must be a cat's ear and a cat's neck. The presence probability of parts in the bottom-up network should be the same as the one inferred in the top-down network.

@zhicongxian1582 3 жыл бұрын

After I read that paragraph again, I assume, maybe it means the way GLOM updates the weights using consensus agreement can avoid collapsing latent variables. No regularization technique is mentioned, correct?

@veedrac 3 жыл бұрын

Unless I'm misunderstanding something major, transformers can already express most of these computations (including Yannic's proposed improvement) through attention, and in cases do it better. It doesn't do the iteration method shown, but I think that's the only major missing part (and IMO seems kind'a sketchy). It seems to me you'd be better off trying to augment training of a more traditional transformer to encourage these structures, rather than hard-coding the bias in the architecture.

@shengyaozhuang3748 3 жыл бұрын

Can you have a look OpenAI's latest work "Multimodal Neurons in Artificial Neural Networks"?

@charliesteiner2334 3 жыл бұрын

It seems like this is missing scale invariance - if you have a cat and a zoomed in cat, you end up with the same parts of the cat being processed at different layers.

@ryanalvarez2926 3 жыл бұрын

I think this could be implemented with a neural cellular automata. Every pixel gets a column of embeddings and are updated iteratively. It’s already so close.

@charlesfoster6326 3 жыл бұрын

I agree! Try it out :)

@patf9770 3 жыл бұрын

I'm probably misunderstanding something but isn't the feedback transformer essentially implementing this is an efficient way?

@andres_pq 3 жыл бұрын

1:00:00 it is avocado shape

@dr.mikeybee 3 жыл бұрын

I think capsules are the wrong direction. What we've seen over and over is that end-to-end ANNs eventually outperform what humans engineer. I believe that when the models get large enough, and when we feed in the right training data in the right order, we will get truly general models. Are there systems that choose and order training data?

@jsmdnq 2 жыл бұрын

This seems like it would just be equivalent to a Fourier spectrogram. You will have "noise" at the highest levels of detail representing all the various info and as you go up the abstraction you will filter out that noise. The results will simply be that of a 3D Fourier transform with a progressive low pass filter which filters more and more data. at the very top you have a constant which "represents" the scene in it's most abstracted form. Without training the algorithm to know how to filter towards some class you won't be able to interpret the results with any meaning beyond the inherent classification(which is just abstract bits and so no classification).

@jrkirby93 3 жыл бұрын

I'm really curious why Hinton wrote this paper... instead of just building the thing? He has experience, access to data, access to compute, grad students to help, and time to focus on it. Was he afraid someone else was doing something similar and he wants to publish first? Does he need help figuring out design? Is there something else he's missing?

@wenxue8155 3 жыл бұрын

my guess is that he just want to take credits for every advance in AI field. It's like you read some papers and sense that this could be a breakthrough so you write this idea into a paper. This is not fair to people who has been working on this idea. If these people succeed, i.e. they actually build something and works, in the future, people would say, oh, Geoffrey Hinton had this idea before you, so it was he who invent this.

@arnavdas3139 3 жыл бұрын

Lockdown 2.0 run restarted....😭😭😭😭...how to keep up with your videos

@shaypatrickcormac2765 3 жыл бұрын

it looks like Hypercolumns for Object Segmentation

@noddu 3 жыл бұрын

59:40 interesting

@alexanderkyte4675 3 жыл бұрын

You should make merch related to NN jokes

@Mordenor 3 жыл бұрын

Neural Network November

@kartikeyshaurya1827 3 жыл бұрын

Bro is it ok to use the knowledge from your video in my own video??? Of course I will be giving credit to you.....

@zeyuyun6605 3 жыл бұрын

10:00 "a cat" lol

@abhishekaggarwal2712 3 жыл бұрын

I think you are confusing between levels and layers. Levels are withing embeddings and represents levels of abstractions. Each layer will have all (say 5) levels in an embedding. Layers are meant to provide progressive temporal resolution of these levels during forward pass. The lateral computation between same levels across different locations is conventional self attention like computation with the all keys, values and queries being identical.

@youngjin8300 3 жыл бұрын

Another unreasonable effectiveness of Yannic just hit home.

@eelcohoogendoorn8044 3 жыл бұрын

I guess im all for people publishing whatever it is on their mind, without too much regard for conventions, so good on him. But I dont see much value in such 'idea' papers in ML. I dont think there is a shortage of ideas; there is a shortage of ideas you can get to work and do something useful. If the field was in a state where we had a deep understanding of why the things we do work in the first place, such theoretical leaps might pay off. As it stands, the question of 'but is this an idea with a loss function that will converge using stochastic gradient descent?' is one you ignore at you own peril.

@user-tm9fh5rb5y 3 жыл бұрын

The avacado is not a joke.

@pensiveintrovert4318 3 жыл бұрын

Large networks already do this implicitly. A neural net is an ensemble of neural nets, each specializing in different images.

@wahabfiles6260 3 жыл бұрын

what are you saying?

@arnebinder1406 3 жыл бұрын

Implementation How-To for the text domain (approx. < 1 h when adapting huggingface code): * take the ALBERT model (aka weight sharing across layers), see paper (arxiv.org/abs/1909.11942) and code (github.com/huggingface/transformers/blob/master/src/transformers/models/albert/modeling_albert.py) * use t layers (t=number of time steps you want to model) * use L heads (L=number of GLOM layers you want to model) * do these small modifications to the ALBERT model: 1) remove the linear projections for query, key, value (just pass through [(d/L)*i..(d/L)*(i+1)] to the i'th head; d is the embedding dimensionality) 2) modify/constrain the dense layer that follows the attention in a way that each partition [(d/L)*i..(d/L)*(i+1)] of its output is only constructed by the output of the (i-1)-th head, the i-th head, and the (i+1)-th head 3) remove the skip connection(s) and the MLP that sits on top of the attention layer Maybe this needs some minor tweaks, but you should get the idea. EDIT: Took a bit longer, but here you are: github.com/ArneBinder/GlomImpl

@boss91ssod 3 жыл бұрын

onenote is not good for annotating pdfs, please use sth with sharper quality (eg. good notes, or liquid text). it is uncomfortable to read the text. as an alternative, maybe import the pdf pages as high-res images...

@nicolagarau9763 3 жыл бұрын

I don't get all the hate towards submitting an Arxiv paper targeting theoretical ideas. I mean, even if Hinton is the author, even if some concepts of the paper are not that new, in my opinion, the idea is expressed very well and could be revolutionary if implemented correctly, it's pretty refreshing to see new theoretical papers in ML which are not targeted towards pure benchmarking. To see it as an attempt of being the first to invent it is quite silly in my view since research should not be a competition. The point is, we need less benchmarking and 0.001% improvements in ML, and more unsupervised and interpretable models, possibly based on biologically plausible concepts.

@nicolagarau9763 3 жыл бұрын

Also, thank you very much Yannic for the inspiring video ❤

@eduarddronnik5155 3 жыл бұрын

So he proposed transformer?

@bublylybub8743 3 жыл бұрын

I appreciate the video walk through. but, I have to rant a bit...what bothers me the most is the reference/related work is severely lacking/missing in this manuscript. The first sentence in the introduction quote "There is strong psychological evidence that people parse visual scenes into partwhole hierarchies and model the viewpoint-invariant spatial relationship between a part and a whole as the coordinate transformation between intrinsic coordinate frames that they assign to the part and the whole [Hinton, 1979]." come on! only one paper is cited and it's his paper... I don't know how psychology or neural science people feel about this. I mean at least pay some effort to cite some biology/psychology/neural science papers here. Back to the idea... I think the idea presented in this paper is ... not new... hierarchal structure in vision modeling, message parsing, attention are all well studied... I am sorry but I don't see any REAL NEW stuff here in this paper. My impression is things described in this paper might be just some less expressive transformers...with some inductive bias (like hierarchy) baked in.

@nikronic 3 жыл бұрын

You are right but the point is that it is an idea paper to just share his views and also in abstract he says that it's a mechanism that combines all well known mechanism in a specific way. Also, at the end of the paper, he acknowledges that he should have read much more papers and wants other people's view on this topic.

@SudhirPratapYadav 2 жыл бұрын

This looks very similar to *Jeff Hinton's* thousand brain theory - cortical columns in neocortex with voting system. edit1: Jeff Hawkin's not Jeff Hinton's

@bertchristiaens6355 2 жыл бұрын

Jeff Hawkings*, and I thought the same! I’m curious which architecture will implement it most accurately, transformers, graph NNs, ... when looking at the Tesla Day video, it seems that a combination could be the solution, encoding with CNNs, fusion of input with Attention, feature pyramids, and temporal and spatial predictions

@SudhirPratapYadav 2 жыл бұрын

@@bertchristiaens6355I dont know how I could type jeff hinton, after reading jeff hawkins book and watching many videos. Mind is quirky in its own way

@SudhirPratapYadav 2 жыл бұрын

@@bertchristiaens6355 Yes, Tesla took engineering approach. I think Tesla self driving car is one of the first real world system with deep learning networks used as 'module' as we use other softwares. It truly is software 2.0.

@jeffhow_alboran 3 жыл бұрын

Confirmed. Yannic is a alien creature and does not need sleep.

@silvercat4 3 жыл бұрын

Hey Yannic, please share a picture of your cat! I suspect she's quite a beauty

@aloshalaa1992 3 жыл бұрын

Do you want to collaborate and do your comments on that with a few additions of mine :)?

@wenxue8155 3 жыл бұрын

this is not fair to people who has been working on this idea. If these people succeed, i.e. they actually build something and works, in the future, people would say, oh, Geoffrey Hinton had this idea before you, so it was he who invent this.

@walterwhite4234 3 жыл бұрын

Dude Jürgen Schmidhuberr invented it in the 90this so shut the f**** up

@tensorstrings 3 жыл бұрын

"About 5" hahaha

@RoboticusMusic 3 жыл бұрын

Is this how many or most of our braincells are actually themselves an "expert" (for example in rats there is just one neuron to signal an image is moving up and it can be stimulated to make the rat press a button it was trained to press even if the image is moving down), so this is an efficient method to "find that expert neuron"?