Transformers Represent Belief State Geometry in their Residual Stream

No video

Transformers Represent Belief State Geometry in their Residual Stream

Рет қаралды 6,189

Күн бұрын

arxiv.org/abs/...
Support my learning journey either by clicking the Join button above, becoming a Patreon member, or a one-time Venmo!
/ tunadorable
account.venmo....
Discuss this stuff with other Tunadorks on Discord
/ discord
All my other links
linktr.ee/tuna...

Пікірлер: 57

@dr.mikeybee Ай бұрын

Next token prediction is the objective function. It's not what's being learned in the model. the world is what's being learned. We shouldn't mistake the chisel for the model.

@covertassassin1885 Ай бұрын

Great way of wording this very clear distinction!

@vipinsou3170 Ай бұрын

This is just multiple next word prediction are interacting in a way that represents world 🌎.

@BooleanDisorder Ай бұрын

Yes, this. People so often mistake that.

@hjups Ай бұрын

Very interesting. I can confirm that a similar behavior occurs in generative image models, especially with representations peaking near the center of the model (actually off-center). Measuring the image representations across multiple layers is much more challenging than what this paper did though, but you can do it en-aggregate (e.g. define a metric and compute the metric as a function of layer depth).

@genegray9895 Ай бұрын

There seems to be a general phenomenon in these generative models where the earlier layers build and enrich an abstract representation of the task, then the later layers query this representation to select an appropriate output.

@hjups Ай бұрын

@@genegray9895 I agree, and that's behavior which can be leveraged to improve convergence and computational efficiency. Although, doing so does restrict the distribution manifold (it's not clear how big of an issue that is yet).

@genegray9895 Ай бұрын

@@hjups Can you elaborate? I'm not sure if you mean the models implicitly learning an efficient way of organizing the forward pass or that we as researchers can leverage this observation to improve training efficiency. If the latter, how so?

@hjups Ай бұрын

@@genegray9895 Both actually. I can't offer evidence for the former without going into a paper that's under review, but the latter is evident with U-Net architectures. Essentially my claim is that "The Bitter Lesson" only applies in the domain of infinite compute and data, but as researchers we don't typically live in that world (maybe Google and OpenAI approximate that regime). So in a more constrained environment, there are things we can do to reduce the cost of training a generative model without sacrificing performance (i.e. by leveraging the representation learning which the model will learn anyway).

@chrisavila1969 Ай бұрын

Thanks for reviewing papers! Love to listen to it while at work. Would be fun if you tried recreating some parts of the paper as you go thru them (:

@Tunadorable Ай бұрын

haha i do this but very very rarely. replication takes orders of magnitude more time so i’ve gotta pick my battles

@infuriatinglyopaque57 Ай бұрын

I think Hoffman’s position is that our representations are totally decoupled from external reality, not just simplifications or compressions of it. So I don’t think the Shai paper is necessarily congruent with Hoffman's work (doesn’t really contradict it either though, since they’re not doing evolutionary simulations with the transformers). There’s also some debate over whether Hoffman’s findings are robust across complex environments where agents must develop representations that are reusable across multiple tasks. Relevant paper citation below: Flexible goals require that inflexible perceptual systems produce veridical representations: Implications for realism as revealed by evolutionary simulations. Berke, M. D., Walter‐Terrill, R., Jara‐Ettinger, J., & Scholl, B. J. (2022). Cognitive Science, 46(10), e13195.

@Tunadorable Ай бұрын

I actually did an extension of Hoffman's work that has never seen the light of day beyond one of my first shitty YT videos kzfaq.info/get/bejne/jtCiiM2QssC7Y6s.html I'm of the belief that if Hoffman is correct and the structure we do learn, while not veridical, is still useful, that AI agents should likely converge to learn roughly the same structure we have (although it may be in some ways quite different due to the nature of their substrate & architecture) and it doesn't matter whether either of our models are veridical as long as they're compatible definitely going to check out the paper you sent, based on the title it reminds me of an idea I had for an extension of his work that I never got around to

@weirdsciencetv4999 Ай бұрын

This should be even more apparent in heavily grokked models.

@dr.mikeybee Ай бұрын

You are absolutely right. These aren't common statistics. Logits are not log-odds. These are high-dimensional models of all sorts of functions. What happens at an induction head is not mathematical probability in the usual sense. Activation pathways are intractably complex to humans.

@onicarpeso Ай бұрын

🎯 Key points for quick navigation: 00:00 *🧠 Transformers learn belief State geometry during training* - Transformers pre-trained for next token prediction develop internal structures related to belief updating. - Models like stable diffusion demonstrate internal 3D world models inferred from 2D photos. - Transformers are capable of computational understanding based on complex internal world models. 03:50 *🌌 Representation of belief states in Transformers* - Hidden Markov models describe generative structures and belief states. - The structure of belief states in Transformers emerges gradually during training. - Linear regression analysis reveals the fractal geometry of belief states in the residual stream. 08:14 *🎨 Understanding fractal structures in data distributions* - The complex, non-trivial data distribution exhibits a fractal structure. - Linear regression analysis matches ground truth belief distributions in a 2D subspace. - The geometric structure of belief states emerges over the course of training in the residual stream. 16:49 *⚙️ Relationship between Transformers' training and belief state representation* - Transformers learn representations of belief states across all layers rather than just the final layer. - The complexity of belief states is preserved despite the models being trained for next token prediction. - The interplay between different layers of the residual stream impacts the model's internal understanding. Made with HARPA AI

@peterbabu936 Ай бұрын

LLMs, like fractal pattern correlators, identify complex patterns within large datasets. LLMs process text, finding similarities in word usage, sentence structure, and overall meaning, much like how correlators detect recurring patterns in fractal geometry. This enables LLMs to generate text, translate languages, and perform various text-based tasks.

@antaishizuku Ай бұрын

I still think transformers with memory like a lstm will be amazing. Like imagine the transformer predicting 3-4 tokens ahead then keeping those in memory to check that it accurately predicted the token. That way its always a bit ahead in the thought process. I may be off but overall i think its a interesting idea.

@oasill Ай бұрын

Very good. First time I hear reference to Hoffman! Interesting paper indeed.

@sk8l8now Ай бұрын

Love the work and am following for the consistent updates!

@kevon217 Ай бұрын

love thumbnail on this one

@AryFireZOfficiel Ай бұрын

nice ! I like the rating thing you made at the end of the video :-)

@letteracura Ай бұрын

Interesting, thank you for sharing 😊🌄 Best wishes

@zyzhang1130 Ай бұрын

Agree on ur comment about stochastic parrot at the beginning. Why does learning the underlying statistics not lead to understanding in the first place? The fact that people treat ‘learning the statistics’ as a derogatory characterisation is just baffling.

@huytruonguic Ай бұрын

what an absolutely beautiful paper! this raises the implication that if an LLM perfectly minimizes loss on language generation, it's d_model would tell us exactly how many hidden states in the simulator of this reality! or at least a human-centric one

@GNARGNARHEAD Ай бұрын

oi, ergodic /əːˈɡɒdɪk/ adjectiveMathematics adjective: ergodic relating to or denoting systems or processes with the property that, given sufficient time, they include or impinge on all points in a given space and can be represented statistically by a reasonably large selection of points. 😎

@reinerzer5870 Ай бұрын

Thank you for the community service ❤

@oafhauohguoihgakds5151 Ай бұрын

Oi, the five star level is missing a star. Keep up the good work, learning a lot from you. :)

@besiansherifaj9350 Ай бұрын

Thank you so much , the part that i needed u to say u said not biased truly , and made a true point , and i understood a really good truth , about trinity , and ai multimodal trinity !

@OzGoober Ай бұрын

Nice. Good call on the disclaimer! we are all works in progress.

@spitalhelles3380 Ай бұрын

Sometimes I have a hard time distinguishing actual science from esoteric pseudoscience at first glance.

@Tunadorable Ай бұрын

lmao i’m rigor blind fsfs

@MatthewKelley-mq4ce Ай бұрын

I mean it's not really a hard and fast rule to be fair. Just a sense about the work based on the work and what's been involved with it. It's easy to disregard, but also easy to give credence to nonsense. That said, I fall towards the end of I'd rather 'waste my time' than not.

@kimcosmos Ай бұрын

Qualia is just a limbic orientation to understanding

@kimcosmos Ай бұрын

canon is a religious term referring to ground truth. The papal bull of infallibity, for example, whereby canon is canon.

@Tunadorable Ай бұрын

it’s been used for awhile in the context of fiction stories to refer to facts about the fictional world confirmed by the author as opposed to those made up by fans in fan-fiction stories. it’s now being used by gen z to refer to important and inevitable events or themes in a person’s life, particularly common events frequently shared within a group of people (for example, cutting their own bangs and likely doing it too short when they’re pre-teens is a canon event for alternative/goth/lgbtq girls)

@kimcosmos Ай бұрын

@@Tunadorable another word for it is gospel. But yes, how do you give spock a sister and stay canon. You develop a section 5 conspiracy that jumps her out of his timeline before star trek starts. Because the exemplary captain is Rodenbury canon and she was a progressive branch

@eliyahenoch9131 Ай бұрын

🎉

@spencois2473 Ай бұрын

Dude you gotta get Brat up on your wall!!! It’s all about stable #bratdiffusion this summer Also I always remember the definition of ergodic like “well ergo- this dicc” which like ok guess the whole state of the system is affected at this point. Weird, but worked for me ig 🤷🏻‍♂️

@Tunadorable Ай бұрын

ngl it is growing on me but i can’t say i’m loving brat as much as ive loved most prior charli albums. also that wall like never gets updated but i do one day want to switch it to a green screen and have it rotate thru different & newer albums like fantano does

@NicheAsQuiche Ай бұрын

Strong disagree with this bit in the paper: "pretrained models should learn more than the hidden structure of the data generating process-they must also learn how to update their beliefs about the hidden state of the world as they synchronize to it in context. " SGD does the updating - this line implies that the model does a sort of in-context learning across pretaining - how could a transformer update its own weights? We know they become more samples efficient later in training, I don't see a reason to believe this is any more than Chollet's explanation that pretraining embeds a set of useful programs that fine-tuning can later call upon to quickly incentivise more specific behaviours that are already more or less present in pretraining

@adams9020 Ай бұрын

The transformer updates it's internal activations during inference. In context, this activation updating can be thought of as a type of dynamics. It's not dynamics of weight updating, it's dynamics of activation states of the neurons in the neural network. One can call that dynamics a "set of useful programs" if one wants to, but what this paper shows is that the in-context activations have the geometry associated with belief state updating over the hidden states of the data generating process.

@DanielYang-mc6zn Ай бұрын

I don't see how this paper is interesting at all. I think it's stupidly complex and uninteresting. [1] The data was literally the output of an HMM. It's conceivable that a model converged on this data will learn the rules of the HMM... [2] It is more surprising if a model that is able to predict next tokens so accurately on unseen data doesn't have a belief model. [3] This idea that " Belief state geometry represents information beyond next-token prediction" doesn't make sense at all. Information cannot be created, by the data processing inequality. So, how are you claiming that intermediary layers contain information that is "beyond next-token prediction." It is impossible to do next-token prediction without a belief model, unless you are just overfitting to the dataset. In other words, I'm not sure what the contribution of this paper is other than "hey here is a pretty image plotting activations on HMM data." Don't get me wrong, the plot is pretty. But the results don't really contribute anything that we didn't know.

@adams9020 Ай бұрын

[1] All sequential training data can be thought of as arising from an HMM. It doesn't place a restriction on the structure of the training data whatsoever. The training data could be generated from a Turing Machine, and the work presented in the paper would still hold. [2] The belief states have information about the infinite future of the sequence, way beyond that of the next token. That's one way to see why the belief structure is surprising. [3] At a particular context window position, the transformer internal states can be influenced by any information in the previous context window positions. Depending on the correlation structure of the data generating process, these histories will have particular types and amounts of information about the future sequences that can come after that context window position. The data processing inequality isn't broken by this fact. Unless you are referring to something else? Just kind of zooming out a bit since maybe there are some confusions about what the paper is claiming and what it does - the HMM structure and the belief state structure and geometry are different things. The belief states have to do with the inference algorithm an optimal observer must do in the service of prediction, assuming that the observer knows the underlying data generating process. You very much do not need this structure to do _just_ next token prediction in a local sense. This video skipped over the last figure but it explicitly shows that the internal structure of the residual stream is explained by predictive structures for the entire future of the sequence, and not just the next token. Distinct belief states can have the same next token distribution, and yet the paper finds that these distinctions are represented in the residual stream, even in these cases.

@danielyang2884 Ай бұрын

Thanks for the reply. I want to preface this by saying this area of work is not in my field. However, I still don’t understand how it’s possible that a model that predicts next token probabilities can do so on unseen data without learning some concept of belief states.

@heavenrvne888 Ай бұрын

nice reading!

@Frank-qg4ik Ай бұрын

Computational. Sounds exactly like the essence of the stochastic parrot argument...

@MDNQ-ud1ty 25 күн бұрын

You are understanding. The "algorithm" does not understand anything. Addition does not "understand". You have it completely backwards. When someone thinks they "understand" something it is a feeling. It is a feeling that one gets when they can properly apply the algorithm/formula to get the right/expected answers. It is a feeling of accomplishment. That feeling of accomplishment lets one know they are "doing it right". Understanding is completely nothing but feeling with awareness of what was to be understood. Algorithms do not understand anything. Algorithms are just processes that are created to get the specific result someone that creates them wants. They are simply functions. You are anthropomorphizing algorithms. They have no feeling. Only when AI becomes are very complex system on the complexity of a biological creature will they start to feel anything. Basically when they are punished for getting it wrong can they possibly after start to feel. One might say that rewarding and punishing an algorithm creates such a feeling but if it does it is "10^10^10 orders smaller" than what happens at the human level. Only when there are real consequences to ones existence and they comprehend what that means can understanding mean anything. AI is not even close to understanding anything. It is humans that understand and then create/discover the algorithms that get the results they want to do the things they do. It may all be tied up together but an algorithm, formula, process does not feel anything in and of itself or if it does then so does a rock, a meteor, a rain drop, etc. In the later case then the "feeling" of such things are so minuscule compared to humans to be essentially zero.

@Tunadorable 20 күн бұрын

I delved deeply into the definition of "understanding" and even dedicated a large portion of the video to responding to your comment kzfaq.info/get/bejne/qJmah8hnpq3Gj3k.html

@coralexbadea Ай бұрын

lol stop getting triggered by the paper 😂😂

@KALLAN8 Ай бұрын

your five star rating only have 4 stars in it... ⭐⭐⭐⭐

@fontenbleau Ай бұрын

Prof Simon Holland revealed a research paper from 2021 that SETI project found alien activity far away, that's more interesting and next hype topic for many years after Ai.

@mrd6869 Ай бұрын

All good and fine but only 7% of companies have adopted AI😅...We need to translate the mathematical goony goo goo into practical use cases.How can this increase my companies bottom line.