LLM - Reasoning SOLVED (new research)

No video

LLM - Reasoning SOLVED (new research)

Рет қаралды 16,903

Күн бұрын

Grokking transformers, a technique for infusing transformers also with near-perfect causal reasoning abilities. (Note: Grokking has nothing to do with Musk's AI Grok or Groq Inc. for fast inference.)
Grokking achieves this by enabling transformers to identify hierarchical structures within human sentences. Through extended training, the internal structure of the transformer undergoes a fundamental shift, allowing the formation of specific neural pathways called "generalizing circuits." These circuits are instrumental in efficiently encoding and retrieving knowledge for reasoning tasks. To create grokked transformers, several key elements are needed.
First, extensive training is essential, particularly for complex reasoning tasks that require structured knowledge. Second, the transformer architecture must have an optimal depth, balancing computational efficiency with reasoning performance. Third, a perfectly designed training dataset is crucial. This dataset should incorporate atomic facts and inferred facts, mimicking a formal system of axioms and theorems. Testing grokked transformers involves using out-of-distribution examples, which significantly differ from the training data. This helps assess the transformer's generalization capabilities.
Two tasks where grokked transformers excel are composition, where they outperform traditional methods that rely on external knowledge, and comparison, where they reason about similarities or differences between entities. The ratio of inferred to atomic data, the number of layers in the transformer, and the distribution of data within the training set all influence the grokking performance.
To understand how grokking transformers work, we can leverage techniques like logic lens, which analyzes internal activations to pinpoint which parts are involved in specific reasoning tasks, and causal tracing, which maps causal pathways through the transformer's layers. In conclusion, grokking transformers represent a promising approach to achieving near-perfect causal reasoning in large language models.
By meticulously designing training data, optimizing the architecture, and employing techniques like logic lens and causal tracing, we can unlock the potential of grokked transformers to tackle various reasoning challenges.
All rights w/ authors:
Grokked Transformers are Implicit Reasoners:
A Mechanistic Journey to the Edge of Generalization
arxiv.org/pdf/...
#airesearch
#ainews

Пікірлер: 48

@mulderbm 2 ай бұрын

Indeed very good. It was under our nose all the time. Not sure why this research is only now being picked up. First papers on grokking are 2021, 2022 and partially earlier. Brigning this all together is very insightful. These series make me wat to set it up to play with 😂😂

@luke.perkin.inventor 2 ай бұрын

The causal tracing highlights how similar NNs are to just applying input-sensitive matrix multiplication. In the case of ReLUs they're zero or linear, so it's like a hierarchical bunch of switches that turn on just the right linear transform on the input to get the output. The fact that this works (effective, trainable, interpolates and generalises) still amazes me!

@alexjensen990 2 ай бұрын

Completely blown away by the test with the "old model"...

@luke.perkin.inventor 2 ай бұрын

The atomic facts on the graph at the 95% / 5% reminds me of the approach in reinforcement learning for physics models where you start with, for example, low gravity and high friction to dampen the system, then slowly increase/reduce each to bring it closer to reality. It makes unlearned high frequency chaotic (deterministic) systems learnable.

@mlytle0 2 ай бұрын

Amazing stuff. We heard a few months ago about Q* and supposed advances in math ability at OpenAI on unreleased models, nothing of which has appeared in the public domain. This seems like real advances and is publicly accessible. Part of me thinks OpenAI puts out a lot of hype out there to keep the interest up, but their model still hallucinates like crazy, nothing as solid as this appears to be.

@notaras1985 2 ай бұрын

How can we reduce hallucination

@MultiNiktar 2 ай бұрын

This is a crazy good video keep it up! The Algorithm will pick this channel up in no time

@code4AI 2 ай бұрын

Smile. Since I always decline when Google wants me to pay them for advertising my own video to a broader audience, I am not at all a good customer for Google, since I do not support their business model: that I pay for promoting my video. Therefore I'll be a stealth YT Channel for a dedicated audience only.

@user-tm5nm9dp7l 2 ай бұрын

Great video. If possible, make a lesson with python code. It would help to understand better how it works. This science is a deep ocean.

@xenophobe3691 2 ай бұрын

Reminds me of the Ten Thousand Hours rule for mastery of a subject

@user-mr5qi5yf6j 2 ай бұрын

For me too

@TiagoTiagoT 2 ай бұрын

How about this, first train a model for grokking just on pure logic dataset, randomly generated examples of logic (which should be easy to verify is correct), not language, just those stuff with letters and those weird symbols for logic gates/operators and so on; then once it groks it, move on the the next barebones level of mathematics, then climb up the math ladder at each grokking, at some point start including coding, physics, chemistry etc, and leave natural language for towards the end of the training ladder; ensuring the dataset for all steps follows the ideal ratio. Will we get an ASI that runs on RasPI with something like this approach?

@obsidianSt6761 2 ай бұрын

you are talking about curriculum learning, which has been around for many decades. The limitation is that different architectures require different curricula (the one you've proposed seems to work for human learning, but does it work for an arbitrary neural architecture? it is expensive to test many architectures!)

@TiagoTiagoT 2 ай бұрын

@@obsidianSt6761 Combining the ratios thing, with building foundations for rational circuits gradually from the most basic concepts to more and more complex thinking, sounds like a good recipe for the achieving high rational thought processes and understanding from the type of neural-networks discussed in this video, no?

@obsidianSt6761 2 ай бұрын

@@TiagoTiagoT But what is the architecture of the Transformer? Does it have 8 layers, 20? What is its hidden activation, hidden size, feed forward size, dropout rate, etc. ? This video shows that you need a whole research to test out if different architectures grok, you are proposing to not only testing different architectures but also through an extensive curricula for each architecture

@TiagoTiagoT 2 ай бұрын

@@obsidianSt6761 Ah, I see. I got the impression there was already a good starting point to pick an architecture that would grok with just about anything it was trained with...

@815TypeSirius 2 ай бұрын

The most ideal data is 49.9~9% (49.9~ is equal to 50%) noise and 50% signal.

@LamontCranston-qh2rv 2 ай бұрын

If these structures can be detected, surely they can be predicted? Can we build a model that will look at a dataset and output a good guess at what the weights of a grokked model would be? If so, maybe we can radically diminish the amount of computation required to achieve grokking? Perhaps even predict optimal cross layer memory sharing? I wonder if this might require spatial reasoning. Specifically a kind of self-reflective "imagining" of the model's blackbox architecture, as well as possible, and desirable structures within it?

@obsidianSt6761 2 ай бұрын

detectability assumes specific instances of dataset, architecture, algorithm, and the confirmed grokked subject model. To produce a hypervisor prediction model as described by you, you must train that model over many datasets, architectures, and algorithms, while also training your subject architecture until grokked to get the groundtruth labels (this simply introduces tremendously more computational resources than it may worth)...

@LamontCranston-qh2rv 2 ай бұрын

@@obsidianSt6761 Fair enough. It's like trying to predict where the needle in the haystack might be. Why waste time and resources? Why not just go look for it? Still though, I can't help but think that, over time, a kind of library might emerge which essentially says that, these kinds of structures, tend to form in these kinds of models, when confronted with this type of data. It may be a worthwhile starting point as opposed to the brute force, train to death approach. Or, as you say, it could be another blind alley. Maybe the answer lies in the middle: trust your guess... but verify and abandon as needed? It is certainly true that martial arts masters, for example, don't typically take shortcuts to decades of training... but what if they could? It would amount to learning how best to learn. (A dynamic approach.) With this view into the black box, the professor has inspired an entirely new field of endeavor: Artificial Neuroscience. Necessary perhaps, if we are to have any hope of knowing how or why this stuff runs off the rails, and how to (hopefully) fix it! Thank you very much for your exceptional reply, all the best to you!

@815TypeSirius 2 ай бұрын

No. Its not reciprocal. But things dont get interesting till they start organizing using hypergeometry. How do you think a brain is so efficent an a cpu is comically inefficient.

@LamontCranston-qh2rv 2 ай бұрын

The brain uses analog circuitry while LLMs (currently) use digital circuits is one answer. Additionally DNA itself can exhibit quantum tunneling effects in seemingly "intelligent" processes that are not yet well understood. If you are suggesting that human neurons process information in high dimensional space... perhaps. How interesting!

@815TypeSirius 2 ай бұрын

@LamontCranston-qh2rv oh its a "the brain is quantum" loon.

@goodtothinkwith 2 ай бұрын

Really incredible stuff

@alexjensen990 2 ай бұрын

Cant wait for the comparison!!!

@code4AI 2 ай бұрын

A prominent Feature in Part III.

@lukeskywalker7029 2 ай бұрын

This all sounds too good to be true. However the atomic / inferred knowledge thing is something I have had a gut feeling on for a long time. Cant wait to replicate this on some easy tasks with continued pre-training.

@manslaughterinc.9135 2 ай бұрын

Why do we have to exclude RAG from grokked LLMs? There is literally no reason why we can't RAG into a grokked LLM.

@frag_it 2 ай бұрын

Yeah I don’t see RAG going away, grokked llm might even provide more reasoning on the context’s 😅

@code4AI 2 ай бұрын

Great comment! Maybe I'll design an answer in an upcoming video!

@timgorn8927 2 ай бұрын

Thank you very much! I loved this presentation.

@code4AI 2 ай бұрын

Thank you for taking the time to send this feedback to me. Appreciate it.

@notaras1985 2 ай бұрын

What should i do in order to make an AI helper model in my pharmacology lab?

@Daniel-Six 2 ай бұрын

Anyone who has read the Law of One transmissions might recognize the principle of "intelligent infinity" operating here.

@HUEHUEUHEPony 2 ай бұрын

Uhm seek a doctor?

@Daniel-Six 2 ай бұрын

@@HUEHUEUHEPony Are you familiar with the Law of One?

@acasualviewer5861 2 ай бұрын

What do they mean by sharing the information between the upper and lower layers? It's not clear to me how that is implemented. And that's kind of the key here.

@code4AI 2 ай бұрын

I am referring to the architecture of a transformer.

@acasualviewer5861 2 ай бұрын

@@code4AI yes.. but what kind of "sharing" do you mean? Just the normal mechanism of passing info to the next layer?

@RalphDratman 2 ай бұрын

I think you have referred to the wrong paper at the bottom of your youtube summary. You mention a "metric", "structural grokking" and "tree structuredness." I cannot find the words "metric", "structural" or "tree" in the paper "Grokked Transformers are Implicit Reasoners: A Mechanistic Journey to the Edge of Generalization" (arxiv 1405.24071), but all three of those terms are easy to find in "Grokking of Hierarchical Structure in Vanilla Transformers" (arxiv 2305.18741).

@code4AI 2 ай бұрын

No, you are wrong ... but your comment provides a beautiful example of the inner workings of a vector store. So when you are looking for the terms I used in my reference video, at 2:16 to 2:32 I introduced here the new study by MIT and Stanford Univ: I present the title of the pre-print, I present the authors of the pre-print and the https link of this pre-print, and one (!) second later (at 2:33 in my video) I introduce the term "Tree Structuredness" from the study. You (@RalphDratman) comment now, that you can't find the words and were looking in another pre-print that I mention in the video. Perfect example of the semantic and causal relation encoded in a "close-by" representation within a low dim vector space. So whenever you don't find terms in a linear video sequence of mine, there is a high probability, that literally 1 sec before the term in question appears, the complete information where to find the term(s) was given to you, including the title, the authors and the https link of the pre-print. Imagine a cosine similarity-function that returns the term and the identifier for the pre-print in question directly to you. Thank you for this comment.

@RalphDratman 2 ай бұрын

@@code4AI 1) I was trying to be helpful 2) The reason I did not see the paper on the screen is that I was listening rather than watching.

@spkgyk 2 ай бұрын

Sorry if you covered this in another video, but what's the difference between parametric and non parametric memory?

@code4AI 2 ай бұрын

I'll explain it in detail in my next video. Thanks for pointing it out.

@generichuman_ 2 ай бұрын

He covered it in this video. Parametric memory is contained in the weights of the models, and non parametric is contextual memory that you put into the prompt or retrieve with RAG ( which still technically goes into the prompt)

@GerardSans 2 ай бұрын

Forcing features into the existing transformer architecture is a foolish idea when you can change its design to accommodate whatever features you need perfectly and fix all the known shortcomings.

@farrael004 2 ай бұрын

Alright Einstein. How does the architecture that solves all of the transformers shortcomings look like?