TransformerFAM: Feedback attention is working memory

Рет қаралды 32,361

19 күн бұрын

Paper: arxiv.org/abs/2404.09173
Abstract:
While Transformers have revolutionized deep learning, their quadratic attention complexity hinders their ability to process infinitely long inputs. We propose Feedback Attention Memory (FAM), a novel Transformer architecture that leverages a feedback loop to enable the network to attend to its own latent representations. This design fosters the emergence of working memory within the Transformer, allowing it to process indefinitely long sequences. TransformerFAM requires no additional weights, enabling seamless integration with pre-trained models. Our experiments show that TransformerFAM significantly improves Transformer performance on long-context tasks across various model sizes (1B, 8B, and 24B). These results showcase the potential to empower Large Language Models (LLMs) to process sequences of unlimited length.
Authors: Dongseong Hwang, Weiran Wang, Zhuoyuan Huo, Khe Chai Sim, Pedro Moreno Mengibar
Links:
Homepage: ykilcher.com
Merch: ykilcher.com/merch
KZfaq: / yannickilcher
Twitter: / ykilcher
Discord: ykilcher.com/discord
LinkedIn: / ykilcher
If you want to support me, the best thing to do is to share out the content :)
If you want to support me financially (completely optional and voluntary, but a lot of people have asked for this):
SubscribeStar: www.subscribestar.com/yannick...
Patreon: / yannickilcher
Bitcoin (BTC): bc1q49lsw3q325tr58ygf8sudx2dqfguclvngvy2cq
Ethereum (ETH): 0x7ad3513E3B8f66799f507Aa7874b1B0eBC7F85e2
Litecoin (LTC): LQW2TRyKYetVC8WjFkhpPhtpbDM4Vw7r9m
Monero (XMR): 4ACL8AGrEo5hAir8A9CeVrW8pEauWvnp1WnSDZxW7tziCDLhZAGsgzhRQABDnFy8yuM9fWJDviJPHKRjV4FWt19CJZN9D4n

Пікірлер: 135

@dongseonghwang7870 16 күн бұрын

I'm the author :) As self-taught ML practitioner, when I started studying deep learning, Yannic was my favorite teacher. I've learned so much from you. I'm so honored that Yannic featured my paper today. After watching your review, I made some changes to the paper and re-uploaded it to arXiv. I'd like to add some supplementary explanations to Yannic's comments: 1. Feedback loop vs. Recurrent As Yannic mentioned, it's essentially incorporating recurrent features into the Transformer architecture. RNNs were the first to implement feedback loop in a sequence-by-sequence manner, while the paper implements feedback using attention. The paper emphasize feedback as RNN means more like Markov process or autoregressive concept. The paper mentions that LSTMs and GRUs successfully implemented feedback loops within the RNN framework. 2. Stop Gradient and Gradient Checkpointing Sorry for bad english. I rewrote the part. The idea is that previous research suggested using stop gradient to improve computational performance. However, with gradient checkpointing, all computations need to be redone anyway. So, whether or not you use stop gradient, there's no noticeable impact on speed or memory usage. This is why I recommend eliminating the use of stop gradient. 3. 18:00 simpler algorithm Thank you for Yannic's compliment on Appendix C "Don't". In C.3, I addressed the simpler algorithm that Yannic suggested. Actually, this was the very first method I tried. Unfortunately, this algorithm failed to remember PassKey for long sequences, and the algorithm became a bit more convoluted, as Yannic mentioned.

@frankyvincent366 16 күн бұрын

Congrats for your work. Feel free to keep in touch if you need some kind of independant reviewing / advices on your strategy.

@arnebinder1406 15 күн бұрын

Are you aware of ERNIE-DOC: A Retrospective Long-Document Modeling Transformer (Ding et al, 2021)? They also introduce recurrence by passing the state to the next step, but one layer below. Found it via lucidrains x-transformers library :)

@omarnomad 15 күн бұрын

This is science at its best

@roupen66 15 күн бұрын

Having studied neuroscience in undergrad and now an ML practitioner, I really appreciate how your team tried to relate the architecture of your transformer to neuro! There should be a huge push for this given that we have seen such huge success with even simple adoption of neurons and neural networks.

@dongseonghwang7870 15 күн бұрын

@@arnebinder1406 I didn't know ERNIE-DOC, but the idea is very same to "C.2. Feedback Memory Segment" in "Appendix C. Don't" section. Thank you for letting me know. I'll cite the paper. BTW, this architecture failed to remember PassKey. In my opinion, individual token has confusion to carry both local and global info together.

@Aziqfajar 17 күн бұрын

At this point, Yannic is in desperate need of somebody, someone to not reinvent and instead create a new invention. Here, a back pat for you for your contribution to sharing academic papers with us.

@kacperkubicki1101 17 күн бұрын

I think in the next two to three years, the moment will finally come, when someone reinvents attention and calls it something like "self-aware in-context persistent memory", sprinkling it with some neuroscience mumbo-jumbo gibberish.

@adama7752 17 күн бұрын

Persistently Engaging Neurological Initiatives Stimulation

@syncrossus 17 күн бұрын

i think your timeline is optimistic. I give it at least 5 years

@marshallmcluhan33 16 күн бұрын

Mamba #5

@Wobbothe3rd 16 күн бұрын

No, you're wrong. No one moment will ever come where an AI will suddenly be discreetly self aware. Self aware is one of those vague concepts that either has no concrete meaning, or to the extent it can be well definied it will be a matter of degrees (ie "model A is more self aware than model B"). Your comment is ironic in that you attack actual scientific paper's terminology as "mumbo jumbo" - I would argue the phrase "self-aware" ITSELF is mumbo jumbo!

16 күн бұрын

@@marshallmcluhan33 selective scan sctuctured state space sequence model AKA linear RNN

@omarnomad 17 күн бұрын

All paths leads to Schmidhuber

@BerntGranbacke 16 күн бұрын

Thank you Yannic för making these papers understandable, and breaking them down with your insight and understanding. Very interesting. 🙏

@blocksofwater4758 16 күн бұрын

RNN: "You cannot live with your own failure. Where did that lead you? Right back to me."

@michalchik 17 күн бұрын

So I think you make a pretty convincing argument that this is a repackaged form of recurrent neural network. And yes you're right that a long time ago people were using these as neurally inspired architectures that weren't anywhere as successful as transformers. Now what I'm wondering is if they failed because they didn't have the transformer architecture underneath them which was more similar to organized long-term memory and learning. Maybe recurrent neural networks are almost useless by themselves, but built on top of transformers they provide the powerful equivalent of what we in Neuroscience call working memory and that the combined architecture, combined in this way can take things to the next level. I know that performance metrics can be gamed and can be very misleading but ultimately it doesn't matter if we're doing something similar to what we did in the past, if this particular arrangement leads to significant performance gains in certain kinds of tasks it still might be valuable even if it's recycled. A lot of technological progress occurs with the repurposing of old inventions in a new context. I apologize if this is a very off base comment I'm just really getting to learn this stuff and there's a lot of holes in my background because I'm coming at this from more of a neuroscience perspective. I can say that interrupting the cortico thalamic loop produces humans that might know a lot of stuff and even give appropriate responses, but leads to knowledgeable people that are just reactive entities and can't get anything done and lose track of where they are all the time. Those kinds of problems seem present in the current entities like chat GPT 4.0 and Claude opus which is what I have the most experience with.

@edu833 16 күн бұрын

❤

@drdca8263 17 күн бұрын

So if the paper framed itself as, “because of neuroscience reasons, the way to improve transformers should be to combine them with an RNN (specifically, in a way like this)”, specifically describing it as a form of, or variation on, RNNs, would that have alleviated most of your issues with it?

@zyxwvutsrqponmlkh 17 күн бұрын

I want models that can modify or add weights on the fly. I want it to have better long term memory without having to go back and read over the rags and hope it got what I want.

@chispun2 17 күн бұрын

I want a unicorn 🦄

@zyxwvutsrqponmlkh 17 күн бұрын

@@chispun2 But only if it has wings.

@drdca8263 17 күн бұрын

@@zyxwvutsrqponmlkhI think that’s called a pegasus ?

@jsalsman 17 күн бұрын

LORA is kind of like that, but interfaces to update it are in their infancy because researchers are sorting out the right way to do it.

@zyxwvutsrqponmlkh 17 күн бұрын

@@drdca8263 Pegasus doesn't have the horn.

@DanielCardenas1 16 күн бұрын

Your style is very entertaining. I laughed at the section : feature is not a bug. ~34 minutes into the video.

@mriz 17 күн бұрын

I feel like being heard by you! Thanks you!

@joseponce9567 14 күн бұрын

guau great paper revision you enlight so well the key mechanism of the net, great job in scintific dissemination

@KevinHorecka 15 күн бұрын

It's interesting how for working memory they describe layer wise interaction this way (at least according to these authors). In flexible hippocampally dependent memory we see multiple recurrence across many system layers as well as layer-wise specialization in things like Pattern Separation (DG/CA3), Pattern Completion (CA3), and Relational Binding (CA1). We know the hippocampus is critical in humans for creative flexible reasoning. I feel like it's the brain region and memory system we're missing, not PFC mediated working memory...

@AndreAmorim-AA 16 күн бұрын

The biological neural point of single/small group neurons in the loop diagram at 2:17 reminded me of how we remember the past and imagine the future, as well as the sci-fi movie ‘Inception’ (2010).

@lone0017 14 күн бұрын

Funny that I thought of the same idea right after watching your last review on infini attention lol

@seraphiusNoctis 16 күн бұрын

Isn’t reprompting with a model’s own output just this, but at a higher level? (it’s effectively concatenating a state to a transformer, that came from a transformer…)

@oncedidactic 16 күн бұрын

This is what I’m pickling on too…. next token generation would seem to supersede and generalize hidden state juggling. The working memory is the “let’s think step by step”.

@ScottVanKirk 17 күн бұрын

Hey Dr Kilcher, you should put your reputation where your snark is!😮😂 show us how you would architect your recurrent neural network to act as short-term memory. Snark aside, I really would like to see how that might work. Would you just prepend an RNN to a transformer?

@JoeTaber 16 күн бұрын

Transformers use masking during training and are trained to predict the next token given previous tokens. I wonder if the memory mechanism/RNN node should also be trained to do the opposite: mask off the prefix and given the current tokens and the previous memory, predict the previous tokens.

@bernardoramos9409 16 күн бұрын

The problem with the simple algorithm you proposed is when the generation reaches the block (or context) length in the middle of a sentence, then it is not so straightforward to generate the next token without the previous ones, while maintaining the sentence structure

@MonkeySimius 17 күн бұрын

Thanks for the explanation. Total aside ... I downloaded llama-3 and set the context window to 8k and it worked fine. I boosted it up to 9k and got pure gibberish. That was the first time using too big of a context size had that happen. The closest I've had before when I had an error for using too big of a context size my program simply crashed as I ran out of memory. Happily fine tuned(?) models have come out with longer context lengths. But I found it interesting. I obviously didn't understand what's going on underneath the hood to fully understand why, exactly, that happened... But videos like this give me a better foundation to make guesses.

@longinjanlatecki4025 16 күн бұрын

This framework reminds me a [cls] token added at the end of each block. As shown in the following paper, adding a few of [cls] tokens, called registers, at the beginning improves performance. VISION TRANSFORMERS NEED REGISTERS, ICLR 2024.

@-mwolf 17 күн бұрын

What you are drwawing at 17:00, isn't that exactly what "Vision Transformers Need Registers" (DINOv2) proposed?

@Balorng 16 күн бұрын

Well, I do think all those new papers have one thing in common: create "a hierarchy of memory" with increasing granularity and compute from less recent to most recent data. What's missing, I think, is not just going with hierarcy from linear to quadratic, but from linear to quadratic to *cubic*.

@johanlarsson9805 16 күн бұрын

Well, I can instantly say they are onto something, but not sure if correct way. I've been experimenting with ANNs for 15 years. For the last 5 I've been working on a structure where the signal never dies.. I want the "loop", the amalgamation of the last states pulses through the net, meaning different things to different areas, preserving the "state" and working memory until the next input comes, so that the input is affected by the current "mood" of the net. Still strugling with this approach, but it is what is needed for real AI. In us, the signal never dies, it needs to be there to propegate more signal. I was blown away when people solved my difficult task in a simple ways with LLM, "just have many agents interacting with eachother". Yeah, together they then acchive what I want my single net to do, the signal never dying, but I want it in a single net.

@MarkDStrachan 14 күн бұрын

I've been contemplating similar ideas--my intuition is that the big limitation with the transformer design is the linear request/response structure. What's needed is a physical loop of layers with an input structure and an output structure woven into the loop-creating a signal of updates continually propagating around the loop. You want to establish a standing wave state around the loop that can be perturbed by the inputs and monitored by the outputs. The signal perturbation would be the communication--i.e. when humans talk we're modulating frequency, so send a signal in, let it integrate into the feedback loop, and monitor the perturbations at the exit. Toss out the single pass forward propagation and replace it with these self-oscillating feedback loops on a large scale--the entire structure a loop of layers, which would need inhibitory function to prevent runaway feedback. Imagine speaking into it, and hearing a voice coming out of it.

@johanlarsson9805 14 күн бұрын

@@MarkDStrachan EXACTLY! That sounds like my thinking

@LunaProtege 15 күн бұрын

I think I actually have thought about a similar idea of a handful of output tokens being paired with input tokens; and it sounds like that's what's happening here... And you say this is basically an RNN but for this kind of transformer system? Alright, fair enough. I often also propose something to pair with it to give the most versatility; have a sort of data-table akin to a notebook it can write to, some of its outputs are akin to coordinates on this table, some are "here's what to write to it" as well as one that simply determines whether or not to write or not, and another set of coordinates for what to read for the next loop of the neural network. Having this kind of "Notepad" as well as a means to make a short term memory by doing a direct loop from output straight to input, could allow it to better remember both long term and short term information at the same time. I imagine its probably simple enough for all this to be implemented in a single system, especially in a system where this "RNN" functionality as you've described it is already implemented.

@YasserAder 16 күн бұрын

can you do a video telling how you analyze papers , how to find the limitations of pares , something like that ? critical analysis in computer science papers , if that possible .?

@d_b_ 16 күн бұрын

Proposal at 17:00 seems natural, has it been done? Its just as parallelizable as a regular attention mechanism, isn't it?

@AM-yk5yd 16 күн бұрын

It's closer to Block-Recurrent Transformers than transformersxl. TransformersXL reused output of previous layer rather than its own. Which makes sense for anything than Alberta where each layer is the same. Also the only reason xl stop backprop is publishing year. They simply had no resources to propogate that much. Also BRT authors were like "oh,yeah,we have vram now, so we let backprop through more CTX window" they called it "Slide" model. OK,after watching all video,its closer to RMT except they route memory to the same layer. I don't understand why they didn't went with RMT route and just computed them normally(appending memory) without second attn

@fox_7765 16 күн бұрын

This field bounces back and forth between pure engineering and cognitive sciences: first they were inspired by parallel-distributed processing and neuron-like units like the brain, then it was all about optimisation and infrastructure (cognitive-science was irrelevant), now they've realised they'll have to go back and get more inspiration for cognitive/biological sciences to achieve AGI. IMO feedback loops were inevitable from the start. How much impact will this paper really have?

@dimitriognibene8945 10 күн бұрын

Maybe the limited dimensions is a form of regularization?

@lexer_ 17 күн бұрын

To some degree I kind of get why you harp on this point a lot that this is just reinventing recurrent neural networks. And its really quite strange that nobody actually talks about these basically just being rnn architectures. But on the other hand these are at least novel in that they manage to combine the magic of transformer attention with the magic of rnns in a way that supposedly actually works well. Does it really matter if the "new" component they are introducing has been invented before separately? The only real benefit of seeing this connection is that you might be able to transfer some of the rnn experience over to these new hybrid architectures. Or is it just the annoyance that it seems like they haven't properly studied the older literature on ML and are kind of these transformer kiddies that claim having invented stuff based on ignorance? I am not trying to start a fight here! I am just curious where this frustration comes from.

@btnt5209 17 күн бұрын

Combing RNNs + Attention is the precursor to the Attention is All You Need paper (hence the "All You Need" part...). The commonality in all these papers seems to be "we claim good results on this particular dataset in this particular setting", but in reality, it's very hard to reproduce the same good results in real-world settings or even on other datasets or environments

@axelmarora6743 16 күн бұрын

@@btnt5209 I still don't understand why this is still an active area of research when State Space Models have solved the Quadratic scaling problem (or so I thought). SSMs allow for optimal linear transfer of information between adjacent FFN, which is what this paper tries to do.

@kayemni 16 күн бұрын

Going back to Attention + RNN is not bad in itself, if they improve upon things and show that Attention isn't all what you need after all (at least not for all use cases) then it's great, but not acknowledging the huge similarities between their approach and what already exist (and was the default for some time) is quite disingenuous and should be pointed out. Not only does it introduce redundancy in the research in the field, but it also obfuscates the contribution to overall knowledge and the comparisons to existing approaches, and yes it does also inflate their contribution. Just take a look at OpenReview and how harsh they are on papers that don't effectively cite similar work and compare their contribution to theirs

@esalexander5807 16 күн бұрын

Research as a process is reliant on building from and comparing with the prior art - "the point" is to improve the existing understanding. By ignoring well-established concepts from the literature (intentionally or through ignorance) the novelty and merit of the presented work are unnecessarily hard to determine, and doing so requires effort from each reader that the authors could reasonably have been expected to do (once) themselves. If the paper had instead been presented as a neuro-focused take on combining RNNs with transformers, with examples illustrating how and when that is successful, there would be less redundant information and likely more valuable insights and/or comparisons.

@descai10 16 күн бұрын

@@axelmarora6743 I'm wondering this myself as well. SSMs come out boasting 100x speed ups for large models and now it's just crickets with everyone still using regular transformers. That being said, I did hear that they perform poorly at copywriting.

@john_blues 17 күн бұрын

What's up FAM?

@TravellingTheWorldWideAndLarge 15 күн бұрын

I think it is outrageous when journals don't require the submission of the code for the acceptance of the algorithm. What if as part of the price for the publication, the journal offers cloud computational resources for anyone who wants to test the algorithm?

@caimansaurus5564 17 күн бұрын

Why does it bother you so much that these papers are basically reprising RNNs? I mean yes, that's what they're doing, but they're doing it in different ways (there's countless variations on the RNN itself after all), so what's the problem? RNNs were always an intuitively good idea anyway, held back by vanishing gradient/info loss over thousands of tokens. All these papers, by "recurring" over big chunks rather than token by token, basically solve this. I think it's really exciting.

@nnnik3595 17 күн бұрын

That is extremely slow though. RWKV is a better approach

@eliaweiss1 16 күн бұрын

To answer your question: 1. They don't refer to rnn, instead they are blabbering about neuroscience 2. They don't compare to rnn, instead they compare to block wise, which is clearly preform worse, and even than the improvement is minor Like yanic says, it is clear that they missed something crucial and they are blurring it with blabla

@kayemni 16 күн бұрын

The problem is that it doesn't acknowledge that their approach are basically RNNs, if they presented it as an RNN + attention variant, it would have garnered less attention but would be more honest, and if they effectively compared to RNNs and showed even slight increase in performances it would have been really good. The problem here is that they are obfuscating their contribution and how it relates to previous knowledge in the field, and don't even compare to RNNs... And don't get me started on the redundancy they are introducing, which is never good.

@andreaterlizzi 16 күн бұрын

Also, they do all of this fancy talk about working memory in neuroscience, which is basically BS since that isn't what working memory actually is, not even close. Real working memory is unbounded with respect to the input size, (theoretically) similar to a Turing machine or the RAM of a Von Neumann machine, this kind of "working memory" that is in RNNs is linearly bounded with respect to the input size, which is much more similar to linear Turing machines and stack-atutomatas.

@AM-yk5yd 16 күн бұрын

When mamba, rnn, comes with claims of outperforming transformers, I kinda like seeing benchmarks against rnn proper

@MasamuneX 16 күн бұрын

the shaved head makes the LLM's work better the power is building

@clray123 14 күн бұрын

but wait, wasn't shaving head supposed to deprive of power

@user-uc2qy1ff2z 15 күн бұрын

Okay, we get it. Transformer need some sort of latent representation to be able to think coherently about huge chunks of data. Okay, expectable. But why are there four works, which imply it, but call it by different names?

@meselfobviouslyme6292 17 күн бұрын

Thank you Mr Yannik for your explanation of TransformerFAM: Feedback attention is working memory.

@syncrossus 17 күн бұрын

i thought attention was all i needed lol

@egor.okhterov 15 күн бұрын

Please start reviewing papers without backpropagation

@P1XeLIsNotALittleSquare 16 күн бұрын

just ask chat-gpt to write summary of the conversation after each answer and call it a day lol

@nathan9771 17 күн бұрын

woah

@juanjesusligero391 17 күн бұрын

It's 23:24 where I live, please let me sleep XD

@jeremykothe2847 17 күн бұрын

move!

@unvergebeneid 17 күн бұрын

Same time zone here. How dare he! 😄

@xKreesherZ 17 күн бұрын

lol another sleepless night here in italy

@gregmattson2238 16 күн бұрын

I get his frustration, but really IMO the era ot 'totally new' approaches is likely dead. We are now in the refinement stage, with low-level techniques being minor variants on each other and the innovations being tweaks on existing paradigms. this may change and there may be totally novel inventions down the pass, but i'd be unsurprised if nothing new comes for years to come. what's important is the performance. If the performance is there, it is well worth publishing and reviewing.

@CharlesVanNoland 16 күн бұрын

They're on the right track, except for the fact that it's all predicated on backprop/gradientdescent/automaticdifferentiation and thus totally incapable of online learning! It can only work with what it has been trained on.

@SimonJackson13 17 күн бұрын

Integral and differential terms?

@SimonJackson13 17 күн бұрын

I said it kind of before. "The thing about an integral is the gradient is related."

@anishbhanushali 16 күн бұрын

this is roast + tutorial ... a Roastorial !!

@Neomadra 15 күн бұрын

My suspicion why so many people are reinventing RNNs is the lack of proper academic education and peer review process. People are just learning about transformers and the basic math and believe they have seen it all. Just a hunch, but nowadays everyone can upload stuff on arxiv without any quality control.

@erongjoni3464 14 күн бұрын

I'd be surprised if Google researchers weren't well aware of RNNs. I think it's more likely that a lot of people feel that SOME form of recurrence is going to be necessary for a model capable of system 2 thinking.

@johnnytshi 17 күн бұрын

Lets put RNN back

@erickmarin6147 17 күн бұрын

I think there should be more work on identifying redundancy in the field, probably using AI itself

@erickmarin6147 17 күн бұрын

Maybe an RHF dataset filled by academics only

@spencerfunk6697 Күн бұрын

we need to focus on kan

@asnaeb2 17 күн бұрын

No cap on a stack fr

@ryzikx 16 күн бұрын

no hat on a tower

@scottmiller2591 16 күн бұрын

Unrolling w backprop is a mistake - more local learning, more sums of geometric series.

@sharannagarajan4089 16 күн бұрын

If it isn’t a big deal, why are you reviewing it?

@ArnaldurBjarnason 17 күн бұрын

Another episode of Yannic explaining how a paper is just RNN 😆

@mennovanlavieren3885 17 күн бұрын

I'm 7:48 minutes in and you've convinced me not to waste my time by continuing watching. I get your point, but if you want to increase viewer count you need to sell the paper better. 🙃

@cajampa 17 күн бұрын

Thanks for saving me the time. Seem like "it just another RNN paper" video from the comment.

@kayemni 16 күн бұрын

I'd rather he don't and keep his content honest especially considering the academic nature of the content, baiting ppl into watching useless content shouldn't be normalized but punished!

@ChaoticNeutralMatt 16 күн бұрын

I don't think that's per se his goal to begin with? You don't have to watch.

@naninano8813 16 күн бұрын

i am a strange cortical thalamic loop

@naninano8813 16 күн бұрын

corthaloop

@dinogodor7210 14 күн бұрын

Hello, I didn't finish the video yet, but I already wanna add that it's unfair to complain about people using different terms for neural architectures when they use a known concept at a different place in the architecture. Try to compare chip design. If you design a minimal chip that is turing complete all other things you'd design around it just seem distortions of it - yet a processor isn't just a minimal turing machine but has a lot of machines in it that are optimized in some way, like ALUs or FPUs etc. which you could use as a fundamental bulding block to implement a turing machine itself. My point is that what seems the same from a theoretical standpoint has absolutely different consequences in effect and makes it necessary to give it another name from an engineering point of view.

@Sirmrmeowmeow 16 күн бұрын

Could you please do a video on the Larimar Paper by IBM. Also has a bit of brain lingo in it. Think it could be used to get important dense context across inference(s) in a way that is almost semi-stateful instead of just of 'knowledge updates'? 🧐 Because if the next inference could be informed by the memory unit of important context, that might be helpful for Long Term Planning and more stateful, coherent responses, better stitching the inferences together so important context survives inf. ~Like makes me wonder if they could have further RLHFd it to use the memory unit to use context stored and further RLHF to refines it's use appropriately. ie learning to maintain important information relative to tasks and current goals across inf as needed. :x arxiv pdf/2403.11901 arxiv abs/2403.11901

@transquantrademarkquantumf8894 16 күн бұрын

Great show you truth and brutal honesty are incredibly refreshing your your noting of origins justifies Ori once again you're giving credit for work and participation is proper one of your best shows for the year even though the discoveries may not seem to be the greatest I thank you very much for bringing illumination and putting it on the table sincerely Michael Crosby CRYPTEWORLD

@yakmage8085 15 күн бұрын

He don’t care

@zerotwo7319 17 күн бұрын

I like the attempt to suggest some biological inspiration, "working memory" but could it is just not that. would be nice to see something truly biological inspired.

@mennovanlavieren3885 17 күн бұрын

Look into the work of Jeff Hawkins (A Thousand Brains Theory). He's interviewed by Lex Fridman a couple of times and has a research company Numenta which publishes a lot of their work.

@drdca8263 17 күн бұрын

Has anyone gotten transformers to work with spiking neural networks?

@zerotwo7319 17 күн бұрын

@@mennovanlavieren3885 There is a lot of pre assumptions in 'the brain makes a model of the world' - (it is just platonism for the masses) I personaly don't work with that theory because our models could be wrong (example, have magic, faith) and ... people be considered inteligent, it has to have something to do with motion because the motor cortex is just right besides the prefrontal cortex. and it have to do with more ancient parts of the brain.... the cortex is overrated. If you look in other brains they don't need such larger cortex and still are inteligent. (also most research in this area convinently have something to do with old research or a math model they can improvise or update, not something new). Also, all that information is not only coded in the cortex. many other parts of the brain have super rigid functions or double, triple functions it is a mess of wires down there. This 'systems of systems' are more like what inteligence would look like, not 'convinently the cortex do everything' - it is not feasible it have a model of an object for each column. Lex friedman is just a podcast host, they shut up and let the interviwer talk wathever. I can't blame him. it is a peaceful life. TL;DR. it has to be something to do with the cerebellum- thalamus - motor cortext circuit frist.

@bernardoramos9409 16 күн бұрын

@@drdca8263 yes, search for SpikeFormer. There is more than 1 implementation