xLSTM: Extended Long Short-Term Memory

Рет қаралды 30,556

26 күн бұрын

xLSTM is an architecture that combines the recurrency and constant memory requirement of LSTMs with the large-scale training of transformers and achieves impressive results.
Paper: arxiv.org/abs/2405.04517
Abstract:
In the 1990s, the constant error carousel and gating were introduced as the central ideas of the Long Short-Term Memory (LSTM). Since then, LSTMs have stood the test of time and contributed to numerous deep learning success stories, in particular they constituted the first Large Language Models (LLMs). However, the advent of the Transformer technology with parallelizable self-attention at its core marked the dawn of a new era, outpacing LSTMs at scale. We now raise a simple question: How far do we get in language modeling when scaling LSTMs to billions of parameters, leveraging the latest techniques from modern LLMs, but mitigating known limitations of LSTMs? Firstly, we introduce exponential gating with appropriate normalization and stabilization techniques. Secondly, we modify the LSTM memory structure, obtaining: (i) sLSTM with a scalar memory, a scalar update, and new memory mixing, (ii) mLSTM that is fully parallelizable with a matrix memory and a covariance update rule. Integrating these LSTM extensions into residual block backbones yields xLSTM blocks that are then residually stacked into xLSTM architectures. Exponential gating and modified memory structures boost xLSTM capabilities to perform favorably when compared to state-of-the-art Transformers and State Space Models, both in performance and scaling.
Authors: Maximilian Beck, Korbinian Pöppel, Markus Spanring, Andreas Auer, Oleksandra Prudnikova, Michael Kopp, Günter Klambauer, Johannes Brandstetter, Sepp Hochreiter
Links:
Homepage: ykilcher.com
Merch: ykilcher.com/merch
KZfaq: / yannickilcher
Twitter: / ykilcher
Discord: ykilcher.com/discord
LinkedIn: / ykilcher
If you want to support me, the best thing to do is to share out the content :)
If you want to support me financially (completely optional and voluntary, but a lot of people have asked for this):
SubscribeStar: www.subscribestar.com/yannick...
Patreon: / yannickilcher
Bitcoin (BTC): bc1q49lsw3q325tr58ygf8sudx2dqfguclvngvy2cq
Ethereum (ETH): 0x7ad3513E3B8f66799f507Aa7874b1B0eBC7F85e2
Litecoin (LTC): LQW2TRyKYetVC8WjFkhpPhtpbDM4Vw7r9m
Monero (XMR): 4ACL8AGrEo5hAir8A9CeVrW8pEauWvnp1WnSDZxW7tziCDLhZAGsgzhRQABDnFy8yuM9fWJDviJPHKRjV4FWt19CJZN9D4n

Пікірлер: 96

@GraniLP 24 күн бұрын

Funny to see my professors names on the paper here. Feels odd, since I know this channel way before I started to study there.

@wurstelei1356 5 күн бұрын

Thank god they had these techs decades ago, so nothing is patented and hidden from the public.

@tantzer6113 23 күн бұрын

Seems like the title of this paper could have been, perhaps provocatively, “LSTMs are all you need.”

@nicolasmichel5163 23 күн бұрын

If feel that's not really the conclusion here. More like "Billions of parameters is all you need"

@JackSPk 23 күн бұрын

"Matrices aren't circles" - Yannic Kilcher

@pawelkubik 5 күн бұрын

I used to think of c and h as memory capacitor and hidden output. This was especially clear in word tagging problems where we had to align our outputs with the input tokens. So the h vector was directly corresponding to one of the tag classes that we used to predict and c was used strictly as the memory (I thought c was just from "capacitor" or "memory Cell").

@intrinsical 23 күн бұрын

I mean, the term Language Model was coined in the 90s. Even N-Gram models were considered language models. We just didn't start prefixing Language Models with the word "Large" till the early 2000s. The claim that LSTMs were doing LLM in the 90s is an exaggeration, but also partially true.

@davidhauser7537 23 күн бұрын

nice thanks for convering this paper :)

@KathiresanKathiresan-ld6zn 20 күн бұрын

Revux is being mentioned everywhere - definitely a project to watch!

@wolpumba4099 24 күн бұрын

*Summary* *What is xLSTM?* [0:00] * xLSTM aims to push the boundaries of LSTM architectures by incorporating lessons learned from the world of LLMs and Transformers. * It introduces two modified LSTM cells: sLSTM and mLSTM. * xLSTM architectures are formed by residually stacking these modified LSTM blocks. *Key Features:* [7:35] * *Exponential Gating:* [31:02] Replaces the traditional sigmoid non-linearity in LSTM gates with an exponential function to address vanishing gradient issues. * *Normalization and Stabilization Techniques:* [32:38] Introduces methods to handle the rapid growth of the exponential function and stabilize training. * *Modified Memory Structures:* * *sLSTM:* [27:47] Utilizes a scalar memory, scalar update, and "new" memory mixing (which leverages matrix properties for information routing between dimensions). * *mLSTM:* [36:24] Employs a matrix memory and a covariance update rule for associative memory. It's fully parallelizable in training, similar to Transformers. *Advantages:* * *Constant Memory Usage:* Unlike Transformers, xLSTM maintains a fixed memory footprint regardless of sequence length. * *Competitive Performance:* Achieves results comparable to state-of-the-art Transformers and State Space Models on language modeling benchmarks. * *Parallelizable Training (mLSTM):* The mLSTM variant removes the non-linear dependency on past time steps, enabling parallel training like Transformers. *Limitations:* [54:30] * *Large Constant Memory Requirement:* While memory usage is constant, the mLSTM's matrix memory can be large, leading to higher computational costs. * *No Fast Parallel Training for sLSTM:* The sLSTM variant still involves recurrency, making fast parallel training challenging. * *Further Optimization Needed:* The authors acknowledge the need for further architecture and hyperparameter optimization, especially for larger xLSTM models. *Overall:* [55:54] * xLSTM demonstrates the potential of enhanced LSTM architectures to compete with Transformers in language modeling. * Further research and real-world applications will determine its long-term impact and adoption. i summarized the transcript with gemini 1.5 pro

@XX-vu5jo 24 күн бұрын

Gemini is a joke lol

@FunkyJeff22 23 күн бұрын

Thanks!

@guillaumevermeillesanchezm2427 23 күн бұрын

How much did it cost?

@wolpumba4099 23 күн бұрын

@@guillaumevermeillesanchezm2427 Nothing. I'm in some kind of beta. It is also super fast (less than 10 seconds). Much better than GPT-4

@guillaumevermeillesanchezm2427 23 күн бұрын

@@wolpumba4099 thank you for answering!

@CM-mo7mv 23 күн бұрын

finally approaching ART

@RezaJavadzadeh 21 күн бұрын

brilliant thanks Yannic

@Mordenor 24 күн бұрын

Thank you Mr Yannic for explaining xLSTM, which extends the famous Long Short-Term Memory model. p.s I like your videos, so please stay healthy

@aintgonhappen 23 күн бұрын

Pray for Mr Yannic 🙏🙏🙏

@KavinKaviya-bw7rb 20 күн бұрын

I'm all in on Revux. Presales have the highest returns, and this one’s gold.

@EobardUchihaThawne 23 күн бұрын

mlstms are similiar to google's infini attention on memory retrieval

@pietrorse 22 күн бұрын

this reminds me of serialization and paralelization mixing in various layers, which i actually observe in nature.

@ANKIT_GAMING193 9 күн бұрын

Cyberopolis been the hot topic in several groups I'm in.

@paxdriver 23 күн бұрын

thank you Yan! I thought I was crazy but you seem to have read a similar tone in the early sections lol that's pretty funny "our paper is all about this addition, and this multiication... Novel ideas, eh?". That's the headline, but only after that does the real new part start with memory management (soft memory, not hardware.. Also confusing).

@vanshbibyan35 9 күн бұрын

Some serious backing on cyberopolis

@yeetyeet7070 24 күн бұрын

Extended Long Short-Term really sounds like upper lower middle class

@Hexanitrobenzene 23 күн бұрын

Yeah, the adjacent words "long" and "short" do not clear the matters at all... In contrast, the authors of "Attention is all you need" could work for political campaigns writing slogans as a side hustle :)

@DamianReloaded 24 күн бұрын

So, the answer is kind of yes. If you scale a high-dimensional token mixer using backpropagation to adjust weights towards the desired result, you will achieve functionality. The question lingering in my mind is: Do biological neural networks employ backpropagation? How do we one -shot learn new token sequences and how are we able to remember them long term and bring them back when we need them if they are so low probability (we only saw them once) ?

@xxlvulkann6743 23 күн бұрын

I imagine that when you have agentic models, you can implement more sophisticated memory encoding. For example, you might allow for particular memory samples to have a larger "significance" based upon your current level of arousal/reward. Also, exposure to a token doesn't have to come from the external environment, it may result from constantly "thinking" about the topic, essentially generating and training on synthetic data. We must remember that generative models are still not actual agentic models, they're basically just foundation models.

@ssssssstssssssss 23 күн бұрын

Backpropagation is largely considered implausible for biological networks and BPTT is impossible because it is a non-causal system. Some do think the brain does employ some kind of "gradient" though.

@Hexanitrobenzene 23 күн бұрын

@@ssssssstssssssss BPTT ?

@ChlorieHCl 22 күн бұрын

@@Hexanitrobenzene Back-propagation through time

@eltongoaustriaco8268 22 күн бұрын

The brain might generate a training signal from a single example in short term memory (you repeating your hotel room number in mind). Regarding BP, it is plausible that the brain uses a less optimised version of that.

@chickenp7038 24 күн бұрын

we need a new mamba explanation. the current one has errors and doesn’t rely explain much

@longvo7088 24 күн бұрын

You need to read previous papers like HiPPO, S4 to be able to understand Mamba. Also, with some prerequisite skills about CUDA Programming

@AM-yk5yd 23 күн бұрын

Sasha Rush has several as he seems to be a big fan of SSM. "Mamba: The Hard Way" is very detailed.

@ravindramore4783 9 күн бұрын

Caught some insider buzz about cyberopolis and the names involved.

@andytroo 22 күн бұрын

33:10 - is this sort of a built-in soft-max? exponetiate everything then normalise?

@tamirtsogbayar3912 23 күн бұрын

Hello Yannic thanks for you videos! Are you going to make some vidoes related to KAN (Kolmogorov Arnold Network) ? thank you

@quickpert1382 23 күн бұрын

KANs are fairly easy, and it's a nice lecture to venture into by yourself

@_XoR_ 21 күн бұрын

Unfortunately they are quite flawed for most applications since they don't scale and based on the distribution shape they can be worse than mlps.

@quickpert1382 20 күн бұрын

@@_XoR_ Yep, for now we are waiting for an optimized implementation.

@hanskraut2018 17 күн бұрын

I could have told you when i was in the end of Kindergarden. I hope there is more behind it than what it sounds to be.

@edeneden97 23 күн бұрын

in the mLSTM block, isn't it very similar to attention just without softmax?

@GGlessGo 3 күн бұрын

And is it? Cant completely follow actually

@bensimonjoules4402 21 күн бұрын

The last few papers Yannic covered all follow the same line of using back again some sort of recurrence with transformers. In this case not explicitly but I don see a fundamental difference why each step on the sequence couldn't be processed by one. Seems to be a clear direction on research of resurging recurrence, I wonder if this direction has a formal theory or even a name.

@matveyshishov 24 күн бұрын

Wait, I've been watching your channel for maaany years, how come it only has 245k subscribers, and something like 2minpapers has 1.5M?

@ChlorieHCl 23 күн бұрын

I've felt a significant decline in quality for Two Minute Paper videos. The 2min are like 30s of unwanted background info, 30s of experimental results, and 1min of sponsor acknowledgment. And also “what a time to be alive” and “hold on to your papers”, apparently. No real info gained from those videos. To the point that I've unsubbed from that channel for months just to get rid of the annoyance.

@yakmage8085 23 күн бұрын

@@ChlorieHClthere’s been a decline for sure but also yannics videos have a significantly higher minimum education requirement. 2 min papers are just video highlights and no math, intuition or criticisms

@AvastarBin 23 күн бұрын

because 2minpapers videos are 5 or 6 minutes long (ironically) and are understandable by anyone regardless of your background, whereas Yannik's videos are one hour long very indepth and requires a lot of background knowledge in ml

@GoldenBeholden 23 күн бұрын

@@ChlorieHCl Yeah, seeing some guy get enthusiastic about research papers was nice enough when just began and sat below 30k subscribers, but he really started playing into his "character" rather than the actual content of the papers. Not really worth your time anymore, to be honest. AI Explained is great if you're looking for another channel in the same vein as this one (al be it lighter on the academics).

@thirdeye4654 23 күн бұрын

Why do influencers on Tiktok have millions of followers just talking bullshit all day long? Because people love entertainment and not many have a long attention span. Also there is just so much time you have in your own life to watch and do stuff.

@KamalMondal-vl6fo 9 күн бұрын

Yep, the signals are strong on cyberopolis, especially with the big endorsements.

@BijoyBaskeBijoyBaske 20 күн бұрын

Shifting my portfolio - heavy on BTC and Revux, with a sprinkle of DOT and ADA.

@florianjug 21 күн бұрын

Isn’t that close to the mindset behind Mamba as well? What would be the key difference?!

@user-zb8hd1gh8x 20 күн бұрын

I see Revux doing 50x, maybe even 100x after it goes live on major exchanges.

@Fabio-zi4hw 23 күн бұрын

Is this the ultimate bitter lesson?

@herp_derpingson 23 күн бұрын

Money is all you need

@devrajdevraj3369 20 күн бұрын

Got a strong feeling Revux will go 100x once it hits Binance.

@Raju-ib9ug 9 күн бұрын

Been seeing a lot of talk in the private circles about cyberopolis.

@TiagoTiagoT 23 күн бұрын

Is there enough information in the pdf that some of the current bigger LLMs that can read pdfs would be able to produce the equivalent code to what the researchers used to get their alleged results?

@Hexanitrobenzene 23 күн бұрын

This task probably requires AGI...

@aakashsaini3327 24 күн бұрын

3rd for AI :P

@dairin0d 23 күн бұрын

Regarding the large memory requirements of the d*d matrix, perhaps they could take a page from the Vector Symbolic Architectures approach? In VSA, state, keys and values are all vectors of the same shared space (and so have the same dimension), so if all that's needed is to combine them in a way that would result in dot(new_state, key) ~= value, VSA's binding operation (e.g. component-wise / Hadamard product) sounds like a perfectly viable replacement 🤔 I suppose it would still benefit from large space dimensionality, but a vector size can be controlled on a more granular level than a square matrix size. If they use binary or ternary weights, the memory requirements would be even smaller (though that would probably require some changes in how the model is trained).

@JerryFederspiel 15 күн бұрын

If I'm thinking about this right, the off-diagonal elements of the outer products of k and v can be thought of as "clues" that each vector element in the key gives about each other vector element in the value. The Hadamard product dispenses with these clues- each element is treated independently- but maybe each individual element only has to be kind-of right with a VSA because d is so high. It may also be possible to compromise between Hadamard and outer products by taking the key and value vectors and breaking them up into P parts of d/P elements each. Then you take the outer products of corresponding parts. This gives us a memory requirement of P * (d/P)^2 = d^2 / P. It means that each key element gives a clue about d/P value elements. Setting P to sqrt(d) feels good, so clearly that is the right choice 🙂

@qhansen123 24 күн бұрын

2nd for AI

@AmirNajafgholi 11 күн бұрын

Don't you want to review KAN?

@intrinsical 23 күн бұрын

So the matrix memory is simply old school Kohonen Maps from the 70s?

@Hexanitrobenzene 23 күн бұрын

It seems, if that's the name. They list Kohonen, Anderson and Nakano as references, all from 1972.

@jonsmith6331 24 күн бұрын

First for AI

@darshank8748 24 күн бұрын

AI for fisting

@rumfordc 24 күн бұрын

ngl that's gotta be among the top 20 stupidest names for anything i've ever heard

@GAISENSE 24 күн бұрын

Feels more tLSTM than mLSTM, right?

@jonnylukejs 23 күн бұрын

I invented this and it got jacked low key i called it block matrix lstm and they changed the name to be dicks and get away with it but the fact that it exactly follows my ipynb for it is like ehhh

@jonnylukejs 23 күн бұрын

my app is called hyper chat and I'm still going to launch it but yeah I've had this since i wrote the code for it

@wunder1385 23 күн бұрын

Sure bro