Is Mamba Destroying Transformers For Good? 😱 Language Models in AI

No video

Is Mamba Destroying Transformers For Good? 😱 Language Models in AI

Рет қаралды 6,160

Күн бұрын

#mamba #transformers #llm #mambaai
Select 1080p. The Transformer language model started to transform the AI industry, but it has one main problem that can make it go extinct even before it blasts off in 2024!
Watch this easy-to-follow and full-of-fun graphics video about the model architectures and performance differences of the Transformer and Mamba language models. I will compare the functionalities of the main AI and machine learning models, and show the necessary improvements in the Mamba AI model compared to its Recurrent Neural Networks predecessor such as Long Short Term Memory or LSTM, and Gated networks.
Stick around for more videos on LLM, Natural Language Processing (NLP), Generative AI, fun coding and machine learning projects, and follow Analytics Camp on Twitter (X): / analyticscamp
www.youtube.co...
Don’t forget to subscribe and watch these related videos:
Transformer Language Models Simplified in JUST 3 MINUTES!
• Transformer Language M...
Mamba Language Model Simplified In JUST 5 MINUTES!
• Mamba Language Model S...
This Is How EXACTLY Language Models Work In AI-- NO Background Needed:
• This is how EXACTLY La...
Backpropagation Simplified in JUST 2 MINUTES! --Neural Networks
• The Concept of Backpro...
www.youtube.co...
Video Timeline:
00:00 Intro
00:27 Transformer language model
00:52 Mamba language model
01:21 Main task of any large language model
01:49 Transformer LLM encoder and decoder units
02:19 Attention Is All You Need
02:45 Mamba model architecture
03:08 Challenges of sequence modelling
03:20 Recurrent-based models
03:48 The main problem of the Transformer model
04:09 Attention Mechanism's quadratic memory complexity
04:21 Sequence scaling
04:48 Differences between Mamba's Selective Mechanism and Transformers' Attention Mechanism
05:25 Selective State Space Models
05:47 Performance degradation
05:57 Linear scaling in sequence length
06:14 Linear activation function
06:33 Hardware-aware algorithm and GPU with High Bandwidth
06:36 Parallel Scan
06:48 Training data
07:09 Synthetic Tasks
07:21 Selective Copying
07:37 Induction Heads
07:58 Performance summary: Transformers VS Mamba
08:16 Longformer and Linformer
08:20 Reformer model
08:22 Transformer-XL
08:24 Sparse Transformers
08:25 Big Bird model
Related terms and concepts:
Tags:
#mamba
#mamba ai
#SSM
#SelectiveSSM
#TransformerLM
#llm
#languagemodel
#transformers
#ai
#generativeai
#neuralnetworks
#machinelearning
#chatgpt
#generativeai
#lstm
#rnn

Пікірлер: 34

@viswa3059 6 ай бұрын

I came here for giant snake vs giant robot fight

@analyticsCamp 6 ай бұрын

Thanks for watching :)

@soccerdadsg Ай бұрын

Appreciate your effort to make this video.

@analyticsCamp Ай бұрын

My pleasure, thanks for watching :)

@Researcher100 6 ай бұрын

Thanks for the effort you put into this detailed comparison, I learned a few more things. Btw, the editing and graphics in this video were really good 👍

@analyticsCamp 6 ай бұрын

Glad you liked it!

@first-thoughtgiver-of-will2456 2 ай бұрын

can mamba have its input rope scaled? It seems it doesnt require positional encoding but this might make it extremely efficient for second order optimization techniques.

@analyticsCamp 2 ай бұрын

In Mamba sequence length can be scaled up to a million (e.g., a million-length sequences). It also computes the gradient (did not find any info on second-order opt in their method): they train for 10k to 20k gradient steps.

@thatsfantastic313 Ай бұрын

beautifully explained!

@analyticsCamp Ай бұрын

Glad you think so!

@mintakan003 6 ай бұрын

I'd like to see this tested out for larger models, such as comparable to llama 2. One question that I have, is whether there are diminishing returns for long distance relationships, compared to a context window of sufficient size. Is it enough for people to give up (tried and true?) transformers, with explicit modeling of the context, over something that is more selective.

@analyticsCamp 6 ай бұрын

A thoughtful observation! Yes, it seems that the authors of Mamba have already tested it out against Transformer-based architectures, such as PaLM and LLaMA, and a bunch of other models. Here's what they quoted in their article, page 2: "With scaling laws up to 1B parameters, we show that Mamba exceeds the performance of a large range of baselines, including very strong modern Transformer training recipes based on LLaMa (Touvron et al. 2023). Our Mamba language model has 5× generation throughput compared to Transformers of similar size, and Mamba-3B’s quality matches that of Transformers twice its size (e.g. 4 points higher avg. on common sense reasoning compared to Pythia-3B and even exceeding Pythia-7B)." With regards to scaling the sequence length, I have explained a bit in the video. Here's a bit more explanation from their article, page 1: "The efficacy of self-attention is attributed to its ability to route information densely within a context window, allowing it to model complex data. However, this property brings fundamental drawbacks: an inability to model anything outside of a finite window, and quadratic scaling with respect to the window length." There's also an interesting table of summary of model evaluation (Zer-shot Evaluation, page 13) of different Mamba model sizes compared to GPT-2, H3 Hybrid model, Pythia, and RWKV, where in each instance Mamba exceeds these models' performances (check out the accuracy values in each dataset, especially for Mamba 2.8 Billion parameter model, it is truly unique. And, thanks for watching :)

@optiondrone5468 6 ай бұрын

I'm enjoying these mamba videos you're sharing with us, thanks

@analyticsCamp 6 ай бұрын

Glad you like them!

@yuvrajsingh-gm6zk 6 ай бұрын

keep up the good work, btw you got a new sub!

@analyticsCamp 6 ай бұрын

Thanks for the sub!

@richardnunziata3221 6 ай бұрын

A 7B to 10B Mamba would be interesting to judge but right now it seems its really good with long content for the small models space

@analyticsCamp 6 ай бұрын

You are right! Generally speaking, a larger size of parameter considered in tuning the models give better result. But Mamba is claiming that we don't necessarily need larger models, but a more efficient design of a model to be able to perform comparable to other models, even though it may have been trained on smaller training data and smaller number of parameters. I suggest their article, section 4.3.1 where they talk about "Scaling: Model Size", which can give you a good perspective. Thanks for watching :)

@70152136 6 ай бұрын

Just when I thought I had caught up with GPTs and Transformers, BOOM, MAMBA!!!

@analyticsCamp 6 ай бұрын

I know, right?!

@MrJohnson00111 6 ай бұрын

You clearly explain what is the difference between Transformer and Mamba, thank you but could you also give the reference paper you mention in the video let me dive in ?

@analyticsCamp 6 ай бұрын

Hi, glad the video was helpful. The reference for the paper is also mentioned multiple times in the video, but here's the full reference for your convenience: Gu & Dao (2023). Mamba: Linear-Time Sequence Modelling with Selective State Spaces.

@ricardofurbino 6 ай бұрын

I'm doing a work that uses sequence data, but not specific to language. In a transformer-like network, instead of embedding layer for the source and target, I have linear layers; also, I send both source and target to the forward process.. In a LSTM-like network, I don't even need this step, I just have the torch standard lstm cell; in this case, simply source is necessary for the forward pass. Does someone has a code example on how I can do it using Mamba? I'm having difficulties on how I can do it.

@analyticsCamp 6 ай бұрын

Hey, I just found a PyTorch implementation of Mamba in this link. I haven't gone through it personally, though; but if it is helpful please do let me know: medium.com/ai-insights-cobet/building-mamba-from-scratch-a-comprehensive-code-walkthrough-5db040c28049

@consig1iere294 6 ай бұрын

I can't keep up. Is Mamba like Mistral model or it is a LLM technology?

@analyticsCamp 6 ай бұрын

Mamba is an LLM but has a unique architecture, a blend of traditional SSM-based models together with Multi-layer Perceptron which helps it to add 'selectivity' to the flow of information in the system (unlike Transformer-based models which often take the whole context, i.e., all the information, to be able to predict the next word). If you are still confused, I recommend you watch my video in this channel called "This is how exactly language models work" which gives you a perspective of different types of LLMs :)

@Kutsushita_yukino 6 ай бұрын

lets goooo mamba basically has the similar memory as humans. but brains do tend to forget when the information is unnecessary so thats that.

@analyticsCamp 6 ай бұрын

That's right. Essentially, the main idea behind SSM architectures (e.g., having a hidden state) is to be able to manage the flow of information in the system.