Mixture of Experts LLM - MoE explained in simple terms

  Рет қаралды 13,575

code_your_own_AI

code_your_own_AI

8 ай бұрын

Mixture of Experts - MoE explained in simple terms with three easy examples.
You can test Mixtral 8x7B through this link (sign-in required, beta version only, beware):
app.fireworks.ai/models/firew...
GPT-4 generated text:
The video transcript provides a comprehensive overview of the development and optimization of mixture of experts (MoE) systems in the context of Large Language Models (LLMs). The presenter begins by introducing the concept of MoE as a framework for decomposing LLMs into smaller, specialized systems that focus on distinct aspects of input data. This approach, particularly when sparsely activated, enhances computational efficiency and resource allocation, especially in parallel GPU computing environments. The video traces the evolution of MoE systems from their inception in 2017 by Google Brain, highlighting the integration of MoE layers within recurrent language models and the critical role of the gating network in directing input tokens to the appropriate expert systems.
The technical specifics of MoE systems are delved into, focusing on the gating network's intelligence in assigning tokens to specific expert systems. Various gating functions, such as softmax gating and noisy top-k gating, are discussed, detailing their role in the sparsity and noise addition to the gating process. The presenter emphasizes the importance of backpropagation in training the gating network alongside the rest of the model, ensuring effective assignment of tokens and balancing computational load. The video also addresses the challenges of data parallelism and model parallelism in MoE systems, underlining the need for balanced network bandwidth and utilization.
Advancements in MoE systems are discussed, with a particular focus on the development of 'megablocks' in 2022, which tackled limitations of classical MoE systems by reformulating computations in terms of block sparse mathematical operations. This innovation led to the creation of more efficient GPU kernels for block sparse matrix multiplication, significantly enhancing the computational speed. The video concludes by discussing the latest trends in MoE systems, including the integration of instruction tuning in 2023, which further refined the performance of MoE systems on downstream tasks. The presentation provides an in-depth view of the evolution, technical underpinnings, and future directions of MoE systems in the realm of LLMs and vision language models.
Mixtral 8x7B config:
"dim": 4096,
"n_layers": 32,
"head_dim": 128,
"hidden_dim": 14336,
"n_heads": 32,
"n_kv_heads": 8,
"norm_eps": 1e-05,
"vocab_size": 32000,
"moe": {
"num_experts_per_tok": 2,
"num_experts": 8
Unproven info: GPT-4’s 8 experts with 111 billion parameters each.
recommended literature:
---------------------------------
MEGABLOCKS: EFFICIENT SPARSE TRAINING WITH MIXTURE-OF-EXPERTS
arxiv.org/pdf/2211.15841.pdf
OUTRAGEOUSLY LARGE NEURAL NETWORKS:
THE SPARSELY-GATED MIXTURE-OF-EXPERTS LAYER
arxiv.org/pdf/1701.06538.pdf
Github: MegaBlocks is a light-weight library for mixture-of-experts (MoE) training
github.com/mistralai/megabloc...
#ai
#experts
#tutorialyoutube

Пікірлер: 22
@javiergimenezmoya86
@javiergimenezmoya86 7 ай бұрын
Video implementation with MoE training with several swiching Lora layers would be great!
@HugoCatarino
@HugoCatarino 7 ай бұрын
What a great class! Very much appreciated 🙌👏👏🙏
@patxigonzalez4206
@patxigonzalez4206 7 ай бұрын
Woah...thanks a lot for this clean and powerful explanation about this dense topics, as a representative of average people, I appreciate it so much.
@TylerLali
@TylerLali 7 ай бұрын
Hopefully this doesn’t sound entitled, but rather expresses my gratitude towards your excellent work - yesterday I did a KZfaq search for MOE on this topic and saw several videos but decided not to watch others and rather wait for your analysis- and here I am today and this video enters my feed automatically :) Thanks for all you do for your community!
@darknessbelowth1409
@darknessbelowth1409 7 ай бұрын
very nice, thank you for a great vid.
@ricardocosta9336
@ricardocosta9336 7 ай бұрын
yaya!🎉🎉🎉🎉🎉 ty so much once again
@suleimanshehu5839
@suleimanshehu5839 7 ай бұрын
Please create a video on Fine tuning a MoE LLM using LoRA adapters. Can one train individual expert LLM within a MoE such as Mixtral 8x7B
@robertfontaine3650
@robertfontaine3650 6 ай бұрын
Thank you.
@TheDoomerBlox
@TheDoomerBlox Ай бұрын
Is this where I raise the obvious question of "wouldn't a Grokked(tm) model be the perfect fit for an Expert-Picking mechanism?"
@yinghaohu8784
@yinghaohu8784 4 ай бұрын
In autoregressive model, the generation of the token is progressively. However, when will the router works? Is it in each pass or the routing will be decided at the very beginning ?
@LNJP13579
@LNJP13579 4 ай бұрын
Can you please share a link to your Presentation. Need to use the content to make my own abridged notes.
@hoangvanhao7092
@hoangvanhao7092 7 ай бұрын
00:02 Mixture of Experts LLM enables efficient computation and research allocation for AI models. 02:46 Mixture of Experts LLM uses different gating functions to assign tokens to specific expert systems. 05:24 Mega Blocks addressed limitations of classical MoE system and optimized block sparse computations. 08:12 Mixture of Experts selects the top K expert system based on scores. 10:59 Mixture of Experts LLM enhances model parameters without computational expense 13:33 Mixture of Experts LLM - MoE efficiently organizes student-teacher distribution 16:07 Block Spar formulation ensures no token is left behind 18:35 Mixture of Expert system dynamically adjusts block sizes for more efficiency in matrix multiplication 20:57 Mixture of expert layer consists of independent feed-forward experts with an intelligence gating functionality.
@user-bf6bu3ex8c
@user-bf6bu3ex8c 6 ай бұрын
which PDF reader you are using to read the research paper?
@davidamberweatherspoon6131
@davidamberweatherspoon6131 7 ай бұрын
Can you explain to me how to mix MoE with Lora adapters?
@cecilsalas8721
@cecilsalas8721 7 ай бұрын
🤩🤩🤩🥳🥳🥳👍
@Jason-ju7df
@Jason-ju7df 7 ай бұрын
I wonder if I can get them to do RPA
@krishanSharma.69.69f
@krishanSharma.69.69f 7 ай бұрын
I made them do SEX. I was tough but I managed.
@densonsmith2
@densonsmith2 6 ай бұрын
Do you have a patreon or other paid subscription?
@matten_zero
@matten_zero 7 ай бұрын
Hello!
@PaulSchwarzer-ou9sw
@PaulSchwarzer-ou9sw 7 ай бұрын
@omaribrahim5519
@omaribrahim5519 7 ай бұрын
cool but MoE is so fool
@EssentiallyAI
@EssentiallyAI 7 ай бұрын
You're not Indian! 😁
MAMBA AI (S6): Better than Transformers?
45:48
code_your_own_AI
Рет қаралды 33 М.
The moment we stopped understanding AI [AlexNet]
17:38
Welch Labs
Рет қаралды 845 М.
小宇宙竟然尿裤子!#小丑#家庭#搞笑
00:26
家庭搞笑日记
Рет қаралды 12 МЛН
Идеально повторил? Хотите вторую часть?
00:13
⚡️КАН АНДРЕЙ⚡️
Рет қаралды 7 МЛН
Amazing weight loss transformation !! 😱😱
00:24
Tibo InShape
Рет қаралды 65 МЛН
Mistral 8x7B Part 1- So What is a Mixture of Experts Model?
12:33
Sam Witteveen
Рет қаралды 40 М.
GraphRAG or SpeculativeRAG ?
25:51
code_your_own_AI
Рет қаралды 6 М.
Understanding Mixture of Experts
28:01
Trelis Research
Рет қаралды 8 М.
AI Pioneer Shows The Power of AI AGENTS - "The Future Is Agentic"
23:47
Embeddings: What they are and why they matter
38:38
Simon Willison
Рет қаралды 22 М.
ChatGPT Explained Completely.
27:39
Kyle Hill
Рет қаралды 1,2 МЛН
This is why Deep Learning is really weird.
2:06:38
Machine Learning Street Talk
Рет қаралды 377 М.
Q* explained: Complex Multi-Step AI Reasoning
55:11
code_your_own_AI
Рет қаралды 8 М.
host ALL your AI locally
24:20
NetworkChuck
Рет қаралды 931 М.
Xiaomi SU-7 Max 2024 - Самый быстрый мобильник
32:11
Клубный сервис
Рет қаралды 544 М.
Хакер взломал компьютер с USB кабеля. Кевин Митник.
0:58
Последний Оплот Безопасности
Рет қаралды 2,2 МЛН