Transformers Explained From The Atom Up (Many Inaccuracies! Revised Version Is Out Now!)

  Рет қаралды 8,306

Jacob Rintamaki

Jacob Rintamaki

Күн бұрын

Have you ever wanted to learn how transformers work from the atom up? Well this is the place to learn! :)
Please follow me for more nanotech and AI videos.
Twitter: ‪@jacobrintamaki‬
Time-Stamps:
0:00 Optimus Prime
0:12 Overview
0:33 Atoms
0:48 Semiconductors
3:51 Transistors
5:53 Logic Gates
6:47 Flip-Flops
7:39 Registers
8:35 ALUs/Tensor Cores
10:34 SMs
12:08 GPU Architecture
13:44 ISA
14:42 CUDA
16:25 PyTorch
17:27 Transformers
21:41 Transformers (Hardware)
22:29 Final Thought

Пікірлер: 26
@sophiawisdom3429
@sophiawisdom3429 10 күн бұрын
Some thoughts as I watched the video: Tensor cores don't do FMAs, they do MMAs (matrix-multiply add). FMA is a different thing they can also do that typically refers to a *single* fused multiply-add. Kudos for mentioning they do the add though, most people skip over this. At 12:58 you have a slide with Register/Sram/L1$/SRAM/L2$/DRAM. All of these are made of SRAM. Under ISA you mention the ISA for a tensor core, which I don't think makes sense. The tensor core is within the SM and is called just like any other part of the chip like the MUFU. All of the stuff you put on the slide at 14:24 is also not part of the ISA as most people would understand it. Outputs also can't be written to memory (though as of Hopper they can be read from shared memory!). You're correct that CUDA is compiled to PTX and then SASS, but SASS probably doesn't stand for Source And Assembly (it probably stands for Shader Assembly but NVIDIA never specifies) and CUBIN is a format for storing compiled SASS. What you're saying is equivalent to "C gets optimized to LLVM IR then to armv9-a aarch64 then to ELF" on CPU. Ignoring Inductor, Torch does not compile pytorch into CUDA -- this is an important distinction that is meaningful for both Torch's strengths and weaknesses. It calls pre-existing CUDA kernels that correspond to the calls you make. For transformers, I find it somewhat confusing you're teaching encoder-decoder instead of decoder-only, but whatever. The dot product of things that are close would not be close to 1 -- the *softmax* of the dot product of things that are close would be close to 1. MHA is also not based on comparing the embeddings, but on comparing "queries" for each token to "keys" for each other token. The network *learns* specific things to look for. The addition is *not* about adding little numbers to it but about adding *the previous value* to it. The intuition is that attention etc. compute some small *update* to the previous value as opposed to totally transforming it. I think your explanation of MLP also leaves something to be desired -- there are already nonlinearities in the network you described (layer norm and softmax). It also doesn't do an FMA, but a matrix multiply. your explanation of the linear embedding at the end is confusing. Typically the unembedding layer *increases* the number of values per token because the number of tokens is larger than d_model. you say all the matrix and addition happen in the tensor cores, inside of the SM, whereas the intermediate stuff happens in the registers. All of the stuff "happens in the registers" in the sense that the data starts and ends there, but more correctly it happens in the ALUs or the tensor cores. When you say that DRAM hasn't kept up as much, DRAM is made of the same stuff as the SMs -- it's all transistors. You mention you would have to redesign your ISA -- the ISA is redesigned every year, see e.g. docs.nvidia.com/cuda/cuda-binary-utilities/index.html .
@d_polymorpha
@d_polymorpha 10 күн бұрын
Hello do you know of any resources to dive deeper into this higher level intro video? Specially towards cuda/pytorch/ actual transformer?
@maximumwal
@maximumwal 9 күн бұрын
Very good post, but Jacob's right about DRAM. DRAM also uses capacitors to store the bits, and then transistors for reading, writing, and refreshing the bit. In addition, the manufacturing process is quite different. Moore's law for DRAM has been consistently slower than Logic scaling, which is why nvidia pays 5x as much for HBM than the main die, and still, the compute : bandwidth ratio keeps getting more and more skewed every generation towards compute. Even SRAM, which is purely made of transistors, can't keep up because leakage gets worse and worse, and if you're refreshing it all the time, it's unusable. Logic is scaling faster both due to 1. physics and 2. better/larger tensor cores.
@sophiawisdom3429
@sophiawisdom3429 9 күн бұрын
@@maximumwal ah true, though i thought DRAM uses a transistor and a capacitor (?). I feel like you should expect they pay more for HBM than the die because the main die is 80B transistors but the HBM is 80GB*8 bits/byte=640B transistors+640B capacitors. HBM is also much more expensive than regular DRAM I believe, like $30 vs $5 per GB.
@maximumwal
@maximumwal 9 күн бұрын
@@sophiawisdom3429 Yes, there's 1 transistor per capacitor, whose channel and gate connect to the bit and word lines. Branch education has a great video on this. as for HBM being roughly the same transistors/$: True, but they used to be much cheaper, because logic has tens of layers of wires/vias on top of the transistors at the bottom, vs just 2 simple layers of wires on dram. With b100 and beyond, HBM will be more expensive than logic on a transistor basis. There are many reasons for this, including the fact that smaller capacitors have to be refreshed more often, and the hard limits of memory frequency + bits per pulse (a100 -> h100 doubled bits per pulse, but lowered frequency, probably since it's harder to parse the signal at low power, but possibly because of greater resistance with thinner bitlines), which were previously leaned on to improve GB/s/pin, whereas on the die you can just build a larger systolic array/tensor core, and get more flops/(transistors * clock cycles), and increase clock frequency more easily, you just have to manage power. Right now we're stacking HBM with even more layers (8 -> 12 -> 16), and using more stacks (5 -> 8). Nvidia will eat the cost, and lower their margins. The normalizations + activations are soon going to use more gpu seconds than the matmuls. Everyone knows this, so tricks on the algorithms, scheduling, and hardware sides are being aggressively pursued to provide life support to Huang's law.
@En1Gm4A
@En1Gm4A 10 күн бұрын
Highest Signal to noise ever observed
@jacobrintamaki
@jacobrintamaki 11 күн бұрын
Time-Stamps: 0:00 Optimus Prime 0:12 Overview 0:33 Atoms 0:48 Semiconductors 3:51 Transistors 5:53 Logic Gates 6:47 Flip-Flops 7:39 Registers 8:35 ALUs/Tensor Cores 10:34 SMs 12:08 GPU Architecture 13:44 ISA 14:42 CUDA 16:25 PyTorch 17:27 Transformers 21:41 Transformers (Hardware) 22:29 Final Thought
@ramanShariati
@ramanShariati 6 күн бұрын
LEGENDARY 🏆
@sudhamjayanthi
@sudhamjayanthi 10 күн бұрын
damn super underrated channel - i'm the 299th sub! keep posting more vids like this :)
@logan4565
@logan4565 5 күн бұрын
This is awesome. Keep it up
@Barc0d3
@Barc0d3 10 күн бұрын
This was a great and comprehensive high level intro. Oh wow😮 can we hope to get a continuation of these lectures?
@tsugmakumo2064
@tsugmakumo2064 10 күн бұрын
i was talking with gpt-4o about exactly this abstraction layers from the atom until a compiler. So this video will be a great refresher.
@codenocode
@codenocode 10 күн бұрын
Great timing for me personall (I was just dipping my toes into A.I.)
@Nurof3n_
@Nurof3n_ 10 күн бұрын
you just got 339 subscribers 👍 great video
@CheeYuYang
@CheeYuYang 10 күн бұрын
Amazing
@pragyanur2657
@pragyanur2657 10 күн бұрын
Nice
@boymiyagi
@boymiyagi 10 күн бұрын
Thanks
@nicholasdominici
@nicholasdominici 10 күн бұрын
This video is my comp sci degree
@baby-maegu28
@baby-maegu28 4 күн бұрын
I apreciate it. AAAAA make me down here.
@rahultewari7016
@rahultewari7016 10 күн бұрын
Dudee this is so fucking cool 🤩kudos!!
@milos_radovanovic
@milos_radovanovic 3 күн бұрын
you skipped quarks and gluons
@isaac10231
@isaac10231 10 күн бұрын
So in theory this is possible in Minecraft
@baby-maegu28
@baby-maegu28 4 күн бұрын
14:50
@PRFKCT
@PRFKCT 10 күн бұрын
wait Nvidia invented GPUs? wtf
@d_polymorpha
@d_polymorpha 10 күн бұрын
GPUs have only existed for about 25 years!🙂
@enticey
@enticey 10 күн бұрын
they weren't the first ones, no
Transformers Explained From The Atom Up (REVISED!)
28:01
Jacob Rintamaki
Рет қаралды 1,4 М.
The Numitron: An obvious idea that wasn't very bright
23:21
Technology Connections
Рет қаралды 960 М.
WHO DO I LOVE MOST?
00:22
dednahype
Рет қаралды 80 МЛН
When You Get Ran Over By A Car...
00:15
Jojo Sim
Рет қаралды 14 МЛН
Increíble final 😱
00:37
Juan De Dios Pantoja 2
Рет қаралды 115 МЛН
Is graphene starting to live up to its hype?
28:03
RAZOR Science Show
Рет қаралды 183 М.
The future of AI looks like THIS (& it can learn infinitely)
32:32
Sand To GPU In Under 5 Minutes
4:58
Jacob Rintamaki
Рет қаралды 1,6 М.
The Gate-All-Around Transistor is Coming
15:44
Asianometry
Рет қаралды 448 М.
The Man Who Solved the World’s Hardest Math Problem
11:14
Newsthink
Рет қаралды 349 М.
This cli component was trickier to build than I thought
17:34
Dreams of Code
Рет қаралды 39 М.
why you can't explain qcd
37:26
Angela Collier
Рет қаралды 109 М.
The BEST Way to Find a Random Point in a Circle | #SoME1 #3b1b
18:35
The Bubble Sort Curve
19:18
Lines That Connect
Рет қаралды 459 М.
WHO DO I LOVE MOST?
00:22
dednahype
Рет қаралды 80 МЛН