Transformers Explained From The Atom Up (Many Inaccuracies! Revised Version Is Out Now!)

Рет қаралды 8,306

Күн бұрын

Have you ever wanted to learn how transformers work from the atom up? Well this is the place to learn! :)
Please follow me for more nanotech and AI videos.
Twitter: ‪@jacobrintamaki‬
Time-Stamps:
0:00 Optimus Prime
0:12 Overview
0:33 Atoms
0:48 Semiconductors
3:51 Transistors
5:53 Logic Gates
6:47 Flip-Flops
7:39 Registers
8:35 ALUs/Tensor Cores
10:34 SMs
12:08 GPU Architecture
13:44 ISA
14:42 CUDA
16:25 PyTorch
17:27 Transformers
21:41 Transformers (Hardware)
22:29 Final Thought

Пікірлер: 26

@sophiawisdom3429 10 күн бұрын

Some thoughts as I watched the video: Tensor cores don't do FMAs, they do MMAs (matrix-multiply add). FMA is a different thing they can also do that typically refers to a *single* fused multiply-add. Kudos for mentioning they do the add though, most people skip over this. At 12:58 you have a slide with Register/Sram/L1$/SRAM/L2$/DRAM. All of these are made of SRAM. Under ISA you mention the ISA for a tensor core, which I don't think makes sense. The tensor core is within the SM and is called just like any other part of the chip like the MUFU. All of the stuff you put on the slide at 14:24 is also not part of the ISA as most people would understand it. Outputs also can't be written to memory (though as of Hopper they can be read from shared memory!). You're correct that CUDA is compiled to PTX and then SASS, but SASS probably doesn't stand for Source And Assembly (it probably stands for Shader Assembly but NVIDIA never specifies) and CUBIN is a format for storing compiled SASS. What you're saying is equivalent to "C gets optimized to LLVM IR then to armv9-a aarch64 then to ELF" on CPU. Ignoring Inductor, Torch does not compile pytorch into CUDA -- this is an important distinction that is meaningful for both Torch's strengths and weaknesses. It calls pre-existing CUDA kernels that correspond to the calls you make. For transformers, I find it somewhat confusing you're teaching encoder-decoder instead of decoder-only, but whatever. The dot product of things that are close would not be close to 1 -- the *softmax* of the dot product of things that are close would be close to 1. MHA is also not based on comparing the embeddings, but on comparing "queries" for each token to "keys" for each other token. The network *learns* specific things to look for. The addition is *not* about adding little numbers to it but about adding *the previous value* to it. The intuition is that attention etc. compute some small *update* to the previous value as opposed to totally transforming it. I think your explanation of MLP also leaves something to be desired -- there are already nonlinearities in the network you described (layer norm and softmax). It also doesn't do an FMA, but a matrix multiply. your explanation of the linear embedding at the end is confusing. Typically the unembedding layer *increases* the number of values per token because the number of tokens is larger than d_model. you say all the matrix and addition happen in the tensor cores, inside of the SM, whereas the intermediate stuff happens in the registers. All of the stuff "happens in the registers" in the sense that the data starts and ends there, but more correctly it happens in the ALUs or the tensor cores. When you say that DRAM hasn't kept up as much, DRAM is made of the same stuff as the SMs -- it's all transistors. You mention you would have to redesign your ISA -- the ISA is redesigned every year, see e.g. docs.nvidia.com/cuda/cuda-binary-utilities/index.html .

@d_polymorpha 10 күн бұрын

Hello do you know of any resources to dive deeper into this higher level intro video? Specially towards cuda/pytorch/ actual transformer?

@maximumwal 9 күн бұрын

Very good post, but Jacob's right about DRAM. DRAM also uses capacitors to store the bits, and then transistors for reading, writing, and refreshing the bit. In addition, the manufacturing process is quite different. Moore's law for DRAM has been consistently slower than Logic scaling, which is why nvidia pays 5x as much for HBM than the main die, and still, the compute : bandwidth ratio keeps getting more and more skewed every generation towards compute. Even SRAM, which is purely made of transistors, can't keep up because leakage gets worse and worse, and if you're refreshing it all the time, it's unusable. Logic is scaling faster both due to 1. physics and 2. better/larger tensor cores.

@sophiawisdom3429 9 күн бұрын

@@maximumwal ah true, though i thought DRAM uses a transistor and a capacitor (?). I feel like you should expect they pay more for HBM than the die because the main die is 80B transistors but the HBM is 80GB*8 bits/byte=640B transistors+640B capacitors. HBM is also much more expensive than regular DRAM I believe, like $30 vs $5 per GB.

@maximumwal 9 күн бұрын

@@sophiawisdom3429 Yes, there's 1 transistor per capacitor, whose channel and gate connect to the bit and word lines. Branch education has a great video on this. as for HBM being roughly the same transistors/$: True, but they used to be much cheaper, because logic has tens of layers of wires/vias on top of the transistors at the bottom, vs just 2 simple layers of wires on dram. With b100 and beyond, HBM will be more expensive than logic on a transistor basis. There are many reasons for this, including the fact that smaller capacitors have to be refreshed more often, and the hard limits of memory frequency + bits per pulse (a100 -> h100 doubled bits per pulse, but lowered frequency, probably since it's harder to parse the signal at low power, but possibly because of greater resistance with thinner bitlines), which were previously leaned on to improve GB/s/pin, whereas on the die you can just build a larger systolic array/tensor core, and get more flops/(transistors * clock cycles), and increase clock frequency more easily, you just have to manage power. Right now we're stacking HBM with even more layers (8 -> 12 -> 16), and using more stacks (5 -> 8). Nvidia will eat the cost, and lower their margins. The normalizations + activations are soon going to use more gpu seconds than the matmuls. Everyone knows this, so tricks on the algorithms, scheduling, and hardware sides are being aggressively pursued to provide life support to Huang's law.

@En1Gm4A 10 күн бұрын

Highest Signal to noise ever observed

@jacobrintamaki 11 күн бұрын

Time-Stamps: 0:00 Optimus Prime 0:12 Overview 0:33 Atoms 0:48 Semiconductors 3:51 Transistors 5:53 Logic Gates 6:47 Flip-Flops 7:39 Registers 8:35 ALUs/Tensor Cores 10:34 SMs 12:08 GPU Architecture 13:44 ISA 14:42 CUDA 16:25 PyTorch 17:27 Transformers 21:41 Transformers (Hardware) 22:29 Final Thought