The AI Hardware Arms Race Is Getting Out of Hand

Рет қаралды 31,766

Күн бұрын

Check out Gamma.app now using this link here!
gamma.1stcollab.com/bycloud
CS-3
[Blog] cerebras.net/press-release/ce...
Groq
[Groq online Inference] groq.com/
[Groq LPU Blog] groq.com/wp-content/uploads/2...
Truffle-1
[Store] preorder.itsalltruffles.com/
Intel Max1550 Paper
[paper] arxiv.org/abs/2403.17607
[Tweet] x.com/main_horse/status/17728...
Extropic
[Blog] www.extropic.ai/future
[Andrew's Explainer] x.com/Andercot/status/1767252...
[Documentary] x.com/jasonjoyride/status/178...
[Documentary Retweet] x.com/0xKyon/status/178459142...
[Tweet Drama Thread 1] x.com/BasedBeffJezos/status/1...
[Tweet Drama Thread 2] x.com/fluxtheorist/status/178...
This video is supported by the kind Patrons & KZfaq Members:
🙏Andrew Lescelius, alex j, Chris LeDoux, Alex Maurice, Miguilim, Deagan, FiFaŁ, Daddy Wen, Tony Jimenez, Panther Modern, Jake Disco, Demilson Quintao, Shuhong Chen, Hongbo Men, happi nyuu nyaa, Carol Lo, Mose Sakashita, Miguel, Bandera, Gennaro Schiano, gunwoo, Ravid Freedman, Mert Seftali, Mrityunjay, Richárd Nagyfi, Timo Steiner, Henrik G Sundt, projectAnthony, Brigham Hall, Kyle Hudson, Kalila, Jef Come, Jvari Williams, Tien Tien, BIll Mangrum, owned, Janne Kytölä, SO, Richárd Nagyfi, Hector, Drexon, Claxvii 177th, Inferencer, Michael Brenner, Akkusativ, Oleg Wock
[Discord] / discord
[Twitter] / bycloudai
[Patreon] / bycloud
[Music] Massobeats - Glisten
[Music] Massobeats - Lush
[Profile & Banner Art] / pygm7
[Video Editor] Silas
0:00 Intro
0:35 CS-3
1:52 Groq
2:48 Truffle-1
4:09 Intel Max 1550 Research
5:18 Extropic Chip
8:08 Gamma.app (sponsor)

Пікірлер: 86

@bycloudAI 2 ай бұрын

Check out Gamma.app now using this link here! gamma.1stcollab.com/bycloud

@Superfastisfast 2 ай бұрын

no, yes, no, yes, no, yes, no, yes, no, yes, no, yes, no, yes, no, yes, no, yes, no, yes, no, yes, no, yes, maybe?

@imerence6290 2 ай бұрын

You missed 900 T/s generation speed on groq while using Llama 3 7b. ITs insane.

@catcoder12 2 ай бұрын

Finally I can start writing my 1000 word essay at 11:59PM and submit it on time.

@2CSST2 2 ай бұрын

I don't think the number of likes on twitter should be even mentioned when it comes to determining whether a new innovation path is likely to succeed or not.

@globurim 2 ай бұрын

Yeah that's just the hype meter

@o_2731 2 ай бұрын

I believe he was being sarcastic ...

@BooleanDisorder 2 ай бұрын

True. Especially since Twitter doesn't even exist anymore!

@BooleanDisorder 2 ай бұрын

True. Especially since Twitter doesn't even exist anymore!

@Words-. 2 ай бұрын

Maybe I missed a part in the video, though I think you've misunderstood the point of that segment, which I'm assuming is at 5:55. He said they got ratio-ed, which may signify skepticism amongst those who know about the tech; nowhere did he say that this was a good predictor of the company's success. I feel like this criticism is out of place.

@20xd6 2 ай бұрын

explaining particle physics with spongebob is next level

@WoolyCow 2 ай бұрын

so kids, if you fold some paper in half and then stab it with a pencil you get a wormhole!

@TheDragonshunter 2 ай бұрын

We also need that guy that found a way to use out current fiber optic infrastructure for way faster internet to start cooking.

@MrValgard 2 ай бұрын

forgot mention photonic chips!

@kushalvora7682 2 ай бұрын

There is no performance test till now like grok and rest. But once photonic chips are perfected for real world applications, electronic chips would feel like century old technology.

@Guedez1 2 ай бұрын

The exotropic chip thing seems iffy at best. While it probably actually can run that fast, it probably will be filled with compound inaccuracies that will super lobotomize the models

@mu11668B 2 ай бұрын

I'd say nothing's iffy there. It's a blatant scam. Simply runnung a RNG and hoping that a trained AI model would come out of it is like trying to implement the infinite monkey theorem without the "infinite" part.

@caseymurray7722 Ай бұрын

@@mu11668B It's a lot more than just RNG. Take a look at RandNLA speedups for linear algebra. Implementing a hardware accelerator for randomized linear algebra is exactly what they are doing. The hardware problem is needing billions of cores due to the massive amount of matrix multiplications that neural networks do. Instead of a general cpu/gpu/etc it's a chip designed specifically for "noisy" inputs which is what an AI model is using as inputs.

@mu11668B Ай бұрын

@@caseymurray7722 The thing is, exotropic chip developers claim that speeding up the RNG part makes an impact. That would be true if the computational bottleneck came from the RNG algorithms. It doesn't take a chip dedicated to generating random numbers to make RandNLA more efficient. Moreover, RandNLA achieves its efficiency gain by reducing total operations required to reach the target accuracy. You still need accurate floating point calculations for it to work. That does not seem to align with what exotropic chips are claimed to provide.

@BernhardVoogenberger-tl5ox 2 ай бұрын

No light chips mentioned.

@XenoCrimson-uv8uz 2 ай бұрын

sadly.

@truthwillout2371 2 ай бұрын

I think they're a bit far out for now.

@user-io4sr7vg1v 2 ай бұрын

Mach Zender.

@oxey_ 2 ай бұрын

The developments on them are cool but we're still very far off actually using them for any sort of AI training or inference

@SirusStarTV 2 ай бұрын

I only seen Taichi chip that did something useful

@BooleanDisorder 2 ай бұрын

Imagine attosecond pulse based photonic chips! With multiple wavelengths in the same pathways for extra oomph

@jameshughes3014 2 ай бұрын

This stuff is so cool. Personally I love the idea of some kind of analog digital hybrid as the most efficient path forward, but all these different solutions are interesting. The truffle looks cool. I wish they'd make one half as powerful and half as expensive. Imagine being able to run a 60B model at home without adding to your pc's load?

@wiiztec 2 ай бұрын

really surprised this wasn't about quantum computers

@pn4960 2 ай бұрын

The question isn’t wether or not exthropic can succeed in the current market environment but will the current market environment (ai bubble) sustain itself long enough for exthropic to succeed.

@Alexisz998 2 ай бұрын

A bubble ? where do you see a bubble ?

@MaakaSakuranbo 2 ай бұрын

@@Alexisz998 everyone hype-jumping in on AI when we have very few actually reliable use cases for it yet? Images are cool, but still full of flaws and other issues Text is cool, but hallucinations and stuff Music is cool but far from perfect

@Alexisz998 2 ай бұрын

@@MaakaSakuranbo I share your point of view, but remember that ai only took of 1 and a half year ago... and everything is getting better and better every months. Ai won't limit itself to video, song and text generation, i think it'll go far beyond with humanoid robotics. And big tech companies like microsoft, google, amazon and Meta are already making huge profit using Nvidia GPU by lowering their operating cost

@MaakaSakuranbo 2 ай бұрын

@@Alexisz998 Humanoid robots require fixing the inherent issues in the current generation first

@Alexisz998 2 ай бұрын

@@MaakaSakuranbo What do you mean by the current generation ?

@WeirdInfoTV 2 ай бұрын

Your editing is very funny

@ncking5414 2 ай бұрын

alot of ifs in the tech industry lately,ill just wait and see actuall physical outcomes.

@pladselsker8340 2 ай бұрын

Empirical observation gang!

@Napert 2 ай бұрын

300 tokens per second of what model? I can get 1000+ tokens per second on my 3060 ti if I use some of the smaller models (gemma 2b q2_k or qwen 0.5b q2_k)

@setop123 2 ай бұрын

very good video 👏

@user-tg2or8wy2n 2 ай бұрын

Hi! Great video. I am a little bit lost on some of the numbers though . Regarding the cerebras huge chip, how on Earth are they training a 4 trillion parameter model, which at bfloat16 (2 bytes/parameter) the model alone (without counting activations, and optimizer weights) weighs 8 TB??!! Same applies to the Truffle part where they claim you can train a 100B parameter model in just 60GB RAM. Sure if you use int8 or less as the model's precision then it makes sense, but the standard is 2 bytes so they should at least be more specific when giving their values. Is there something related to the 'shared memory' component that gets my calculation wrong? What am I missing? Thanks again for the video!

@novantha1 2 ай бұрын

Model size in GB != Size of data needed to perform operations on that model. As an extreme example, there was a paper where they managed to run a 70B model (if memory serves) on a 4GB GPU by only throwing the matrices actively being operated on into VRAM. The whole model was something like 140GB of information, but realistically, if you divide that across ~70 or so layers it was more like 2GB per layer, plus some change. Adding onto that, the attention matrices and feed forward matrices could be further split up as needed. Granted, in that case it was pretty slow (1 hour for a forward pass, lmao), but if you were planning a compute chip around that there’s a few ways around it like significantly faster RAID arrays to load data quickly, for example, would not be an unreasonable solution for people like Cerebra’s. With that said, there are other approaches, too. Things like Mixture of Experts are much more effective if you’re evaluating inference accuracy per memory access, for instance, or efficiency of parallel compute per intercommunication. If you look at something like Snowflake Arctic it’s pretty efficient in the sense that you only load 2 out of 256 (if memory serves) experts, meaning that while the whole model could be 900+ GB, you could get away with running it on a 32GB GPU, given a sufficiently fast method to load the experts onto the GPU per forward pass. In reality, there’s a lot of solutions for dealing with large models and in particular hardware companies have access to a lot of talent that can sort out these optimizations for them, so even if it sounds crazy usually there’s a way to do what they’re saying they can do.

@user-tg2or8wy2n 2 ай бұрын

@@novantha1 Thanks for the response! I am aware of these variants where you only partially load certain parts of the model, there's also the option that Apple is pushing forward of storing most of the model in flash and loading only what's necessary higher up in the hierarchy. But I don't think many of these options (if any) actually work in practice right? Not that they aren't technically possible, but barely practical, the latency caused by the continuous up/down streams of data between memory layers is too much to handle, am I right?

@novantha1 2 ай бұрын

@@user-tg2or8wy2n It depends heavily on the (no pun intended) context. In the case of the paper I mentioned where they ran a 70B model on a 4GB GPU, I'm pretty sure they had extremely slow storage relatively. A typical GPU for ML has between 500 to 2000GB/s of memory bandwidth, but the storage they used had 0.3GB/s (hard drive or SATA SSD, IIRC). On the other hand, if you were planning a piece of enterprise hardware (Cerebras) my assumption is that for a chip that costs over $1 million, that shelling out for a RAID array of 32 PCIe gen 5 SSDs (320GB/s, 64Tb of capacity easily) is by no means insane, and I wouldn't be surprised if they had even faster information delivery systems. In the case where you're doing something in enterprise and plan it out, there's no reason that streaming "cold" parameters (ie: the unchanging values that are multiplied against) couldn't be do-able on modest GPU hardware, or waferscale engine or what have you. My assumption is that the waferscale engine is fast enough to handle the active parameters in memory, and has a well thought out delivery system for streaming new data in as it's needed, though I could be wrong, but that seems like a large oversight given the scale of their operation.

@StefanReich 2 ай бұрын

@@novantha1 1 hour for a forward pass... that's a very scalable solution 😄

@Laszer271 2 ай бұрын

I thought you will talk about those light-based hardware. The one that uses photons rather than electrons to basically 100x the performance of chips.

@kylemorris5338 2 ай бұрын

I completely forgot until you pointed it out with arrows that THAT'S the company whose CEO got outed as the acc holder of Beff Jezos. Not holding my breath that that guy holds the key to the next big thing since sliced transistors.

@invizii2645 2 ай бұрын

Nice

@H1kari_1 2 ай бұрын

Whatever happened to Mythic AI Chips?

@monad_tcp 2 ай бұрын

1:54 why do I care as a consumer ? when the AI expending hype is over, I want to have realtime raytracing on my games !

@cdkw2 2 ай бұрын

Every time one of theses videos role, I think why am I even doing my CS degree? Why not just become like a math teacher?

@cdkw2 2 ай бұрын

@@OverbiteGames oh

@kabargin 2 ай бұрын

what about Tesla Dojo's D1 chip?

@DrW1ne 2 ай бұрын

Good video, I am surprised you didn't talk about analog PC for AI. okey, you kinda did...

@latrechetaher6340 2 ай бұрын

Aaah my brain

@pladselsker8340 2 ай бұрын

I am so sceptical about Extropic. I'll just put them there in the "physics based voxel engines" and "flying cars" box and wait until I can buy a PCIe extension that reliably and purposefully accelerates generative models 1000x times. I don't care anymore.

@rogerc7960 2 ай бұрын

New Operating systems

@monad_tcp 2 ай бұрын

no operating systems , just hardware and an external scheduler service

@Aurora-bv1ys 2 ай бұрын

I want to be a part of the AI industry, how can I approach it?

@Raphy_Afk 2 ай бұрын

Watch the choices that key figures made and do the same, preferably young researchers.

@JazevoAudiosurf 2 ай бұрын

like anything else, learn it from first principles. learn how neural nets, transformers work, learn how to code, how to use OpenAI API etc

@Raphy_Afk 2 ай бұрын

@@JazevoAudiosurf Learning how to code on a low level may be a bad idea, on a high level software engineering and algorithms are necessary, but I doubt coding in python or C will be for long

@JazevoAudiosurf 2 ай бұрын

@@Raphy_Afk Personally I chose typescript, any language works for API. I did some neural networks but figured that understanding how it works is more valuable that implementing it (ML is hard). Personally just use OpenAI API now for my projects

@rock3tcatU233 Ай бұрын

I took a shortcut, I slept my way into the industry.

@itzhexen0 2 ай бұрын

Can't wait to see all of these people try to shove all of this into peoples lives. Grab the popcorn.

@carterprince8497 2 ай бұрын

I don't see why you wouldn't just but a 4090 instead of preorder the mac thing.

@average_snmp_user 2 ай бұрын

Funny how he didn't mention the fastest AI chip that you can actually buy right now, which is AMD's mi300. And don't tell me that b200 is faster, i know, but if even Mark Zuckerberg can't buy it right now, than the b200 is not out, it is announced. Even on NVIDIA's website you can't rent a GB200.

@monad_tcp 2 ай бұрын

3:44 hahah , TAKE THAT NVIDIA. nVidia selling over-priced VRAM , someone actually did what I've been complaining for years, why not use normal cheap memory instead from the system via a shared bus ? (yeah, I know about the intel problem with PCs and the unified memory architecture that never comes, just don't use Intel then)

@randomobserver9488 2 ай бұрын

Because it's slow af. Works for LLM inference where the problem is in fitting the models in memory at all, but it would be insanely slow in most DGPU applications

@monad_tcp 2 ай бұрын

@@randomobserver9488 no, I was talking about a computer that uses unified memory. on PCs, the problem is the bandwidth of the PCIe bus, which is never the same bandwidth as the GDDR. But GDDR itself isn't that slower than normal DDR5. With smart caching done by the "motherboard", its totally possible to pull that thing. nVidia really knows what they're doing with this myth that GDDR is expensive, it is not, they're the ones selling it as expensive, way above what it really costs because you can't upgrade the memory. That's why GPU memory paging to system RAM would be slow in a PC, there's not enough bandwidth in the memory controller inside GPU and even less on the PCIe bus, and I think that's on purpose. There's another detail that GPU memory is accessed at 1024bits parallel, while the Intel CPU only access 512bits in parallel. That's easily solvable using more memory channels. With some cache, and some driver smartness, it would totally be possible for an ARM CPU to share its RAM with the GPU. Unified addressing, as long as the GPU doesn't touch the memory directly is doable. Not only its doable, it was done before, the Playstation5 runs like that, it has an I/O controller that sits between the memory and both the GPU/CPU to arbiter the bus.

@monad_tcp 2 ай бұрын

alas, on mobile, all GPU/CPU share memory that way, its not a matter of the speed of the main memory being slow, its a matter of architecture of the hardware, that's what I'm complaining.

@monad_tcp 2 ай бұрын

I want "VESA LOCAL BUS" back !

@randomobserver9488 2 ай бұрын

@@monad_tcp CUDA has supported unified memory (single address space, on-demand paging) from about 2017 and is limited by PCIe4 x16 to 32GB/s which is ~half of DDR5 bandwidth on consumer platforms. The GDDR bandwidth even on a mid to high end gaming GPU is more than 10x the DDR bandwidth. Might as well use CPU or iGPU when the DDR bandwidth would limit the dGPU to same speed. The number of use cases that benefit from large amount of very slow memory and need dGPU-levels of compute is tiny.

@user-zc6dn9ms2l 2 ай бұрын

lol .the issue is the extremely monochrome nature of binary code

@superfliping 2 ай бұрын

Build your team. Prove your LLM Super? 1. CodeCraft Duel: Super Agent Showdown 2. Pixel Pioneers: Super Agent AI Clash 3. Digital Duel: LLM Super Agents Battle 4. Byte Battle Royale: Dueling LLM Agents 5. AI Code Clash: Super Agent Showdown 6. CodeCraft Combat: Super Agent Edition 7. Digital Duel: Super Agent AI Battle 8. Pixel Pioneers: LLM Super Agent Showdown 9. Byte Battle Royale: Super Agent AI Combat 10. AI Code Clash: Dueling Super Agents Edition

@boxeryy6661 Ай бұрын

Why do you associate scientific theories with companies or any single organisation? Thermodynamic Computing was around for some time and it has been pursued by academicians too .