CPU vs GPU: Which is More Powerful?

No video

CPU vs GPU: Which is More Powerful?

Рет қаралды 108,387

Жыл бұрын

Is an Nvidia 4080 faster than a Threadripper 3970x? Dave puts them to the test!
He explains the differences between how CPUs and GPUs operate and then explores whether the GPU can be leveraged to solve prime numbers faster than the CPU.

Пікірлер: 395

@tails55 Жыл бұрын

5:56 correction: Numbers *can* have prime factors greater than their square root (e.g. 21=3*7, 7>5>sqrt(21)). It's just that every composite number has to have *at least one* prime factor lesser than or equal to its square root.

@DavesGarage Жыл бұрын

Good catch, thanks for clarifying!

@petergilliam4005 Жыл бұрын

Thank you! I spent way too long trying to figure out why the algorithm must work knowing the factoid about prime numbers was mistaken.

@UmVtCg Жыл бұрын

I asked GPT Chat about this, it gave an example and stated that 3 is greater than the square root of 21: Prime factors of a number can indeed be greater than its square root. For example, let's consider the number 21. The square root of 21 is approximately 4.58. If we prime factorize 21, we get 3 * 7, where both 3 and 7 are prime numbers, and both are greater than the square root of 21. This illustrates that numbers can have prime factors that are greater than their square roots. It's just that, as per the fundamental theorem of arithmetic, every composite number must have at least one prime factor lesser than or equal to its square root.

@alexcourchesne2051 Жыл бұрын

@@UmVtCg I apologize if this is a dumb question, I understand everything up to both 3 and 7 being larger than the square root of 21. We esteblished that 3 and 7 are prime factors of 21, and that the square root of 21 to be ~4.58. Where I am lost is you are saying both 3 and 7 are greater than square root of 21 (~4.58). Indeed 7 > 4.58 but 3 ! > 4.58?

@notarabbit1752 Жыл бұрын

@@alexcourchesne2051 You may have misread. They are saying that 21 has at least one prime factor (7) bigger than its square root, not that all of them are.

@whiterose7055 Жыл бұрын

Thanks Dave, you never cease to amaze me with your ability to drill problems down to achieve practical solutions. Also thanks for the introduction to CUDA coding, I'm already a fluent C++ programmer and have been one for decades so this was right down my alley.

@susibakha Жыл бұрын

You're improving as a content creator, and are now my favorite one. I always am eager to watch all of your new videos. 👍

@DavesGarage Жыл бұрын

Wow, thanks!

@coolbrotherf127 Жыл бұрын

It's interesting that GPU programming isn't covered more often. I don't know many developers, especially the really old school guys, who actually know much about it. My college professors knew more about programming Unix mainframes than GPUs.

@nathanfranck5822 Жыл бұрын

Mostly due to general computing on GPU being a relatively new thing, also CUDA only works on nvidia hardware, so you become super platform dependant

@astroid-ws4py Жыл бұрын

I guess it is due to GPU programming being too proprietary. Also only in just recent years GPUs got REALLY that fast. CUDA is an NVidia only technology, There is an effort by AMD called HIP that strives to copy the API of CUDA and make it applicable to both AMD and NVidia GPUs. Maybe it will help change something. Also most of the stuff you would use such a thing is mostly for highly advanced stuff such as solving partial differential equations, simulating fluid dynamics and similar physics related areas so it is a much more niche/domain specific knowledge than just a general computing skill that you learn in Computer Science.

@coolbrotherf127 Жыл бұрын

@@astroid-ws4py True, in college I learned the basics in a computer graphics and simulation course and some in a machine learning course. Other than that it wasn't particularly needed for general understanding of programming fundamentals. It was treated more like something you learn if you need to.

@RiversJ Жыл бұрын

It's possible to drive them in HLSL / GLSL aswell but these are explicitly graphics focused languages and mostly game / engine devs talk about them in their specialized forums not in general programming boards etc. Half the knowledge isn't even written down anywhere, being tribal knowledge only that is.

@0LoneTech Жыл бұрын

I guess this thread could use a gentle reminder that OpenCL, OpenMP and SPIR-V exist. You don't have to write everything in Glide. ;) I'd consider prototyping with CLyther or Futhark.

@pribeiro Жыл бұрын

Please be aware someone posing as Dave as trying to scam us saying we are the winner of the RTX4080 and asking us to pay the shipping ... (I was one of the users he tried to scam ...)

@mohammedissam3651 Жыл бұрын

These scams are everywhere

@aleksandrbmelnikov Жыл бұрын

It's all over KZfaq. You can hit the three dots next to their post and select [🏳Report], but they'll just create another disposable account. It's like swatting flies on roadkill.

@nene71286 Жыл бұрын

Lol

@ame7165 Жыл бұрын

I just wanted to say, that it seems like you were born to do KZfaq videos. you're really great at them. keep it up! thanks for taking the time to make great content, we all appreciate it!

@DavesGarage Жыл бұрын

Thank you too!

@Chris-ib8lw Жыл бұрын

Absolutely love the content and what you do here Dave. Your channel and Ben Eater have quickly become my favorite computer science channels on KZfaq. Keep up the great work!

@DavesGarage Жыл бұрын

Glad you enjoy it!

@lofasz_joska Жыл бұрын

These two people helped me understand computers way more than my teachers in school ever managed.... In retrospect: my former teachers seem lazy compared to the work these legends post on KZfaq....

@Chris-ib8lw Жыл бұрын

@@lofasz_joska 100% Makes me wonder why I bothered to pay for the piece of paper on my wall to begin with. haha

@UmVtCg Жыл бұрын

Great explanation of the differences between GPU and a CPU with some nice programming to elaborate. Great job Dave!

@kamtschatkas Жыл бұрын

A few years ago I rewrote a program that was written in Mathematica in Java to run calculations in parallel and speed it up(a lot of matrix multiplications). For fun I also rewrote it with C# and CUDA support. Somehow the program finished so fast, that I thought there was a bug. Took me some time to actually check the output and realize that the output was actually correct, it was just blazingly fast...

@Drew-Dastardly Жыл бұрын

Just imagine how cool it was in 386 days when we got a 387 "math co-processor" to do finite element analysis that would still take a day for the simplest task.

@liamconverse8950 Жыл бұрын

So with CUDA a GPU can do everything a CPU can?

@kamtschatkas Жыл бұрын

@@liamconverse8950 No it can't, but it can multiply a lot of numbers really fast and that is mostly what the program I had did.

@liamconverse8950 Жыл бұрын

@@kamtschatkas What can't it do?

@Takyodor2 Жыл бұрын

@@liamconverse8950 It _can_ do everything the CPU can, but it will not always be faster. (Well, it can do anything the CPU can in a computing sense, it can't process mouse movement or open files on a USB drive, since it lacks USB ports). Many programs have large sections that are sequential in nature (the next step depends on the current step in some way), and in those cases the GPU will be a lot slower than the CPU.

@connecticutaggie Жыл бұрын

Another interesting comparison is CPU vs FPU. A few years ago I was working on a project where we were using a PC to close the loop on a large number (>100) of servo motors. The goal was to close the PID loop on all of the motors at least 2000 times per second. Everyone figured we had to use integer math to make the computation as efficient as possible but then someone tried just doing the computation using doubles instead and to our surprise, it was way faster (and of course, way simpler). Always challenge your assumptions!

@pierQRzt180 10 ай бұрын

AFAIK integer math is faster than floating point math? Further floating point units are internal to the CPU (and GPU) nowadays. In the past (like math coprocessors) that wasn't the case.

@coolbrotherf127 Жыл бұрын

Keep growing that beard out and you would be a good Santa Claus.

@SeanBZA Жыл бұрын

Or Col Sanders of KFC.....

@chrisclawson296 Жыл бұрын

He isn't fat enough for Santa.. good beard, though, agreed 😋

@jeffyp2483 Жыл бұрын

@@SeanBZA i was going to say 'or a chicken salesman' ;)

@volvo09 Жыл бұрын

My uncle would grow a Santa beard every fall, he had the belly too, 😆

@HalfbreedTrini Жыл бұрын

I’ll sit on his lap for a 4080🤓

@wx39 Жыл бұрын

I'm absolutely loving these experiments you've been running. You're a really good presenter and make the topics much more approachable than I have seen elsewhere. Please keep up the cool tech and programming experiments. I love the puzzles they provide and being surprised by the results.

@PatrickStaight Жыл бұрын

I stopped the video at 8:38 to think about how thousands of cores could write to memory all at once. My genneral guess is that the V-RAM would have a clock and it would only be able to complete 1 operation per clock cycle, in this case a write (or 2 opps in the case of hyperthreading, but anyway). Here are my theories so far: • The V-RAM clock is just thousands of times faster than the CUDA core clocks. • There's some sort of high impedance state magic going on that allows multiple memory locations to change in a single clock cycle. • The V-RAM is actually a hierarchy of memory copies that merge together in a pipe. • The CUDA cores are staggered somehow so they aren't actually all trying to write at the same time. I'll continue watching the video to see if this mystery is revealed.

@PatrickStaight Жыл бұрын

The mystery was not revealed... Still a very informative video. I may need to do some Googling about how VRAM works now.

@0LoneTech Жыл бұрын

Hierarchy is basically right. Every compute unit has its own local memory, and there's a cache hierarchy before reaching the memory controllers, which are typically interleaved across the address space and use a whole bunch of distinct channels. Even within the compute unit, the multiprocessors often have operations across parallel threads of execution, like bitwise or. When you do get congestion (quite easy to do), your accesses will stall and stagger, which is often hidden by running other contexts at the time. Remember how the CPU runs a couple of threads per core? The GPU might do hundreds.

@stefanl5183 Жыл бұрын

@@0LoneTech This raises the question would this code actually run faster on some of the older data center GPUs that have HBM2 memory? Might be interesting to try it on an old P100 or V100 and see how that compares. If memory bandwidth is a limiting factor, these cards even though they are much older might perform better.

@0LoneTech Жыл бұрын

@@stefanl5183 It might, depending on which cards you compare. Radeon 6900 XT has about the same theoretical VRAM bandwidth as Radeon R9 Nano (500GB/s), easily beaten by e.g. MI250X with 3TB/s. However, local memory is far faster still (one or two orders of magnitude again), and the host bus (PCIe, now up to about 60GB/s) far slower. The block version of this program should essentially operate wholly in local memory. Similarly, when running on CPU such a blocked version can keep all the bit clear operations within L2 cache. There's no need for off-die RAM round trips where HBM's shorter bursts might help.

@Kristinapedia Жыл бұрын

I just gotta say I LOVE THIS CHANNEL!!! Aside from all the tech jargon (which I love because why else would I be watching this channel), the videos are just simple (FWIW) and to the point. No flashy graphics, no loud over the top music. you speak fast enough to understand but not too slow that I need to change the speed (like technology connections but I love his channel too) You are well and clearly spoken. LOVE IT LOVE IT LOVE IT!

@DavesGarage Жыл бұрын

Awesome, thanks!

@Luredreier Жыл бұрын

8:01 A cuda core isn't a "core" pr say and is a better analogue to a execution unit within the CPU cores. While yes the GPU has more execution units then the CPU the difference in actual *cores* isn't as big as people often think and with good use of instruction level parallelism you can get surprisingly close to GPU performance with CPUs. You usually have to buy expensive server level CPUs to get equivalent computational capability to a GPU people have in their homes. Part of the price difference is that the CPU is more complex. But some of the difference is also just that server hosts are willing to pay a lot for them. OpenCL does a better job of exposing what is or isn't a core.

@ayoung7811 Жыл бұрын

Dave, Thank you for all of your work. As a bit of a geek and a bit of a teacher, I am impressed with your ability to take complex concepts and to present them in a simple while comprehensive manor. Your work here on this chanel is an incredible resource for those aspiring to understand and grow in personal knowledge within computing. You are a blessing for those who have tried to recieve understanding from "Computer Experts" but receive only exasperated derision for their efforts. Keep up the great work!!!

@BleuSquid Жыл бұрын

This takes me back a bit. In the early days of CUDA, Nvidia themselves ported the SETI@home client to CUDA as a demonstration of CUDA and their cuFFT library. Afterwards, a bunch of us community members fixed up a few things. I adapted some of the initialization code to run on the GPU as well, since it was several minute-long function to generate an array.

@ceuabara159 Жыл бұрын

Awesome video Dave, it was great to see this in action. 😊 loved the intro and the tunes. Haha

@zacmitchell_1984 Жыл бұрын

Thank you for explaining the differences of a cpu and a gpu.congrats to the winner!

@alliwantedisapepsi1492 Жыл бұрын

Amazing example showing the difference between CPU & GPU. I really had no idea, and I started on a 486-66. Thanks.

@jeremywillis3434 Жыл бұрын

I love that this channel bridges some of the gap between hardware and software. They have always been two vastly different worlds in my head.

@VysesDarkheart Жыл бұрын

Awesome video! However, I noticed that the giveaway rules changed from when it was first announced in the 'World's Smallest PC + 4080 GPU Giveaway!' video. Initially, we were asked to leave a compelling comment, and around 3,200 people participated. I must admit, I'm a little bummed about the change, but it's not a big deal - you're still cool in our books!

@AlwaysCensored-xp1be Жыл бұрын

Finally an explanation I understand, I must learn by visual means, that picture makes sense.

@DavesGarage Жыл бұрын

Thanks! I was hoping the grid would make it clear!

@cliff8675 Жыл бұрын

@@DavesGarage Yes, that grid really did explain it and that along with a few details in the comments and I now see how the square root works here as well. It's amazing how the right diagram or comment can clear up most any question. The real trick I've found is listening for that comment.

@andersrimmer6675 Жыл бұрын

I’ve made a Borland C / Aztec C compatible, ANSI C-compliant sieve as per the rules, which calculates primes numbers to 1.000.000 at a rate of 2.900 iterations per second on my i9 13900kf. It’s no way near the Zig implementation, but still nearly twice the speed of the solution offered as the best C-implementation in the previous episode - AND it runs on my IBM 5150 just as fine, with no adjustments, albeit somewhat slower, and needing space in RAM for the tracking. I think I should try to do a pull request to add my code to the list. Oh, and by the way, forgot about the 4080 competition in the process of coding. Had just WAAAY too much fun for an old man to remember other minor details 👴🏽🤓🦤

@Trefall Жыл бұрын

Nice breakdown, makes it easier to understand for the average person how each processors works. Very educational. 👍

@notation254 Жыл бұрын

I'm so glad the youtube algorithm made me stumble across this gem of a channel. Great stuff!

@McGuire40695 Жыл бұрын

Been having your videos pop up in my recommended feed for the past week or so, and I'm loving them! Computer Science always interested me, but I never dove deep into it, so my understanding isn't too deep on it. Love the content, Dave!

@johndoh5182 Жыл бұрын

This is neat and all but IMO it's not CPU vs. GPU because it's code that requires the CPU to run some part of it, which is the limitation of the GPU in different types of calculations. So it's CPU vs. CPU+GPU in highly parallel tasks when decision making is minimal or absent.

@vitalyl1327 Жыл бұрын

Not really, it would have been better to run the first step on a GPU as well to avoid a significant PCIe transfer latency in between. It's quite common in GPGPU computing to have steps running on a single GPU core (very inefficiently vs. a CPU) simply to avoid costly transfers.

@johndoh5182 Жыл бұрын

@@vitalyl1327 PCI gen4 X16? Do the math to see the bandwidth. As far as I understand you don't get DMA with a GPU. Am I wrong about this point? The initial data transfer to the GPU would be TRIVIAL, so much so it's not worth mentioning. If it were an issue the results would show the latency problems. But, when you can clock a single PCIe lane at 2GB/s with gen4 and 4GB/s with gen5, no, just no. Latency isn't an issue. Either way, the CPU is going to call data before it goes to the GPU. Isn't that what DirectStorage is supposed to change? And what kind of decision making can the GPU do? You read about GPUs and they can't go general purpose computing. Well, that's typically a lack of ability to make decisions or very limited decision making. It's a ONE task system. It can process data and give that data back to you, but it's not making decisions based on that data. This is why the GPU cores, in this case for Nvidia CUDA cores, are very tiny. They exist to calculate and pass data on. So I don't see how you can get the GPU to do this entire task. But please this is a lack of knowledge on my part if I'm incorrect about this. Have access to the dev. tools and can list out the huge instruction set for CUDA? All the branches it can do?

@DavesGarage Жыл бұрын

Can you build a PC with just a GPU and no CPU? Then I'll run it on that :-)

@johndoh5182 Жыл бұрын

@@DavesGarage I'm not being critical, I'm just pointing out that on a PC to run such a task you need both, that's all, because a GPU isn't a GP compute device, at least not in a PC. I don't know all the details of server GPUs. I'll ask you the same question though with a slight modification, CAN you run a PC with just a GPU? In other words does a GPU actually have an ISA that allows it the full complexity of running an OS and loading applications? I understand a GPU to be a single task device with tiny cores which do math and that's about it.

@vitalyl1327 Жыл бұрын

@@johndoh5182 bandwidth is irrelevant for small buffers, there is a cost of about 1ms to just initialise a PCIe transaction. If you have many CPU steps in your pipeline, it's not worth it and better to do it slower on a GPU. Not sure what you're talking about regarding "decision making". There is a performance penalty for a divergent branching on GPUs of this architecture, but in this scenario only one GPU core is working, so divergence is irrelevant. It is just a slower CPU.

@BrianJones-wk8cx Жыл бұрын

Always love spending time in the garage. Also just noticed the “DANGER MEN COMPUTING,” sign, ha!

@driver_8 Жыл бұрын

Good video Dave. Enjoyed your interview on The Retro Hour Podcast today.

@David_Crayford Жыл бұрын

I know the difference between CPU and GPU but didn't know anything about CUDA code, nor that primes are limited to the square root, so I learned something here. Thanks!

@wictimovgovonca320 Жыл бұрын

Now this one was not at all a surprise. GPU is a misnomer because there are many non-graphics applications where this technology excels. Turning back the clock many decades, we had what were called array processors, or vector processors. The modern GPU is an evolution of that technology.

@AliNoh Жыл бұрын

I appreciate how you deliver a ton of useful information in a way that’s easily digestible

@naukowiec Жыл бұрын

Thanks for a fun comparison, it is interesting to see how an SIMD (single instruction multiple data) version compares to the CPU. If you split your data into float / int chunks you can exploit the separate floating point and integer arithmetic cores on the GPU, but i'm probably going beyond the practical parts of the CUDA API here... would be interesting to see the float or int64 versions are faster too [ __dmul_rd() vs __umul64hi() instead of your factor * factor ] note that if you are defaulting to uint64 for all OPs you are running slower than you would on pure double precision OPs Also, the latest GPUs allow tensor operators, so you could probably pre-sieve factors of low primes if you treat the input as an n-d tensor. OP is faster, but not sure if it would be efficient at these relatively low OP counts. If anything I would try replacing your integers with double precision floats and see how much faster it gets, compiler should optimize it to _dmul rather than _umul then.

@SmiliesGarage Жыл бұрын

This is great! I have been playing around with CUDA, and am applying it to my current research and development on assured positioning and timing. Modeling fast vehicles requires a fast update rate to accurately extrapolate the next position solution. Having each GPU core work on solving a specific mathematical problem helps speed up the time it takes to solve a new point. Also, I currently hate your Python script!

@jerryseinfeld6283 Жыл бұрын

Never would have thought I would be watching how to code a GPU to solve primes at 1am

@KobkG Жыл бұрын

Another great video in a format that's easy for me to understand. Thank you for all your great work and I look forward to learning much more with your help.

@UncleKennysPlace Жыл бұрын

I come here (and at Ben's channel) to pick up the details that I missed, being self-taught. I worked at several engineering firms, and for the DoD, for a couple of decades (until the influx of out-sourced "code by rote" workers killed the wages), as well as supported two commercial packages of my own. I didn't know what I didn't know, and in my case, that was an advantage! Now I can learn the whole story.

@4Nanook Жыл бұрын

For finding prime numbers I am going to vote on the GPU because this is a problem that can be solved with integer math and the GPU allows massive parallelization of that.

@ReevansElectro Жыл бұрын

Well presented as usual! I am keen to get my hands on that GPU. Thanks.

@TernaryM01 Жыл бұрын

I still find it impossible to believe that Dave didn't handpick the winner. The algorithm he's talking about is literally called "Sieve of Eratosthenes"! Still, congratulations to the winner!

@andrewkepert923 Жыл бұрын

Same. But all power to him if he did - it’s a great choice and it’s his channel.

@zwettemaanbenjamin Жыл бұрын

Small nitpick on what was probably a slip of the tongue: numbers _can_ have a prime factor larger than their square root, but they cannot have _only_ prime factors larger than their square root. At least one prime factor has to be less than or equal to the square root.

@zwettemaanbenjamin Жыл бұрын

Duh. What MishaMingazov said...

@DavesGarage Жыл бұрын

Correct, thanks for the correction!

@Petch85 Жыл бұрын

It has been on my list for years to try som matrix calculations on GPU's... But I have never gotten around to do it, maybe this video is the one that will give me the push I need. The timing is good with a weekend coming up.🤞

@kr0m-san Жыл бұрын

Dave, your channel is of the ones most imbued with technical expertise and exciting content!!! Your effort and knowledge really show in the quality of the material and the production, always excited to watch a new video. It is really intriguing to think how computing with large amount of CPU threads on Windows and Linux (user/kernel mode?) (and a custom baremetal OS?) would stack up (with/without the hypervisor?), to see how the overhead (schedulers in the kernel, interrupt servicing, power management, hypervisor timeslicing the partitions), would get in the way :)

@sfbshoccho Жыл бұрын

I think for the graph showing the shorter time as a lower is better would've been more intuitive to follow.

@BillMurey-om3zw Жыл бұрын

I had to debate this 2months ago, happy to see validation 😊

@Spielix Жыл бұрын

Not sure why you are using either "blocks" or "threads". To efficiently use the GPU, you will need to combine both. I might create a pull request later.

@DavesGarage Жыл бұрын

Have at it! The memory contention can be a bear though.

@Gregi555 Жыл бұрын

Hi Dave, you had RTX 4080 for the testing. But you mentioned 20 GB of VRAM. This card has just 16 GB so you have to have also RX 7900XT somewhere behind the scene (in your PC) ;)

@hancockautomotive1 Жыл бұрын

*fingers crossed* I'm a subscriber! I am hoping I am a lucky one. Nice test build, and most effective benchmark metrics to showcase how each chip is meant to be utilized.

@CODEDSTUDIO Жыл бұрын

Nice explanation,

@GamingHelp Жыл бұрын

Holy crap! A superPET! Before my health went to hell and I retired, I came across six working superpets and I took one to the office. We stuck it in a corner to let random coders write little ditties on it. No disk/storage so if the power was yanked, it was gone, but it was neat having a communal machine for coders to try and one-up each other on. :)

@Devo_gx Жыл бұрын

Your content is amazing and I'm so glad I stumbled upon your channel a while back.

@MikeHarris1984 Жыл бұрын

The 4080 gpu is years newer then the 3750x threadripper. The new Ryzen generation out performs the threadripper too. You should upgrade, then do same test.

@DavesGarage Жыл бұрын

Got a contact at AMD? :-)

@LA-MJ Жыл бұрын

There is no newer threadripper non-pro though. The market segment seemed to be dead but Intel has finally released some hedt stuff

@CoolFire666 Жыл бұрын

Not bad. I distinctly recall my first attempt at CUDA was orders of magnitude slower than the CPU version. It definitely takes some work to even get a real performance gain.

@RiversJ Жыл бұрын

There should be an absolutely Giant sign on every introductory tutorial on GPU programming to read up on SIMD architecture. Even an amateur who understands the architecture will create more performant code than someone who has decades more experience in the language used. One can't parallelize if one doesn't understand what it actually means in terms of the hardware it's running on.

@PatrickStaight Жыл бұрын

@@RiversJ Do you have a good resource for understanding VRAM? Particularly: How do thousands of CUDA cores all write to memory at once? I wrote a comment with some guesses elsewhere in this thread. I could paste it here if you are up for it.

@DavesGarage Жыл бұрын

Do you know SIMD? Why haven't you contributed a SIMD solution to the Github Primes project?????

@WildRapier Жыл бұрын

@Dave Going back a decade or so, there used to be "math coprocessors" for FPU calculations, is that what evolved into modern GPU's? My first dedicated was a 4 MB Diamond Stealth II and changed the world for me. I do remember initially having 1 MB onboard VRAM, but I think math coprocessors were before that. Probably closer to the time of "Hercules"...black and green...my kids will never understand!

@nikolaypaev6714 Жыл бұрын

I believe nowadays math coprocessors for floating points operations are part of the CPU. GPU is different kind of coprocessor optimized for matrix multiplications and stuff like this, that can be easily parallelized.

@tyrantworm7392 Жыл бұрын

Exactly that, they are big paralllel FP32 & 64 processors with input/output for specific instruction sets. Nvidia has considerable driver abstraction, but because you address via CUDA it's only an academic objection. OpenCL is similar but is GPU agnostic, all major players engaging with the Kronos workgroup. There have been some crazy advancements in thoughput, doing one GPU's worth of FP grunt 25 years ago would have been a supercomputers worth. The throughput enables more granularity/resolution for a lot of calculation types. I appreciate the "decade or so" :D

@niallrussell7184 Жыл бұрын

good old days of 8087, 80287, 80387, 80487s. I remember piggybacking a 287 so I could raytrace back in the day. GPU's started out as 3D accelerators for vector matrix math - T&L (transform and light). More features were added by Nvidia, like CUDA, which supports image processing, PhyX, etc. I think everything Pentium onwards had built in co-pro, ending the whole SX / DX chipsets.

@0LoneTech Жыл бұрын

FPUs were merged into the CPUs, and extended for streaming and vector operations - e.g. MMX, 3DNow!, Neon, SSE and AVX. GPUs are more closely related to DSPs, extended for embarrassingly parallel tasks. There's a bunch of overlap, of course; we've had accelerating processors on graphics cards long before GPUs existed (nvidia marketed the term when they put transform and lighting into graphics accelerators), and FPUs optimized specifically for 3D, like the IIT 2C87 which can do 4x4 matrix by 4 vector multiplication. The Raspberry Pi uses a DSP called VideoCore IV to do its GPU work.

@stefanl5183 Жыл бұрын

No. The FPU or math coprocessor got incorporated into the CPU with the 80486 although there were cheaper 486 SX units sold that had the internal FPU disabled. Anything from a 486 DX up has the FPU incorporated. GPUs started out as 3d accelerators and were mainly focused on 3d rendering or drawing polygons on screen as they should appear in a 3D environment. But many of the early "gpus" were not really much like what we call GPUs today. For example the famous 3DFX voodoo cards were basically just 2 asic chips that did drawing and texture mapping. Anyway, the modern era of GPUs was usher in by Nvidia with CUDA.

@God.Almighty Жыл бұрын

very educational. he kinda gave away the winner pretty early on by mentioning that the gpu is optimized for repetitive tasks and the sieve process is certainly very iterative.

@hjplano Жыл бұрын

Hope your back is improving

@tcpnetworks Жыл бұрын

Edge of my seat? I'm at a standing desk!!! :P

@TruWrecks Жыл бұрын

Great video. Good topic. I wish my AMD 6900 didn't crash so much doing heavy tasks. As a veteran on disability it would be nice to have better Hardaway to play with. Prime numbers are my favorite in math, but binary is what I use most with networking.

@jacoblf Жыл бұрын

this is the first episode I've had trouble following. The math went over my head. Still, excellent content.

@hishnash Жыл бұрын

Interesting to run a GPU thread per prime, does this not result in memory contention when it comes to writing the flag as multiple threads will be doing this at once (to the same memory pages/even the same number). Did you consider inverting this: Instead of giving each thread a single prime that it uses to scan the entier siv. Given each thread a dedicated subsection of the siv and the full list of primes. This would avoid memory contention on the writes (no need to use atomics etc as no other thread will be writing flags for this region of the siv) and it will mean the writes are more local, this will help a lot of cache locality. In addition splitting up the threads by siv region means you can get better utilisation, as even if you do not have enough primes to fully occupy the GPU, you can reduce the amount of the siv you give to each thread to fully saturate it.

@JonathanSwiftUK Жыл бұрын

"smaller, independent operations", there the rub. Only things which support parallelism. In our normal lives parallel operations can cause us to do a whole lot more checking to prevent blocking. Linear looping code operations have a more simplistic (less can go wrong) nature, once you decide to exploit the do everything at once approach - well hold on to the seat tightly in case the plane spirals into a rapid descent due to unanticipated and unexpected code clashes.

@asvarien Жыл бұрын

Could have sworn he said "we're not interested in how times a multi-core processor can solve CRIMES in paralell"

@gledigondullinn Жыл бұрын

This channel is the ultimate cozy place.

@rashie Жыл бұрын

Excellent content, as always! Thanks!

@properjob2311 Жыл бұрын

this was very interested and informative. you clearly explain things.

@TheRealStructurer Жыл бұрын

This one a bit deep for me, but still like it. Good to know where your limits are. Thanks for sharing 👍🏼

@rvkasper Жыл бұрын

this was an awesome watch!

@coopta7441 Жыл бұрын

Great video! Are you planning on doing any neural network/LLM's/diffusion etc. videos?

@j777 Жыл бұрын

Congrats to the winner!

@wngimageanddesign9546 Жыл бұрын

This was a great eye-opening video!!

@boydpukalo8980 Жыл бұрын

Another fascinating video.

@philmarsh7723 Жыл бұрын

I feel that most people leave out the most important distinction between GPUs and CPUs. Multi-core processor (CPU cores): Each core executes different instruction streams on different data. GPU cores: All cores execute the same instruction stream (same instructions synchronously but on different data for each core) on different data.

@philmarsh7723 Жыл бұрын

Then you have to worry about getting data to and from each core. Most CPUs and likely most GPU cores are limited by memory speed on many real-world problems.

@KevinDC5 Жыл бұрын

Wish you were in Texas, we’d make great neighbors. (Ext. LED home lightn.) Your vid’s on the ws2812s helped me so much and I’d love to try your “magicLED” is board.

@mackfisher4487 Жыл бұрын

After trying to follow Dave, I just feel stooped

@ironman5034 Жыл бұрын

Ah nice, i am enjoying the videos again!

@johnrusselsmurf4842 Жыл бұрын

Things I never wanted to know.... Til Dave brought it to my attention. Lol

@Horus9339 Жыл бұрын

Well done winner of the GPU, you are the chosen one. Thanks Dave.

@xeridea Жыл бұрын

CPU has abnormally high time increase jumping from 100M to 1B, likely due to 100M being 12.5MB memory required, easily fitting into the 32MB cache (per CCD....) on CPU. 1B would require 125MB memory, exceeding the 32MB of cache. Does the CPU use segmented sieve? Seems it should be faster. Just tested very basic C++ implementation on 5800X, with slower 2666 memory, 2.6 seconds, so would be perhaps 3-3.5s on 3970x. This is only 2x speedup from 32 cores, 64 threads. Noteworthy though, is the test is using CPU whose architecture released July 2019, nearly 4 years ago while the GPU tested is brand new. Obviously GPU still wins, but it would be a smaller gap.

@connecticutaggie Жыл бұрын

As you mentioned, once you get to 100million (10^8), you have everything you need to get to 10^16 (1 quintillion, I think). So, after the CPU has searched 100million numbers, could you change change the CPU code from search mode to extend mode and process the additional 99million, 100million number pages one at a time launching GPU thread for the numbers you found on the search pass then only save out the primes after each pass. It seems like the number of primes would be limited enough to fit in the CPU's memory - so, no paging.

@mikepict9011 Жыл бұрын

The new direct x 12 update with agility and heap rebar enabled using my second gpu as my progrM gpu and the gpu plugged in as my rendering/ physics gpu . Load balancing because it switches back outside ofte programs . Victory. Worked on that for weeks

@warrenk9587 Жыл бұрын

Great video! Thank you for sharing.

@NullStaticVoid Жыл бұрын

HAHA I did a prime sieve on my Atari 800 back in the day. It took me a long time since I couldn't afford the programming books at Radio Shack so I'd go to the store and take notes then go home and try stuff. I also spent way too much time trying to make it look it cool before I realized that was going to rob cycles from how fast it could go, so I had to re-write it to be plainer and just update when memory was getting full. Wish I stuck with that as well as my early experiments in making music on the 800. Instead I got obsessed with trying to program games on the Atari. I'd read somewhere about some kid who wrote a game and made a bunch of money. So I was hypnotized by the idea of getting rich from video games. But I really had no resources to know how to do it. Collision detection, how does that work? If I'd stuck with the nerdier coding and making programs for music, that could have gone someplace.

@lborate3543 Жыл бұрын

So are you starting with an array of all numbers and then removing the multiples from the array as they come back from the GPU? I see how you can generate the non primes quickly, but how are you marking the primes at the end?

@lborate3543 Жыл бұрын

Anyone can answer obviously.

@Rhythmattica Жыл бұрын

Task in hand, Is a Task indeed.

@jordanlake9151 Жыл бұрын

I love this stuff, finally an explanation on c++ cuda programming

@mrrolandlawrence Жыл бұрын

Stuff I love. Also remember doing this when I was a kid on a 6502. This is literally a million times better.

@walkabout16 Жыл бұрын

CPU versus GPU, the debate has begun Which one's more powerful, which one's more fun? The CPU's the brain, the GPU's the brawn Let's take a closer look, and find out what's going on The CPU's a multitasking machine It's fast and efficient, like a well-oiled dream It handles all the everyday tasks with ease From browsing the web, to running office suites But when it comes to graphics and gaming The CPU's power can be quite limiting That's where the GPU comes into play With its specialized hardware, it's built to slay The GPU's like a co-processor on steroids It handles graphics and video like no other toy With its thousands of cores, it's built for speed It renders images and videos, like a true pro indeed So which one's more powerful, you may ask It all depends on the task at hand, it's no easy task If it's everyday use, the CPU takes the crown But for graphics and gaming, the GPU is renowned In the end, it's not about which is better It's about using the right tool, to make things better Both CPU and GPU have their place And together, they create a powerful pace.

@siahsargus2013 Жыл бұрын

CUDA Cores... got a love hate relationship with em.

@RenormalizedAdvait Жыл бұрын

So basically Mr. Cray was wrong when he said that he preferred two strong bulls to thousand chickens to plough a field. Here the two strong bulls i.e. the CPU loses to 1000 chickens i.e. the GPU once the field is large. However, the big question still remains and that is who wins in a MIMD vs SIMD face-off with a comparable number of cores. Please shed light on the issue and your response would be much appreciated.

@8bit711 Жыл бұрын

Now this is content! Nice work. I am defo not clever enough to fully understand but Cheers.

@P-G-77 Жыл бұрын

Dave please, you have worked in the process of bios -- > windows commands to (case buttons, fans, energy saving etc...)? all this world (jungle)... I'm really curious to understand how these processes are managed from one operating system to another and from one hardware to another.

@Muhammad-sx7wr Жыл бұрын

I would love to have that 4080 gpu oh my God so much AI so little time.

@prempink12311 Жыл бұрын

Very educated info thanks.

@broccoloodle Жыл бұрын

This test mostly measure VRAM performance not GPU computation. It's possible to get 100x speed up

@x7heDeviLx Жыл бұрын

I’m in desperate need of the 4080 also I’ve been subbed for ever and love your content

@8bit711 Жыл бұрын

4080! man my GTX 950 is looking a little slim. My opencv project would love that!

@erikp6614 Жыл бұрын

Nice and good video! Like to view your videos!

@TheNoodlyAppendage Жыл бұрын

6:00 I think you mean no composite number will have a least common factor higher than its square root.

@MagicPlants Жыл бұрын

Dave! Please let me get that GPU! I just got a new job as Creative Director making 3D models and promo videos, web design, custom apps and more plus marketing responsibilities and my car just died... I need to replace the car and upgrade my desktop so I can render faster and it would really make a difference. I have a 3060 12GB and it's just not enough sadly. Thanks for all your excellent videos over the years.

@MagicPlants Жыл бұрын

I'm also a self-taught programmer who's story might be interesting to your viewers. I worked in blackhat hacking making bots for a while in the mid 2000's but eventually settled into a full-stack programmer role for the last 15 years.