120x Faster Algorithm By Nested Loops

  Рет қаралды 336,659

ThePrimeTime

ThePrimeTime

Күн бұрын

Recorded live on twitch, GET IN
/ theprimeagen
Reviewed video: • Adding Nested Loops Ma...
By: / @depth-buffer
MY MAIN YT CHANNEL: Has well edited engineering videos
/ theprimeagen
Discord
/ discord
Have something for me to read or react to?: / theprimeagenreact
Kinesis Advantage 360: bit.ly/Prime-Kinesis
Hey I am sponsored by Turso, an edge database. I think they are pretty neet. Give them a try for free and if you want you can get a decent amount off (the free tier is the best (better than planetscale or any other))
turso.tech/deeznuts

Пікірлер: 304
@MichaelAbramo
@MichaelAbramo 8 ай бұрын
It takes a true legend to make a 15 minute demo on optimization that he probably spent a month making and then at the end say "but instead you can just use this library that's 8x faster"
@khhnator
@khhnator 7 ай бұрын
but only if your problem is algebra... the thing is that you will be always having cache performance issues. is really a mine of free performance gains if you can optimize the hotspots of your code
@chrism3790
@chrism3790 7 ай бұрын
Well, those libraries were created by someone who went through the optimization process in a much more thorough way than presented here, and that is interesting in and of itself.
@lesscommonsense1804
@lesscommonsense1804 7 ай бұрын
True legends give the sauces.
@pithlyx9576
@pithlyx9576 7 ай бұрын
@@chrism3790 yea but optimizing for your specific use cases will be more performant than a catch all solution in most situations. additionally i think the video is also telling a story that can be applied to many sectors of optimization.
@chrism3790
@chrism3790 7 ай бұрын
@@pithlyx9576 I disagree with that. Most of the time, people overestimate their ability to hard core optimize things. It takes deep knowledge to truly extract every ounce of performance from your hardware. If you're working on well known problems, trying to do these things yourself is a waste of time that will produce suboptimal results, almost guaranteed. Ask yourself, would you be able to optimize matrix multiplication yourself, to the degree that it was done in this video? And remember, even the author admits there is more under the hood of the referenced implementation.
@christophervsilvas
@christophervsilvas 8 ай бұрын
I’m doing full stack coming from a game development background. As a game developer I had to deal with this stuff all the time and so when I got my first full stack job I wrote everything like a high performance C++ algorithm even though it was just a typescript nodejs server. At first I thought I was always just a bad programmer and didn’t even know it but now I am realizing good code means different things for different contexts. In game dev it means minimal operations to achieve best result. In web it means minimal complexity and maximum maintainability by the broadest set of programmers. It’s almost like you have to program for the “worst” programmer in the “room”.
@boastfultoast
@boastfultoast 8 ай бұрын
If you don't mind me asking - how did you swap from gave dev to full stack? What steps did you take to learn/fill in gaps for full stack
@christophervsilvas
@christophervsilvas 8 ай бұрын
@@boastfultoast Hey, the full story is a little complicated (I didn't exactly "swap", it's more like I got a good job from someone who didn't care what I knew, they just knew I could make what they need happen). Anyway, to answer your question I would really just say have a problem that you are interested in and solve it! Build something! I didn't really have the luxury of learning fullstack in the abstract. I started by creating a Shopify app with a Typescript NestJS backend, and NextJS admin ui front end. Redis for caching. MongoDB for db. You have to be okay with writing everything baddly for a while so you see the problems and go look for a solution. I can't tell you how many times I have suffered through ugly programming experiences only to find a feature I didn't know about that makes everything so much easier. After two years of development I am currently on our third major refactor. Jump in with a simple project and be okay with sucking and not knowing how much you suck. Side note: if you come from a typed language I would really recommend learning Typescript. It was a huge help for me to get started. (If you don't come from a typed language I would still recommend Typescript :D). I just recently "relearned" how important it can be to have a project to work on in order to learn. I saw the Primagen talking about Rust and decided to give it a try but couldn't really make heads or tail of it for a whole week... Luckily my boss told me to make a simple service for querying every zipcode in the USA against the AT&T and Verizon coverage maps. I started dissecting both AT&T and Verizon's APIs using Python as a quick mockup. Then I decided to try to create it in Rust. At first it was really really slow... but then it just started to fall into place and after 3 days I know Rust pretty well. That all may or may not be helpfull. If you have any more specfic questions, feel free to ask!
@doc8527
@doc8527 8 ай бұрын
I think you still got some wrong ideas from the video (not everything but how you view web and game in terms of difference). The video is the opposite of what you are doing. For a given "specific size of dataset" or "specific data structure", sometimes simply dumping the nested array might run faster (node.js in your example, V8 is doing some cache magic behind the scene), which is what a lot of devs are doing in web. Even though they don't realize that, and many in-place optimizations might lead to slow performance because the requirement change constantly, your old in-place optimization only works for specific size of data, then the stupid slow nested solution out-perform yours. The web is quite dynamic, so premature optimization is pretty bad. If you are just working on some local offline game, not the game server that has a drastically change of traffic all the time. Then your in-place optimization will be best because you know exactly what you will get. For web server development, a lot of times is hard to predict the incoming traffic or outgoing traffic (as db size grows). the stupidest slow solution can have the maximum reliability (regardless the size and structure ) and flexibility later if you need to make the change. The video is about optimization wouldn't work uniformly, size wise, structure wise, some works the best for the given range or given structure. I have the opposite experience from another side, I actually want to try out game development from fullstack (but traditionally just software development). I constant see game devs are doing over optimization without knowing the reason, and lead to unreadable and slow code, but they tend to so pretentious that they are the smartest, like those leetcode obsessives. Many are lack of computer science knowledge, but hey, game devs are really passionate (better than traditional software developers by a margin), I really really like them doing some creative thing. At the end is more case by case, person by person, I wouldn't generalize my experience to the entire field, whether it's game development or web development, whether who is smart or not, it's up to the specific area/problem you are working on.
@BeefIngot
@BeefIngot 8 ай бұрын
I want to know this too. Not because I think its impossible at all, just curious. I would imagine for a motivated individual it just takes building your own project 0 user service to a reasonable enough quality to feel confident you know what is going on, front to back @@boastfultoast
@martinvuyk5326
@martinvuyk5326 8 ай бұрын
Yeah I also started on a more C performance track then fell on Web dev. It's absolutely about maintainability. But there are corners here and there where I still get a kick on optimization, especially now that I'm doing rather Data engineering on DB. I had to reimplements something with numpy the other day and it was not really as readable as before, but I'll take 22 seconds over an hour every day lol
@CattoFace
@CattoFace 8 ай бұрын
16:10 "transposing the walk" works without hurting performance only when it doesnt cause cache misses, earlier in the video, he did not account for cache size yet, so it caused misses. but now the block is in cache so the transposition doesnt cause extra misses.
@DexterMorgan
@DexterMorgan 8 ай бұрын
I'm a prewatch viewer. I watch it once, write down notes, then watch it again and pretend I know the answers while crossing my arms and nodding my head with a smug look on my face. Classic.
@ThePrimeTimeagen
@ThePrimeTimeagen 8 ай бұрын
That is the way to do it
@YDV669
@YDV669 4 ай бұрын
A fellow after my heart. I do that with PVR. Once the thing watches the movie I told it to record, I feel I no longer have to and can get on with my day without too much waste of time.
@notapplicable7292
@notapplicable7292 8 ай бұрын
Cache optimization has been really eye opening for me over the last year.
@petrkiyashko4248
@petrkiyashko4248 8 ай бұрын
Same here, and also branch prediction and cpu pipelining lately
@alexhiatt3374
@alexhiatt3374 8 ай бұрын
same! I think people focus too much on "heap vs. stack" instead of "cached vs. uncached". the stack is way faster because it's basically always in cache, but heap usage doesn't necessarily need to be slow.
@astronemir
@astronemir 8 ай бұрын
It’s also all wasted effort whenever cache layout size cpu instructions change.
@yegorzakharov8514
@yegorzakharov8514 8 ай бұрын
Cache me daddy outside
@TheSnowMan2023
@TheSnowMan2023 8 ай бұрын
@@Jim-gyswry i think it depends on the problem you are trying to solve and the framework/languages and packages/libraries you are using.
@PythonPlusPlus
@PythonPlusPlus 8 ай бұрын
21 GB/s equates to 5.25 GFLOPS because floats are 4 bytes, and every operation has to load a float from memory. So 21/4 = 5.25
@markkalsbeek5883
@markkalsbeek5883 14 күн бұрын
But it's multiplication, so you need to fetch 2 floats right?
@martijn3151
@martijn3151 8 ай бұрын
The single most important take here: always profile your code. Don’t assume, measure.
@unbreakablefootage
@unbreakablefootage 7 ай бұрын
Ye and look at the instructions generated
@DFPercush
@DFPercush 7 ай бұрын
-O3 or -Ofast , and -march=native are good for squeezing out the most performance out of your compiler on your test rig. If you control the server, or in source distros, you can leave it that way, but for public binary distributions it's good to have a well defined set of features to target, like plain x86, -mAVX, etc.
@martijn3151
@martijn3151 7 ай бұрын
@DFPercush those compiler flags do come with a risk. I’ve seen games suddenly behaving differently when using Ofast for instance, which is definitely not something you want when being in optimization phase. There is a reason as to why optimization flags can be switched on and off; for some applications the shortcuts generated by the compiler are acceptable and the speed increase is worth it. For other applications, the speed increase isn’t all that much, but the risks are. So, be careful with using them would be my advise
@blindshellvideos
@blindshellvideos 4 ай бұрын
what does the -march flag do anyway?@@DFPercush
@DFPercush
@DFPercush 4 ай бұрын
@@blindshellvideos -m is used for a bunch of different options. I think it means "machine" or something like that. One of them is "arch" = "architecture". In other words what kind of CPU to compile for. This is how you can cross compile for other types of machines. But "native" means whatever processor the compiler is currently running on. This will allow it to take advantage of any advanced features you have on your machine like AVX, encryption acceleration hardware, etc. Otherwise, you can specify those features with flags like -mavx2. man gcc | grep -C 20 -- '-m' | less and look for x86 options.
@m4rt_
@m4rt_ 8 ай бұрын
16:40 if you are doing game programing and know what size the matrices will be, for example 4x4, you can just hardcode the multiplication. Since the size is set you don't need General Matrix Multiplication (GEMM), which is the algorithm he is optimizing. With GEMM you don't optimize for the specific size, you make something that can handle all sizes, but then may not have the best performance. When you know that the size will always be 4x4 you can make it do it the most optimal way.
@heavymetalmixer91
@heavymetalmixer91 8 ай бұрын
Asking from a novice's perspective: Can you put an example in C++ here of how you would hardcode a 4x4 optimized multiplication?
@theairaccumulator7144
@theairaccumulator7144 8 ай бұрын
​@@heavymetalmixer91use a library like glm
@Ulchie
@Ulchie 8 ай бұрын
@@heavymetalmixer91 My assumption is that you just manually unroll any loops and have every multiplication laid out, probably using SIMD intrinsics to really optimize it.
@m4rt_
@m4rt_ 8 ай бұрын
@@heavymetalmixer91 I don't have a C++ version handy, but I have a Jai version. It should be relatively readable. (btw --- means that it is uninitialized). multiply :: (m: Matrix4, n: Matrix4) -> Matrix4 { result: Matrix4 = ---; result._11 = m._11*n._11 + m._12*n._21 + m._13*n._31 + m._14*n._41; result._21 = m._21*n._11 + m._22*n._21 + m._23*n._31 + m._24*n._41; result._31 = m._31*n._11 + m._32*n._21 + m._33*n._31 + m._34*n._41; result._41 = m._41*n._11 + m._42*n._21 + m._43*n._31 + m._44*n._41; result._12 = m._11*n._12 + m._12*n._22 + m._13*n._32 + m._14*n._42; result._22 = m._21*n._12 + m._22*n._22 + m._23*n._32 + m._24*n._42; result._32 = m._31*n._12 + m._32*n._22 + m._33*n._32 + m._34*n._42; result._42 = m._41*n._12 + m._42*n._22 + m._43*n._32 + m._44*n._42; result._13 = m._11*n._13 + m._12*n._23 + m._13*n._33 + m._14*n._43; result._23 = m._21*n._13 + m._22*n._23 + m._23*n._33 + m._24*n._43; result._33 = m._31*n._13 + m._32*n._23 + m._33*n._33 + m._34*n._43; result._43 = m._41*n._13 + m._42*n._23 + m._43*n._33 + m._44*n._43; result._14 = m._11*n._14 + m._12*n._24 + m._13*n._34 + m._14*n._44; result._24 = m._21*n._14 + m._22*n._24 + m._23*n._34 + m._24*n._44; result._34 = m._31*n._14 + m._32*n._24 + m._33*n._34 + m._34*n._44; result._44 = m._41*n._14 + m._42*n._24 + m._43*n._34 + m._44*n._44; return result; } and here is a simplified version of the struct Matrix4 Matrix4 :: struct { _11, _12, _13, _14 : float; _21, _22, _23, _24 : float; _31, _32, _33, _34 : float; _41, _42, _43, _44 : float; }
@oblivion_2852
@oblivion_2852 8 ай бұрын
I have a strong suspicion that the simd registers being wider than the data... With the added latency would actually make it slower than just raw assembly on the memory addresses.
@sinom
@sinom 8 ай бұрын
No simply walking differently would mean you'd also have to walk differently on the other matrix so that would just shift the problem from one matrix to the other. The reason it works later on is because of that temporary copy being made.
@sinom
@sinom 8 ай бұрын
Relating GB/s to Gflops is something that's often done in high performance multithreaded programming. If you look at a CPU that can handle 64Gflops. Every one of the floating operations will be using 2 8byte numbers and then outputting 1 8byte number. That would mean if every operation would require its own new data to be loaded the required throughout would be 128GB/s. But your actual bandwidth is probably, gonna be more like 8GB/s. So you somehow need to try fully utilizing all the flops with much less data. In matrix algorithms this is often done by basically loading tiles one after another. And then iterating over those. This will introduce more nested loops but it will also significantly reduce the amount of data loading required per floating point instructions, allowing for much higher utilization of the CPU. This is called tiling, and by now a lot of compilers do a simplified version of it on their own, but they don't know your exact flops and bandwidth specs, so if you do know them you can get out even more performance than your compiler can. Basically the exact same thing is also done when doing matrix maths on the GPU, since there there's also a huge bandwidth limit most of the time. Just in the GPU case you usually just use a library that already implemented it for you like Thrust or rocThrust
@BeefIngot
@BeefIngot 8 ай бұрын
God, what a name those libraries have. You really just thrust that upon us.
@gauravnegi6857
@gauravnegi6857 8 ай бұрын
​@@BeefIngotas a computer science noob I can't understand a lot, but that's really crazy amazing.
@drooplug
@drooplug 8 ай бұрын
Matt parker wrote a python script that he admitted wasn't very good, but it got thebjob done. It took something like 30 days to complete on his laptop. He shared it with his viewers and a number people rewrote it to run faster. I think the first rewrite reduced the rub time ti a few hours in python. Eventually, someone submitted code in another language that gor it down to something like 2 ms.
@GabrielSoldani
@GabrielSoldani 7 ай бұрын
That’s pretty different, though. Parker used an inefficient algorithm. The community first found more efficient algorithms and then optimized it for cache hits and parallel execution. This video uses the same algorithm from start to finish.
@Z3rgatul
@Z3rgatul 4 ай бұрын
​@@GabrielSoldaniare you sure library doesn't use Strassen's algorithm?
@Impatient_Ape
@Impatient_Ape 8 ай бұрын
The FORTRAN spaghetti code I used for research calculations back in the day used the LAPACK library, which, at the time, was the most optimized code for linear algebra in existence.
@abdullahahmad1642
@abdullahahmad1642 8 ай бұрын
It still is in most cases, though there's a Rust library called Faer being developed that looks like it could give it a run for its money...
@emanuelfarauanu1760
@emanuelfarauanu1760 8 ай бұрын
One of my tasks during an internship way back in 2018 was to do such optimisations in C++ on a Raspberry PI, I learnt so much from that. Really cool algs.
@tordjarv3802
@tordjarv3802 8 ай бұрын
I think this level of optimization is more common in scientific computing than in game development since SC uses way more data (think peta to exa bytes) than in game development.
@0dsteel
@0dsteel 8 ай бұрын
Those probably use specialized hardware (for massively parallel computations), aka not the cpu. Cuda magic and kernels go brrrr
@tordjarv3802
@tordjarv3802 8 ай бұрын
@@0dsteel I'm in that field, and yes accelerators are sometimes used such as GPU, but you will be surprised how much research code still runs on CPU only.
@0dsteel
@0dsteel 8 ай бұрын
@@tordjarv3802 can imagine how 10^n vs 10^(n+2) ticks of simulation per second makes a difference, games are pretty shy on things happening pre tick in comparison
@OREYG
@OREYG 8 ай бұрын
Depends how you define scientific computing. In gamedev you usually have to crunch pretty considerate amount of data in 16ms, so such optimizations are mandatory. In scientific computing you are not bound by constraints of having to output something in real time. Then there are some numerically intensive applications that cannot use fastest available option (for example they require a certain accuracy). Then there are libraries, that are available to scientists, that already employ some of those techniques. What I'm trying to say, that in gamedev it's done as a necessity, while in scientific computing as an afterthought.
@0dsteel
@0dsteel 8 ай бұрын
@@OREYG Depends how you define constraints. A game engine doesn't have to produce 60+ frames a second, but it helps with sales. Take an engine that runs simulations. 1 second vs 2 minutes to simulate the same model for 1 year delta time is huge. Little Tommy experimenting with python scripts for his PhD thesis is not really what comes to my mind at least. --- Edit --- ;)
@viniciuspjardim
@viniciuspjardim 8 ай бұрын
In my understanding he only stopped transposing the matrix in the end because of the buffers. Without the buffers, a transposed copy of the matrix will lead to more cache hits. With the buffer we already have a copy. So he fuse two steps in one, transposing it into the buffer, so it's still transposed copy when accessing the buffer memory.
@gregorymorse8423
@gregorymorse8423 4 ай бұрын
Transpose load from cache vs transpose load from RAM. I thought the same.
@manofacertainrage856
@manofacertainrage856 8 ай бұрын
Wow. I haven't seen this stuff for a while ... or felt the terror of implementing the first few optimizations for a grade. That's after two parallel programming classes in my master's that included OpenMP and many matrix multiplication algorithms. Took a CUDA course a few years afterward and some of the access/transpose problems are still a problem, but the CUDA libs are the answer there too. Awesome video!
@PhilfreezeCH
@PhilfreezeCH 7 ай бұрын
Last year I helped optimize the math library for my universities RISCV multi-core microprocessors (you can also buy commercial versions called GAP8 and GAP9). It was entirely this stuff, just with Fourier transforms and so on. We started with C code and by the end almost all inner loops were assembly, it was pretty fun honestly.
@woolfel
@woolfel 8 ай бұрын
it's a good video. Lots of people have talked about how bad cache hit miss affects performance, but the video explained it in a simple way that more people can understand. In the java world, lots of talks about mechanical sympathy have addressed this topic.
@PiratePawsLive
@PiratePawsLive 8 ай бұрын
This was awesome, I knew quite a bit about how memory and CPU's work. But the optimization was a bit of beauty even if the code looks like an aesthetic war crime xD. Love it. Now I'm interested in learning more about optimizing again :).
@devaereo
@devaereo 8 ай бұрын
the code aesthetics are really not good. I'm downvoting it
@Kotfluegel
@Kotfluegel 8 ай бұрын
For small size matrices up to four dimensions, you usually employ specific types for each dimension, because then you can literally unroll all the loops. So, no need to use block matrices for graphic applications.
@m.sierra5258
@m.sierra5258 6 ай бұрын
I optimized this very problem in a university competition, and won the competition. One thing they missed in this video: line stride. Not every memory can be put into ever cache line. In most cases, there is something like modulo 256 applied to the address to decide which cache bucket something gets put into. Sadly, a matrix of 1024 means that every item of a column gets put into the same cache bucket. So for us it increased throughout substantially to apply a padding of one cache entry to every matrix row. This has the effect that columns can be cached efficiently.
@suede__
@suede__ 8 ай бұрын
That's a lot of computer science.
@ArkDEngal
@ArkDEngal 8 ай бұрын
loop ordering (locality, locality, locality) was so important that I had at least 3 classes in classes that covered it in college.
@leschopinesns100
@leschopinesns100 8 ай бұрын
I've literally implemented vanilla GEMM in parallelism course at uni, multiplying big ass matrices.
@zkreso
@zkreso 5 ай бұрын
Can't wait to implement this optimization in the ERP software I write at work
@alfred.clement
@alfred.clement 8 ай бұрын
I was mind blown when I performed byte shuffling using x64 SIMD with the SSSE3 instruction set, specifically the pshufb instruction. I've eventually optimized reordering/transformation of complex data with vector instructions, resulting in a significant impact on real performance. It was about x88 faster... It matters A LOT, especially when you're dealing with low-latency operations.
@alexcristache8830
@alexcristache8830 6 ай бұрын
This is really cool. I had to do a very similar thing for one of my university courses called High Performance Computing. Basically simulate fluids using Lattice Boltzmann on a cluster of the university's supercomputer. Was done in C and had to pay attention to all the optimizations he talks about there. Memory access, cache, using OMP and so on. It was really cool, but extremely challenging. I guess this is used more in academia for different simulations like the one I mentioned.
@rogierderuijter7009
@rogierderuijter7009 8 ай бұрын
Never seen prime this quite. Amazing video
@chaselewis6534
@chaselewis6534 8 ай бұрын
On the question for games does this scale for N=4 or just larger matrices. N=4 fits into 4 SIMD registers so you can do the entire set of operations actually in register. In fact you generally load both into register do a 4x4 transpose on one in register then do the multiply since transposing a 4x4 in register is only a couple instructions and lets you do all the multiply adds in parallel afterwards. So their is another level optimization at that point, but that is just because the matrices fit entirely in register. Once you get outside fitting everything in registers this is how GEMM works. There are more optimal algorithms though that get into fancier math that lets the matrix calculation scale below N^3 and more like N^2.4 or so.
@theoballet1519
@theoballet1519 7 ай бұрын
another optimization not mentioned here is using the super scalar nature of our processors by creaing a temp vector that stores the product of elements from the two lines, the multiplication will be performed in parrallel and the sums will be performed when the prior mult have been done. (this results in more code and more loops, but more perf too, especially for big lines since you wont have too much miss in branch prediction) this is an optimisation used by BLAS. another thing could be using INT representation since modern CPUs have more int functionnal units that floats.
@filiformis
@filiformis 8 ай бұрын
I encountered this stuff in my Computer Architecture class in college, but I haven't had the opportunity to use this knowledge it yet.
@besknighter
@besknighter 8 ай бұрын
In games, we rarely use huge matrices, if ever. However, depending on the game, the amount of matrices multiplications itself is so so soooo big that it might be useful to optimize it this much.
@Blubb3rbub
@Blubb3rbub 8 ай бұрын
Matrix Multiplication becomes even more interesting if you reach the "I can't reasonably fit my matrix into memory"-size of matrices and need to employ MPI and distributed memory.
@TQuantP
@TQuantP 8 ай бұрын
I wanted to watch this video for a while now, just sitting there on a tab so old it wad beggining to gather dust. Thankfully now I can! Thanks Prime 😂
@ArkhKGB
@ArkhKGB 8 ай бұрын
Blast from the past: had to do the same kind of things in engineering school. Try to optimize matrix multiplication, be happy with the results (same thing about finding cache sizes but no SIMD at the time). Then you whip out BLAS and it out performs your work one or two order of magnitude. And yeah, those big matrix multiplications are useful in most simulations for things like weather forecasting or fluid simulation.
@HyperFocusMarshmallow
@HyperFocusMarshmallow 8 ай бұрын
Watched this video a while back. It’s really good.
@QayLikeKay
@QayLikeKay Ай бұрын
Even small cache line sizes are 32 bytes long, so a 4x4 float matrix would be 16 bytes, 2 of them would be 32. So both matrices could fit in one cache line and the result another (in the worst case scenario, in a 64byte cache line system everything *could* potentially fit in one cache line). People in chat saying your wrong forget that pretty much everything in computing is a trade off and many optimizations come at the cost of some fixed amount of setup time (or a setup time that scales slower than the normal case). As one of my professors once said "Bubble sort beats small children" (english was not his first language), but he meant that bubble sort beats algorithms like quicksort for small child data sets as it can just go, it doesn't have to allocate extra buffers, or do anything else. That said "bubble sort beats small children" sticks in your brain way better.
@josephlabs
@josephlabs 8 ай бұрын
That’s crazy 😂. I aspire to be doing this one day soon.
@BleachWizz
@BleachWizz 5 ай бұрын
5:13 - I think the idea is really to transpose the matrix in memory, we want the columns to be close together so when we read the first block in cache from the first multiplication all subsequent numbers possible for the next multiplications would be brought into cache. otherwise indices would be in intervals on the dimension of the matrix and you would read a lot less numbers at a time up to rereading memory every multiplication. 16:17 - the problem with walking is that you have to check if you CAN do that, in this case, localA and local B are both being completely brought into the cache, because now you're calculating different parts of the matrix at a time so the result needs to be completely fit in the cache as well or THIS PART will start causing problems.
@josefpharma4714
@josefpharma4714 5 ай бұрын
Transposing the memory layout of a matrix can improve cache hits a lot.
@GRHmedia
@GRHmedia 5 ай бұрын
This used to be the standard of what was taught in college when it came to programming. It is why they used to require you to learn ASM before you could take C and C++. A project I worked on 6 years ago was to create a large maze with many rooms attached about 40,000. The first version would take days to finish. So rewrote it and it was taking hours. The rewrote it again got it down to 30minutes. Another rewrite and I was down 8min, then 2.5 min, then 30seconds. Finally version 0.9seconds. That project was purely on a single thread at 3ghz a8-3850. I've seen much larger changes in performance stuff that would run weeks and months reduced to seconds. Most of those also ended up having a change in language used as well.
@zataritamods7499
@zataritamods7499 6 ай бұрын
Transposing the matrix reminds me a lot of the concept of Swizzling, also doing the blocks reminds me a lot of the tiling method used by the ps2. It feels like this is generalizing the concept for matrix multiplication. Also really helps me conceptualize why someone would do something like this.
@tomaszrusnak5684
@tomaszrusnak5684 7 ай бұрын
Jako inżynier budownictwa pracuję na progtramach wykorzystujących obliczenia wytrzymałościowe właśnmie na macierzach, i im więcej danych (drobniej podzielona konstrukcja na elementy skończone) tym dokładniejsze wyniki otrzymujemy. Od ponad 10 lat obliczenia nie przyspieszają, i obliczanie macierzy o wymarach 200 000 x 200 000 elementów sąjuż wyzwaniem, nie mówiąc o tym, że wiele solverów jest napisachych bez wykorzystania wielu procesorów. Postanawiam na poważnie podjąć naukę programowania (nie tlko hobbistycznie jak do tej pory) aby rozpocząć w branży budowlanej optymalizację solverów. Świetny film.
@Peregringlk
@Peregringlk 3 ай бұрын
7:24 If you don't care about the order of elements of the array, you can remove an element at any position by swapping it with the last value and then pop. That way you don't need to "move" things around.
@GRHmedia
@GRHmedia 5 ай бұрын
The stated amount a CPU can do is based on if the program runs entirely inside cache. However if the program has to reach out to memory for new variables then it is limited by memory bandwidth.
@oliverer3
@oliverer3 7 ай бұрын
This is the stuff that makes me question if I can even call myself a programmer. Any optimization I've ever had to do has been very basic stuff. I can't even begin to wrap my head around this. Then again I've never really done any compute-heavy programming.
@omdevs
@omdevs 7 ай бұрын
This is amazing!
@enviidakid
@enviidakid 7 ай бұрын
lmfaooo this video got me in tears right now
@swannie1503
@swannie1503 8 ай бұрын
I’ve had to do quite a bit of cache optimization. Rust is very good for it actually. I enjoy the problems. Similarity matrices with 100s of thousands of members features. It makes a huge difference.
@shapelessed
@shapelessed 6 ай бұрын
3:20 - Technically, if you have a 32-bit float, which is 4 bytes, you can squeeze about "5 TFLOPs" through 20GB/s memory bus, I think that's the analogy here...
@thapr0digy
@thapr0digy 8 ай бұрын
8:43 Prime realizes that this guys maths is blowing his mind.
@7th_CAV_Trooper
@7th_CAV_Trooper 4 ай бұрын
two guys having an argument over which compiled language is faster and this guy walks up, transposes some matrices, lights the computer on fire and then tells them it could be faster if he just walked the array backwards and walks out of the room.
@HyperFocusMarshmallow
@HyperFocusMarshmallow 8 ай бұрын
8:35 This is the point where you see in Primagen’s eyes how his brains got fried by maths.
@zayniddindev
@zayniddindev 8 ай бұрын
In the first minute, I figured out that this video is not for me (yet) xD
@russelllapua4904
@russelllapua4904 8 ай бұрын
Didn't even get that far 😂
@llothar68
@llothar68 8 ай бұрын
You are never to young to learn about cache locality. The matrix multiplication is just a use case, no need to understand that.
@shahbazkhan-ek7hp
@shahbazkhan-ek7hp 8 ай бұрын
28seconds😁
@ward7576
@ward7576 8 ай бұрын
2:58 that's about the type of optimizations that a company, doing WordPress plugins, would ask you to do in your technical interview
@ivanjelenic5627
@ivanjelenic5627 7 ай бұрын
Also, if you know something about the matrices you can decompose them and do stuff more efficiently for some operations.
@Felix-ve9hs
@Felix-ve9hs 7 ай бұрын
I barely know anything about algorithms, programming, math, or optimization, but it's pretty impressive how knowledge about hardware can improve performance :)
@MrHaggyy
@MrHaggyy 7 ай бұрын
I´m doing automotive stuff. For small matrices, reorganizing doesn`t work that well. You load or create the matrix in the right order. What we do is "sliding" across the matrices by rotating pointers over the memory we are given. This allows you to fill your SIMD operation with the demand from multiple mathematical operations. Also, there are many hardware solvers for matrix multiplication, that can run in parallel to the CDU.
@burloiumarian2665
@burloiumarian2665 8 ай бұрын
some examples of those libraries you mentioned for nodejs ? just curious :D
@devaereo
@devaereo 8 ай бұрын
This video made me feel dumb and I want to go and study linear algebra back again
@mubin-ansari
@mubin-ansari 8 ай бұрын
Prime without hoodie? This is going to be a great article
@richardpowell1425
@richardpowell1425 4 ай бұрын
The applications for large matrix multiplication include engineering and science. Important to get fast code.
@kitastro
@kitastro 7 ай бұрын
9:14 I like that music that comes in
@hannessteffenhagen61
@hannessteffenhagen61 8 ай бұрын
For small matrices the main thing is that parallelising the work for multiplying two matrices makes no sense anymore, the overhead of distributing the work would far exceed the actual work that needs to be done. Some of the other things mentioned in the video may still make sense though.
@forinda
@forinda 7 ай бұрын
I have never seen @primeagen have this agree to almost everything. This guy is fluent in math
@johanngambolputty5351
@johanngambolputty5351 8 ай бұрын
Did I see a Manjaro terminal, hell yeah! (oh and something like astronvim too, respect)
@guywithknife
@guywithknife 7 ай бұрын
Regarding “is that what I said before, that we can walk the matrix on a different order instead of transposing” - no, the earlier algorithm required transposition because of memory prefetching, the reason he can walk it differently in the later version is that the block fits entirely into cache this time. Walking in a different order doesn’t affect cache when the entire thing fits in cache. So while you did say that you could do this, it doesn’t work for the initial version, only for the block version.
@seancooper5007
@seancooper5007 8 ай бұрын
That's some for loop nesting right there
@mannycalavera121
@mannycalavera121 8 ай бұрын
I'd just be happy to make the visualisations
@user-ud8hw4gp6t
@user-ud8hw4gp6t 2 ай бұрын
11:47 this is called a native array, since it is fixed in memory and you cant push
@nexovec
@nexovec 8 ай бұрын
It's probably true that you don't have to transpose the matrix if it's float32 4x4, because the entire thing is 64 bytes, which is one cache line, so in that extremely common case it can actually be made even faster, you just load it into the SIMD registers directly from the definition, done.
@vladlu6362
@vladlu6362 8 ай бұрын
Except that, if you are running m/n matrices and not n/n matrices, it's not in a single cache line. It's only in the same cache line if you can guarantee that they are allocated all at once, and that's not always the casem
@nexovec
@nexovec 8 ай бұрын
@@vladlu6362 I only meant 4x4 times 4x4, which is a pretty common case... What do you mean by allocated all at once. Never would I imagine anything else than having them contiguously in memory.
@kazedcat
@kazedcat 8 ай бұрын
​@@nexoveccache alignment problem for example you have non vector data like the determinant that you need for your calculation over and over so you want them stored next to your 4x4 but the determinant value will cause your 4x4 to straddle two cacheline.
@nexovec
@nexovec 8 ай бұрын
@@kazedcat Sure, always #align 64 or whatever the thing was.
@DFPercush
@DFPercush 7 ай бұрын
@@kazedcat That's the main advantage of a struct-of-arrays approach. The matrices can live in an alignas(64) array, and the det's or whatever can live in an alignas(4) array. You just have to write container.matrix[i] and container.det[i] instead of container[i].matrix and container[i].det .
@rotors_taker_0h
@rotors_taker_0h 8 ай бұрын
Then you find yourself counting execution port's limits, micro-op cache, branchless logic, whatever. The very next level of optimization (for those curious enough) is to separate "logic" of transforming data (matrix multiplication or whatever) and "schedule" (correct blocking, SIMD, multithreading, etc) and let a compiler or special optimizer do it for you. And they can optimize _several_ operations in the best way. Google for stuff like Halide, Taichi, TVM, Tiramisu compiler, etc. This stuff may optimize for different hardware too (DSP, GPUs, etc). They make camera pipelines with that, neural networks, rendering, all kind of stuff. Computers are crazy fast and crazy difficult to program.
@Gamex996
@Gamex996 Ай бұрын
I did once optimize python code from 50min to 5min with some C binding libraries and multithreading + multiprocessing using 12cores/24threads cpu felt great, the best feeling I ever had after that hot garbage worked and later threw it at colleague to finish it
@nocopyrightgameplaystockvi231
@nocopyrightgameplaystockvi231 8 ай бұрын
Just going to use Matmul to escape this Armageddon
@russelllapua4904
@russelllapua4904 8 ай бұрын
This video has solidified that maths just isn't for me.
@Veptis
@Veptis 8 ай бұрын
This seems like great engineering. Food to be able to simply import mkl
@gingeral253
@gingeral253 8 ай бұрын
This cache optimization has went against the programming principles I had learned.
@d_9696
@d_9696 8 ай бұрын
This level of optimization is probably only needed for numerical applications and gaming
@bennymountain1
@bennymountain1 7 ай бұрын
optimization and gaming in one sentence LMAO
@spamviking8591
@spamviking8591 8 ай бұрын
Some of the best devs I know suffer from premature optimization.
@Hr1s7i
@Hr1s7i 6 ай бұрын
3:40 It's the same relation as the one between moments and velocity in mechanics. The CPU can perform that many billions of floating point operations in a single second. Meanwhile the RAM can "process" or store/cycle, that many in a single second. RAM is a capacitive hard wired reactive circuit, which means you have to wait for the rather long transient processes to do their thing. Couple that and the fact that you expend part of the bandwidth for the sake of communication (CPU's memory controller is the equivalent of a train station tossing trains at other stations, be it memory, gpu or in some cases SSDs) and you might end up with seemingly low performance under certain conditions. A good example of how this weakness can surface is when you operate on enormous active data sets. In the meantime, your PC can look like an absolute monster when playing a video game, where once you load the data necessary for the simulation to run, very little of it changes on the logic side of things. As a matter of fact, most game behaviours are de facto hard coded specifically to prevent overhead. In the meantime, driving a PID sim in Matlab can easily employ 4-5 cores when done in a non cheaty way.
@michaelt126
@michaelt126 6 ай бұрын
Im pretty sure modern compilers should already use a combination of prefetch and blocking of large arrays without having to include additional libraries,, maybe this is not a standard for all languages, but its worth knowing prior to adding a bunch of redundant code.
@plurallyplurally7948
@plurallyplurally7948 8 ай бұрын
What matter is required in order to learn about this?
@markuszeller_official
@markuszeller_official 8 ай бұрын
Uncle Bob is going to die if sees these nested loops.
@BrainySmurf77
@BrainySmurf77 8 ай бұрын
I spent a large part of my PhD doing optimizations like this to my CFD codes so I could graduate faster.
@shams350z
@shams350z 7 ай бұрын
How do they animate such stuff ?
@KamillaMirabelle
@KamillaMirabelle 8 ай бұрын
We needed to that in a homework assignment at a course 😅
@mbarrio
@mbarrio 8 ай бұрын
How would one cache optimize using rust?
@hkravch
@hkravch 8 ай бұрын
Bro is using that star trek “inser technobabble” thing
@deldrinov
@deldrinov 8 ай бұрын
I'm a novice in programming at best but I've encountered something similar once. I wanted to add attributes to each member of an array which themselves were in an array; pretty much extending 2D array if my lingo is correct. I did that in 2 nested for loops, and the initial code ran like crap. I swapped the for statements without any other changes and speed increased 40x.
@DFPercush
@DFPercush 7 ай бұрын
There's a similar thing in image processing. It's usually better to loop for (y=...) on the outer loop and for (x=...) in the inner loop. That way you're accessing memory contiguously in a horizontal manner and not skipping several cache lines on every pixel.
@sanderbos4243
@sanderbos4243 8 ай бұрын
7:03 If you have a Set of 10 items, rather than transforming it into an array and doing O(n) removal by shifting everything over, you can just make it O(1) using the "swap-remove" `arr[i] = arr[--len];` that swaps the last element into the position of the removed element. Note this presumes it's a Set, so the order doesn't matter, but you said it was a Set. :)
@YumekuiNeru
@YumekuiNeru 8 ай бұрын
was not the point that the O-notation does not take into account how the data structure is stored in memory so an array will probably be faster even if it naively says O(n) or whatever
@Marius426
@Marius426 8 ай бұрын
... unless you want to keep it sorted for binary search access, then you do have to move the elements. But again, for small enough sets it doesn't matter there either.
@sanderbos4243
@sanderbos4243 8 ай бұрын
@@YumekuiNeru Sure, but my single-line suggested approach of removing items from the array is an improvement of the approach prime suggested where you're moving way more items. Why shift over say 5 items when you can just swap. It does depend on it being a Set, but that was prime's hypothetical.
@sanderbos4243
@sanderbos4243 8 ай бұрын
@@Marius426 If your elements are integers and you want to find an element in it, you can do it in O(1) as well rather than binary search's O(log N). You just need to maintain a second array that is the same length, where you index it with the item you're looking for and it tells you what index it's at. No hashing required. This of course doesn't really matter for performance when dealing with 10 integers, but with more integers it can. I hear you say "but my elements often aren't integers", but you can let those integers be indices into an array of the structs or whatever you're looking for. Think of them as entity IDs. This costs you another level of indirection, but on the other hand swapping integers is faster than swapping large structs. But this is admittedly going quite far, just wanted to bring up the possibility. "We can solve any problem by introducing an extra level of indirection." "…except for the problem of too many levels of indirection,"
@austinbachurski7906
@austinbachurski7906 8 ай бұрын
Just when I think I'm starting to learn a thing or two about programming, here comes this guy. Nope, I'm still an idiot...
@marble_wraith
@marble_wraith 8 ай бұрын
yo dawg i heard you like nested loops. So i got some nested loops for your nested loops.
@S-we2gp
@S-we2gp 3 ай бұрын
The world would grind to a halt if every programmer had to deal with this, just solve this for me bro and give me a library and some rules.
@diegocassinera
@diegocassinera 8 ай бұрын
you need the data from memory to registers do the operation in registers and move the result to memory
@cblair1353
@cblair1353 8 ай бұрын
Linear algebra is fucking magic.
@resresres1
@resresres1 8 ай бұрын
modern programming at it's finest: Spend forever optimizing code, then find out there's a library for that and you wasted your time.
@yuack2398
@yuack2398 8 ай бұрын
As a researcher using HPC computers in daily life for physics, this kind of performance gain is really helpful. But the problem is, most of users of HPC platform knows nothing about computer science itself
@priyeshagarwal2644
@priyeshagarwal2644 4 ай бұрын
Would Strassen's matrix multiplication optimize it even further? O(n^2.8)
@user-ud8hw4gp6t
@user-ud8hw4gp6t 2 ай бұрын
i just dont get it if you can avoid redundancy you use a adjastancy matrix nested loop, if not you either use matrix multiplication with native arrays, or you use tensors
@TutoPanographic
@TutoPanographic 7 ай бұрын
The optimization concept in every mind is to try to make is faster, smaller, more efficient, etc. while the real problem is to choose first the right algorithm. Thanks Donald. Also the video doesn't take care to tell that even if this low level optimizations work, they probably compile differently on different machines. And, every optimization is contextual, because every problem /need is different. And then there's reality. When the client code is not new, you know you'll witness schizophrenia...
AI Discovers Faster Algorithms
19:30
ThePrimeTime
Рет қаралды 205 М.
Nested while Loop in Python
12:49
Neso Academy
Рет қаралды 10 М.
버블티로 체감되는 요즘 물가
00:16
진영민yeongmin
Рет қаралды 55 МЛН
Получилось у Вики?😂 #хабибка
00:14
ХАБИБ
Рет қаралды 5 МЛН
The Truth About HTMX | Prime Reacts
49:56
ThePrimeTime
Рет қаралды 348 М.
Prime React: Fast Inverse Square Root - A Quake III Algorithm
30:19
ThePrimeTime
Рет қаралды 148 М.
Adding Nested Loops Makes this Algorithm 120x FASTER?
15:41
DepthBuffer
Рет қаралды 123 М.
9 Months with GPT, Can I Fire My Devs Now?
20:53
ThePrimeTime
Рет қаралды 181 М.
CPU Faster Than You Think | Prime Reacts
22:18
ThePrimeTime
Рет қаралды 74 М.
The Perfect Programming Language
23:50
ThePrimeTime
Рет қаралды 353 М.
Golang is BAD for SMART PEOPLE
27:25
ThePrimeTime
Рет қаралды 248 М.
The Truth About The Fast Inverse Square on N64 | Prime Reacts
16:59
Object Oriented Programming is Good | Prime Reacts
31:30
ThePrimeTime
Рет қаралды 289 М.
This Algorithm is 1,606,240% FASTER
13:31
ThePrimeagen
Рет қаралды 734 М.
#miniphone
0:16
Miniphone
Рет қаралды 3,5 МЛН
How To Unlock Your iphone With Your Voice
0:34
요루퐁 yorupong
Рет қаралды 26 МЛН