C++ Weekly - Ep 435 - Easy GPU Programming With AdaptiveCpp (68x Faster!)

Рет қаралды 13,878

Ай бұрын

☟☟ Awesome T-Shirts! Sponsors! Books! ☟☟
Upcoming Workshop: C++ Best Practices, NDC TechTown, Sept 9-10, 2024
► ndctechtown.com/workshops/c-b...
Upcoming Workshop: Applied constexpr: The Power of Compile-Time Resources, C++ Under The Sea, October 10, 2024
► cppunderthesea.nl/workshops/
Episode details: github.com/lefticus/cpp_weekl...
Code Sample: github.com/lefticus/cpp_weekl...
T-SHIRTS AVAILABLE!
► The best C++ T-Shirts anywhere! my-store-d16a2f.creator-sprin...
WANT MORE JASON?
► My Training Classes: emptycrate.com/training.html
► Follow me on twitter: / lefticus
SUPPORT THE CHANNEL
► Patreon: / lefticus
► Github Sponsors: github.com/sponsors/lefticus
► Paypal Donation: www.paypal.com/donate/?hosted...
GET INVOLVED
► Video Idea List: github.com/lefticus/cpp_weekl...
JASON'S BOOKS
► C++23 Best Practices
Leanpub Ebook: leanpub.com/cpp23_best_practi...
► C++ Best Practices
Amazon Paperback: amzn.to/3wpAU3Z
Leanpub Ebook: leanpub.com/cppbestpractices
JASON'S PUZZLE BOOKS
► Object Lifetime Puzzlers Book 1
Amazon Paperback: amzn.to/3g6Ervj
Leanpub Ebook: leanpub.com/objectlifetimepuz...
► Object Lifetime Puzzlers Book 2
Amazon Paperback: amzn.to/3whdUDU
Leanpub Ebook: leanpub.com/objectlifetimepuz...
► Object Lifetime Puzzlers Book 3
Leanpub Ebook: leanpub.com/objectlifetimepuz...
► Copy and Reference Puzzlers Book 1
Amazon Paperback: amzn.to/3g7ZVb9
Leanpub Ebook: leanpub.com/copyandreferencep...
► Copy and Reference Puzzlers Book 2
Amazon Paperback: amzn.to/3X1LOIx
Leanpub Ebook: leanpub.com/copyandreferencep...
► Copy and Reference Puzzlers Book 3
Leanpub Ebook: leanpub.com/copyandreferencep...
► OpCode Puzzlers Book 1
Amazon Paperback: amzn.to/3KCNJg6
Leanpub Ebook: leanpub.com/opcodepuzzlers_book1
RECOMMENDED BOOKS
► Bjarne Stroustrup's A Tour of C++ (now with C++20/23!): amzn.to/3X4Wypr
AWESOME PROJECTS
► The C++ Starter Project - Gets you started with Best Practices Quickly - github.com/cpp-best-practices...
► C++ Best Practices Forkable Coding Standards - github.com/cpp-best-practices...
O'Reilly VIDEOS
► Inheritance and Polymorphism in C++ - www.oreilly.com/library/view/...
► Learning C++ Best Practices - www.oreilly.com/library/view/...

Пікірлер: 72

@sqeaky8190 20 күн бұрын

Hearing about your pains is comforting. Thank you for sharing that part. It must be humbling, but it is helpful to know I am not alone with GPU woes.

@cppweekly 8 күн бұрын

I knew I was in for a steep learning curve. I had been planning an episode like this for YEARS and finally just put in the effort to do it. Honestly one of the most difficult episodes for me to put together.

@TsvetanDimitrov1976 Ай бұрын

The fact that this is pure c++ code is actually quite impressive. Reminds me of c++ AMP back in 2011. Still, this kind of solutions leave a ton of performance on the table compared to hand writing it using vulcan or dx12 compute shaders, so I'm not really sure it's the right way to go forward towards heterogenous computing. I'd rather have the GPU vendors conform to a common ISA, so that we can directly program the GPUs instead of going through multiple layers of (black box)abstractions.

@victotronics Ай бұрын

Given that NVidia dominates the market, they are not interested in a common ISA. But Sycl & Kokkos are such common ways of writing for multiple GPU brands.

@Illuhad Ай бұрын

AdaptiveCpp's C++ standard parallelism offloading is not intended for folks who are willing to hand-write shader code. It's for people who have a C++ application, and want to remain at a high abstraction level, and perhaps get some speedup just by recompiling. If you want more control, AdaptiveCpp also supports SYCL as a programming model which exposes much more control to the users. And you can mix both models in the same app, e.g. start on a high level, then move to SYCL if you want to optimize some kernel in particular. A common ISA for GPUs is... extremely unrealistic. Architectures are way too different. And vendors can't even agree on a common IR. AdaptiveCpp by the way supports a unified IR and code representation across all its targets (CPU as well as GPUs).

@TsvetanDimitrov1976 Ай бұрын

@@Illuhad I totally agree, it's a great tool. My comment was more along the lines of a possible way for c++ to go into gpu programming, while still be as close to the metal as possible. And for that to be possible we definitely need at least a stable ISA from each vendor even if it's not common between nvidia/amd/intel/etc. I don't mind writing all the shaders and the surrounding infrastructure code, but I imagine a future where I could just write pure c++ code instead of hlsl, glsl, metal, etc. and leave all that work to the compiler without giving up control or performance.

@Illuhad Ай бұрын

@@TsvetanDimitrov1976 But you can do this with AdaptiveCpp though. It has unified code representation based on LLVM IR, which is then JIT-compiled at runtime for GPUs from all the vendors. And while what you have seen in this video was fairly high-level, AdaptiveCpp also allows you to have way more low-level control if you like. The SYCL model that it supports is on a similar abstraction level as CUDA, so it might be pretty close to what you want...Aligning ISA would require aligning hardware architectures. When you say ISA, I'm not sure you really mean ISA. For example NVIDIA does not even have a well-documented stable ISA. Their ISA (SASS) is proprietary and changes with each GPU version. What NVIDIA has is an intermediate representation (IR) for all their GPUs called PTX. AdaptiveCpp gives you an intermediate representation across all the GPUs.

@TsvetanDimitrov1976 Ай бұрын

@@Illuhad "For example NVIDIA does not even have a well-documented stable ISA. Their ISA (SASS) is proprietary and changes with each GPU version." That's exactly what I hate about it. I'd rather program the hardware than the OS/driver/whatever abstraction on top of the driver. This is the loss of control/performance I am talking about. I'm a game dev so it's probably a very niche opinion, but I want total control over the memory allocation, scheduling and executing the code on the gpu. I kind of get it through vulcan/dx12/etc. but that's at least 2 levels of indirection which I'd rather not have. ANY "magic" runtime is a no starter, be it a driver, an API, or some state machine that assumes how I want to use the gpu. I hope that clears my stance.

@toast_on_toast1270 Ай бұрын

Will definitely be checking this out. In solving some parallelisable problems at my job, I went ahead and used Vulkan by essentially modifying the compute shader example on the Vulkan website, and saw something like a 100x speedup. However, it's pretty complicated to use. You have to write the shader code, compile it at runtime, manage dispatch, memory and synchronization. It's a pretty long way from standard c++. There's also a lot to go wrong. If Adaptive C++ can even come close to the performance I am getting with Vulkan it's worth a shot, because it will simplify the codebase significantly. I would like to see how well it handles complex tasks, say for example if it can chunk through a lot of trig operations on data quickly, and also how efficiently it handles the dispatch or successive "draw calls".

@avramlevitter6150 Ай бұрын

I'm curious what the performance is when running a natively-written CUDA version of this code, and how it stacks up against the AdaptiveCPP versions. In general, I find that these "write once, compile for anywhere" systems tend to only be useful if it's actually your use case to not target a specific architecture, but once you're already in a situation where you know you're going to be running on a particular architecture it's almost always better to write something natively for it. I know that AdaptiveCPP's claim is that it can even beat the native code but that's something I'd like to see benchmarks on.

@kikeekik Ай бұрын

there are benchmarks, look in google scholar. The benchmarks my team did were ~20% slower than cuda in comparison to SYCL DPC++, but that was 3 years ago, things have changed a lot in the last years

@geto6242 Ай бұрын

New to the channel. This is an instant subcribe. Thanks!

@GeorgeTsiros Ай бұрын

It's an expected result. The calculations done are simple and for each cell independent of the calculations of the other cells. So, boatload of (relatively) simple processors beat a handful (8, you said) very complex processors. Truth be told, this is impressive. Basically free, automatic parallelization. 20 years ago this was the stuff of dreams.

@gast128 Ай бұрын

GPU's are cool though be aware of applicability and the memory transfer overhead. Microsoft used to offer C++ AMP which was a nice library to offload calculations to an accelerator. Unfortunately they withdraw that library.

@Xaymar Ай бұрын

Nice overview, though it is extremely unoptimized. The CPU versions can be significantly sped up by aligning cells to cache lines, and processing a full cache line in one thread. On modern hardware, you can store 512 cells, or 256 cells of current and future state in one cache line. This combined with SIMD enables you to update up to 62x62 cells with just 8 cycles overall (AVX512). It should be a significant improvement over your current code, but isn't instantly transformable to compute code.

@mytech6779 Ай бұрын

OpenSyCL was also known as hipSyCL (for AMDs hip GPU framework) if it helps anyone trying to lookup information. And the switch of SyCL from version numbers like v2.3 to revision years (2020 ie) also marked a complete change in the entire standard, which was based largely on OpenCL and is now fully independent; the year revisions are also intended as a way to stay somewhat aligned with iso C++ revisions.

@callmeray7705 Ай бұрын

I remember reading on a vulkan (blog? might not have been a blog, idk, was a while ago) thing that the sweet spot was around 2 million concurrent floating point operations so glad to see a proven sweet spot with a similar number.

@mytech6779 Ай бұрын

With which hardware?

@jacobmoore2036 18 күн бұрын

You should really consider the MATAR C++ performance portability library developed by Los Alamos National Labs. It supports single source portability for CPU and/or GPU hardware with MPI wrapping.

@cppweekly 8 күн бұрын

Noted! I'll consider doing a follow-up

@jacobmoore2036 8 күн бұрын

@@cppweekly Full disclosure, im biased. I use MATAR heavily. Its just really nice to go from serial to threaded CPU parallel to runing on GPUs by just changing a compile flag. :)

@mathieu564 Ай бұрын

I haven't yet seen this video, but that would be great if this kind of video were done more in C++ Weekly. It is the kind of video, where even if done badly it would be great because the subject is not well covered. How to use C++ to use X and Y would really be great for viewers.

@mjKlaim Ай бұрын

Wow I wasnt aware of that hetergeneous compiler! I'll note to play with it someday, maybe mixnig with the new std::execution library }:D

@bevanweiss522 Ай бұрын

It would have been good to see the graph continue for a few more size iterations 'bigger' on the right. It hasn't clearly shown the intersection between GPU and CPU where it appears the GPU 'bowl' is on the way back up. Perhaps it was just going to level out around the Clang CPU curve (suggesting there is some high intensity CPU load associated with the larger grids.. perhaps virtual mem paging, which i suspect is not CPU parallelizable)

@Illuhad Ай бұрын

Yep, the hardware used was an APU. APUs/iGPUs typically have fairly limited amounts of dedicated memory. So I suspect that for the larger problem sizes, we reach that limit and virtual memory shenanigans start. Apart from caching effects for small problem sizes, we usually don't see the pattern that AdaptiveCpp slows down for larger problems as long as you remain within VRAM capacity.

@anon_y_mousse Ай бұрын

That is most curious to me that using the same standard library implementation the two compilers produce such wildly different results. I would assume it has something to do with how the code is optimized, as in clang expects a particular organizational structure based on how they think optimization should work and gcc uses something totally different that doesn't mesh when clang is set to accommodate gcc's library. Although, I'd bet that when I see your video it'll be something completely different.

@AxWarhawk Ай бұрын

Wait until he learns about Celerity 😉

@treelibrarian7618 Ай бұрын

for context, I sketched an avx512 asm version (in about an hour) and did 1024*1024*500 in 3.5ms, in a SINGLE THREAD on my 11th gen i5.

@GeorgeTsiros Ай бұрын

how many iterations? 500?

@treelibrarian7618 Ай бұрын

@@GeorgeTsiros yes

@MrVladko0 22 күн бұрын

@@treelibrarian7618 Could you share your code?

@matrixstuff3512 Ай бұрын

Id love to hear your thoughts as a fresh user compaing this with kokkos

@AusSkiller Ай бұрын

I wonder how the performance of this compares to just writing a fragment shader to do the computation. Honestly I'm pretty surprised by how slow it is compared to what I would normally expect on a GPU, I was expecting well under 10ms per iteration at 10,000x10,000, then again maybe it is limited by memory bandwidth especially on an integrated GPU. Also I tend to have pretty high end GPUs so my expectations are probably a little high for an integrated GPU.

@Illuhad Ай бұрын

Yeah, an APU will have memory bandwidth of like 30GB/s, depending on the exact configuration... Also, keep in mind that these are not pure kernel timings, but host-side timings. There might be offloading latencies, initial data transfer costs etc included in those timings as well. The code seems also more optimized for teaching rather than perf, e.g. if I see it correctly, it does not generate the indices of the cells on the fly using e.g. iota view but stores them in memory, which is not needed. So it has to move more data than just the 10000x10000 grid and the associated stencil. AdaptiveCpp has been shown to deliver competitive perf on large HPC GPUs compared to other models like CUDA.

@literallynull Ай бұрын

Hey, Jason, what do you think of the Intel's DPC++?

@victotronics Ай бұрын

That's basically the same as SYCL. But without this ultra-cool trick of converting range algorithms.

@Spielix Ай бұрын

@@victotronics In turn Intel have their own version of parallel STL-like algorithms called oneDPL. It's basically Intel's answer to Thrust/rocThrust. Being able to just use the STL algorithms is pretty cool, but in many situations these libraries bring some extra features to the table as well like segmented reductions, scans and sorts.

@VioletGiraffe Ай бұрын

Would be interesting to know more about how it actually works. Does it store LLVM IR code and compile that for the GPU? Does it translate it into some other language first?

@Illuhad Ай бұрын

It supports multiple compilation flows; the default compiler is a unified host-device compiler which indeed stores LLVM IR and JIT-compiles that at runtime for host CPU/NVIDIA PTX/amdgcn/SPIR-V, depending on what is needed. Details can be found in the project documentation on github. There are also papers describing it in more detail, e.g "One Pass to Bind Them: The First Single-Pass SYCL Compiler with Unified Code Representation Across Backends" discusses the unified host-device JIT-compiler and "AdaptiveCpp Stdpar: C++ Standard Parallelism Integrated Into a SYCL Compiler" discusses how the C++ standard parallelism offloading works. You can find both papers by googling.

@darkmagic543 Ай бұрын

Not bad, although seems usable only in very specific scenarios? In the case you want to juice out maximum performance, you would just use something like CUDA to get better performance as it gives you more control. If your task is a bit more complex, such that using simple standard algorithms would be hacky, it would also not be great solution - what about concurrency/synchronization? So basically you would probably use it only if you have a simple problem and you want to quickly speed it up a little bit, but don't want to put the work in to speed it up more?

@VFPn96kQT Ай бұрын

CUDA works on Nvidia GPUs only. Sycl is generic and compiles to CUDA, Rocm, Spir-V or OpenMP.

@Illuhad Ай бұрын

AdaptiveCpp also supports the SYCL programming model which you can mix-and-match with standard C++ algorithms. SYCL exposes much more control, on a similar level as CUDA. For example, you could e.g. start developing your application with C++ standard algorithms, and if you find some performance bottlenecks, optimize those bits in SYCL.

@victotronics Ай бұрын

AdaptiveCpp is indeed an open implementation of Sycl. Unfortunately Sycl is not an easy system to program in. There is also Kokkos (from Sandia Labs) that also targets multiple backends, and is (imnsho) easier to program. And then there is OpenMP with offloading. With the exception of OpenMP, they are all "data parallel" systems, much like CUDA. In fact, if you squint a little, they all look so similar to CUDA that you can probably automatically convert them. In fact, Intel has such a tool for their version of Sycl. But that's the story for the standard Sycl. The fact that you didn't have to change your code means that (and this I didn't know and is very cool!) apparently AdaptiveCPP translates CPP range algorithms to the (ugly ugly) underlying Sycl code. I think this is specific to the ACPP compiler, and it's not behavior mandated by the Sycl standard. Cool.

@Spielix Ай бұрын

"not easy to program in" is relative. As someone used to CUDA, SYCL to me actually looks quite nice to program in. I think it also compares quite favorably to OpenCL. On the other hand I have little experience with actually using SYCL (or OpenCL) outside of reading some samples, so take these opinions with a grain of salt. Nvidia nowadays provides a HPC toolkit including the nvc++ compiler (formerly PGI's HPC C++ compiler) that can also offload stdpar algortihms (only for Nvidia hardware). It basically just interfaces to Nvidia's Thrust library and replaces heap memory with CUDA UVM, i.e. memory that is accessible from both CPU and GPU. Allocations become quite expensive though.

@victotronics Ай бұрын

@@Spielix I've never written OpenCL but it looks awful. Sycl & Cuda are both "apply this point function over this range". However, Sycl, unlike Kokkos (or Cuda) insists on making the task queue explicit which complicates writing the kernels needlessly.

@Spielix Ай бұрын

@@victotronics I guess you rather mean writing the host-side orchestration of kernels, buffers and so on? Because it doesn't seem to influence how one writes the kernel functions themselves. And with SYCL 2020 USM and terse syntax it seems to me as if they fixed the worst boilerplate? I just looked at Intels small SYCL tutorial which mentions those 2020 features in the end. While CUDA streams are somewhat optional you still want to use them in serious code for asynchronicity/overlapping multiple operations, so generally I don't see a problem with explicit queues other than the amount of boilerplate with the classic syntax and buffers.

@mytech6779 Ай бұрын

@@Spielix Old SyCL was basically a fancyfied openCL frontend. With the change from version numbers to year releases the entire specification changed so SyCL is now a stand alone standard no longer sitting on the openCL backend. The year releases are also intended as a way to stay more clearly aligned with the Iso C++ revision cycle.

@dsecrieru Ай бұрын

Aren't there race conditions when parallel processing a cell's neighborhood?

@Antagon666 Ай бұрын

Nope since you don't modify the input buffer, and outputs are unique 9

@toast_on_toast1270 Ай бұрын

If the output of the cell at Tn depends only on its neighbours at Tn-1, then no: you can have multiple reads on the same data and not cause a race condition. If, on the other hand, the output of a given cell depends on the output of the other cells, then GPU programming is not the tool for you.

@Spielix Ай бұрын

@@toast_on_toast1270 "then GPU programming is not the tool for you." That is a bit of hyperbole. First of all this problem is not specific to GPUs and second of all there are solutions for parallel in-place updates like coloring (See Red-black Gauss-Seidel).

@toast_on_toast1270 Ай бұрын

@@Spielix I meant not the tool for the job. If the output of each cell depends on the current iteration's output of neighbours then that is a sequential, not parallel, problem. See CFD simulation. Edit: well I looked up your ref, yeah maybe a flat "no" is a bit too far. But your example is not as much a direct application of the hardware architecture. The solution is instead found using calculus, and the GPU is used for the linear algebra. The performance increase is less remarkable than in something inherently parallel like image processing. It also requires a maths degree to pull off!

@ekondis Ай бұрын

I didn't hear any mention on the GPU & CPU specs. If this was run on a high end one e.g. RTX-4090 then the speedup doesn't look quite impressive.

@Illuhad Ай бұрын

The specs are displayed in the spreadsheet, e.g. look at time 8:14. It was a Ryzen 7 7730U APU - not a high-end GPU. So GPU memory bandwidth was the same as on host CPU, and this is a memory bound problem. I think the speedup is actually pretty great given the hardware.

@ekondis Ай бұрын

@@Illuhad Yes, this is great. I didn't notice this on the spreadsheet.

@Antagon666 Ай бұрын

Lemme guess, Clang vectorizes the code, which is useless in this case

@kikeekik Ай бұрын

AFAIK, acpp uses OpenMP or OpenCL as CPU backends, not TBB

@victotronics Ай бұрын

It has backends for OpenMP, CUDA, HIP

@Illuhad Ай бұрын

This is true. However, the parallel STL implementations in libstdc++ and libc++ rely on TBB. I think what was done here was to compare with acpp as a regular host compiler using PSTL from libstdc++ without offloading. This then would go through TBB due to libstdc++ internals. You can also use AdaptiveCpp to run PSTL on CPU via AdaptiveCpp's CPU support (OpenMP or OpenCL as you say), but this was not the focus here I think.

@SillyOrb Ай бұрын

8:47 Just a minor nitpick: twice as fast isn’t the same as twice faster. With that out of the way, that’s curious. It would make for a good follow up.

@frankreich5018 Ай бұрын

You don't have any devices that are natively running linux? That is completely outrageous.

@vasylzaichenko3253 Ай бұрын

Just FYI, I have Asus Rog Flow X13 AMD CPU + Nvidia GPU and it is much easier to build Intel’s oneAPI llvm toolchain both for Windows and WSL2 just having CUDA toolkit installed. From my POV it is easiest way to start playing with DPC++

@markusasennoptchevich2037 Ай бұрын

There is a reason system programmers don't like c++ for low-level stuff

@sqlexp Ай бұрын

Skill issues.

@PopescuAlexandruCristian Ай бұрын

Imagine the lack of skill that you must have to use some garbage like this. 68 times compared to what a CPU does is pocket change for a GPU if you are not a Java prpgrammer

@Illuhad Ай бұрын

Dude....this was done on an APU, not a powerful dedicated GPU. Memory bandwidth there is the same as on CPU. And the application is hardly a benchmark, there are a couple things in there that are not ideal could be optimized. It's a simple example...

@Spielix Ай бұрын

Using abstractions is not about a lack of skill, but about using your time as a developer efficiently. Once you have a working implementation you can start to benchmark/profile and optimize the actual bottlenecks instead of wasting time on reinventing the wheel.