Introduction to C++ Atomic Variables in Multithreaded Programming

Рет қаралды 33,196

Күн бұрын

A quick intro to C++ atomic variables and why you might want to use them when writing multithreaded code (or why you might NOT want to use them).
You should definitely check out this CppCon video to get a full explaination of Atomics from the very basic right up to the advanced level.
CppCon 2017: Fedor Pikus “C++ atomics, from basic to advanced. What do they really do?”
• CppCon 2017: Fedor Pik...
Here is the compiler explorer link to the very simple program that shows the copy, modify and write operation.
godbolt.org/z/p3cu2m

Пікірлер: 56

@njitgrad 2 жыл бұрын

A very simple and very easy to understand tutorial! Thanks.

@DavePoo 2 жыл бұрын

No problem

@cafelashowerezweb 4 жыл бұрын

Thanks, great explanation and healing voice!

@astral_md 4 жыл бұрын

You're excellent ! Outstanding post !

@treyquattro 3 жыл бұрын

Good topic and example. I just have a couple of points. The vast majority of programmers will be using Intel or AMD 32-bit x86/IA-32 or 64-bit i64/x86-64 architecture processors (x86 from now on). The Intel instruction set enables atomic semantics on fundamental data types - those which fit into the word size of the processor and which can be fetched or stored in a single memory access, such as bool, char, word (short), dword (int/int32 on x86-32), qword (long/int64 on i64) and so forth by asserting the lock prefix on a single instruction that performs the atomic operation such as add, sub, inc(rement), dec(rement) and some other more esoteric instructions such as compare-exchange. Instead of the loop that you described an atomic add, for example, in a fully optimized build of a C++ program will do the following on an x86 architecture machine: lock add [variable],value Various (ordinary x86) forms of addressing are available to the locked instruction but basically we're addressing memory and adding an immediate or register value. Without the lock prefix the processor would not synchronize operations with other processors and the data value could be corrupted (although IIRC reads of fundamental types on x86 are inherently interlocked). The lock prefix locks the data bus between processors, or in modern processors cores, to stop any other processors in the system accessing the locked address range. The locked instruction takes the same amount of time as the unlocked instruction, plus the time required to fetch and decode the lock prefix which is a single byte operation. On modern processors the amount of extra time is so negligible as to be virtually unmeasurable. So on an x86 machine there would be virtually no difference in time between a thread performing a locked add and one performing the unlocked version. There would be a very small hit at hardware level but it will likely be entirely hidden by caching and memory management at CPU level. You could run an experiment by using the chrono high performance counter and doing say a billion adds with a normal and a locked integer variable and comparing the results. You mentioned mutexs which are more sophisticated locking primitives and can cause a context switch (the processor will execute some other runnable thread or process) if this thread cannot acquire the mutex. This really will lead to measurably worse performance. However, a single mutex won't lead to a deadlock situation (meaning no threads can run because they're all holding a lock primitive such as a mutex and simultaneously trying to acquire one already held by a different thread). Lastly, the example that you gave of the volatile add gave a good example of the sort of behavior that will happen on a register-based load/store architecture machine like an ARM processor but not an x86. It wasn't entirely accurate since there was no comparing the added value with the expected result and looping until it became indivisibly set. Volatile tells the C++ compiler that the value of a variable might change between accesses so that it can't be stored in a register or aliased but must be accessed specifically for every operation each time. So the traditional Intel add instruction which does the read-add-write operation in a single instruction is subdivided into three separate instructions. This is how it would appear on the ARM or some other RISC architectures but not Intel, unless you were manipulating an atomic variable that was not a fundamental type and then it might use a mutex or spinlock. FYI, use of the volatile keyword is discouraged in modern variants of the language and is scheduled to be deprecated in newer versions (C++20 and on). A good understanding of the processor architecture and instruction set can be garnered from Intel x86 Programming Manual, freely available on the Intel site (ARM documentation not nearly as detailed in my experience).

@DavePoo 3 жыл бұрын

Wow, thats the biggest comment i've ever had on a video. So, optimising Atomics is a while thing and beyond the scope of this video, But surfice to say you need to know when a value needs to be treated as an atomic, how it's used and what platform it's on to do that. It may be that certain platforms don't need any special treatment for certain operations as the hardware is naturally able to provide the guarantees required. But when writing software for cross platforms, then you would still have go through some kind of library (or write your own) so that each platform can get the correct treatment. So recognising that a value needs to be atomic is the important part to start with. And it is for that reason i would never consider atoics to be free or very low cost, as even if they were totally free on platform A, they might be expensive on platform B, so never use them where you don't need them, so just like medicine for a headache "more than enough isn't better than enough". As for deadlocks, yes you are correct you need more than one to deadlock, i already did a video on that.

@treyquattro 3 жыл бұрын

@@DavePoo Hi Dave, generally I agree with what you say about needing to know when to use atomics, or protect code with critical sections (by implication), and cross-platform solutions. This is why it's a good idea to abstract out the semantics in the language - or rather the STL (Standard Template Library) in the case of C++ - into a templated type. That said, traditionally C and C++ programming has been aimed towards a close coupling of the underlying hardware. Basically they're systems programming languages where you're very aware of what architecture you're running on. This was especially true with C which was a sort of high-level assembler language. The C++ committee is trying to have C++ be all things to all programmers, be a low-level systems language and a productive high-level abstract general purpose programming language, but still provide the most performant solution to any programming problem. In theory, as you implied in the video, we could have every access be atomic. Then we'd let the compiler actually figure out what really needed to be atomised (if you will) under the covers and what could be left to unlocked accesses. The language currently leaves that decision in control of the programmer but it does mean that the programmer has to have an understanding - quite intimate really - of their threading model, and even the underlying hardware. In fact I believe Bjarne Stroustrup himself (inventor of C++) always encourages programmers to understand what is happening with the code at machine level. Obviously this requirement is not present for higher level languages which abstract out the hardware fundamentals at ever greater levels. At some point, possibly not too far in the future, possibly with the help of AI techniques, programmers can divorce themselves entirely from concerns about the intricacies of hardware platforms regarding threading models, memory architectures, cache line sizes, paging and so on and so forth, and busy themselves with only the specifics of an algorithm, leaving everything else to the compiler in the case of C++ or the virtual environment in the case of more managed offerings like C#, Java, Javascript, etc. (Javascript has no real concept of threading but does traffic in async patterns, such as promises and futures, e.g.) Anyway, my main point was that the x86 platform in particular, with which most of us are familiar, has fast hardware support (since the original 8086 - then it was used in communication with the 8087 floating point coprocessor; now all that functionality is in the CPU proper) for interlocked memory accesses which very closely map to the semantics of the atomic class for fundamental types. This is true also of C# with Interlocked instructions. C# goes one better than C++ with its *lock* keyword for locking critical sections e.g. that entirely abstracts out the locking mechanism to whatever is most performant or best suits the usage scenario under the covers. I expect C++ will likely go in a similar direction, like it is with coroutines for abstracting concurrency, as the committee work at making C++ an ever-higher level language!

@aiviskri 3 жыл бұрын

Very nice explanation, thanks!

@ShaileshDagar 2 жыл бұрын

This is such a great video.

@ahmadafkande1662 4 жыл бұрын

Awesome video, Thanks!

@kodaloid 5 жыл бұрын

Great video man :)

@abdullahalmosalami2373 Жыл бұрын

Thank you for the example! It would have been cool to also possibly inspect the generated assembly to reaaally get a feel of how the compiler is treating the atomic variable differently, but I know that's quite involved. Thanks for the link to the video as well, I'll be off to that now.

@dorianmajerowski7895 2 жыл бұрын

Great video, thank you :)

@Srkulkarni 3 жыл бұрын

Very nice explanation. It would be nice if you could make video on memory_order

@ThePaullam328 4 жыл бұрын

Very clear. This video makes me think multi-threading isn't that hard after all. Very nice!

@DavePoo 4 жыл бұрын

Thanks. I kind of agree that multi-threading isn't as difficult and scary as it may first appear. However, it's worth noting that this video only really covers one of the aspects of writing code that is thread-safe. Actually writing good multi-threaded code that is is fast, efficient and gets good concurency is quite often where all the work is. It's very easy to end up writing multi-thread code that doesn't get good concurency and is worse that just running on a single thread. Having a good understanding of the underlying nuts and bolts of how to write thread-safe code is a requirement to being able to begin writing good multithreaded code that gets good concurency.

@ThePaullam328 4 жыл бұрын

@@DavePoo good point, good concurrency is not easy to achieve for complex multi-threading code. By the way, atomic library looks fairly confusing in the first glance with the load/store, concepts like memory order, and operations like acquire/release, it would be great if there's a video explaining those. :)

@DavePoo 4 жыл бұрын

@@ThePaullam328 I can recommend the book "C++ Concurrency In Action" by Anthony Williams. It covers everything multi-threaded and uses the std libraries.

@ThePaullam328 4 жыл бұрын

@@DavePoo Herb Sutter - atomic Weapons explains how atomic works like a charm as well

@MattPryze 5 жыл бұрын

Good to know! Thanks :)

@DavePoo 5 жыл бұрын

Glad it helps, it's quite a simple topic in the end but something you should know if you are ever writing any code that has to be multi-threaded. e.g. If you look into the implementation of any referenced counted smart pointer, you'll probably find they use atomic increment and decrements to update the reference count to ensure thread safety.

@prashantaithal507 4 жыл бұрын

good one!

@gautierlathuiliere6072 3 жыл бұрын

Nice introduction to atomics ! However, std::atomic can be an order of magnitude faster than mutexes because it's implementation doesn't always rely on locks. Depending on the size of the atomic value and the Hardware capabilities you can have pretty fast atomics in fact. They're slower but still way faster than mutexes

@ahmadalastal5303 3 жыл бұрын

You see mutexes in the majority of cases is much faster than atomics, the only lock-free guarantee for atomics is std::atomic_flag or std::atomic the rest are not guaranteed to be lock free, atomics will use locks/timed locks if the implementation could not be achieved using lock-free algorithms, take a look at Fedor Pikus Cpp conference in the following link kzfaq.info/get/bejne/kLd2rbCXra_cnps.html, as a rule of thumb you need to do performance profiling for both mutex and atomic implementation of your code, it is very true that atomics depend on hardware, so you need to know your hardware and the target hardware you are writing your application to, which in most cases is not available to you, not to mention that x86 and x64 handles atomic operations differently, do you know that the c++ standard doesn't have atomic operation for floating point operations ?

@__3093 3 жыл бұрын

good job!

@industrialdonut7681 3 жыл бұрын

Interesting! I was thinking it was like CUDA at first where you have hardware 'atomic operations' instead of atomic variables

@majestif 5 жыл бұрын

Would be nice if you used modern C++ (std::accumulate, std::rand, std::generate_n)

@DavePoo 5 жыл бұрын

Sorry, i'm not uber familiar with the std library as i haven't really used it much in my day job. I showed this example using std::atomic as many people will probably want to use the standard library to do this. If you were using windows API for instance you might call "InterlockedAdd" to do a similar operation.

@alexb6568 4 жыл бұрын

No offense to the Indians; however, finally a clear instructional video from someone who speaks English. Keep up the good work.

@sabyabhoi8841 4 жыл бұрын

None taken bruh. I'm an Indian myself and it's indeed quite irritating to parse the weird dialects while trying to concentrate.

@YahyaRahimov Жыл бұрын

Thanks for the Great video! What if we created sum1, sum2, sum3 for each thread and add up the sums then print. Does it improve the performance and prevent incorrectness which we got while using normal long? Sry for the noob question 😂

@hordi1ful Жыл бұрын

Yes it does, but idea of this video to explain how/when to use std::atomic types.

@mrreese2342 2 жыл бұрын

-I used to do a little coding my self,.... if you want to use multi-threading i suggest atomic variables - wait that works ? -Yes that's why i suggested it That's why I'm here 😂

@ahmadalastal5303 3 жыл бұрын

passing const & to a thread will not pass it as a reference, you need to use std::ref to do so

@stansem6806 4 жыл бұрын

Finally clear English, what a relief....)))

@alltheway99 Жыл бұрын

Sorry, why adding random numbers returns always the same result?

@alltheway99 Жыл бұрын

Is the benefit of multithreading not canceled by the overhead of atomic?

@JaSamZaljubljen 4 жыл бұрын

I had to make more threads(5-8) to start getting different numbers from time to time, your program seems to always generate different numbers just for number_of_threads=3,why do such things happen?I have 4 cores processor. btw good video

@DavePoo 4 жыл бұрын

I think that is one of the problems with multithreaded programming. You can write code that is incorrect and it can work fine "most" of the time and then just fail. "Race conditions" at the time that can affect the result (like number of cores in the machine, CPU vendor, the temperature in the room, or what your virus checker is doing in the background). You might find that your machine was actually running this program single threaded most of the time (as other cores were busy? is that a 2 core / 4 thread machine?) so you were not getting any conflicts (you got lucky). My machine has 8 cores / 16 threads it can run on, so it would almost always be able to run that code in parrallel, and show up bugs easily.

@JaSamZaljubljen 4 жыл бұрын

@@DavePoo AMD A10-5757M: Number of Cores / Threads: 4 / 4 Actually I just tested my program and it turns out that the busier it got(turning on games, youtube , various aplications), more often error happened for n_of_threads=3. MUCH , MUCH more often than when I was running it while my pc was doing nothing. ( from every 1000thnd or so try to every 20th , I was runing that in a loop of course I didnt press buttons for 1000 times). Like you said it looks like it's very architecture dependant.

@DavePoo 4 жыл бұрын

@@JaSamZaljubljen Well, that goes to show how hard it is to predict what willl happen if the code is not correct. My last computer was an AMD A10.

@csmellow4644 3 жыл бұрын

Dumb question. Would it be a problem if multiThreadSum is accessed by multiple threads at the same time? I mean I read that we cannot access a memory location at the same time by multiple threads. Thanks :)

@DavePoo 3 жыл бұрын

Do you mean before i made it atomic? as once it was made into the atomic then it does get accessed by multiple threads safely (which was the point of the video). Anyway, if i assume you meant BEFORE i made it atomic, then you have to be clear what you mean by "access", as that could be reading, writing or reading & writing. And just to make things a little more complex, it may depend the processor architecture as to whether the access is safe (meaning will the process allow you to always see the "whole" part of a multi-byte value?) Assume the value in question is memory aligned correctly and the CPU supports it. If you just want to read, then multiple threads can access the value safely on multiple threads. You may also be able to read from as many threads as you want and write from one thread at the same time (again caveat that the CPU architecture needs to be ok with this if its a multi-byte value and usually the value would have to be correctly memory aligned). And finally if you want to read and write from many threads then yes it doesn't crash the CPU, but as you see in the video the results become undefined as writes from one thread are stomping writes from another thread.

@csmellow4644 3 жыл бұрын

I was referring to read but this detailed explanation was insightful. Thank you very much really appreciate it :).

@YouLilalas 4 жыл бұрын

Please increase the font size. It’s very difficult to read on a small screen.

@DavePoo 4 жыл бұрын

Sorry, but it's a bit late now, i did this over a year ago.

@DavePoo 4 жыл бұрын

You'll be pleased to know in my latest video i went with a whopping 300% text zoom.

@czitels1856 3 жыл бұрын

What is that picture on 14:27-14:28?

@DavePoo 3 жыл бұрын

My desktop background, it's concept art from Halo 5

@rsarson 2 жыл бұрын

nice explanation. one small criticism - the typing mistake corrections on nearly everything you enter is distracting

@DavePoo 2 жыл бұрын

I never get it right first time

@antonfernando8409 2 жыл бұрын

I am religiously devoted to Linux POSIX pthread APIs, I suppose there are benefits to using c++ thread classes, specially thinking about pthread_conditional vars. Anyways cool to learn about atomic methods. A general question, with multi core architectures, how could this atomic operations would work to achieve concurrency? I guess with native OS APIs, number of cores and how to distribute the threads onto them is hidden away inside the OS itself.

@softwareEngineer77 5 жыл бұрын

I've stopped watching when I saw "void main"

@DavePoo 5 жыл бұрын

Daunting isn't it.

@josephlagrange9531 10 ай бұрын

Electrons dont exist, dude. So C++ is poorly optimized language.

@turdwarbler Жыл бұрын

This video is so over complicated its untrue. Its 15 minutes long but you dont mention atomics until approx 10.38. You have got yourself tied up in such a complicated example when all you need it 2 threads incrementing a volatile and non volatile variable and see the result. You dont need vectors, or rand() or summing arrays or whatever. When teaching a new topic find the simplest example you can to explain and demonstate it and get rid of all the nonsense you are spouting.