Branchless Programming in C++ - Fedor Pikus - CppCon 2021

  Рет қаралды 150,096

CppCon

CppCon

2 жыл бұрын

cppcon.org/
github.com/CppCon/CppCon2021
---
Have you ever written code like this:
void f(bool b, long x, long& s) { if (b) s += x; }
Probably you have. Would you like me to tell you how much performance you left on the table? With a small change, that function could be made 2.5 times faster.
What about this code:
if (a[i] && b[i]) do_something(); else do_something_else();
Would you believe me if I told you that, under some not-so-exotic conditions, this line runs four times slower than it could be? It’s true, and I’m going to show you when and why.
This presentation will explain how modern CPUs handle computations to ensure that the hardware is used efficiently (which is how you get high performance from the processor). We will then learn how conditional code disrupts the ideal flow of computations and the countermeasures employed by the CPU designers to retain good performance in the presence of such code. Sometimes these measures are sufficient, often with the help of the compiler. But when they aren’t, it is up to the programmer to recover lost performance by coding with fewer branches.
---
Fedor Pikus
Fedor G Pikus is a Chief Engineering Scientist in the Design to Silicon division of Mentor Graphics Corp (Siemens business). His earlier positions included a Senior Software Engineer at Google and a Chief Software Architect for Calibre PERC, LVS, DFM at Mentor Graphics. He joined Mentor Graphics in 1998 when he made a switch from academic research in computational physics to the software industry. Fedor is a recognized expert on high-performance computing and C++, he presented his works at CPPCon, SD West, DesignCon, in Software Development Journal, and is also an O’Reilly author. His responsibilities as a Chief Scientist include planning the long-term technical direction of Calibre products, directing and training the engineers who work on these products, design, and architecture of the software, and research in the new design and software technologies. Fedor has over 25 patents and over 100 papers and conference presentations on physics, EDA, software design, and C++ language.
---
Videos Filmed & Edited by Bash Films: www.BashFilms.com
KZfaq Channel Managed by Digital Medium Ltd events.digital-medium.co.uk
*--*

Пікірлер: 176
@doc7115
@doc7115 Жыл бұрын
That "session is over" was definitely not predicted.
@DavidsKanal
@DavidsKanal 4 күн бұрын
It definitely wasn't branchless
@denispriyomov6086
@denispriyomov6086 2 жыл бұрын
Very interesting with a bit of a drama at the end: "Your session is over". 😏
@douggale5962
@douggale5962 Жыл бұрын
I suggest you look at `sudo perf top -e branch-misses` - it will just tell you exactly where there are too many mispredicts. Hit enter on the top thing and drill down to what function it is. Build with debug info.
@playman350
@playman350 Жыл бұрын
I didn't even know you could use top with perf! Learn something new everyday.
@happycats-go8sv
@happycats-go8sv Жыл бұрын
1:03:30 He was ready for some more questions and then it ended suddenly with a well deserved applause. Its like games where a cut-scene would end a fight with infinitely spawning enemies. "Well"
@DasAntiNaziBroetchen
@DasAntiNaziBroetchen Жыл бұрын
Man that ending is depressing.
@garychap8384
@garychap8384 7 ай бұрын
Impressively, the talk ends in a Branch Prediction failure. Fedors reaction is amazing : ) From now on, that's exactly what I'll imagine the CPU doing.
@stephenjames2951
@stephenjames2951 2 жыл бұрын
Thanks for repeating the questions.
@cryp0g00n4
@cryp0g00n4 2 жыл бұрын
underappreciated comment
@binarytv2904
@binarytv2904 2 жыл бұрын
"The closer you sit, better the performance" - I see what you did there :)
@Thiago1337
@Thiago1337 2 жыл бұрын
Session is over, thank you!
@spacelem
@spacelem Жыл бұрын
Could have said "I'm sorry, we've run out of time, and we need to prepare the next session".
@tolkienfan1972
@tolkienfan1972 2 жыл бұрын
With x86 the biggest effect likely/unlikely has is to rearrange the instructions such that the unlikely branch is moved to a different area of the program. This makes the likely path a serial instruction stream, which is good for instrction cache. Also good for branch prediction in the case there is no branch prediction entry in the table. E.g. the first time thru
@dagoberttrump9290
@dagoberttrump9290 Жыл бұрын
Thanks for the input!
@masondeross
@masondeross Жыл бұрын
I watched this talk sometime back around when it was first uploaded, and I am almost certain I missed that joke at the beginning: "it's a talk on performance; the closer you sit, the better the performance." I am so glad YT recommended it to me again.
@serhii-ratz
@serhii-ratz Жыл бұрын
This is absolutely mind blowing session
@endian675
@endian675 2 жыл бұрын
Awesome talk! It's rare that I would ever watch an hour-long video on KZfaq, but by the time I got to the end of this one I wanted to go back to and watch it again.
@CppCon
@CppCon 2 жыл бұрын
Glad you enjoyed it!
@piotrarturklos
@piotrarturklos 2 жыл бұрын
Excellent talk about branch predictor and the ways to take advantage of it. When I first saw the title though, for some reason I initially thought that the topic is about functional programming.
@Ryan1729
@Ryan1729 2 жыл бұрын
Having looked at the comments before watching the entire talk, I was a little worried the talk would end before the speaker got to close the talk. So I'll mention here that the cut off happened during the question section at the end, after the last slide. Still somewhat abrupt!
@JohnDoe-ly4ls
@JohnDoe-ly4ls 2 жыл бұрын
It was a great ride! Thanks Fedor & @CppCon 👍
@CppCon
@CppCon 2 жыл бұрын
Our pleasure!
@GeorgeTsiros
@GeorgeTsiros 2 жыл бұрын
Always happy to see Fedor!
@Marius-ir1qn
@Marius-ir1qn 2 жыл бұрын
what a way to close the session
@GeorgeTsiros
@GeorgeTsiros 2 жыл бұрын
seriously, what's up with that? i've seen surgeries with less strict timetables okay, i haven't, but you get the idea
@spacelem
@spacelem Жыл бұрын
That was amazingly rude. "I'm sorry, we've run out of time and need to prepare the next session" would have been fine.
@jake1367
@jake1367 Жыл бұрын
could not have predicted that
@ernststravoblofeld
@ernststravoblofeld Жыл бұрын
The Tyler Durden close.
@marcossidoruk8033
@marcossidoruk8033 11 ай бұрын
​@@GeorgeTsirosMost conferences work that way, its the norm, you know it if you have participated of one.
@boguscoder
@boguscoder 2 жыл бұрын
hand drawn illustrations are great touch
@bsuperbrain
@bsuperbrain Жыл бұрын
the art of writing efficient programs is a very good introductory book into this performance focused universe, highly recommended
@ugurcan969
@ugurcan969 2 жыл бұрын
Great talk and I have learned a lot of stuff in only 1 hour :)
@AhmedSam
@AhmedSam 2 жыл бұрын
Loved it. Thanks Fedor!
@CppCon
@CppCon 2 жыл бұрын
Glad you enjoyed it!
@jbar7742
@jbar7742 2 жыл бұрын
great talk, unfortunate end though :D can anyone tell me which CPU it was that had the faulty branch predictor?
@vasiliynkudryavtsev
@vasiliynkudryavtsev 2 жыл бұрын
Does that really matter? Are you going to write code only for one specific CPU, e.g. ARM?
@jbar7742
@jbar7742 2 жыл бұрын
@@vasiliynkudryavtsev no, I just asked out of curiosity
@z140140
@z140140 2 жыл бұрын
any modern cpu will do the same, his code was written specifically to fool modern cpus
@douggale5962
@douggale5962 Жыл бұрын
If you want to make it as hard as possible for old processors, do taken, taken, not taken, not taken, repeatedly. Every branch will mispredict. They had a two bit counter that incremented if taken, decremented if not taken, and it predicted taken if counter for this branch >= 2. Once it has been saturated one way, it has to mispredict twice in a row to start predicting the other way. Modern processors are not so dumb, they have a bunch of histories, and it selects what history based on whether the last few branches were taken or not (to get correlation).
@kaychubies
@kaychubies 8 ай бұрын
Best I can tell it may have been the Intel Pentium 4 Willamette, also called NetBurst. I can't find anything online that says what the speaker said -- namely reported the opposite of what it meant to -- but apparently it had pretty bad branch prediction. (I apologize in advanced if I've mixed up some terms. CPU architecture isn't my strong suit!)
@younghsiang2509
@younghsiang2509 2 жыл бұрын
The session is over, thank you......
@KeyserTheRedBeard
@KeyserTheRedBeard 2 жыл бұрын
cool upload CppCon. I smashed that thumbs up on your video. Maintain up the excellent work.
@CppCon
@CppCon 2 жыл бұрын
Much appreciated
@leshommesdupilly
@leshommesdupilly Жыл бұрын
Cpu and compiler engineers are mad geniuses lol
@thorham1346
@thorham1346 11 ай бұрын
Indeed.
@jimr5855
@jimr5855 Жыл бұрын
Really good presentation, thank you!
@huzifa30
@huzifa30 2 жыл бұрын
thank you very much , excited talk
@innovationscode9909
@innovationscode9909 2 жыл бұрын
Marvellous. Thanks. Deep thinking...
@CppCon
@CppCon 2 жыл бұрын
You're most welcome
@DankAirsoft
@DankAirsoft 2 жыл бұрын
Great summary of the subject, I learnt a lot
@CppCon
@CppCon 2 жыл бұрын
Glad to hear it!
@krellin
@krellin 2 ай бұрын
these concepts apply not only to c++ but many other languages, very good presentation
@u263a3
@u263a3 2 жыл бұрын
The session had ended 🤣
@serhii-ratz
@serhii-ratz Жыл бұрын
“Predictor is good in prediction. We are not“ - nice :)
@MMIIRRKKOO
@MMIIRRKKOO 2 жыл бұрын
Great talk!
@douggale5962
@douggale5962 Жыл бұрын
Nice to see a talk like this when there are so many people scoffing at branch free, because they saw one example where a perfectly predictable branch skipped a few ops and ended up slightly faster.
@lepidoptera9337
@lepidoptera9337 Жыл бұрын
How many times in your life did you need those 0.3ns back? :-)
@toufikoran8416
@toufikoran8416 2 жыл бұрын
very informative thanks
@CppCon
@CppCon 2 жыл бұрын
Glad it was helpful!
@BenjaminBuch
@BenjaminBuch Жыл бұрын
Very great talk, thank you!
@bobby9568
@bobby9568 2 жыл бұрын
Really well presented!
@CppCon
@CppCon 2 жыл бұрын
Thank you kindly!
@CppCon
@CppCon 2 жыл бұрын
Thank you kindly!
@PedroOliveira-sl6nw
@PedroOliveira-sl6nw 2 жыл бұрын
Excellent talk. Very sad to see it cut short but it is what it is. I have the book but haven't started yet. I would actually like to know/test if this can be done in GPUs too.
@jorcyd
@jorcyd 2 жыл бұрын
Indeed, a lot of GPU-based implementations of Math functions (such as abs) are intrinsecally branch-free as the divergence between threads in a GPU tend to degrade performance.
@DagarCoH
@DagarCoH 2 жыл бұрын
It can, and it is done. Many aspects to talk about when optimizing for GPUs
@SianaGearz
@SianaGearz Жыл бұрын
No need pretty much? All shader instances within an execution unit will cause all branches for which the prerequisites are fulfilled in any of the instances to be taken (effectively by all these instances), so while at source level you branch, the underlying implementation is effectively branch-free by hardware and system design. Furthermore the cost of branching, when it rarely does happen (aggregated), is pretty much zero, as during the resulting latency window, other work is performed, another context is ready to go and switch in by then. GPUs are INSANELY good at latency hiding and maintaining high nominal utilisation. Branching at source level is actually recommended, for the case that none of the instances in an execution group take a given branch, then the whole execution group can skip it, else the whole group effectively takes it. In the end you don't really have much control over it and often times you have to even pay the associated extra cost of branch-free execution (those other branches being executed as well) in spite of nominally branching. It all even goes back to the very first shader model which had no branching provision at all, at machine language level there were only selective mov instructions, no jump instructions; while shader compiler allowed and recognised branches at source level and converted them correspondingly. Loops had to have a trivially provable max iteration count and would be fully unrolled by the compiler, naturally.
@robertjmccabe
@robertjmccabe 2 жыл бұрын
Very creative talk
@SerFluffikins
@SerFluffikins 2 жыл бұрын
In examples 1 and 2: isn't there an additional data dependency because both the 'if' and the 'else' branch write to a1? It seems a2 remains at 0 in those examples.
@omarkhan6217
@omarkhan6217 2 жыл бұрын
give him another hour
@christophclear1438
@christophclear1438 2 жыл бұрын
Great talk! I would have liked to ask if the array-of-two-bool-index micro-optimization for removing a branch is generally preferable to what I would have done more naively, namely adding or or'ing the two values after multiplying them with 1 and 0, gotten by treating the bool as an int.
@EgD996
@EgD996 2 жыл бұрын
I guess the answer could be: measure it
@BowMcGee
@BowMcGee 2 жыл бұрын
Super interesting topic. Great presentation. I wonder where the most to gain is with current software (current hardware is criminally underused due to bad software). Is it branch prediction, is it parallelism, is it cache oriented code? Or is it something else?
@dmitrykargin4060
@dmitrykargin4060 2 жыл бұрын
Software is vastly different. We develop solvers for different forms of sparse linear equations. This particular case of algorithms is often memory-bound. CPU spends 100 cycles for next portion of data to be loaded from RAM and about 10 cycles to process. So branch misprediction is not a problem in our case. Cache misses and memory speed are.
@xeridea
@xeridea 2 жыл бұрын
Depends on the code. General software probably branch prediction. Video, audio, graphics, or other things easily parallelizable probably parallelism and cache. Using SIMD instructions can easily provide an absolutely massive performance gain (perhaps 5-8x) where it makes sense. Multithreading can help with a lot of apps, not just those capable of benefiting from SIMD, though still not always a perfect or easy solution. Cache optimization can benefit, there are some general practices to prevent obvious slowdowns (such as how you iterate multi dimensional arrays), but can be a bit tricky otherwise.
@iuppiterzeus9663
@iuppiterzeus9663 2 жыл бұрын
I'd say memory layout and parallelization
@bva0
@bva0 2 жыл бұрын
@@dmitrykargin4060 I find this quite interesting. Do you work with direct solvers or iterative solvers?
@SianaGearz
@SianaGearz 2 жыл бұрын
I would say based on relative performance of different processors with different implementations of branch prediction, different number of cores, different SIMD configurations, different cache capacities: all of the above. But branch prediction has a knock on effect on everything, like your high memory latency through the cache hierarchy doesn't really matter so much if you don't do useless requests. On the other hand, bad memory locality just has such a devastating impact on performance that it's really unfortunate to ignore.
@uy-ge3dm
@uy-ge3dm 2 жыл бұрын
Surprised that the humble s += b*x could be so much more efficient. I've been using that one for years
@jbar7742
@jbar7742 2 жыл бұрын
though i would have really expected the compiler to pick up on this
@47Mortuus
@47Mortuus 2 жыл бұрын
whats even better is a negation of either 1 or 0 with a bitwise AND (-1 is all one-bits). 2 clock cycle latency (NEG; AND) vs 3-4 (multiply): s += -b & x;
@DasAntiNaziBroetchen
@DasAntiNaziBroetchen Жыл бұрын
@@47Mortuus I assume that one only makes sense if you are adding integers.
@OmarChida
@OmarChida 2 жыл бұрын
Great talk like always quality content from Fedor! I will definitely buy his book as I'm rly interested in these type of optimisations. Also can someone explain why the optimisation with function pointers doesn't work when functions are inlined?
@rauldragu9447
@rauldragu9447 2 жыл бұрын
Functions wouldn't work *ever* . A function call is a jump, so basically a branch. You could get the pointer branch-lessly, but then you would have to stall until the actual instructions that make up the function are loaded. It's basically like loading an array. No matter how fast you can get the pointer to that array, when you dereference it to use it, you still have to wait for the array to get cached. Same with functions. And if you just had if (cond) foo(); else bar(); you would always have a fast branch and a very slow branch, but if you introduce indirection with the function pointer you always have a pretty slow branch by having to cache the function instructions every single time, only after you know where they are stored. If the condition was sufficiently unpredictable, then it may be worth it, but it seems unlikely. And for the inlined function it's even worse because an inlined function doesn't need to build a new stack frame since all the instructions are right there, inlined. Maybe even the caching of instructions is faster this way. If you use pointers you are forcing the compiler not to inline since it needs to be able to point to anything.
@qwertyuiop-cu2ve
@qwertyuiop-cu2ve 9 ай бұрын
@@rauldragu9447 I think what may work is call both functions, store the results in the array, then use the condition to index into the result of the correct function. But if the function is more expensive than a branch misprediction on deciding which function to call, then you should just use the branch. Like Fedor Pikus said, measure! If performance doesn't matter enough for you to make a benchmark and measure, then you shouldn't do any special tricks either.
@MarcoBergamin
@MarcoBergamin Жыл бұрын
Very interesting, thanks
@CppCon
@CppCon Жыл бұрын
Glad you think so!
@nitsanbh
@nitsanbh 3 ай бұрын
18:30 register aliasing blew my mind I had no idea eax is a logical register!
@Minty_Meeo
@Minty_Meeo 2 жыл бұрын
In PowerPC, the branch conditional instruction has the prediction baked into it. I've seen, for example, MetroWerks compiler output where its static predictions were very primitive (99% of branches forward were unlikely, 99% of branches backward were likely). I've yet to use the C++20 attributes, but [[likely]] and [[unlikely]] probably give you manual control of the prediction bit for those branches, which is neat. Debug assertions and nullptr checks were totally the first thing I thought of when I learned about these attributes. Even if the compiler is smart enough to recognize a nullptr check and mark it as unlikely (I'm sure it is), it is nice to be able to self-document it with an attribute.
@marcossidoruk8033
@marcossidoruk8033 11 ай бұрын
Yes the compiler will always assign a lower probability to a null pointer check indeed. And no, never use likely or unlikely without profiling and looking at the assembly, the C++ standard is ass and says that likely/unlikely means "arbitrarily likely" and what that means in clang and gcc is 80%/20% respectively. It turns out that because of things such as null pointer checks and early returns the compiler can assign to a branch a probability lower than 20% but because unlikely means always 20% marking it as unlikely will actually make it more likely. Again, the C++ standard is complete dogshit and it should have a feature to tell the compiler exactly how likely it is ([[likely 5%]] or something like that), but it doesn't and thus unlikely may jot do what you expect it to do.
@mytech6779
@mytech6779 7 ай бұрын
@@marcossidoruk8033 That is much more of an implementation failure than a standard failure.
@fzz5964
@fzz5964 2 жыл бұрын
cool!
@ciCCapROSTi
@ciCCapROSTi Жыл бұрын
Really great topic and good info from Fedor, but his style takes some getting used to.
@47Mortuus
@47Mortuus 2 жыл бұрын
The part about the compiler built in likely/unlikely hints is completely off. it is just meta information for the resulting ORDERING OF INSTRUCTIONS in the machine code. This is due to the programmer wanting to avoid (instruction) cache misses, since the branch marked as more likely will immediately follow the conditional check. Furthermore, the unlikely branches may be put at the very very end of a particular procedure (function) or loop with an unconditional jump back to where the branch ends. It may even prevent function inlining in order to improve performance that way (yes, sometimes forcing not to inline improves performance).
@vladimir0rus
@vladimir0rus 2 жыл бұрын
"Your session is over" :(
@GordonAitchJay
@GordonAitchJay 2 жыл бұрын
What a damn shame that happened!
@Revoker1221
@Revoker1221 2 жыл бұрын
Another talk related to this speculative type branching topic was one given by Chandler Carruth here: kzfaq.info/get/bejne/aKuHmM2e0LHQqKc.html I'd recommend giving it a watch first if some of the foundational material at the beginning of this talk seemed hard to understand. It's a similar talk to this one except with a focus on loops rather than if-else branches.
@jieliu756
@jieliu756 Жыл бұрын
Thank you CppCon. Is the presentation slides available anywhere?
@CppCon
@CppCon Жыл бұрын
Thank you for your comment. We are investigating this and hope to resolve the issue shortly.
@AlFredo-sx2yy
@AlFredo-sx2yy 11 ай бұрын
@@CppCon idk what issue you're even trying to resolve lmao. They asked for the slides, this aint a bug report :s gotta love automated bot comments...
@bishop3000home
@bishop3000home 2 жыл бұрын
Serious question: if we repeat all the time ‘measure before changing’ and that compilers and processors may do better work then why do we think, that after we changed code once, it stays the fastest code? We made code less readable, we removed branch and added more work. Then what if new processor came or new optimization in a compiler etc and original code would be better?
@SianaGearz
@SianaGearz 2 жыл бұрын
You cannot strictly rely on the code being perpetually faster indeed, but what choice do you have? You can't optimise for the unknown, and pessimising software in the hope that an architecture comes along someday that magically makes your slow code fast obviously doesn't help either. One somewhat warranted assumption you can make is that processors will be optimised towards current, existing code; that they will be similar to today's processors just better and faster, that they will implement similar techniques as today but make them more robust. That processors when they get released and benchmarked by the press, will have to compete on old code, which will be a mix of code profile-optimised for last-gen processors and canonical style code. Another related assumption is that even if the processors open up a different way to do things, old way will not be slower in absolute terms, just in relative terms. So say if your software has minimum specification being an AMD Bulldozer, and you optimise towards that, you have eliminated the performance weakness on the weakest hardware that you target, and that's enough, you don't actually care that the software fails to perform much quicker on a nominally faster processor, you have created a software that is palatable for the entire userbase. I have optimised towards Pentium 4 in the past (well over 10 years ago) knowing that it's the chip family most likely to struggle among all processors people are likely to use, and it didn't make nearly as much of a difference either way on anything like an Athlon XP or a Pentium M or a Core. Obviously neither of these assumptions is absolutely true but they are useful as guiding principles in absence of better information. Of course if you have the freedom - and there exists code which actually does that, though it's exceptionally rare - you can have several different implementations in your software, and profile them when you detect system configuration change or on startup. Of course it has its own hazards as you introduce an indirection at entry into that code, so better be sure it's worth it. Less risky is having the benchmark of optimised vs. canonical code automated in your test suite outside the shipped software. I do think performance benchmark should be part of your CI procedure, because if there are performance regressions - usually an undesired side effect of development such as feature work or bugfixing rather than any sort of platform changes - you'll know about them. These performance regressions are a day to day reality rather than a hypothetical differently behaving platform to come along someday, and your software will generally gradually turn into a disaster if you ignore them.
@Adowrath
@Adowrath 2 жыл бұрын
That's why you don't just measure once, change, and then leave it. You continually measure. If you upgrade your compiler and there's a 5% performance drop, you measure, find the (potentially new) bottleneck, and change. Don't forget that also 99% of the time you shouldn't need to think about this in the first place.
@mytech6779
@mytech6779 7 ай бұрын
There are many cases where the hardware target is well known and exact. For example an embedded device, controller for some factory, or a transportation system (eg flight mangement sytems cannot make hardware version changes without major testing); orat the big end when you are optimizing for a super computer, a mainframe, or a large server farm deployment (matching hardware) the electrical costs alone can more than offset the cost of targeted optimization and a custom compile. (These cases spend tens of thousands of USD per month on electricity. A 1% efficiency gain during a year is worth a week of developer time.) In a few cases the hardware may be purchased after saftware profiling and optimization results that were made on a selection of sample machines.
@skybuck2000
@skybuck2000 2 жыл бұрын
Why does the function pointer trick not work ? Perhaps expensive memory look up for the function ?
@GordonAitchJay
@GordonAitchJay 2 жыл бұрын
Fedor explains why in the last 3 paragraphs of chapter 3: CPU Architecture, Resources, and Performance - Branchless computing. In short, function pointer calls prevent inlining, which is a more beneficial optimisation than avoiding mispredicted branches.
@shilangyu
@shilangyu Жыл бұрын
Why code like `rand() & 0x1` generates branches that can be missed? Doesnt this piece of code perform a branchless stream of instruction (with some bitwise-and call)?
@DasAntiNaziBroetchen
@DasAntiNaziBroetchen Жыл бұрын
I'm not an expert but it depends on the implementation of "rand()" and it may not be inlined by the compiler
@Sidelobes
@Sidelobes 2 жыл бұрын
Interesting and informative talk! I like the hands-on, example-driven approach. What I don’t like is the constant interruptions (esp. ~ mins 30-40) from the audience questions. These are very hard to follow as a remote viewer and disrupt the flow.
@vasiapatov4544
@vasiapatov4544 2 жыл бұрын
I agree that it reduces the quality of our experience as remote viewers, but you have to remember that these conferences are optimized for the experience of the in-person attendees (as they should be, it's an organic, and expensive event)
@leonid998
@leonid998 2 ай бұрын
18:57 shoud lt be register renaming?
@VitisCZ
@VitisCZ 6 ай бұрын
31:21 anyone knows what CPU he is talking about? I'm quite curious
@HFsrj
@HFsrj 2 жыл бұрын
In the first example, how do you get 10% only misprediction, if the predicate is random ?
@theeinhaender5132
@theeinhaender5132 2 жыл бұрын
There are multiple branches. For example, there could be four branches that can be predicted with ~100% accuracy, and the fifth and final branch is predicted with 50% accuracy. Or just one branch that’s 100% accurate and it just gets evaluated 4x more often than the 50% one.
@giacomo.delazzari
@giacomo.delazzari 2 жыл бұрын
Perf says that 10% of the branches are mispredicted. However the "interesting" branch was just one of the many branches in the program. There's quite a lot of other code involved, think about the benchmarking framework being used.. For every benchmark instance it might be running quite a bit of logic. Also the for loop termination branch is present. Also, the arrays are filled up before doing the experiment itself, and this also involves some branches (for loop and stuff). All of these branches are well predicted, so the "interesting" ones that are mispredicted are just a part of all the branches of the program.
@meekrab9027
@meekrab9027 2 жыл бұрын
It is all branches for the entire program, which includes a lot more than just the code being benchmarked.
@tomasdzetkulic9871
@tomasdzetkulic9871 2 жыл бұрын
there are other branches in the code. There are branches in the loop around the `if` statement. There are branches in the benchmark framework etc.
@froody7
@froody7 2 жыл бұрын
10% of all branches executed were mispredicted, but the branch in question is only one of several executed in the loop, so it might be 50% mispredicted but when aggregated with a bunch of predicted branches at other sites the final rate could be 10%.
@multiHappyHacker
@multiHappyHacker 2 жыл бұрын
I so wanted the bool conditional to be the solution at: @44:12
@edgararakelyan9326
@edgararakelyan9326 Жыл бұрын
Awkward ending to say the least
@tomaszstanislawski457
@tomaszstanislawski457 2 жыл бұрын
It can't optimize `c[i] = rand() >= 0` because function `rand()` is treated as "deterministic" random generator. The internal state of RNG must be advanced even though the returned value is discarded. The best one can get is `rand(); c[i] = 1;`
@somehrjdbddhsjsbnsndndk751
@somehrjdbddhsjsbnsndndk751 2 жыл бұрын
Deterministic, I don't get it and I don't understand why?
@gianni50725
@gianni50725 2 жыл бұрын
@@somehrjdbddhsjsbnsndndk751 Computers, at least without special hardware, are incapable of making truly random numbers. So rand() with the same given seed will give you the same sequence of "random" numbers every time you call it. If the compiler optimized out the rand() call, then it would be violating one of the rules of the compiler since it would be affecting the output of the program.
@niles_5003
@niles_5003 Жыл бұрын
@@somehrjdbddhsjsbnsndndk751 Tomasz was saying that `rand()` has side-effects, so it can't be eliminated entirely. This is because `rand()` has an internal state that holds that last randomly generated number to generate the next one. If you didn't call rand() at all, that internal state would not be changed properly, and that could result in observable differences between optimized code and not optimized code, namely that the sequence of random numbers would be generated for a given seed would be applied to different vectors in his program. Now, of course, the programmer might not care about that, and would like this kind of optimization to occur, but the compiler has no way of asking or knowing that the programmer does not care about it. Regardless, I would argue that a compiler warning would be better in this context than an optimization. If I had some condition buried in my code that was always true, then that condition is probably written incorrectly or I could delete it myself. There are some tools which will give you a warning when a condition is always true, but often times, the tool does not have a type that can describe what the real outputs look like. I.e. if something returns an "int" or a "number" rather than a "positive integer", the tool has no way to determine that the returned value will always be greater than or equal to 0.
@KX36
@KX36 2 жыл бұрын
My low-mid (second from the bottom level) level interview was 70% leadership to start with, followed by 30% extremely basic technical. I didn't expect leadership questions at such a low level, and I messed up that interview so hard, but fortunately they hired all 4 people that made it to interview so it's all good!
@lepidoptera9337
@lepidoptera9337 Жыл бұрын
You know why, right? Because an extremely technical person has a tendency to get lost in minutia and might spend hours, days and sometimes months thinking about how they can save a few nanoseconds in a program that needs to have 100% uptime, instead. Like this guy.
@DasAntiNaziBroetchen
@DasAntiNaziBroetchen Жыл бұрын
@@lepidoptera9337 lol whatever makes you feel better.
@AbstractObserver
@AbstractObserver 2 жыл бұрын
How they managed to have an in person convention in 2021?
@JoseyWales93
@JoseyWales93 2 жыл бұрын
The power of C++.
@jean-luclachance7242
@jean-luclachance7242 Жыл бұрын
Why can't compilers do this type of optimization?
@PrzemyslawSliwinski
@PrzemyslawSliwinski 2 жыл бұрын
53:10 - can this technique be used to make an effective Meltdown/Spectre-proof code?
@chriswysocki8816
@chriswysocki8816 2 жыл бұрын
1. this video was made in 2021, with supposedly not an ancient CPU in the test system. I've heard so much about modern CPUs having redundant pipelines that keep evaluating both sides of the branch (and still keeping the branch prediction circuitry, in case the pipelines are "shorter" than the branch code paths). If that's true, why doesn't that make the handling of the ASM code generated by the compiler much more efficient? 10% (minority) wrong predictions should still lead to high efficiency. 2. why isn't the C compiler taking advantage of the SIMD in the fastest version of C branchless? Isn't there some compiler option you could have turned on to get the compiler to make code that performs closer to the optimal ASM impl? .... are compiler implementors and CPU designers lying to us or are they optimizing to unrepresentative/narrow test cases?
@simivb
@simivb 2 жыл бұрын
1. Evaluating both sides of the branch is gonna perform worse than branch prediction for the vast majority of branches. As Fedor said in the talk, 1% mispredict is already a really bad case, so if you just take both paths always, you do unnecessary work for 99% of all branches. So 1% of your code becomes faster, 99% of it becomes slower. This is not good enough to use as a general solution. I also don't know what CPUs you are talking about. 2. You can tell your compiler to autovectorize using flags, but I believe it's not done by default. One reason for this is that available SIMD instruction sets differ between processors, so the compiler will of course avoid generating code that your CPU may not even be able to run by default. So you have to tell it what SIMD ISA you are targeting. It also absolutely can't do as good of a job as optimal assembly, because 1) it doesn't have as much information as you, so it can't make the same optimizations as you. The guarantees the code makes are different than the guarantees that the algorithm makes. And the strongest optimizations often change how the program runs internally, which the compiler can not do, because it only does transformations that keep the observable program behaviour identical. And 2) SIMD code needs certain preconditions to excell, which are often not present in normal code. So the "optimal assembly" might involve preparing the data in a different way and transforming it in a different, and that code might be scatter across multiple functions. A programmer can make these changes because they can reason that the end result is identical, but it might be way too many instructions for the optimizer to reason about. There really is no reason to attribute malice to compiler or CPU vendors here.
@lepidoptera9337
@lepidoptera9337 Жыл бұрын
If you depend on a machine that needs to be 99% efficient to keep up with your problem, then you have a poor hardware design to begin with that will fail the next time your requirements change. In most applications the CPU is waiting most of the time, so why in the world do you care? Gaming didn't get faster because they optimized the hell out of branches. It got faster because they are selling GPUs.
@PiotrKundu
@PiotrKundu 2 ай бұрын
1:03:28 EOF
@user-nm7sp5xj7q
@user-nm7sp5xj7q 2 жыл бұрын
2:58 beginning
@none_of_your_business
@none_of_your_business 2 жыл бұрын
bool(x)+bool(y) is a bit scary, booleans shouldn't have arithmetic adition defined, it should have only boolean operations defined.
@XKS99
@XKS99 Жыл бұрын
It’s crazy the quality of people Russia has lost
@skybuck2000
@skybuck2000 2 жыл бұрын
Do your cats own laptops ? LOL.
@lepidoptera9337
@lepidoptera9337 Жыл бұрын
You know that you have no real problems in your life if you are focused on the branch prediction pipeline of your CPU. ;-)
@mytech6779
@mytech6779 7 ай бұрын
Just ran perf stat on python3 hello world and got 6% branch misses. pypy3 had 3% but way more total branches... g++ or clang++ with -O0 or -O3 and I still get 2.5% misses(though a lot less total branches, 800k vs 30M) using iostream with endl same branches and misses for printf() Also 3M instructions for printing hello world? hmmm... 950k instructions and 220k branches for ` int main(){ return 0;} ` must be overhead from perf or the OS, the assembly is basically zero eax then ret
@mytech6779
@mytech6779 7 ай бұрын
A;so just tested some assembly loops. I'd say around 10G instructions or 1G branches per run is a good level to washout the bulk of the overhead. Maybe bump that an order of magnitude for fine tuning(but then your also going to want some control over cpu core-thread scheduling to prevent context switching from molesting your cache.)
@736939
@736939 2 жыл бұрын
How to find a job for a C++ developer? I'm not joking, I'm a dotnet developer, and I'm asking the serios quesion. Companies (as usual) want to see professional C++ developers after they finished university, which is impossible.
@BarafuAlbino
@BarafuAlbino Жыл бұрын
Ignore all conditions written in the vacancy description. Most companies don't care what they have written there.
@kilasuelika6955
@kilasuelika6955 2 жыл бұрын
Let me make a summary: one can use a function table to replace branches.
@greob
@greob 2 жыл бұрын
58:15 didn't he say that doesn't work though?
@iUniversEi
@iUniversEi 2 жыл бұрын
That would be a really bad summary, because this is something that does NOT work, as presented in the talk.
@kebien6020
@kebien6020 2 жыл бұрын
He specifically mentioned this as something that almost never works 58:19
@Derheight
@Derheight 2 жыл бұрын
I think a function table is a kind of branching mechanism because the jump address depends on a runtime value which must be predicted to avoid flushing the pipeline.
@MMIIRRKKOO
@MMIIRRKKOO 2 жыл бұрын
Function results table, forcing you to evaluate everything. He specifically said that a table of function pointers is usually bad, and if those can be inlined, it's usually terrible.
@paulfunigga
@paulfunigga Жыл бұрын
you write this kind of code and then some other person will look at your code and be like "was this guy on drugs when he wrote this?". I think all this makes C++ a horrid language
Always be more smart #shorts
00:32
Jin and Hattie
Рет қаралды 29 МЛН
World’s Deadliest Obstacle Course!
28:25
MrBeast
Рет қаралды 122 МЛН
The day of the sea 🌊 🤣❤️ #demariki
00:22
Demariki
Рет қаралды 78 МЛН
C++ cache locality and branch predictability
10:43
mCoding
Рет қаралды 78 М.
Faster than Rust and C++: the PERFECT hash table
33:52
strager
Рет қаралды 519 М.
Samsung Galaxy 🔥 #shorts  #trending #youtubeshorts  #shortvideo ujjawal4u
0:10
Ujjawal4u. 120k Views . 4 hours ago
Рет қаралды 9 МЛН
Обзор Sonos Ace - лучше б не выпускали...
16:33
💅🏻Айфон vs Андроид🤮
0:20
Бутылочка
Рет қаралды 690 М.
ПОКУПКА ТЕЛЕФОНА С АВИТО?🤭
1:00
Корнеич
Рет қаралды 3,1 МЛН