## Misprediction-poor looping

Discussion of chess software programming and technical issues.

Moderators: bob, hgm, Harvey Williamson

Forum rules
This textbox is used to restore diagrams posted with the [d] tag before the upgrade.
hgm
Posts: 24443
Joined: Fri Mar 10, 2006 9:06 am
Location: Amsterdam
Full name: H G Muller
Contact:

### Misprediction-poor looping

Suppose we have a loop, the body of which should only be performed for some iterations, and it is difficult to predict on which:

Code: Select all

``for(i=0; i<MAX; i++) if(Condition(i)) Action(i);``
We could minimize the number of branch mispredictions by first figuring out on which iterations the action should be performed:

Code: Select all

``````int todo[MAX];
int n=0;
for(i=0; i<MAX; i++) { todo[n] = i; n += (Condition(i) != 0); }
for(i=0; i<n; i++) Action(todo[i]);
``````
This uses branches for the looping only, which, if the loops typically do many iterations, should be predicted as 'always taken', causing mispredicts only at the loop end. (And if the loops are unrolled, and have a well-predictable number of iterations, not even that.)

dragontamer5788
Posts: 201
Joined: Thu Jun 06, 2019 6:05 pm
Full name: Percival Tiglao

### Re: Misprediction-poor looping

Comment: For some reason, I couldn't grok your code in the first pass. My brain understands the following more:

Code: Select all

``````for(i=0; i<MAX; i++) { todo[n] = i; n += (Condition(i) ? 1 : 0); }
``````
Condition(i) ? 1 : 0 will probably compile into a cmov instruction, so you won't have any branches either. In fact, you probably can use the if() else {} code and rely upon compiler optimizations to use cmov instead of a branch.

EDIT: Godbolt says that GCC uses sete flag, which is still branchless.

--------

I see the theory, and I think it could be useful! Interesting food for thought.

mar
Posts: 2108
Joined: Fri Nov 26, 2010 1:00 pm
Location: Czech Republic
Full name: Martin Sedlak

### Re: Misprediction-poor looping

dragontamer5788 wrote:
Mon Dec 09, 2019 10:25 pm
Comment: For some reason, I couldn't grok your code in the first pass. My brain understands the following more:

Code: Select all

``````for(i=0; i<MAX; i++) { todo[n] = i; n += (Condition(i) ? 1 : 0); }
``````
Condition(i) ? 1 : 0 will probably compile into a cmov instruction, so you won't have any branches either. In fact, you probably can use the if() else {} code and rely upon compiler optimizations to use cmov instead of a branch.

EDIT: Godbolt says that GCC uses sete flag, which is still branchless.

--------

I see the theory, and I think it could be useful! Interesting food for thought.
Yes, but only if you assume that Action() is trivial without side effects and the compiler can see its guts.

My experience:
- don't use likely/unlikely compiler hints (I remember I got a slightly slower code this way even if I thought the hint was appropriate, at instruction level this means extra byte (prefix) before the jump instruction); I don't know if "forward branches assumed not taken" is still a deal now that we got branch predictors (I guess not)
- for well predicted trivial branches, a branch may actually be faster than branchless magic that taxes ALU more (example: sample clipping loop in audio mixer)
- of course, profile, profile, profile. I remember I have outsmarted myself serveral times, thinking I came up with something clever that has to be faster but profiling showed otherwise
Martin Sedlak

bob
Posts: 20896
Joined: Mon Feb 27, 2006 6:30 pm
Location: Birmingham, AL

### Re: Misprediction-poor looping

One other thing. Modern hardware can predict 100% on loops that are less than N iterations. Not sure what N is today, a few years ago it was 10-12 depending on the microprocessor. If the iteration count is reasonably long, you get N-1 predictions and 1 misprediction for each time you execute the loop, which is still not very bad. And the neat thing is that if you have that loop in several places, it will predict all the same way, not having to learn what is going on beyond the first loop completing.

hgm
Posts: 24443
Joined: Fri Mar 10, 2006 9:06 am
Location: Amsterdam
Full name: H G Muller
Contact:

### Re: Misprediction-poor looping

dragontamer5788 wrote:
Mon Dec 09, 2019 10:25 pm
Comment: For some reason, I couldn't grok your code in the first pass. My brain understands the following more:

Code: Select all

``````for(i=0; i<MAX; i++) { todo[n] = i; n += (Condition(i) ? 1 : 0); }
``````
I think that (A != 0) used as an operand by definition means (A ? 1 : 0), so the compiler will consider it equivalent, and generate the same code for it.

There are several possibilities to do this without branches. You already mentioned CMOV, but there is an older instruction SETcc (in this case SETNE) which loads 1 or 0 in the AL register depending on the condition codes. So it is always as easy to branch on a condition (with BNE) as to convert it to a truth value. Another trick is to compute the condition and then add -1 (= 0xFFFF...) to it, which only does not overflow (and then leaves the carry condition code cleared) is the condition was zero. You can then add 0 with carry to n (ADC) to only increment it when the condition was non-zero.
mar wrote:
Mon Dec 09, 2019 11:29 pm
Yes, but only if you assume that Action() is trivial without side effects and the compiler can see its guts.
Indeed, I used the function calls only as a means to write compact pseudo-code; they were supposed to be simple code sections. BTW, all calls to Action are done in the same order as they would be in the original code. So side effects there are not dangerous per se. But Condition and Action should not have side effects on each other's behavior.

mar
Posts: 2108
Joined: Fri Nov 26, 2010 1:00 pm
Location: Czech Republic
Full name: Martin Sedlak

### Re: Misprediction-poor looping

hgm wrote:
Mon Dec 09, 2019 9:36 pm

Code: Select all

``````int todo[MAX];
int n=0;
for(i=0; i<MAX; i++) { todo[n] = i; n += (Condition(i) != 0); }
for(i=0; i<n; i++) Action(todo[i]);
``````
I use something similar but with different motivation. I often extract elements/pointers based on a condition into a (thread)local array for further processing.
This is beneficial if you want to do several loops or parallelize the loop.
Or even if you want to expose this to a higher-level scripting language where iterating would mean calling C/C++ iterator code periodically.
EDIT: in my case, Action is typically non-trivial
Martin Sedlak

flok
Posts: 222
Joined: Tue Jul 03, 2018 8:19 am
Full name: Folkert van Heusden
Contact:

### Re: Misprediction-poor looping

hgm wrote:
Tue Dec 10, 2019 8:47 am

Code: Select all

``````for(i=0; i<MAX; i++) { todo[n] = i; n += (Condition(i) ? 1 : 0); }
``````

Code: Select all

``````for(i=0; i<MAX; i++) { todo[n] = i; n += !!Condition(i); }
``````
www.vanheusden.com: Micah / Embla / PuppetMaster / DeepBrutePos / Pos / Feeks

dragontamer5788
Posts: 201
Joined: Thu Jun 06, 2019 6:05 pm
Full name: Percival Tiglao

### Re: Misprediction-poor looping

hgm wrote:
Tue Dec 10, 2019 8:47 am
mar wrote:
Mon Dec 09, 2019 11:29 pm
Yes, but only if you assume that Action() is trivial without side effects and the compiler can see its guts.
Indeed, I used the function calls only as a means to write compact pseudo-code; they were supposed to be simple code sections. BTW, all calls to Action are done in the same order as they would be in the original code. So side effects there are not dangerous per se. But Condition and Action should not have side effects on each other's behavior.
If this simplicity can be assured, then we open up SIMD-compute capabilities. Wait, wait wait... don't run away. I'm not talking GPUs here, just pointing out that the compiler will autovectorize the loop.

Godbolt says the following code is SIMD, compiling into XMM-registers (SSE, 128-bit assembly).

The "SIMD" part is simply:

Code: Select all

``````
void Action(int& x){
x = x+10;
}

...

#pragma omp simd
for(int i=0; i<n; i++) Action(todo[i]);
``````
GCC still needs the #pragma omp simd to see that this loop is vectorizable for some reason, but the #pragma hint is all GCC needs to start using SSE. With "-mavx2", you'll get AVX2 (256-bit YMM registers). With "-mavx512f", you'll get 512-bit ZMM registers.

The Condition() loop is obviously stream-compaction, which requires prefix-sum as a primitive. This will be tricky, requiring #pragma omp simd capabilities (well supported by CLANG and GCC, but not by Microsoft Visual C++). But I assert that the Condition() loop can also be formed into SIMD while retaining work-efficiency, albeit with a bit of extra effort, some helper functions, and some extra memory (Maybe a second "todo" array to hold the prefix-sum results);

mar
Posts: 2108
Joined: Fri Nov 26, 2010 1:00 pm
Location: Czech Republic
Full name: Martin Sedlak

### Re: Misprediction-poor looping

dragontamer5788 wrote:
Tue Dec 10, 2019 3:05 pm
GCC still needs the #pragma omp simd to see that this loop is vectorizable for some reason
why? the code you posted is auto-vectorized by gcc regardless of any pragma, clang even unrolls the loop as well as msc
Martin Sedlak

dragontamer5788
Posts: 201
Joined: Thu Jun 06, 2019 6:05 pm
Full name: Percival Tiglao

### Re: Misprediction-poor looping

mar wrote:
Tue Dec 10, 2019 3:34 pm
dragontamer5788 wrote:
Tue Dec 10, 2019 3:05 pm
GCC still needs the #pragma omp simd to see that this loop is vectorizable for some reason
why? the code you posted is auto-vectorized by gcc regardless of any pragma, clang even unrolls the loop as well as msc
Interesting. I stand corrected. I tested it earlier, but maybe I herp-derp'd and messed up my reading of the assembly earlier.