Compiler Optimization Question

Michael Sherwin · Post by **Michael Sherwin** » Mon Apr 13, 2020 5:12 am

bob wrote: ↑Mon Apr 13, 2020 4:26 am Here's the simple answer from a compiler guy. Small functions are just fine. If they SHOULD be inlined, the compiler will do it during the compilation process. If it won't help, it won't. Not inlining can improve cache efficiency if the function is executed infrequently. Better to have the natural instruction stream sequential rather than broken up into blocks with interspaced dead code.

With today's optimizers, you can do pretty well to just write readable code you can understand and work on. Let the compiler do its thing.

For example
Code: Select all
if (condition) {

do some stuff 

}
Might be compiled as is. Or if the condition is mostly false, the compiler might turn that into this:
Code: Select all
if (! condition) 
  go to xx);
  getback:
  
  
  Somewhere else in memory:
  
  xx:
    do some stuff
    go to getback;
    
I used to have to write code like the latter in the early FORTRAN days, the optimizers were not very good. Today the compilers can do this so that your code looks readable, and then gets reorganized when converting from source to object code, leaving the source easy to read.

Yes, there are places you might do better. In particular if you have information the compiler doesn't have. IE you know that ALL the cases in your switch statement are covered explicitly. The compiler still has to check for the rest and skip all the way to the closing "}". You don't. That can eliminate instructions.

Thanks Bob! I understand about the compiler not knowing about the limits of a case statement. There is an attribute (i think that is what it is called) but I can't remember and I can't find it in my notes or online. It is something like this:
default:
_unreachable;
That is supposed to give the compiler more information.

Also, in this function board[t][fs] and board[t][ts] is referenced so many times that I was about to create a pointer s32* boardptr = board[t]; and replace board references with boardptr[sq]. Is that a good idea or simply not necessary? Thanks again

cucumber · Post by **cucumber** » Mon Apr 13, 2020 5:51 am

Michael Sherwin wrote: ↑Mon Apr 13, 2020 1:11 am A really large function has many local variables and accesses many global variables. Will modern compilers be able to optimize register usage effectively or is it better to break up the large function into a few subfunctions with fewer variables so the compiler can better optimize register usage?

Register allocation is *more likely* to break down across call boundaries when inlining isn't performed, so you're fine. There are many, many, many caveats to that statement depending on the compiler, compiler version, code translation units, whether these functions are static or not, compilation settings, etc., but it is more often than not the case that a call boundary will further you from register allocation optimality. LLVM at least is capable of "outlining" functions when instruction cache thrashing is a concern.

Nonetheless, LLVM and GCC will make well-written C code run fast most of the time. A slightly better register allocation means nothing if the instruction cache is constantly being thrashed because of function bloat. And register allocation is almost certainly going to be fine no matter what. Focus on code clarity and the compiler will generally handle things for you.

Just make sure you use a modern compiler.

Michael Sherwin · Post by **Michael Sherwin** » Mon Apr 13, 2020 6:03 am

cucumber wrote: ↑Mon Apr 13, 2020 5:51 am
Michael Sherwin wrote: ↑Mon Apr 13, 2020 1:11 am A really large function has many local variables and accesses many global variables. Will modern compilers be able to optimize register usage effectively or is it better to break up the large function into a few subfunctions with fewer variables so the compiler can better optimize register usage?
Register allocation is *more likely* to break down across call boundaries when inlining isn't performed, so you're fine. There are many, many, many caveats to that statement depending on the compiler, compiler version, code translation units, whether these functions are static or not, compilation settings, etc., but it is more often than not the case that a call boundary will further you from register allocation optimality. LLVM at least is capable of "outlining" functions when instruction cache thrashing is a concern.

Nonetheless, LLVM and GCC will make well-written C code run fast most of the time. A slightly better register allocation means nothing if the instruction cache is constantly being thrashed because of function bloat. And register allocation is almost certainly going to be fine no matter what. Focus on code clarity and the compiler will generally handle things for you.

Just make sure you use a modern compiler.

Thanks. I'm using MSVC 2019 community.

cucumber · Post by **cucumber** » Mon Apr 13, 2020 6:12 am

Michael Sherwin wrote: ↑Mon Apr 13, 2020 6:03 am
cucumber wrote: ↑Mon Apr 13, 2020 5:51 am
Michael Sherwin wrote: ↑Mon Apr 13, 2020 1:11 am A really large function has many local variables and accesses many global variables. Will modern compilers be able to optimize register usage effectively or is it better to break up the large function into a few subfunctions with fewer variables so the compiler can better optimize register usage?
Register allocation is *more likely* to break down across call boundaries when inlining isn't performed, so you're fine. There are many, many, many caveats to that statement depending on the compiler, compiler version, code translation units, whether these functions are static or not, compilation settings, etc., but it is more often than not the case that a call boundary will further you from register allocation optimality. LLVM at least is capable of "outlining" functions when instruction cache thrashing is a concern.

Nonetheless, LLVM and GCC will make well-written C code run fast most of the time. A slightly better register allocation means nothing if the instruction cache is constantly being thrashed because of function bloat. And register allocation is almost certainly going to be fine no matter what. Focus on code clarity and the compiler will generally handle things for you.

Just make sure you use a modern compiler.
Thanks. I'm using MSVC 2019 community.

I've never worked with MSVC, but it sounds like it's up to date, so I'm guessing you're fine

Michael Sherwin · Post by **Michael Sherwin** » Mon Apr 13, 2020 6:20 am

And yet another question. Since I'm using simple arrays instead instead of an array of structures the move stack is:
frsq[t][10000]
tosq[t][10000]
trgt[t][10000]
type[t][10000]
scor[t][10000]
And this is done:
if (score != -INFINITY && (score > alpha || depth > 0)) {
frsq[t][eye] = fs; // had to use eye instead of i because of forum italics command
tosq[t][eye] = ts;
trgt[t][eye] = target;
type[t ][eye] = typ;
scor[t][eye] = score;
i++;
}
Would it be better for optimization or cache pressure to do this instead:
move[t][i++] = fs;
move[t][i++] = ts;
move[t][i++] = target;
move[t][i++] = typ;
move[t][i++] = score
Sorry for all the questions but finding this kind of info online seems impossible.

cucumber · Post by **cucumber** » Mon Apr 13, 2020 6:34 am

Michael Sherwin wrote: ↑Mon Apr 13, 2020 6:20 am And yet another question. Since I'm using simple arrays instead instead of an array of structures the move stack is:
frsq[t][10000]
tosq[t][10000]
trgt[t][10000]
type[t][10000]
scor[t][10000]
And this is done:
if (score != -INFINITY && (score > alpha || depth > 0)) {
frsq[t][eye] = fs; // had to use eye instead of i because of forum italics command
tosq[t][eye] = ts;
trgt[t][eye] = target;
type[t ][eye] = typ;
scor[t][eye] = score;
i++;
}
Would it be better for optimization or cache pressure to do this instead:
move[t][i++] = fs;
move[t][i++] = ts;
move[t][i++] = target;
move[t][i++] = typ;
move[t][i++] = score
Sorry for all the questions but finding this kind of info online seems impossible.

I'm a bit delirious from lack of sleep, so I can speculate, but you should definitely use a profiler to tell for sure.

I don't know how well MSVC will handle the i++ses. The former might be brutal with small caches b/c of bad locality whereas the latter preserves some locality, but could be less than ideal if MSVC actually increments a counter. If it increments a counter, in addition to having to do addition, now you have a brutal data dependency chain.

Optimized right, the latter is just a series of `lea`s, `movaps`s, or the like.

Michael Sherwin · Post by **Michael Sherwin** » Mon Apr 13, 2020 6:42 am

cucumber wrote: ↑Mon Apr 13, 2020 6:34 am
Michael Sherwin wrote: ↑Mon Apr 13, 2020 6:20 am And yet another question. Since I'm using simple arrays instead instead of an array of structures the move stack is:
frsq[t][10000]
tosq[t][10000]
trgt[t][10000]
type[t][10000]
scor[t][10000]
And this is done:
if (score != -INFINITY && (score > alpha || depth > 0)) {
frsq[t][eye] = fs; // had to use eye instead of i because of forum italics command
tosq[t][eye] = ts;
trgt[t][eye] = target;
type[t ][eye] = typ;
scor[t][eye] = score;
i++;
}
Would it be better for optimization or cache pressure to do this instead:
move[t][i++] = fs;
move[t][i++] = ts;
move[t][i++] = target;
move[t][i++] = typ;
move[t][i++] = score
Sorry for all the questions but finding this kind of info online seems impossible.
I'm a bit delirious from lack of sleep, so I can speculate, but you should definitely use a profiler to tell for sure.

I don't know how well MSVC will handle the i++ses. The former might be brutal with small caches b/c of bad locality whereas the latter preserves some locality, but could be less than ideal if MSVC actually increments a counter. If it increments a counter, in addition to having to do addition, now you have a brutal data dependency chain.

Optimized right, the latter is just a series of `lea`s, `movaps`s, or the like.

So you're feeling a little pickled right now?

Thanks, you are the third person to tell me to find out myself by using a profiller. I guess that I should take the hint! It is just I like knowing ahead of time.

cucumber · Post by **cucumber** » Mon Apr 13, 2020 7:03 am

Michael Sherwin wrote: ↑Mon Apr 13, 2020 6:42 am
cucumber wrote: ↑Mon Apr 13, 2020 6:34 am
Michael Sherwin wrote: ↑Mon Apr 13, 2020 6:20 am And yet another question. Since I'm using simple arrays instead instead of an array of structures the move stack is:
frsq[t][10000]
tosq[t][10000]
trgt[t][10000]
type[t][10000]
scor[t][10000]
And this is done:
if (score != -INFINITY && (score > alpha || depth > 0)) {
frsq[t][eye] = fs; // had to use eye instead of i because of forum italics command
tosq[t][eye] = ts;
trgt[t][eye] = target;
type[t ][eye] = typ;
scor[t][eye] = score;
i++;
}
Would it be better for optimization or cache pressure to do this instead:
move[t][i++] = fs;
move[t][i++] = ts;
move[t][i++] = target;
move[t][i++] = typ;
move[t][i++] = score
Sorry for all the questions but finding this kind of info online seems impossible.
I'm a bit delirious from lack of sleep, so I can speculate, but you should definitely use a profiler to tell for sure.

I don't know how well MSVC will handle the i++ses. The former might be brutal with small caches b/c of bad locality whereas the latter preserves some locality, but could be less than ideal if MSVC actually increments a counter. If it increments a counter, in addition to having to do addition, now you have a brutal data dependency chain.

Optimized right, the latter is just a series of `lea`s, `movaps`s, or the like.
So you're feeling a little pickled right now?
Thanks, you are the third person to tell me to find out myself by using a profiller. I guess that I should take the hint! It is just I like knowing ahead of time.

It's useful advice! Profilers save a lot of guesswork.

Nonetheless, I took a peek at how MSVC compiles the latter on Compiler Explorer (godbolt.org) and was pretty disappointed. It introduces a dependency chain like I feared.

See: https://godbolt.org/z/dG6yJw

Michael Sherwin · Post by **Michael Sherwin** » Mon Apr 13, 2020 7:57 am

cucumber wrote: ↑Mon Apr 13, 2020 7:03 am
Michael Sherwin wrote: ↑Mon Apr 13, 2020 6:42 am
cucumber wrote: ↑Mon Apr 13, 2020 6:34 am
Michael Sherwin wrote: ↑Mon Apr 13, 2020 6:20 am And yet another question. Since I'm using simple arrays instead instead of an array of structures the move stack is:
frsq[t][10000]
tosq[t][10000]
trgt[t][10000]
type[t][10000]
scor[t][10000]
And this is done:
if (score != -INFINITY && (score > alpha || depth > 0)) {
frsq[t][eye] = fs; // had to use eye instead of i because of forum italics command
tosq[t][eye] = ts;
trgt[t][eye] = target;
type[t ][eye] = typ;
scor[t][eye] = score;
i++;
}
Would it be better for optimization or cache pressure to do this instead:
move[t][i++] = fs;
move[t][i++] = ts;
move[t][i++] = target;
move[t][i++] = typ;
move[t][i++] = score
Sorry for all the questions but finding this kind of info online seems impossible.
I'm a bit delirious from lack of sleep, so I can speculate, but you should definitely use a profiler to tell for sure.

I don't know how well MSVC will handle the i++ses. The former might be brutal with small caches b/c of bad locality whereas the latter preserves some locality, but could be less than ideal if MSVC actually increments a counter. If it increments a counter, in addition to having to do addition, now you have a brutal data dependency chain.

Optimized right, the latter is just a series of `lea`s, `movaps`s, or the like.
So you're feeling a little pickled right now?
Thanks, you are the third person to tell me to find out myself by using a profiller. I guess that I should take the hint! It is just I like knowing ahead of time.

It's useful advice! Profilers save a lot of guesswork.

Nonetheless, I took a peek at how MSVC compiles the latter on Compiler Explorer (godbolt.org) and was pretty disappointed. It introduces a dependency chain like I feared.

See: https://godbolt.org/z/dG6yJw

Last post before bed.
I have done some assembler. In around 2003 I wrote a single threaded bench (perft) in a little C but mostly x86 32 bit assembly on an 80386 using Turco C and Turbo assembler. And used no bulk counting and did make/unmake every move. And the make/unmake does everything except update a transformation hash--it does not have one. I just did a bench 6 in the original position on my new Ryzen 3950x. It took 1.669 seconds while running at 73,601,718 nodes/sec. Here is the code for the queen. It is 100% jumptable.

Code: Select all

wqmf        dd          0
            dd          wqmnd,wqmnd,wqmnd,wqmnd,wqmnd,wqmnd,wqmnd,wqmnd
            dd          wqmnd,wqmnd,wqmnd,wqmnd,wqmnd,wqmnd,wqmnd,wqmnd
            dd          0,0,0
            dd          wqmrm
            dd          0
            dd          wqmrc,wqmrc,wqmrc,wqmrc,wqmrc,wqmrc,wqmrc,wqmrc
            dd          wqmrc,wqmrc,wqmrc,wqmrc,wqmrc,wqmrc,wqmrc,wmcbki
            dd          wnxtm
            
wqcf        dd          0
            dd          wqcnd,wqcnd,wqcnd,wqcnd,wqcnd,wqcnd,wqcnd,wqcnd
            dd          wqcnd,wqcnd,wqcnd,wqcnd,wqcnd,wqcnd,wqcnd,wqcnd
            dd          0,0,0
            dd          wqcns
            dd          0
            dd          wqcrc,wqcrc,wqcrc,wqcrc,wqcrc,wqcrc,wqcrc,wqcrc
            dd          wqcrc,wqcrc,wqcrc,wqcrc,wqcrc,wqcrc,wqcrc,wccbki
            dd          wnxtc

 wmg:        push        ebp
            push        edi
            push        esi
            mov         ebp,[ply]
            mov         eax,[first+ebp*4]
            mov         [lis+ebp*4],eax
            mov         edi,[nxt]
            jmp         [ptmf+edi*4]

wnxtm:      mov         edi,[nxt+edi*4]
            jmp         [ptmf+edi*4]
        
wqm:        mov         ecx,[ps+edi*4]
            mov         esi,[qol+ecx*4]
            movsx       ebx,[qns+esi+ecx]
            mov         edx,[brd+ebx*4]
            jmp         [wqmf+edx*4]
        
wqmrm:      mov         [tree.fsq+eax*8],cl
            mov         [tree.tsq+eax*8],bl
            mov         [tree.typ+eax*8],QMOV
            inc         eax
            movsx       ebx,[qns+esi+ebx]
            mov         edx,[brd+ebx*4]
            jmp         [wqmf+edx*4]

wqmrc:      mov         [tree.fsq+eax*8],cl
            mov         [tree.tsq+eax*8],dl
            mov         [tree.typ+eax*8],QCAP
            inc         eax
wqmnd:      movsx       ebx,[qnd+esi+ebx]
            mov         edx,[brd+ebx*4]
            jmp         [wqmf+edx*4]

If you are interested I can send it to you.

bob · Post by **bob** » Mon Apr 13, 2020 7:01 pm

Michael Sherwin wrote: ↑Mon Apr 13, 2020 5:12 am
bob wrote: ↑Mon Apr 13, 2020 4:26 am Here's the simple answer from a compiler guy. Small functions are just fine. If they SHOULD be inlined, the compiler will do it during the compilation process. If it won't help, it won't. Not inlining can improve cache efficiency if the function is executed infrequently. Better to have the natural instruction stream sequential rather than broken up into blocks with interspaced dead code.

With today's optimizers, you can do pretty well to just write readable code you can understand and work on. Let the compiler do its thing.

For example
Code: Select all
if (condition) {

do some stuff 

}
Might be compiled as is. Or if the condition is mostly false, the compiler might turn that into this:
Code: Select all
if (! condition) 
  go to xx);
  getback:
  
  
  Somewhere else in memory:
  
  xx:
    do some stuff
    go to getback;
    
I used to have to write code like the latter in the early FORTRAN days, the optimizers were not very good. Today the compilers can do this so that your code looks readable, and then gets reorganized when converting from source to object code, leaving the source easy to read.

Yes, there are places you might do better. In particular if you have information the compiler doesn't have. IE you know that ALL the cases in your switch statement are covered explicitly. The compiler still has to check for the rest and skip all the way to the closing "}". You don't. That can eliminate instructions.
Thanks Bob! I understand about the compiler not knowing about the limits of a case statement. There is an attribute (i think that is what it is called) but I can't remember and I can't find it in my notes or online. It is something like this:
default:
_unreachable;
That is supposed to give the compiler more information.

Also, in this function board[t][fs] and board[t][ts] is referenced so many times that I was about to create a pointer s32* boardptr = board[t]; and replace board references with boardptr[sq]. Is that a good idea or simply not necessary? Thanks again

I think there is a #pragma you can use to tell the compiler "don't check for out-of-bounds switch values" but I have not looked for that in years... May well not exist today for all I know...

Compiler Optimization Question

Re: Compiler Optimization Question

Re: Compiler Optimization Question

Re: Compiler Optimization Question

Re: Compiler Optimization Question

Re: Compiler Optimization Question

Re: Compiler Optimization Question

Re: Compiler Optimization Question

Re: Compiler Optimization Question

Re: Compiler Optimization Question

Re: Compiler Optimization Question