lithander wrote: ↑Wed Sep 29, 2021 2:24 pm
This is just another example of an expression that the C compiler had already optimized but the C# compiler didn't.
I've tried many... _MANY_ optimizations I found across the internet to make my engine faster. "Use powers of 2 for the number of TT-entries, so you can use a bitmask, which is faster than modulo", for example. Tried it. Failed. Then I found a post on Talkchess that modulo can be replaced by bitshifting, and this works for any number of TT entries. I would not be surprised if LLVM (rustc's compiler backend) knows this trick, and already applies it in place of the modulo operator.
This way I've encountered many different optimizations that in the end didn't give me any improvements. The only things that really gave me an improvement is removing vectors that live on the heap and replace them by a struct that keeps an array and a counter, on the stack. I now only use vectors in the places where 1) speed doesn't matter (parsing incoming FEN-strings for example) or 2) I need the variable size (appending child-pv to parent-pv). I even tried to optimize 2) by initializing the vector with a certain capacity, and I tried a huge array which I then handled as if it was a vector, but it didn't bring any improvement because the PV (apparently) doesn't change enough to make an impact.
So I basically stopped trying to hand-optimize my code. I just take care to not allocate and de-allocate memory in hotspot functions, and up until now, I'm fine.
Here are some things that helped me:
When having a struct/class i test if i can make every member static.
If I have a struct/class where every member is static - I can convert it to a namespace.
For me and MSVC at least. Having Lookup::Rook or Lookup::Rook in a namespace or a struct made a difference of about 5 percent. You are correct in the assumption that a modulo with a constant power of 2 gets compiled away into something better.
As will small multiplications convert to shifts.
The biggest improvements come from templates. Too often I see: if color == White pawn << 8 else pawn >> 8
This can become a Template parameter if constexpr(color == White) pawn >> 8. At the end you can just call the template function again with Move<!White>() which will be very clean and you have inlined all color dependent stuff. This will be more than twice as fast!
dangi12012 wrote: ↑Thu Sep 30, 2021 2:09 pm
The biggest improvements come from templates. Too often I see: if color == White pawn << 8 else pawn >> 8
This can become a Template parameter if constexpr(color == White) pawn >> 8. At the end you can just call the template function again with Move<!White>() which will be very clean and you have inlined all color dependent stuff. This will be more than twice as fast!
Rust has const-fn functions, and I use them in some places. I'll have to look into them more to see if I can use them in more places. They are not templates, but the do get calculated at compile time.
dangi12012 wrote: ↑Thu Sep 30, 2021 2:09 pm
The biggest improvements come from templates. Too often I see: if color == White pawn << 8 else pawn >> 8
This can become a Template parameter if constexpr(color == White) pawn >> 8. At the end you can just call the template function again with Move<!White>() which will be very clean and you have inlined all color dependent stuff. This will be more than twice as fast!
Very cool!!
But I don't see how something equivalent could be done in C#. Or is there?
Minimal Chess (simple, open source, C#) - Youtube & Github Leorik (competitive, in active development, C#) - Github & Lichess
dangi12012 wrote: ↑Thu Sep 30, 2021 2:09 pm
The biggest improvements come from templates. Too often I see: if color == White pawn << 8 else pawn >> 8
This can become a Template parameter if constexpr(color == White) pawn >> 8. At the end you can just call the template function again with Move<!White>() which will be very clean and you have inlined all color dependent stuff. This will be more than twice as fast!
Very cool!!
But I don't see how something equivalent could be done in C#. Or is there?
In C# and other languages you'd just have two methods "MoveWhite()" and "MoveBlack()", and either accept or work around the code duplication.
[Moderation warning] This signature violated the rule against commercial exhortations.
mvanthoor wrote: ↑Thu Sep 30, 2021 12:53 am
I've tried many... _MANY_ optimizations I found across the internet to make my engine faster. "Use powers of 2 for the number of TT-entries, so you can use a bitmask, which is faster than modulo", for example. Tried it. Failed. Then I found a post on Talkchess that modulo can be replaced by bitshifting, and this works for any number of TT entries.
You may benefit from using a prime-sized TT. While integer division is slow, it's possible to do division through multiplication and shifting. Is that what you meant by "modulo can be replaced by bitshifting" ?
[Moderation warning] This signature violated the rule against commercial exhortations.
mvanthoor wrote: ↑Thu Sep 30, 2021 12:53 am
I've tried many... _MANY_ optimizations I found across the internet to make my engine faster. "Use powers of 2 for the number of TT-entries, so you can use a bitmask, which is faster than modulo", for example. Tried it. Failed. Then I found a post on Talkchess that modulo can be replaced by bitshifting, and this works for any number of TT entries.
You may benefit from using a prime-sized TT. While integer division is slow, it's possible to do division through multiplication and shifting. Is that what you meant by "modulo can be replaced by bitshifting" ?
The compiler does that automatically. If you have a power of 2 there are some special properties. Divrem = division and modulo can be expressed as:
x / 8 => x >> 3
x % 8 => x & 7
I'll have to try a power of two TT some day again. The thing I dislike about it is the discrepancy in memory usage. Most of the time, the TT will be smaller than the size you specified.
mvanthoor wrote: ↑Thu Sep 30, 2021 5:04 pm
I'll have to try a power of two TT some day again. The thing I dislike about it is the discrepancy in memory usage. Most of the time, the TT will be smaller than the size you specified.
What I'm saying is that a non-power of two can be almost as efficient, you can avoid the slow division by using multiply + shift, for example these two expressions are equivalent for 32 bit integers:
Yeah, the compiler will do that for you if the size is known compile time. And a non-power of two sized TT may lead to a lot better TT utilization (though this might not be the case in your engine, it'll depend on how you generate the keys).
And of course, don't repeat the entropy used to calculate the index in the hashed value.
[Moderation warning] This signature violated the rule against commercial exhortations.
mvanthoor wrote: ↑Thu Sep 30, 2021 5:04 pm
I'll have to try a power of two TT some day again. The thing I dislike about it is the discrepancy in memory usage. Most of the time, the TT will be smaller than the size you specified.
What I'm saying is that a non-power of two can be almost as efficient, you can avoid the slow division by using multiply + shift, for example these two expressions are equivalent for 32 bit integers:
Yeah, the compiler will do that for you if the size is known compile time. And a non-power of two sized TT may lead to a lot better TT utilization (though this might not be the case in your engine, it'll depend on how you generate the keys).
And of course, don't repeat the entropy used to calculate the index in the hashed value.
That is what I mean with bit-shifting indeed, and I tried it manually because that was a performance tip I found somewhere. It may have worked in the past, but current-day compilers can do this on their own. Therefore I don't try these micro-optimizations anymore. With regard to the entropy: I calculate the index from the upper 32-bit part of the zobrist key, and the verification hash from the lower 32-bit part, so there is no repetition.
mvanthoor wrote: ↑Thu Sep 30, 2021 5:45 pm
That is what I mean with bit-shifting indeed, and I tried it manually because that was a performance tip I found somewhere. It may have worked in the past, but current-day compilers can do this on their own. Therefore I don't try these micro-optimizations anymore.