Why C++ instead of C#?

R. Tomasi · Post by **R. Tomasi** » Thu Sep 30, 2021 5:55 pm

klx wrote: ↑Thu Sep 30, 2021 5:52 pm
mvanthoor wrote: ↑Thu Sep 30, 2021 5:45 pm That is what I mean with bit-shifting indeed, and I tried it manually because that was a performance tip I found somewhere. It may have worked in the past, but current-day compilers can do this on their own. Therefore I don't try these micro-optimizations anymore.
Have you checked what your code does?
Code: Select all
(key % (self.total_entries as u64)) as usize
Looks to me like the divisor is not known or easily deduced compile time, so I doubt the compiler can fix this one for you.

I'm using power-of-two masks in Pygmalion currently, but I do consider to changing that in a way to support arbitrary sizes. I somehow doubt it's possible to have these to be compile-time constants if you intend to make that a parameter that can be configured via some GUI. However I doubt it's that much of an impact: for me the really time consuming part when doing TT lookups is to check the legality of the TT move, which overshadows any gain from such tweaks by orders of magnitude.

Edit: Actually, when profiling with vTune, the legality check of TT moves is the hottest part of my code (like about 30% time is spent there).

klx · Post by **klx** » Thu Sep 30, 2021 6:06 pm

R. Tomasi wrote: ↑Thu Sep 30, 2021 5:55 pm I'm using power-of-two masks in Pygmalion currently, but I do consider to changing that in a way to support arbitrary sizes. I somehow doubt it's possible to have these to be compile-time constants if you intend to make that a parameter that can be configured via some GUI. However I doubt it's that much of an impact

Oh, but it is. Through Emanuel's Self-Re-Compiling TT Division Scheme.

Also, if your TT size leads to significantly increased risk of collision, it's not a micro-optimization anymore. Which might be the case for non-prime sizes. (Again, depending on how you generate the key.)

amanjpro · Post by **amanjpro** » Thu Sep 30, 2021 6:13 pm

klx wrote: ↑Thu Sep 30, 2021 6:06 pm
R. Tomasi wrote: ↑Thu Sep 30, 2021 5:55 pm I'm using power-of-two masks in Pygmalion currently, but I do consider to changing that in a way to support arbitrary sizes. I somehow doubt it's possible to have these to be compile-time constants if you intend to make that a parameter that can be configured via some GUI. However I doubt it's that much of an impact
Oh, but it is. Through Emanuel's Self-Re-Compiling TT Division Scheme.

Also, if your TT size leads to significantly increased risk of collision, it's not a micro-optimization anymore. Which might be the case for non-prime sizes. (Again, depending on how you generate the key.)

Have you seasoned yourself already?

mvanthoor · Post by **mvanthoor** » Thu Sep 30, 2021 6:30 pm

klx wrote: ↑Thu Sep 30, 2021 5:52 pm
mvanthoor wrote: ↑Thu Sep 30, 2021 5:45 pm That is what I mean with bit-shifting indeed, and I tried it manually because that was a performance tip I found somewhere. It may have worked in the past, but current-day compilers can do this on their own. Therefore I don't try these micro-optimizations anymore.
Have you checked what your code does?
Code: Select all
(key % (self.total_entries as u64)) as usize
Looks to me like the divisor is not known or easily deduced compile time, so I doubt the compiler can fix this one for you.

This post talks about replacing modulo with bitshifts. If this does work for any size (I haven't checked this myself), I wouldn't be surprised if the compiler does this, or something similar.

I tried that bitshift method, and while it works, it didn't provide a speedup.

dangi12012 · Post by **dangi12012** » Thu Sep 30, 2021 6:38 pm

mvanthoor wrote: ↑Thu Sep 30, 2021 6:30 pm
klx wrote: ↑Thu Sep 30, 2021 5:52 pm
mvanthoor wrote: ↑Thu Sep 30, 2021 5:45 pm That is what I mean with bit-shifting indeed, and I tried it manually because that was a performance tip I found somewhere. It may have worked in the past, but current-day compilers can do this on their own. Therefore I don't try these micro-optimizations anymore.
Have you checked what your code does?
Code: Select all
(key % (self.total_entries as u64)) as usize
Looks to me like the divisor is not known or easily deduced compile time, so I doubt the compiler can fix this one for you.
This post talks about replacing modulo with bitshifts. If this does work for any size (I haven't checked this myself), I wouldn't be surprised if the compiler does this, or something similar.

I tried that bitshift method, and while it works, it didn't provide a speedup.

That just shows how important it is to set up a benchmark for a specific function - or to use a profiler - or look at the disassembly.
Could be that a div gets replaced by a shift - but that speedup is just 1/10th of a percent and while faster there are many other places that are an order of magnitude slower - and you just dont see it.

Also keep in mind that making code faster is good - the fastest code is non existing code. Meaning avoiding a line of code or function call altogether is always fastest.

klx · Post by **klx** » Thu Sep 30, 2021 6:59 pm

amanjpro wrote: ↑Thu Sep 30, 2021 6:13 pm Have you seasoned yourself already?

I will not be responding to any more of your posts!

klx · Post by **klx** » Thu Sep 30, 2021 6:59 pm

mvanthoor wrote: ↑Thu Sep 30, 2021 6:30 pm This post talks about replacing modulo with bitshifts. If this does work for any size (I haven't checked this myself), I wouldn't be surprised if the compiler does this, or something similar.

I tried that bitshift method, and while it works, it didn't provide a speedup.

I see. Thanks for sharing. No that's not the same as what I posted. It's not modulo at all. But if your keys are well spread out, yeah it's a great option.

And no, the compiler will definitely not do this for you if you ask it to do modulo, since again, it's not modulo. Your code will most likely result in slow integer division.

Just to be clear: What is the process you use to determine if a change provides a speedup or not? It's surprisingly hard to measure that accurately. Most people do it wrong.

R. Tomasi · Post by **R. Tomasi** » Thu Sep 30, 2021 7:08 pm

klx wrote: ↑Thu Sep 30, 2021 6:59 pm Just to be clear: What is the process you use to determine if a change provides a speedup or not? It's surprisingly hard to measure that accurately. Most people do it wrong.

FWIW I tend to rely on Intels vTune performance Analyzer when I'm trying to profile individual functions.

mvanthoor · Post by **mvanthoor** » Thu Sep 30, 2021 10:57 pm

klx wrote: ↑Thu Sep 30, 2021 6:59 pm And no, the compiler will definitely not do this for you if you ask it to do modulo, since again, it's not modulo. Your code will most likely result in slow integer division.

Then that's what it's going to be, because....

Just to be clear: What is the process you use to determine if a change provides a speedup or not? It's surprisingly hard to measure that accurately. Most people do it wrong.

I converted the engine to assign a TT with a number of entries equal to a power of 2. So, if I entered 256 MB for the TT, the engine would calculate the number of entries needed to create a 256 MB TT, and then take the previous power of two below that.

1- This gives weird hash table sizes, always much smaller than the amount of memory I entered.
2 - When assigning my previous version the same amount of memory as the power-of-two version and then ran a TimeToDepth on a number of positions, the power-of-two version was somewhat faster.
3 - When both versions have equal memory available, the faster version does gain about 25 Elo.
4. However, when just assigning a TT size and letting both engines do their thing (with the new version thus assigning a slightly faster, but almost half the size TT), the new version loses about 25 Elo.

Just as in my previous test, this change would be detrimental to my engine if a user just assigns 256 MB expecting to get a 256 MB TT, where in reality it is only a bit bigger than half the requested size. So yes, the newer version is a bit stronger than the old version, IF you set the old version to the same TT size as the newer version is ACTUALLY assigning. If you don't, the newer version will have a smaller TT, which hurts more than the speedup gains.

The speedup and Elo gain in this case is not worth the inconvenience. Maybe I'll try the option from the linked topic some day, but that's material for a different topic.

klx · Post by **klx** » Fri Oct 01, 2021 4:05 am

mvanthoor wrote: ↑Thu Sep 30, 2021 10:57 pm I converted the engine to assign a TT with a number of entries equal to a power of 2.

Ok thanks for the details but you're talking about a different change now? I thought we were talking about changing division to shifting, while still keeping your non-power-of-two tables.

mvanthoor wrote: ↑Thu Sep 30, 2021 10:57 pm 3 - When both versions have equal memory available, the faster version does gain about 25 Elo.

So this right here tells me that you are most likely NOT testing this correctly. When it comes to pure performance, +25 Elo corresponds to 30% faster overall. There is very little chance that removing a modulo call in the TT probe would make your entire engine 30% faster.

Why C++ instead of C#?

Re: Why C++ instead of C#?

Re: Why C++ instead of C#?

Re: Why C++ instead of C#?

Re: Why C++ instead of C#?

Re: Why C++ instead of C#?

Re: Why C++ instead of C#?

Re: Why C++ instead of C#?

Re: Why C++ instead of C#?

Re: Why C++ instead of C#?

Re: Why C++ instead of C#?