Page 1 of 3

strategies for finding slowdows in lazy smp

Posted: Tue Jun 04, 2019 7:06 pm
by flok
Hi,

tpoppins from ccrl noticed that Embla slows down when the number of threads increases (when using lazy smp).
I thought I had seen that this only happened on windows but that is not correct: it also happens on Linux.

Number of threads versus nps on a threadripper 1950x
Image

The dramatic slow-down is probably because other things were running on it (e.g. the chrome browser).

Now my question is: what are strategies for finding what causes this slow down?
The threads share no common variables apart from the transposition table. That tt has no locks, it uses the xor-trick.

Re: strategies for finding slowdows in lazy smp

Posted: Tue Jun 04, 2019 7:51 pm
by abik
flok wrote:
Tue Jun 04, 2019 7:06 pm
Now my question is: what are strategies for finding what causes this slow down?
The threads share no common variables apart from the transposition table. That tt has no locks, it uses the xor-trick.
This is always an interesting computer architecture question, but you give very few details on the system used. Many factors can cause this. For example, cache coherence overhead may play a role here (just making the access lockless does not mean there is not a penalty moving data between processors). Or CPU throttling may kick in if the system runs hot.

I would start eliminating the issues one by one. For example, start running all threads with no or a private transposition table (yes, I know, that is awful for actual chess performance), just to see if you observe the same slowdown.

Re: strategies for finding slowdows in lazy smp

Posted: Tue Jun 04, 2019 10:20 pm
by jdart
A profiler might also be helpful. You could try OProfile for Linux (http://oprofile.sourceforge.net/news/).

--Jon

Re: strategies for finding slowdows in lazy smp

Posted: Wed Jun 05, 2019 12:25 am
by Dann Corbit
flok wrote:
Tue Jun 04, 2019 7:06 pm
Hi,

tpoppins from ccrl noticed that Embla slows down when the number of threads increases (when using lazy smp).
I thought I had seen that this only happened on windows but that is not correct: it also happens on Linux.

Number of threads versus nps on a threadripper 1950x
Image

The dramatic slow-down is probably because other things were running on it (e.g. the chrome browser).

Now my question is: what are strategies for finding what causes this slow down?
The threads share no common variables apart from the transposition table. That tt has no locks, it uses the xor-trick.
Interesting that there are 16 cores and 32 active threads for that CPU.
There is a huge nosedive at 33 cores.
I think that the graph is exactly what we would expect.

Re: strategies for finding slowdows in lazy smp

Posted: Wed Jun 05, 2019 12:27 am
by Dann Corbit
Err, I have a question.
I assumed that the graph was NPS per core. Is that correct?
If that is NPS for the program, then something is totally broken.

Re: strategies for finding slowdows in lazy smp

Posted: Wed Jun 05, 2019 3:27 am
by bob
Good observation. Either this is (a) total-NPS divided by threads or else (b) it is broken. About all NPS is useful for is to detect architectural issues, such as cache thrashing / false sharing or bandwidth issues, processor throttling due to heat, memory bottlenecks, etc. The number of cores is getting large enough that it becomes interesting to figure out what is going on sometimes.

Re: strategies for finding slowdows in lazy smp

Posted: Wed Jun 05, 2019 5:54 am
by Dann Corbit
I went to the github site:
https://github.com/flok99/Embla
And expected to see threads.h or something of that nature for the threading code.
Where is your SMP stuff at?

Re: strategies for finding slowdows in lazy smp

Posted: Wed Jun 05, 2019 7:42 am
by flok
Hi Aart,
abik wrote:
Tue Jun 04, 2019 7:51 pm
flok wrote:
Tue Jun 04, 2019 7:06 pm
Now my question is: what are strategies for finding what causes this slow down?
The threads share no common variables apart from the transposition table. That tt has no locks, it uses the xor-trick.
This is always an interesting computer architecture question, but you give very few details on the system used. Many factors can cause this. For example, cache coherence overhead may play a role here (just making the access lockless does not mean there is not a penalty moving data between processors). Or CPU throttling may kick in if the system runs hot.
This happens on many systems. My laptop (core i7) and desktop (threadripper 1950x) but also the dell i5 at work. For the first two (which are linux) the scheduler has been set to "performance".
I would start eliminating the issues one by one. For example, start running all threads with no or a private transposition table (yes, I know, that is awful for actual chess performance), just to see if you observe the same slowdown.
Good point!
Will try that.

I'll also pin each thread to a core/thread.

Re: strategies for finding slowdows in lazy smp

Posted: Wed Jun 05, 2019 8:06 am
by flok
Dann Corbit wrote:
Wed Jun 05, 2019 12:25 am
flok wrote:
Tue Jun 04, 2019 7:06 pm
Hi,

tpoppins from ccrl noticed that Embla slows down when the number of threads increases (when using lazy smp).
I thought I had seen that this only happened on windows but that is not correct: it also happens on Linux.

Number of threads versus nps on a threadripper 1950x
Image

The dramatic slow-down is probably because other things were running on it (e.g. the chrome browser).

Now my question is: what are strategies for finding what causes this slow down?
The threads share no common variables apart from the transposition table. That tt has no locks, it uses the xor-trick.
Interesting that there are 16 cores and 32 active threads for that CPU.
Yes, this cpu has 2 threads per core.
There is a huge nosedive at 33 cores.
That's at 32 actually.
I think that the graph is exactly what we would expect.
Is it? Because stockfish for example shows noise in the nps but no nose-dive (well a tiny one but it had to share that laptop with a browser and other mess):

Code: Select all

  # threads nps    nps/thread
        1: 1924447 1924447
        2: 3516980 1758490
        3: 5726175 1908725
        4: 7295187 1823796
        5: 9950080 1990016
        6: 9704585 1617430
Dann Corbit wrote:
Wed Jun 05, 2019 12:27 am
Err, I have a question.
I assumed that the graph was NPS per core. Is that correct?
If that is NPS for the program, then something is totally broken.
bob wrote:
Wed Jun 05, 2019 3:27 am
Good observation. Either this is (a) total-NPS divided by threads or else (b) it is broken. About all NPS is useful for is to detect architectural issues, such as cache thrashing / false sharing or bandwidth issues, processor throttling due to heat, memory bottlenecks, etc. The number of cores is getting large enough that it becomes interesting to figure out what is going on sometimes.
That graph showed the nps for 1 thread.
This new graph shows the average nps for all threads:

Image
Dann Corbit wrote:
Wed Jun 05, 2019 5:54 am
I went to the github site:
https://github.com/flok99/Embla
And expected to see threads.h or something of that nature for the threading code.
Where is your SMP stuff at?
The version at github is a new rewrite, not the one working on currently. I'm going to drop that rewrite as its movegen is slower than the previous version.

Anyway, this is the code: https://vanheusden.com/Embla/files/embla-2.0.8.tgz
Brain.cpp contains search and eval and threading.
there's a "thread()" function and a "calculateMove" method which do the searching. calculateMove invokes search() and starts n - 1 threads via the thread() function.
Tpt.cpp is the hashtable. As you see the transpositiontable has been disabled for the tests.

Re: strategies for finding slowdows in lazy smp

Posted: Wed Jun 05, 2019 8:12 am
by Dann Corbit
That graph showed the nps for 1 thread.
This new graph shows the average nps for all threads:
Something is very wrong with the calculation.
The aggregate NPS is the sum of the NPS for all threads.
How can it be less than the NPS for one thread?