strategies for finding slowdows in lazy smp

flok · Post by **flok** » Wed Jun 05, 2019 10:29 am

Hi Dann,

Dann Corbit wrote: ↑Wed Jun 05, 2019 10:12 am
That graph showed the nps for 1 thread.
This new graph shows the average nps for all threads:
Something is very wrong with the calculation.
The aggregate NPS is the sum of the NPS for all threads.
How can it be less than the NPS for one thread?

In that graph it is not the aggregate, it is the average

Here's a combined graph of the average and the sum:

smatovic · Post by **smatovic** » Wed Jun 05, 2019 10:44 am

flok wrote: ↑Tue Jun 04, 2019 9:06 pm Now my question is: what are strategies for finding what causes this slow down?

- implement an benchsmp command to reproduce results quick on the command line
- as always in engine debugging, turn every extension off, bench only with an
basic engine and turn stepwise extensions on, you can also bench smp nps
with TT off

***edit***
- if it's not TT or extensions, then IDF loop and starting/terminating threads is left

--
Srdja

mar · Post by **mar** » Wed Jun 05, 2019 12:34 pm

flok wrote: ↑Tue Jun 04, 2019 9:06 pm The dramatic slow-down is probably because other things were running on it (e.g. the chrome browser).

First of all, don't mess with affinity (especially if you don't understand how it works).
Let's say your CPU has 2 logical cores per one physical, so if you set affinity mask for one worker to bit 0 and another to bit 1, you force them to run on a single physical core, this is certainly not what you want.
So unless you know exactly what you're doing, simply trust the scheduler.

flok · Post by **flok** » Wed Jun 05, 2019 12:42 pm

mar wrote: ↑Wed Jun 05, 2019 12:34 pm
flok wrote: ↑Tue Jun 04, 2019 9:06 pm The dramatic slow-down is probably because other things were running on it (e.g. the chrome browser).
First of all, don't mess with affinity (especially if you don't understand how it works).
Let's say your CPU has 2 logical cores per one physical, so if you set affinity mask for one worker to bit 0 and another to bit 1, you force them to run on a single physical core, this is certainly not what you want.

But: let's say I have a system with 32 threads (16 physical cores) on which I want to run 32 threads. In that case there's always a case of 2 on the same phsyical core.
Or are you suggesting not to use threading but only 1 thread per core?

flok · Post by **flok** » Wed Jun 05, 2019 12:45 pm

smatovic wrote: ↑Wed Jun 05, 2019 10:44 am
flok wrote: ↑Tue Jun 04, 2019 9:06 pm Now my question is: what are strategies for finding what causes this slow down?
- implement an benchsmp command to reproduce results quick on the command line
- as always in engine debugging, turn every extension off, bench only with an
basic engine and turn stepwise extensions on, you can also bench smp nps
with TT off

***edit***
- if it's not TT or extensions, then IDF loop and starting/terminating threads is left

what is an IDF loop? my googling did not reproduce anything on that

starting/term. threads: I start them once at the start of the whole calculation and stop them when time is up

mar · Post by **mar** » Wed Jun 05, 2019 1:01 pm

flok wrote: ↑Wed Jun 05, 2019 12:42 pm
mar wrote: ↑Wed Jun 05, 2019 12:34 pm
flok wrote: ↑Tue Jun 04, 2019 9:06 pm The dramatic slow-down is probably because other things were running on it (e.g. the chrome browser).
First of all, don't mess with affinity (especially if you don't understand how it works).
Let's say your CPU has 2 logical cores per one physical, so if you set affinity mask for one worker to bit 0 and another to bit 1, you force them to run on a single physical core, this is certainly not what you want.
But: let's say I have a system with 32 threads (16 physical cores) on which I want to run 32 threads. In that case there's always a case of 2 on the same phsyical core.
Or are you suggesting not to use threading but only 1 thread per core?

Of course I'm not, I'm suggesting you don't mess with affinity and let the scheduler do its job!
Let's say I have 8 logical cores and 4 physical:

Code: Select all

L0L1L2L3L4L5L6L7
P0P0P1P1P2P2P3P3

And I want to run a 4-CPU tournament. The way you allocate the logical cores, you end up with thread masks
L0L1L2L3, but that restricts the threads to only two physical cores instead of 4, so a better mask would be
L0L1 for thread0, L2L3 for thread 1 and so on. (of course, you could have more than 2 logical cores per physical, so this is just an example)

So simply let the OS scheduler handle it (plus it's less code

smatovic · Post by **smatovic** » Wed Jun 05, 2019 1:20 pm

flok wrote: ↑Wed Jun 05, 2019 12:45 pm
smatovic wrote: ↑Wed Jun 05, 2019 10:44 am
flok wrote: ↑Tue Jun 04, 2019 9:06 pm Now my question is: what are strategies for finding what causes this slow down?
- implement an benchsmp command to reproduce results quick on the command line
- as always in engine debugging, turn every extension off, bench only with an
basic engine and turn stepwise extensions on, you can also bench smp nps
with TT off

***edit***
- if it's not TT or extensions, then IDF loop and starting/terminating threads is left
what is an IDF loop? my googling did not reproduce anything on that

starting/term. threads: I start them once at the start of the whole calculation and stop them when time is up

IDF - Iterative Deepening Framework

https://www.chessprogramming.org/Iterative_Deepening

Not sure how a lazy smp implementation looks like without Iterative Deepening,
but if you have ID implemented, then maybe you want to implement a termination
strategy for all threads, for the case a thread finishes the search of the
current ID iteration...but this stuff may vary between lazy smp derivatives.

--
Srdja

flok · Post by **flok** » Wed Jun 05, 2019 1:27 pm

smatovic wrote: ↑Wed Jun 05, 2019 1:20 pm IDF - Iterative Deepening Framework
https://www.chessprogramming.org/Iterative_Deepening
Not sure how a lazy smp implementation looks like without Iterative Deepening,

Oh it has IDF, I just didn't know it was called IDF. Thought ID. But never mind.

but if you have ID implemented, then maybe you want to implement a termination
strategy for all threads, for the case a thread finishes the search of the
current ID iteration...but this stuff may vary between lazy smp derivatives.

Currently my main thread is the master-thread. If that one decides the search is finished, then all others terminate as well.

smatovic · Post by **smatovic** » Wed Jun 05, 2019 1:33 pm

flok wrote: ↑Wed Jun 05, 2019 1:27 pm Currently my main thread is the master-thread. If that one decides the search is finished, then all others terminate as well.

And what happens if a helper finishes its search?

--
Srdja

flok · Post by **flok** » Wed Jun 05, 2019 1:36 pm

smatovic wrote: ↑Wed Jun 05, 2019 1:33 pm
flok wrote: ↑Wed Jun 05, 2019 1:27 pm Currently my main thread is the master-thread. If that one decides the search is finished, then all others terminate as well.
And what happens if a helper finishes its search?

It goes on with the next iteration if applicable. Else it'll busy-loop

until the main-thread catches up.

strategies for finding slowdows in lazy smp

Re: strategies for finding slowdows in lazy smp

Re: strategies for finding slowdows in lazy smp

Re: strategies for finding slowdows in lazy smp

Re: strategies for finding slowdows in lazy smp

Re: strategies for finding slowdows in lazy smp

Re: strategies for finding slowdows in lazy smp

Re: strategies for finding slowdows in lazy smp

Re: strategies for finding slowdows in lazy smp

Re: strategies for finding slowdows in lazy smp

Re: strategies for finding slowdows in lazy smp