Lazy SMP >4 Thread Slowdown

Discussion of chess software programming and technical issues.

Moderators: hgm, Harvey Williamson, bob

Forum rules
This textbox is used to restore diagrams posted with the [d] tag before the upgrade.
Post Reply
nitrocan
Posts: 17
Joined: Fri Sep 15, 2017 9:52 pm

Lazy SMP >4 Thread Slowdown

Post by nitrocan » Wed Nov 29, 2017 1:41 am

I have recently implemented lazy smp and it works remarkably well with up to 4 cores. One weird thing that I did notice though is that as soon as I start using 8/16 cores, the nps actually starts decreasing instead of increasing. Same with the depth that it searches per unit time. Any ideas on why this might be happening? Could it be that too many threads are contending for the transposition table? All ideas are very welcome, thanks!

smatovic
Posts: 479
Joined: Wed Mar 10, 2010 9:18 pm
Location: Germany
Contact:

Re: Lazy SMP >4 Thread Slowdown...NUMA?

Post by smatovic » Wed Nov 29, 2017 6:41 am

...in the case of AMD Threadripper/Epyc as cpu,

do you use all 4 memory channels?

http://talkchess.com/forum/viewtopic.ph ... 00&t=64980

Code: Select all

...Threadripper needs to be in quad channel to move from around 18'000 kn/s to over 30'000, it is was it is. 

Not sure about this, but, maybe each 4 core unit of Threadripper has to be treated as NUMA node...

https://www.anandtech.com/show/11551/am ... analysis/2

http://talkchess.com/forum/viewtopic.ph ... tion+table

--
Srdja

abulmo2
Posts: 131
Joined: Fri Dec 16, 2016 10:04 am
Contact:

Re: Lazy SMP >4 Thread Slowdown

Post by abulmo2 » Wed Nov 29, 2017 6:41 am

Good explanation here of what may happen:

https://www.youtube.com/watch?v=WDIkqP4JbkE
Richard Delorme

nitrocan
Posts: 17
Joined: Fri Sep 15, 2017 9:52 pm

Re: Lazy SMP >4 Thread Slowdown...NUMA?

Post by nitrocan » Wed Nov 29, 2017 6:46 am

smatovic wrote:...in the case of AMD Threadripper/Epyc as cpu,

do you use all 4 memory channels?

http://talkchess.com/forum/viewtopic.ph ... 00&t=64980

Code: Select all

...Threadripper needs to be in quad channel to move from around 18'000 kn/s to over 30'000, it is was it is. 

Not sure about this, but, maybe each 4 core unit of Threadripper has to be treated as NUMA node...

https://www.anandtech.com/show/11551/am ... analysis/2

http://talkchess.com/forum/viewtopic.ph ... tion+table

--
Srdja
I'm not actually running it on my own machine at all, I use AWS instances for testing. The particular instance works perfectly fine with stockfish (nps increases almost linearly with number of threads) but not my engine :( haha.
abulmo2 wrote:Good explanation here of what may happen:

https://www.youtube.com/watch?v=WDIkqP4JbkE
Will take a look thanks!

syzygy
Posts: 4253
Joined: Tue Feb 28, 2012 10:56 pm

Re: Lazy SMP >4 Thread Slowdown

Post by syzygy » Wed Nov 29, 2017 10:20 am

Shared cache lines. Use per-thread node counters, etc.

Are you using locked instructions or heavily accessed atomic variables?

On Linux, use "perf c2c" to find the cache lines that are shared, either because of a real shared variable or because of falsely shared variables (thread-specific ones that happen to sit in the same cache line).

Dann Corbit
Posts: 8663
Joined: Wed Mar 08, 2006 7:57 pm
Location: Redmond, WA USA
Contact:

Re: Lazy SMP >4 Thread Slowdown

Post by Dann Corbit » Wed Nov 29, 2017 4:02 pm

nitrocan wrote:I have recently implemented lazy smp and it works remarkably well with up to 4 cores. One weird thing that I did notice though is that as soon as I start using 8/16 cores, the nps actually starts decreasing instead of increasing. Same with the depth that it searches per unit time. Any ideas on why this might be happening? Could it be that too many threads are contending for the transposition table? All ideas are very welcome, thanks!
Perhaps there is a shared variable that is a bottleneck.

E.g. an eval hash (if global) might be better as a thread local storage so that each thread gets its own.

Besides the main hash table, what else is a public object in your program?
Taking ideas is not a vice, it is a virtue. We have another word for this. It is called learning.
But sharing ideas is an even greater virtue. We have another word for this. It is called teaching.

nitrocan
Posts: 17
Joined: Fri Sep 15, 2017 9:52 pm

Re: Lazy SMP >4 Thread Slowdown

Post by nitrocan » Wed Nov 29, 2017 6:20 pm

Dann Corbit wrote:
nitrocan wrote:I have recently implemented lazy smp and it works remarkably well with up to 4 cores. One weird thing that I did notice though is that as soon as I start using 8/16 cores, the nps actually starts decreasing instead of increasing. Same with the depth that it searches per unit time. Any ideas on why this might be happening? Could it be that too many threads are contending for the transposition table? All ideas are very welcome, thanks!
Perhaps there is a shared variable that is a bottleneck.

E.g. an eval hash (if global) might be better as a thread local storage so that each thread gets its own.

Besides the main hash table, what else is a public object in your program?
That was pretty much exactly it. After making:

-Node counter
-Killer moves
-Counter moves
-History

thread specific, the scalability of threads is now working as intended. I'm sure there are other things I should look into as well but so far so good! Thanks everyone!

If anyone's interested, here's the pull request that I have made that addresses this issue:

https://github.com/nitrocan/sctr/pull/24/files

Post Reply