Memory management and threads

Discussion of chess software programming and technical issues.

Moderators: bob, hgm, Harvey Williamson

Forum rules
This textbox is used to restore diagrams posted with the [d] tag before the upgrade.
jdart
Posts: 3991
Joined: Fri Mar 10, 2006 4:23 am
Location: http://www.arasanchess.org

Re: Memory management and threads

Post by jdart » Wed Sep 16, 2020 2:54 am

How do we manage increased memory access outside of cache with increasing thread count? I
A global hash table can create a lot of memory contention. When I have profiled this was certainly the primary memory-related hot spot.

There are a few things you can do including pinning cores to threads (prevent migration of threads across cores) and also hash striping (ensure different parts of the hash are loaded by different threads, ensuring at least some parts are local or proximate to each running thread). This probably has some effect but I think it's fairly minor. Also pinning only really works well in an environment where the engine has complete use of all the system resources.

chrisw
Posts: 3629
Joined: Tue Apr 03, 2012 2:28 pm

Re: Memory management and threads

Post by chrisw » Wed Sep 16, 2020 9:21 am

mar wrote:
Tue Sep 15, 2020 9:07 pm
thanks.

values of 5-yr old cheng (lazy smp) on my 8-core Ryzen, 4MB hash, startpos to depth 20:

Code: Select all

1T: time 54363 nodes 131382573 nps 2416764
2T: time 35140 nodes 166246539 nps 4730977
4T: time 37180 nodes 338705372 nps 9109880
8T: time 18374 nodes 292388814 nps 15913182
so nps scaling is as follows:

Code: Select all

1T: 1.0x
2T: 1.96x
4T: 3.77x
8T: 6.58x
not great, not terrible, but with little effort back then

so your <2x speedup on 4 cores seems a bit low indeed (assuming 4 physical cores)
now that I look at your results, it seems like some test suite of >300 positions, I assume you get the same scaling for a fixed position
searched for a longer time? (say startpos + 30 seconds)
Well, serious thanks for that. Talking round the problem often gets past underlying assumptions that can be wrong. Like mine in this case. Needed to run for longer than d=13 which was my quick benchmark default.

Testing from start position, run to depth=31, which at threads=1 took about 60 seconds. 6 Core Intel.

Code: Select all

Threads=1 Mnps=2.17 1X
Threads=2 Mnps=4.20 1.94X
Threads=3 Mnps=6.00 2.76X
Threads=4 Mnps=7.76 3.58X
Threads=5 Mnps=9.18 4.23X
Threads=6 Mnps=10.77 5.05X
Threads=7 Mnps= 10.40 4.80X
Threads=8 Mnps= 10.50 4.84X
which looks comparable to your results, other than T=7, T=8 give nothing on a 6 core processor. Problem solved, there was no problem!

Joost Buijs
Posts: 1145
Joined: Thu Jul 16, 2009 8:47 am
Location: Almere, The Netherlands

Re: Memory management and threads

Post by Joost Buijs » Wed Sep 16, 2020 3:07 pm

chrisw wrote:
Wed Sep 16, 2020 9:21 am
Testing from start position, run to depth=31, which at threads=1 took about 60 seconds. 6 Core Intel.

Code: Select all

Threads=1 Mnps=2.17 1X
Threads=2 Mnps=4.20 1.94X
Threads=3 Mnps=6.00 2.76X
Threads=4 Mnps=7.76 3.58X
Threads=5 Mnps=9.18 4.23X
Threads=6 Mnps=10.77 5.05X
Threads=7 Mnps= 10.40 4.80X
Threads=8 Mnps= 10.50 4.84X
These results look a lot better than what you showed before.

Since I never checked my engine on the i9-10980XE and was curious about the results, I checked it with the start-position at 20 ply:

threads 1 2.999 Mnps
threads 2 5.949 Mnps
threads 4 11.653 Mnps
threads 8 21.935 Mnps
threads 16 37.956 Mnps

Threads 1 to 8 are pretty linear, with 16 threads it suddenly drops, this is something weird that I don't see when I test this at my TR-3970X. It could be that the boost frequency is much lower with 16 cores loaded or that it has something to do with the rather small caches of the 10980XE. I run the 3970X with precision boost turned off, maybe I have to repeat the test on the 10980XE with a fixed clock frequency.

User avatar
mvanthoor
Posts: 303
Joined: Wed Jul 03, 2019 2:42 pm
Full name: Marcel Vanthoor

Re: Memory management and threads

Post by mvanthoor » Wed Sep 16, 2020 10:32 pm

Today I've added multi-threading to my Perft routine. (In Rust you'll get hit on the head for some time by the compiler and borrow checker, but once you figure out the correct way of "Fearless concurrency in Rust", it becomes ridiculously easy and logical to do.) I just created a function that splits up the initial move list of the position, creating one chunk per thread, and then launch perft for each chunk. It doesn't use a transposition table yet but I think the results are still interesting, because they seem to fluctuate measurably, depending on the position you're testing, and the number of threads you have.

The test was run on an Intel Core i7-6700K; 4 real cores, 4 logical cores. No bulk counting, hashing, or other tricks.

Before the implementation of multi-threading, Rustic's perft routine ran at +/- 40 - 41 million leaves per second (Mlps). In that case the perft function ran within the main thread. After implementation, Perft always uses its own thread(s), even if only running on one thread. That did cost about 2% of performance. (The Kiwipete position would normally complete perft 6 in 198.x-200.x seconds when running single-threaded on the main thread; now it runs at +/- 204 seconds when running single-threaded in its own thread.)

Startpos Perft depth 7:

Code: Select all

Physical cores:
1T: 38.72 Mlps		1.00x
2T: 73.64 Mlps		1.90x
3T: 99.36 Mlps		2.57x
4T: 118.33 Mlps		3.06x

Logical cores:
5T: 132.11 Mlps		3.41x
6T: 129.04 Mlps		(probably poor move distribution on this position with this thread count)
7T: 145.94 Mlps		3.76x
8T: 148.11 Mlps		3.83x
The start position doesn't do that well, but I've noticed more often than not that it isn't the best position to test with because it's so simple. Kiwipete (CPW perft results position 2) is often much more representative.

Kiwipete, perft depth 6:

Code: Select all

Physical cores:
1T: 39.26 Mlps		1.00x
2T: 78.16 Mlps		1.99x
3T: 113.20 Mlps		2.88x
4T: 132.41 Mlps		3.37x

Logical cores:
5T: 135.73 Mlps		3.46x
6T: 147.23 Mlps		3.75x
7T: 159.49 Mlps		4.06x
8T: 166.18 Mlps		4.23x
So, the speed keeps increasing, even when using the logical cores. When running more than 8 threads, performance tanks. That's no surprise because at that point the CPU gets overloaded. It is clear that scaling is noticeably better on the Kiwipete position than it is on the starting position. (The starting position has a drop at 6 threads; I assume that move distribution among the threads is not so great, and that some threads are returning early.)

Maybe the scaling will change after I implement a hash table. I suspect it will, but I don't know by how much.

Post Reply