Page 1 of 3

Lazy SMP and 44 cores

Posted: Tue Aug 07, 2018 8:51 pm
by sandermvdb
At the moment my chessengine participates in the TCEC tournament which uses a machine that has 44(!) cores. I was wondering what would be the best approach on how to use these cores effectively using lazy SMP. At the moment half of the threads search the actual depth and the other half searches at depth + 1. I thought this is what most engines do but I can imagine it isn't very useful when there are 44 threads.

Any advice?

Re: Lazy SMP and 44 cores

Posted: Tue Aug 07, 2018 8:59 pm
by AndrewGrant
Assuming you let the threads loose, and don't force 50% to always be mainDepth + 1, then this system would probably work out pretty well for 43 cores.

I do the following, which is inspired from Demolito. It basically says if 50% of the threads are working on an equal or higher depth, then I should too.

Code: Select all

    for (depth = 1; depth < MAX_PLY; depth++){

        // Always acquire the lock before setting thread->depth. thread->depth
        // is needed by others to determine when to skip certain search iterations
        pthread_mutex_lock(&LOCK);

        thread->depth = depth;

        // Helper threads are subject to skipping depths in order to better help
        // the main thread, based on the number of threads already on some depths
        if (!mainThread){

            for (count = 0, i = 1; i < thread->nthreads; i++)
                count += thread != &thread->threads[i] && thread->threads[i].depth >= depth;

            if (count >= thread->nthreads / 2){
                thread->depth = depth + 1;
                pthread_mutex_unlock(&LOCK);
                continue;
            }
        }

        // Drop the lock as we have finished depth scheduling
        pthread_mutex_unlock(&LOCK);
I have explored replacing this using a preset array of depth skipping, a simplified version of Stockfish, in a way ....

Something like ...

Code: Select all


const int SkipSize = 16;

const int SkipStep[SkipSize] = {0, 1, 1, 2, 1, 1, 2, 1, 1, 2, 3, ..... }; // Pick some pattern you like...

depth = mainDepth + SkipStep[thread index % SkipSize];

Re: Lazy SMP and 44 cores

Posted: Tue Aug 07, 2018 10:54 pm
by sandermvdb
AndrewGrant wrote: Tue Aug 07, 2018 8:59 pm Assuming you let the threads loose, and don't force 50% to always be mainDepth + 1, then this system would probably work out pretty well for 43 cores.

I do the following, which is inspired from Demolito. It basically says if 50% of the threads are working on an equal or higher depth, then I should too.
I don't quite understand what you mean:
"... let the threads loose ..."
"... this system ..." your system or my system?

Do you abort all threads when a particular thread has found the bestmove?

Re: Lazy SMP and 44 cores

Posted: Wed Aug 08, 2018 3:54 am
by AndrewGrant
By system, I was reffering to your current method of picking the thread for each depth.

As for setting loose ... Some people will restart each thread whenever the main thread finishes (awful idea). In that case, a different scheme is probably best.

Re: Lazy SMP and 44 cores

Posted: Wed Aug 08, 2018 5:13 am
by mkchan
sandermvdb wrote: Tue Aug 07, 2018 8:51 pm At the moment my chessengine participates in the TCEC tournament which uses a machine that has 44(!) cores. I was wondering what would be the best approach on how to use these cores effectively using lazy SMP. At the moment half of the threads search the actual depth and the other half searches at depth + 1. I thought this is what most engines do but I can imagine it isn't very useful when there are 44 threads.

Any advice?
One thing you could try is letting each thread do it's own iterative deepening starting from varying depths. Should cause a nice spread with a nice mix of aspiration windows. Although I wouldn't try to predict how it measures up against the usual +0,+1 depth method. Just an idea

Re: Lazy SMP and 44 cores

Posted: Wed Aug 08, 2018 7:21 am
by elcabesa
mkchan wrote: Wed Aug 08, 2018 5:13 am
One thing you could try is letting each thread do it's own iterative deepening starting from varying depths. Should cause a nice spread with a nice mix of aspiration windows. Although I wouldn't try to predict how it measures up against the usual +0,+1 depth method. Just an idea
I think this is the approach of Stockfish. N parallel iterative deepening loop. And at the end it's chosen the result of the best one.

Re: Lazy SMP and 44 cores

Posted: Wed Aug 08, 2018 7:23 am
by AndrewGrant
elcabesa wrote: Wed Aug 08, 2018 7:21 am
mkchan wrote: Wed Aug 08, 2018 5:13 am
One thing you could try is letting each thread do it's own iterative deepening starting from varying depths. Should cause a nice spread with a nice mix of aspiration windows. Although I wouldn't try to predict how it measures up against the usual +0,+1 depth method. Just an idea
I think this is the approach of Stockfish. N parallel iterative deepening loop. And at the end it's chosen the result of the best one.
I've tried doing the "pick best thread" type stuff that Stockfish does -- and it never works for me. Perhaps it just takes too many cores to test, or maybe I've done the logic wrong.

As of now, I only listen to the main thread, which clearly works reasonably well.

Re: Lazy SMP and 44 cores

Posted: Wed Aug 08, 2018 4:33 pm
by jstanback
In Wasp, I start all threads at depth=1 and let them do their own ID loop. At the start of each iteration beyond 6 (I think) I count the number of threads already searching this depth or higher and if the count is more than 50-67% of the total (I'm not sure what is best) then the thread will skip to the next iteration. For a previous TCEC season I skipped 2 iterations if 67% of the threads were already searching at a given depth but this did not work well. Before Wasp 3.0 I let any thread update the global score and pv, but now I only let a "master" thread do this. Both methods seem about the same in strength.

BTW, I had a "bug" in the Wasp I sent for TCEC 12 which reduced the search speed by almost 50%. I have a structure called "EvalData" used in many places which I previously allocated on the stack at the start of eval(). But in a brain-dead moment, I changed this to an array of structures indexed by thread. I think this causes the memory for all threads to be associated with the NUMA node that first writes to this array. Threads not on that NUMA node will have slow access to this data. I fixed this in Wasp 3.0 by including the EvalData structure in the ThreadData structure which is allocated and written to by each thread when it is first started. I have some other global variables which are initialized by the master thread and used, but not changed, by all the threads during search. I think I'll try duplicating these in the ThreadData structure to see if I can get more speedup....

John

Re: Lazy SMP and 44 cores

Posted: Thu Aug 09, 2018 7:29 am
by PK

Code: Select all

jstanback: At the start of each iteration beyond 6 (I think) I count the number of threads already searching this depth or higher and if the count is more than 50-67% of the total (I'm not sure what is best) then the thread will skip to the next iteration.
Just implemented this and tested overnight. Really helpful! Until now Rodent shown no gain for more than 16 threads.

Seems that TCEC forced forty-something threads as a new standard.

Re: Lazy SMP and 44 cores

Posted: Wed Nov 21, 2018 2:20 pm
by sandermvdb
A am still struggling with improving my SMP implementation:
I used to spawn new threads using the modulo of 2 for assigning the depth. If a certain thread was finished I would abort ALL threads and start ALL threads again. I've changed this so I let all threads 'loose' and use the formula as desribed above (depth + 1 if half of the threads are searching at the current depth or above). I also updated my transposition table from a 2 replacement scheme (always replace and depth replace) to buckets as most engines do. But still my new implementation performs worse when using 4 threads than my old implementation.

Some other info
- I use 2 arrays in my hashtable (keys[] and values[]) which are xorred to prevent racing conditions
- killers, countermoves, history heuristics are per thread
- evalcache, materialcache and pawnevalcache are not per thread
- PV is only updated by the master thread

The trace below shows when a thread starts searching at a particular depth and when it is finished (aspiration window has been disabled).
As you can see, sometimes the different threads are finished around the same time with a particular depth (20, 22, ...).
Sometimes however a slave thread is finished way earlier than the master thread (21, 23, ...).

Code: Select all

23:15:11,474 - start 1 1
23:15:11,475 - start 3 1
23:15:11,476 - start 0 1
23:15:11,477 - start 2 1
23:15:11,481 - done  2 1
23:15:11,482 - done  1 1
23:15:11,482 - done  0 1
23:15:11,483 - done  3 1
23:15:11,484 - start 1 2
23:15:11,486 - start 2 2
23:15:11,487 - start 3 3
23:15:11,490 - done  1 2
23:15:11,491 - start 1 3
23:15:11,492 - done  2 2
23:15:11,492 - start 2 4
23:15:11,501 - done  1 3
23:15:11,502 - start 1 4
23:15:11,503 - done  3 3
23:15:11,507 - start 3 5
23:15:11,515 - done  2 4
23:15:11,516 - start 2 5
23:15:11,516 - done  1 4
23:15:11,552 - start 1 6
23:15:11,553 - done  3 5
23:15:11,554 - done  2 5
23:15:11,555 - start 2 7
23:15:11,556 - start 3 6
23:15:11,557 - info depth 1 seldepth 4 time 151 score cp 188 nps 4099 nodes 620 hashfull 0 pv e1g1 
23:15:11,560 - start 0 2
23:15:11,561 - done  0 2
23:15:11,562 - info depth 2 seldepth 10 time 193 score cp 35 nps 39958 nodes 7712 hashfull 0 pv e1g1 e6e2 d3e2 b8c6 c1e3 c6b4 b1a3 
23:15:11,564 - start 0 3
23:15:11,565 - done  0 3
23:15:11,566 - info depth 3 seldepth 10 time 193 score cp 35 nps 40424 nodes 7803 hashfull 0 pv e1g1 e6e2 d3e2 b8c6 c1e3 c6b4 b1a3 
23:15:11,568 - start 0 4
23:15:11,569 - done  0 4
23:15:11,569 - info depth 4 seldepth 10 time 193 score cp 35 nps 40886 nodes 7892 hashfull 0 pv e1g1 e6e2 d3e2 b8c6 c1e3 c6b4 b1a3 
23:15:11,571 - start 0 5
23:15:11,572 - done  0 5
23:15:11,573 - info depth 5 seldepth 10 time 194 score cp 35 nps 41170 nodes 7987 hashfull 0 pv e1g1 e6e2 d3e2 b8c6 c1e3 c6b4 b1a3 
23:15:11,575 - start 0 6
23:15:11,576 - done  0 6
23:15:11,577 - done  3 6
23:15:11,580 - done  1 6
23:15:11,582 - info depth 6 seldepth 10 time 205 score cp 17 nps 79941 nodes 16389 hashfull 0 pv b1c3 b8c6 c1e3 e8g8 e1g1 f8e8 
23:15:11,587 - start 0 7
23:15:11,588 - start 3 7
23:15:11,589 - start 1 8
23:15:11,590 - done  2 7
23:15:11,590 - start 2 8
23:15:11,593 - done  3 7
23:15:11,594 - start 3 9
23:15:11,595 - done  0 7
23:15:11,595 - info depth 7 seldepth 10 time 226 score cp 2 nps 130367 nodes 29464 hashfull 0 pv e1g1 e6e2 d3e2 b8c6 c1e3 c6b4 b1a3 
23:15:11,602 - start 0 8
23:15:11,603 - done  0 8
23:15:11,603 - done  2 8
23:15:11,612 - start 2 9
23:15:11,613 - info depth 8 seldepth 12 time 257 score cp 18 nps 167474 nodes 43041 hashfull 0 pv e1g1 e8g8 e2e6 f7e6 b1c3 b8c6 c1e3 a7a6 
23:15:11,620 - start 0 9
23:15:11,621 - done  2 9
23:15:11,622 - start 2 10
23:15:11,623 - done  3 9
23:15:11,624 - start 3 10
23:15:11,629 - done  1 8
23:15:11,630 - start 1 11
23:15:11,631 - done  0 9
23:15:11,632 - info depth 9 seldepth 13 time 289 score cp 5 nps 205249 nodes 59318 hashfull 0 pv e1g1 e6e2 d3e2 b8c6 c1e3 c6b4 b1a3 c7c6 c2c3 
23:15:11,634 - start 0 10
23:15:11,846 - done  0 10
23:15:11,847 - done  3 10
23:15:11,849 - done  2 10
23:15:11,850 - info depth 10 seldepth 14 time 510 score cp 7 nps 402264 nodes 205155 hashfull 0 pv b1c3 b8c6 c3b5 e8g8 e2e6 f7e6 c2c3 a7a6 b5d6 c7d6 
23:15:11,857 - start 3 11
23:15:11,858 - start 0 11
23:15:11,858 - start 2 12
23:15:11,953 - done  3 11
23:15:11,954 - done  0 11
23:15:11,954 - info depth 11 seldepth 15 time 617 score cp 7 nps 477685 nodes 294733 hashfull 0 pv b1c3 b8c6 c3b5 e8g8 e2e6 f7e6 c2c3 a7a6 b5d6 c7d6 
23:15:11,957 - start 0 12
23:15:11,958 - start 3 12
23:15:12,219 - done  1 11
23:15:12,222 - start 1 13
23:15:12,272 - done  0 12
23:15:12,274 - done  2 12
23:15:12,275 - done  3 12
23:15:12,276 - start 3 14
23:15:12,277 - start 2 13
23:15:12,278 - info depth 12 seldepth 17 time 936 score cp 13 nps 547231 nodes 512209 hashfull 0 pv b1c3 b8c6 c3b5 e8g8 e2e6 f7e6 c2c3 a7a6 b5d6 c7d6 
23:15:12,281 - start 0 13
23:15:12,579 - done  0 13
23:15:12,580 - done  2 13
23:15:12,580 - done  1 13
23:15:12,581 - start 2 14
23:15:12,582 - start 1 15
23:15:12,583 - info depth 13 seldepth 19 time 1243 score cp 1 nps 655860 nodes 815236 hashfull 0 pv b1c3 b8c6 c3b5 e8g8 e2e6 f7e6 c2c3 a7a6 b5d6 c7d6 
23:15:12,586 - start 0 14
23:15:13,245 - done  0 14
23:15:13,246 - done  2 14
23:15:13,247 - done  3 14
23:15:13,248 - start 2 15
23:15:13,248 - info depth 14 seldepth 19 time 1910 score cp 1 nps 770541 nodes 1471736 hashfull 1 pv b1c3 b8c6 c3b5 e8g8 e2e6 f7e6 c2c3 a7a6 b5d6 c7d6 
23:15:13,251 - start 3 16
23:15:13,252 - start 0 15
23:15:13,955 - done  0 15
23:15:13,956 - done  2 15
23:15:13,956 - start 2 16
23:15:13,957 - info depth 15 seldepth 21 time 2619 score cp 0 nps 964234 nodes 2525330 hashfull 2 pv b1c3 b8c6 c3b5 e8g8 c2c3 f8e8 e2e6 e8e6 c1e3 f6g4 
23:15:13,960 - start 0 16
23:15:13,962 - done  1 15
23:15:13,963 - start 1 17
23:15:14,759 - done  2 16
23:15:14,760 - done  0 16
23:15:14,761 - start 2 17
23:15:14,762 - info depth 16 seldepth 26 time 3424 score cp -1 nps 1377666 nodes 4717134 hashfull 5 pv b1c3 b8c6 c3b5 e8g8 e2e6 f7e6 c2c3 e6e5 b5d6 c7d6 
23:15:14,764 - start 0 17
23:15:14,843 - done  3 16
23:15:14,844 - start 3 18
23:15:14,932 - done  1 17
23:15:14,933 - done  0 17
23:15:14,934 - done  2 17
23:15:14,935 - start 1 18
23:15:14,936 - start 2 19
23:15:14,936 - info depth 17 seldepth 27 time 3596 score cp 0 nps 1442885 nodes 5188618 hashfull 5 pv b1c3 b8c6 c3b5 e8g8 e2e6 f7e6 c2c3 e6e5 b5d6 c7d6 
23:15:14,939 - start 0 18
23:15:16,764 - done  0 18
23:15:16,765 - done  3 18
23:15:16,766 - start 3 19
23:15:16,766 - info depth 18 seldepth 29 time 5428 score cp 7 nps 1844129 nodes 10009937 hashfull 11 pv b1c3 b8c6 c3b5 e8g8 e2e6 f7e6 c2c3 e6e5 b5d6 c7d6 
23:15:16,769 - start 0 19
23:15:16,792 - done  1 18
23:15:16,792 - start 1 20
23:15:18,299 - done  2 19
23:15:18,300 - start 2 20
23:15:18,301 - done  3 19
23:15:18,301 - start 3 21
23:15:18,302 - done  0 19
23:15:18,302 - info depth 19 seldepth 29 time 6963 score cp 16 nps 2065770 nodes 14383963 hashfull 17 pv b1c3 b8c6 c3b5 e8g8 e2e6 f7e6 c2c3 e6e5 e1g1 a8e8 
23:15:18,305 - start 0 20
23:15:26,033 - done  3 21
23:15:26,034 - start 3 22
23:15:28,618 - done  1 20
23:15:28,619 - done  2 20
23:15:28,619 - done  0 20
23:15:28,620 - start 2 22
23:15:28,621 - start 1 21
23:15:28,622 - info depth 20 seldepth 33 time 17282 score cp 18 nps 2580668 nodes 44599118 hashfull 54 pv b1c3 e8g8 e2e6 f7e6 c3b5 b8c6 c2c3 e6e5 e1g1 f6e8 
23:15:28,625 - start 0 21
23:15:29,344 - done  1 21
23:15:29,345 - done  0 21
23:15:29,346 - start 1 23
23:15:29,346 - info depth 21 seldepth 33 time 18008 score cp 18 nps 2594858 nodes 46728213 hashfull 57 pv b1c3 e8g8 e2e6 f7e6 c3b5 b8c6 c2c3 e6e5 e1g1 f6e8 
23:15:29,349 - start 0 22
23:15:31,842 - done  2 22
23:15:31,843 - done  0 22
23:15:31,844 - start 2 23
23:15:31,844 - info depth 22 seldepth 33 time 20506 score cp 18 nps 2625330 nodes 53835034 hashfull 65 pv b1c3 e8g8 e2e6 f7e6 c3b5 b8c6 c2c3 e6e5 e1g1 f6e8 
23:15:31,847 - start 0 23
23:15:33,258 - done  3 22
23:15:33,259 - start 3 24
23:15:51,006 - done  1 23
23:15:51,007 - done  2 23
23:15:51,008 - start 1 24
23:15:51,009 - start 2 25
23:16:00,974 - done  0 23
23:16:00,976 - info depth 23 seldepth 40 time 49639 score cp 19 nps 2783156 nodes 138153116 hashfull 168 pv b1c3 e6e2 c3e2 e8g8 e1g1 b8d7 c1e3 f8e8 h2h3 c7c6 
The only cause I can think of is my hashtable implementation or the way I spawn threads which is now about the same as other engines. Any ideas/suggestions?