RMO - Randomized Move Order - yet another Lazy SMP derivate

Discussion of chess software programming and technical issues.

Moderators: hgm, Rebel, chrisw

bob
Posts: 20943
Joined: Mon Feb 27, 2006 7:30 pm
Location: Birmingham, AL

Re: RMO - Randomized Move Order - yet another Lazy SMP derivate

Post by bob »

I would like to add that I have been using a variant of DTS for a long time. Primary difference between Crafty's parallel search and Cray Blitz's search (the original DTS search):

In Cray Blitz, when a thread became idle, it could look at the complete tree data for all other threads and find the best place to help. It could use an existing split point or it could create a brand new one without the original thread knowing until it backed up to that node. Very efficient, but it required the mentioned iterative (rather than recursive) search, since it is not possible to either back up or go forward independently of the call stack. So iterated was the approach (and since FORTRAN back then did not support recursive/reentrant code anyway), there was no choice.

In Crafty, since it is not possible to jump around in the tree for the call stack reason quoted above, I decided to take a different angle to the same problem. At "good" split points, in a serial search thread, I have that thread emit what I call a "gratuitous split point", which is a split point with no helpers. But now this is globally visible and any thread can attach to that split point without needing the call stack data for plies prior to that split point. So any thread can jump in at any of these split points, or if it is not happy with any, it can request that each active thread emit such a split point "right now" and then choose one of these. I limit how many of these gratuitous split points are emitted to save memory and wasted execution cycles (it is not free to create one, but it is also not cpu intensive since I spent a lot of time trying to minimize the split point data).

In performance tests, I can not tell a lot of difference between the two approaches (DTS and almost-DTS). I tested this for tens of thousands of games on the cluster, using up to 8 threads / cores for each pair of programs, and found no significant Elo difference. For that reason, you don't see an iterative Crafty since it was MUCH messier than a purely recursive search.

I would not begin to say this is the ultimate parallel search, but several (including a couple here) tested it through 16 cores (and my own testing which also used 32 cores) and found the parallel speedup to be really good. You will have to look back through old posts here to see the exact numbers, I really don't remember.

The only down-side is that locks and memory contention do have an effect. I spent a lot of time the last year I worked on the code heavily, removing as much serial code (lock protected) as possible, which is why we saw the pretty efficient speedups through 32 cores. Beyond 32 it will definitely begin to show the effects of memory and lock contention. There are some options to probably take it beyond 32, at least close to 64. I did not take the time since I had no real way to measure the performance to see what tweaks were needed...

Clearly lazy has less lock contention, and the only real memory contention seems to be the global hash table. This implies that it ought to continue to scale far beyond 64 cores, easily. But as I mentioned, at some point there is a crossover point where Lazy will out-perform DTS/pseudo-DTS, although I don't think the exact point is known. But I would be willing to bet it is not below 32 cores and probably not below 64. But it is there. A classic case of brute force winning out over cleverness. Exactly what happened in the 70's when clever forward pruning gave way to a full-width search. Reduced my code size by 90% and played better. Lazy reduces the code size for the parallel part of the program significantly, which is definitely an advantage in terms of debugging.
dragontamer5788
Posts: 201
Joined: Thu Jun 06, 2019 8:05 pm
Full name: Percival Tiglao

Re: RMO - Randomized Move Order - yet another Lazy SMP derivate

Post by dragontamer5788 »

For that reason, you don't see an iterative Crafty since it was MUCH messier than a purely recursive search.
I dunno if its "much" messier. Here's the general prototype for an AB Iterative search.

Code: Select all

struct SavedABState{
    Position position;
    int16_t alpha;
    int16_t beta;
    vector<uint16_t> moves;

    int8_t depthRemaining;
};

struct ABStack{
    static constexpr uint16_t END=128;
    SavedABState stack[END];
    uint8_t stackPtr;
};

SavedABState& stackTop(ABStack& s){
	return s.stack[stackPtr];
}

// Covers all the code "before" the recursive call
int16_t ab_preamble(ABStack& stack, SavedABState state){
    int depthRemaining = state.depthRemaining; // Note in my notation, depthRemaining goes negative during QSearch
    int16_t alpha = state.alpha;
    int16_t beta = state.beta;

	while(depthRemaining >=0 || QSearchExtensionCondition()){
		createMoves(state); // Creates moves + sorts moves into state.moves vector
		stack.stackPtr--; // "Push" onto stack
		stack.stack[stack.stackPtr] = state; // Copy state onto the stack. This "deep clones" the moves vector
		
		Position nextPos;
		if(stackTop(stack).moves.size() > 0){		
			state.position = applyNextMove(stackTop(stack));;
			state.alpha = -beta;
			state.beta = -alpha;
			state.moves.clear();
			state.depthRemaining = depthRemaining -1;
			
			depthRemaining--;
			continue;
		} else {
			// No moves available means checkmate or stalemate
			return static_eval(stackTop);
		}
	}
	
	return static_eval(stackTop);
}

int16_t iterative_absearch(Position init, int16_t alpha, int16_t beta, int depthRemaining){
	ABStack stack;
	stack.stackPtr = stack.END;
	
	SavedABState state;
	state.position = init;
	state.alpha = INT16_MIN;
	state.beta = INT16_MAX;
	state.depthRemaining = depthRemaining;
	
	int16_t lastReturn = ab_preamble(stack, state);
	
	while(stack.stackPtr != stack.END){
		lastReturn = -lastReturn;
		
		topStack(stack).alpha = max(topStack(stack).alpha, lastReturn);
		
		if(topStack(stack).alpha >= topStack(stack).beta){ // Betacutoff
			lastReturn = topStack(stack).beta;
			stack.stackPtr++; // And pop
			continue; // Simulates a "return" in iterative form
		}
		
		if(topStack(stack).moves.size() > 0){
			state.position = applyNextMove(topStack(stack));
			state.alpha = -topStack(stack).beta;
			state.beta = -topStack(stack).alpha;
			state.moves.clear();
			state.depthRemaining = topStack(stack).depthRemaining - 1;
			
			lastReturn = ab_preamble(stack, state);
			continue;  // Simulate a "return" in iterative form
		}
		
		// No beta cutoff, all moves complete. Return alpha
		lastReturn = topStack(stack).alpha;
		stack.stackPtr++;
		continue;
	}
	
	return lastReturn;
}
The main benefit of the "explicit stack" style is that you can pause, split, fork, or join ABstacks. For example, if you know that the current path has "become speculative", you can fork off the stack, "block" the speculative task, and switch over to a new ABStack which remains work efficient.

A "true hardrware thread" is an extremely heavy operation, requiring tens-of-microseconds to complete. Linux / Windows has to create a new process, inform the scheduler, setup thread local storage, etc. etc. etc. Many operations which frankly, AB Search does not need. By keeping the stack explicit, it is now possible to wrap the above state into Packaged_tasks, and build a basic work-queue system with thread-pools.
The only down-side is that locks and memory contention do have an effect. I spent a lot of time the last year I worked on the code heavily, removing as much serial code (lock protected) as possible, which is why we saw the pretty efficient speedups through 32 cores. Beyond 32 it will definitely begin to show the effects of memory and lock contention. There are some options to probably take it beyond 32, at least close to 64. I did not take the time since I had no real way to measure the performance to see what tweaks were needed...
Controlling locks and memory contention is why I like the iterative style so much. You can tightly control which hardware threads are in "control" of an ABStack. You can choose to minimize the locks and passing of data as well.

For example, it is a well known fact in parallelism circles that a single-producer+single-consumer queue can be written in an efficient lock-free manner. If you have 64 hardware threads, with 63 producer-consumer queues in a pipeline arrangement, you can minimize contention by having Thread1 communicate only with Thread2 (through the producer-consumer queue).

Thread1.write(queue1);
Thread2.read(queue1); Thread2.write(queue2);
Thread3.read(queue2); Thread3.write(queue3);
Etc. etc. For every hardware thread

The only messages that really need to be sent are:

1. "Load Balancing" messages. Thread1.write(&ABStack), as Thread1 "offloads excess work". Only pass the 8-byte pointer on shared-memory systems. But ABStacks are ~2kB in the worst-case scenario, so this is probably an efficient message even over Infiniband / Ethernet + other non-shared memory systems.

2. "Return Value" messages. Whenever an ABStack is finished executing, it needs to inform any children that were forked-off of the return value, so that those children know how to continue their execution.

Thread1.write(return value for ABStack that was offloaded). Thread1 needs to remember which ABStacks were offloaded to another thread, and send the 16-bit return value for the search tree so that the next thread can continue.

Well, the pipeline of producer-consumer queues is a bit theoretical. Its a methodology of load-balancing I'm working on right now. But the iterative style turned out to be far simpler than I expected. AB-search only recursively calls a new ABSearch in a limited number of circumstances, and all of those "recursive calls" seem like they can be effectively emulated with the ab_preamble() helper function.
Daniel Shawul
Posts: 4185
Joined: Tue Mar 14, 2006 11:34 am
Location: Ethiopia

Re: RMO - Randomized Move Order - yet another Lazy SMP derivate

Post by Daniel Shawul »

Michel wrote: Sat Jan 04, 2020 11:02 am
Ronald wrote: Fri Jan 03, 2020 11:35 pm
Michel wrote: Fri Jan 03, 2020 9:50 pm Occam's razor says that we should first look for the simplest possible explanation. And the simplest possible explanation is that Lazy has almost zero overhead. The overhead is so small, that (combined with a few primitive tricks to avoid that threads do too much double work) it wins against (some?) smarter work dividing algorithms that have much more overhead.
I think you use Occam's razor on the wrong aspect. The only thing that is shared by Lazy SMP is the hashtable. In plain Lazy SMP every thread executes an identical search, the hashtable is what makes multiple threads behave differently, and a mayor part of that is that together they create a "richer" tree. Even 2 threads on a fixed depth perform much better than 1 thread, and this has nothing to do with TTD. So important questions are why is the tree richer, and how different is this richer tree from the 1 thread tree, so that we can try to create this richer tree with 1 thread also.
A "richer" tree is an undefined concept. I assume you mean a wider tree.

I think it still needs to be established that Lazy indeed searches a substantially wider tree (this is testable to some extent). It is not very obvious to me which mechanism might cause this as all the threads use basically the same search. I seem to recall some tests on Fishtest that tried to explicitly widen the tree for some of the threads and they failed. I am sure others have tried this too.

If Lazy searches indeed a wider tree then the question remains (as many people have pointed out) why a similar widening could not be implemented in a single threaded search.
Has anyone ever tested SHT-2 threads vs SHT-1 thread at fixed depth like Ronald mentioned? Maybe this is a very easy way to prove/disprove
if the tree is "richer" or "wider". We can do the same for YBW as a baseline ...
dragontamer5788
Posts: 201
Joined: Thu Jun 06, 2019 8:05 pm
Full name: Percival Tiglao

Re: RMO - Randomized Move Order - yet another Lazy SMP derivate

Post by dragontamer5788 »

Daniel Shawul wrote: Sat Jan 04, 2020 8:33 pm Has anyone ever tested SHT-2 threads vs SHT-1 thread at fixed depth like Ronald mentioned? Maybe this is a very easy way to prove/disprove
if the tree is "richer" or "wider". We can do the same for YBW as a baseline ...
I mean, if you just want to measure "tree richness", you only need to do two things:

1. Log whenever an item is kicked out of the shared hash table.
2. Dump the shared hash table at the end of the algorithm.

This is obviously infeasible for very large searches, but it should remain valid for 20 seconds or so. In effect, you'd be logging every single node traversed, roughly in time-order, by the parallel algorithm. Then you can run analysis on the log to see average moves searched, average "redundancy", etc. etc.

A typical computer does... what? 10-million nodes/second, with maybe 32-bytes per node? That's only 320 MB/s, which is slow enough to be written to a SSD. If you cut it down to ~10-bytes per node through compression, you might be able to write it to a hard drive (around 100MB/s). Alternatively, 20-seconds of 320MB/s is only 6.4GBs, possible to fit inside of RAM.

A bad idea for people who have faster, 100-million nodes/second computers. But us "normal" people with slower computers can actually log every node traversed. :D :D
Daniel Shawul
Posts: 4185
Joined: Tue Mar 14, 2006 11:34 am
Location: Ethiopia

Re: RMO - Randomized Move Order - yet another Lazy SMP derivate

Post by Daniel Shawul »

Daniel Shawul wrote: Sat Jan 04, 2020 8:33 pm
Michel wrote: Sat Jan 04, 2020 11:02 am
Ronald wrote: Fri Jan 03, 2020 11:35 pm
Michel wrote: Fri Jan 03, 2020 9:50 pm Occam's razor says that we should first look for the simplest possible explanation. And the simplest possible explanation is that Lazy has almost zero overhead. The overhead is so small, that (combined with a few primitive tricks to avoid that threads do too much double work) it wins against (some?) smarter work dividing algorithms that have much more overhead.
I think you use Occam's razor on the wrong aspect. The only thing that is shared by Lazy SMP is the hashtable. In plain Lazy SMP every thread executes an identical search, the hashtable is what makes multiple threads behave differently, and a mayor part of that is that together they create a "richer" tree. Even 2 threads on a fixed depth perform much better than 1 thread, and this has nothing to do with TTD. So important questions are why is the tree richer, and how different is this richer tree from the 1 thread tree, so that we can try to create this richer tree with 1 thread also.
A "richer" tree is an undefined concept. I assume you mean a wider tree.

I think it still needs to be established that Lazy indeed searches a substantially wider tree (this is testable to some extent). It is not very obvious to me which mechanism might cause this as all the threads use basically the same search. I seem to recall some tests on Fishtest that tried to explicitly widen the tree for some of the threads and they failed. I am sure others have tried this too.

If Lazy searches indeed a wider tree then the question remains (as many people have pointed out) why a similar widening could not be implemented in a single threaded search.
Has anyone ever tested SHT-2 threads vs SHT-1 thread at fixed depth like Ronald mentioned? Maybe this is a very easy way to prove/disprove
if the tree is "richer" or "wider". We can do the same for YBW as a baseline ...
I did a quick test of 8-threads SHT vs 1-thread SHT at a fixed depth of 8, so far

Code: Select all

Score of scorpio-sht-8-thread vs scorpio-sht-1-thread: 99 - 107 - 94  [0.487] 300
Tree ain't wider or richer with 8 threads (if anything it is slightly messier). Is this what we were looking for lol :)
YBW split depth is 8 so I didn't do that test but there is no reason to expect behavior will be different.
I will try with larger depth but it will take time.
Daniel Shawul
Posts: 4185
Joined: Tue Mar 14, 2006 11:34 am
Location: Ethiopia

Re: RMO - Randomized Move Order - yet another Lazy SMP derivate

Post by Daniel Shawul »

The story for Scorpio is the same at depth=14 too

Code: Select all

Score of scorpio-sht-8-thread vs scorpio-sht-1-thread: 78 - 90 - 132  [0.480] 300
BUT for stockfish, I was shocked to find out some +170 elo at fixed depth=8.
Then I decided to investigate further by doing fixed depth tests starting from depth=1

Code: Select all

Depth 1 = Score of stockfish-8-thread vs stockfish-1-thread: 1258 - 1233 - 527  [0.504] 3018
Depth 2 = Score of stockfish-8-thread vs stockfish-1-thread: 1730 - 1551 - 365  [0.525] 3646  
Depth 3 = Score of stockfish-8-thread vs stockfish-1-thread: 1564 - 1288 - 355  [0.543] 3207
Depth 4 = Score of stockfish-8-thread vs stockfish-1-thread: 1606 - 1048 - 356  [0.593] 3010  => huge leap
Depth 5 = Score of stockfish-8-thread vs stockfish-1-thread: 1923 - 770 - 405  [0.686] 3098   => another huge leap
Depth 6 = Score of stockfish-8-thread vs stockfish-1-thread: 2065 - 581 - 468  [0.738] 3114   => a modest improvement now
Depth 7 = Score of stockfish-8-thread vs stockfish-1-thread: 1966 - 524 - 525  [0.739] 3015   => completely leveled off here
Depth 8 = Score of stockfish-8-thread vs stockfish-1-thread: 2028 - 479 - 649  [0.745] 3156
Depth 9 = Score of stockfish-8-thread vs stockfish-1-thread: 1892 - 398 - 724  [0.748] 3014
I have some theories:

a) First off with the ridiculous theory, Stockfish does not respect fixed depth -- LazySMP with its (d/d+1) depths could be problematic for
fixed depth search. Maybe it is matching d+1 (8-threads) vs d (1-thread)? The single thread search respects depth but the multi-threaded search does d+1. Also, d vs d+1 matches may level off after some depth, like seen in the above patter, in a single-threaded engine as well ... but I have to test first.

b) The one thing staring in our face is the aggressive pruning for depth <= 8. The huge leap from depth 4 to depth 7 suggests that this is the cause of widening. So it is indeed wider/richer BUT why is Scorpio not benefiting from this even though I prune equally aggressively I believe.
There is that difference of SHT vs LazySMP between Scorpio and stockfish though. If this theory turns out to be the case, at least now we know where to fix the single-threaded search. I also suspect a bug in LMR/pruning with the parallel search e.g when you pick a move in parallel it may not have the right index for LMR reductions unless one is very careful. The single-threaded search is what one intends the search to look like, so lessening reductions/pruning via a chaotic parallel search is weird, to say the least.

Let me know what you think.

Daniel
Daniel Shawul
Posts: 4185
Joined: Tue Mar 14, 2006 11:34 am
Location: Ethiopia

Re: RMO - Randomized Move Order - yet another Lazy SMP derivate

Post by Daniel Shawul »

On the other hand, all my algorithms do not widen even at fixed depth of 14

Code: Select all

Score of scorpio-sht-8-thread vs scorpio-sht-1-thread: 78 - 90 - 132  [0.480] 300
Score of scorpio-abdada-8-thread vs scorpio-abdada-1-thread: 111 - 107 - 106  [0.506] 324
Score of scorpio-ybw-8-thread vs scorpio-ybw-1-thread: 92 - 94 - 114  [0.496] 300
So I feel like we are chasing a bug in Stockfish, or searching at two depths d/d+1 magically "widens/enriches" your search.
Michel
Posts: 2272
Joined: Mon Sep 29, 2008 1:50 am

Re: RMO - Randomized Move Order - yet another Lazy SMP derivate

Post by Michel »

Daniel Shawul wrote: Sat Jan 04, 2020 11:02 pm The story for Scorpio is the same at depth=14 too

Code: Select all

Score of scorpio-sht-8-thread vs scorpio-sht-1-thread: 78 - 90 - 132  [0.480] 300
BUT for stockfish, I was shocked to find out some +170 elo at fixed depth=8.
Then I decided to investigate further by doing fixed depth tests starting from depth=1

Code: Select all

Depth 1 = Score of stockfish-8-thread vs stockfish-1-thread: 1258 - 1233 - 527  [0.504] 3018
Depth 2 = Score of stockfish-8-thread vs stockfish-1-thread: 1730 - 1551 - 365  [0.525] 3646  
Depth 3 = Score of stockfish-8-thread vs stockfish-1-thread: 1564 - 1288 - 355  [0.543] 3207
Depth 4 = Score of stockfish-8-thread vs stockfish-1-thread: 1606 - 1048 - 356  [0.593] 3010  => huge leap
Depth 5 = Score of stockfish-8-thread vs stockfish-1-thread: 1923 - 770 - 405  [0.686] 3098   => another huge leap
Depth 6 = Score of stockfish-8-thread vs stockfish-1-thread: 2065 - 581 - 468  [0.738] 3114   => a modest improvement now
Depth 7 = Score of stockfish-8-thread vs stockfish-1-thread: 1966 - 524 - 525  [0.739] 3015   => completely leveled off here
Depth 8 = Score of stockfish-8-thread vs stockfish-1-thread: 2028 - 479 - 649  [0.745] 3156
Depth 9 = Score of stockfish-8-thread vs stockfish-1-thread: 1892 - 398 - 724  [0.748] 3014
I have some theories:

a) First off with the ridiculous theory, Stockfish does not respect fixed depth -- LazySMP with its (d/d+1) depths could be problematic for
fixed depth search. Maybe it is matching d+1 (8-threads) vs d (1-thread)? The single thread search respects depth but the multi-threaded search does d+1. Also, d vs d+1 matches may level off after some depth, like seen in the above patter, in a single-threaded engine as well ... but I have to test first.

b) The one thing staring in our face is the aggressive pruning for depth <= 8. The huge leap from depth 4 to depth 7 suggests that this is the cause of widening. So it is indeed wider/richer BUT why is Scorpio not benefiting from this even though I prune equally aggressively I believe.
There is that difference of SHT vs LazySMP between Scorpio and stockfish though. If this theory turns out to be the case, at least now we know where to fix the single-threaded search. I also suspect a bug in LMR/pruning with the parallel search e.g when you pick a move in parallel it may not have the right index for LMR reductions unless one is very careful. The single-threaded search is what one intends the search to look like, so lessening reductions/pruning via a chaotic parallel search is weird, to say the least.

Let me know what you think.

Daniel
Interesting! I looked in the SF source for the place where it actually sets the depth of the worker threads but I could not find it :( (the whole thing looks rather over engineered). I thought that SF does not quite follow the d/d+1 paradigm but I am not sure now.

In any case fixed depth comparisons are dangerous if some threads search deeper (I do not know if SF does this).
Ideas=science. Simplification=engineering.
Without ideas there is nothing to simplify.
Daniel Shawul
Posts: 4185
Joined: Tue Mar 14, 2006 11:34 am
Location: Ethiopia

Re: RMO - Randomized Move Order - yet another Lazy SMP derivate

Post by Daniel Shawul »

Interesting! I looked in the SF source for the place where it actually sets the depth of the worker threads but I could not find it :( (the whole thing looks rather over engineered). I thought that SF does not quite follow the d/d+1 paradigm but I am not sure now.

In any case fixed depth comparisons are dangerous if some threads search deeper (I do not know if SF does this).
It looks "ridiclous theory number 1" could be the winner :) I redid the test at fixed depth=8 but this time constraining all 8 threads to 1 cores using
"taskset -c 0 ./cutechess-cli", and i can see only 100% cpu usage when the previous test was 800% cpu usage.
So both the single thread and multi-thread search get one core.

Code: Select all

Score of stockfish-8-thread vs stockfish-1-thread: 1113 - 467 - 497  [0.656] 2077
So to improve stockfish search, all one has to do is run multi-threaded but with 1-core. So that is my winning patch to stockfish to be sent later :)
I will do a timed match (not fixed depth) before I sent the patch though lol
petero2
Posts: 685
Joined: Mon Apr 19, 2010 7:07 pm
Location: Sweden
Full name: Peter Osterlund

Re: RMO - Randomized Move Order - yet another Lazy SMP derivate

Post by petero2 »

Daniel Shawul wrote: Sat Jan 04, 2020 8:33 pm
Michel wrote: Sat Jan 04, 2020 11:02 am I think it still needs to be established that Lazy indeed searches a substantially wider tree (this is testable to some extent). It is not very obvious to me which mechanism might cause this as all the threads use basically the same search.
Has anyone ever tested SHT-2 threads vs SHT-1 thread at fixed depth like Ronald mentioned? Maybe this is a very easy way to prove/disprove
if the tree is "richer" or "wider". We can do the same for YBW as a baseline ...
I tested this about 2 years ago just before Texel 1.07 was released. Texel does not use a pure SHT search though, but something I called approximate ABDADA. Anyway, I got this result:

Code: Select all

2 threads vs 1 thread:
  d=10: +24.3 elo, 3594 games
  d=11:  +9.1 elo, 3524 games
  d=12: +18.9 elo, 3892 games
  d=13: +11.8 elo, 3630 games
4 threads vs 1 thread:
  d=10: +31.4 elo, 3590 games
  d=11: +20.9 elo, 3540 games
  d=12: +23.5 elo, 3574 games
  d=13: +17.5 elo, 3522 games
8 threads vs 1 thread:
  d=10: +48.5 elo, 4102 games
  d=11: +38.8 elo, 5698 games
  d=12: +32.3 elo, 3568 games
  d=13: +30.7 elo, 4174 games
16 threads vs 1 thread:
  d=10: +71.0 elo, 10104 games
  d=11: +62.0 elo, 5718 games
  d=12: +52.3 elo, 3546 games
  d=13: +44.8 elo, 3654 games
24 threads vs 1 thread:
  d=10: +89.8 elo, 9475 games
  d=11: +75.3 elo, 5686 games
  d=12: +63.5 elo, 3500 games
  d=13: +52.9 elo, 3822 games
The search algorithm is really a mix of ABDADA and lazy since the ABDADA exclusion is only used at a node if the remaining depth is >= 7 and the node already has an entry in the hash table (which happens on average around 95% of the time, but much more often for larger remaining depth).

Also the ABDADA exclusion does not enforce sequential consistency between threads (for performance reasons), so a thread could sometimes fail to "see" an exclusion flag recently set by a different thread. (That may not happen in practice though unless you run on a cluster or on an ARM CPU.)

I speculated why lazy SMP could cause elo to increase with increased number of threads at fixed depth here:
petero2 wrote: Sat Aug 19, 2017 9:56 am 1. It might be the case that lazy SMP (and SHT) works better if you have a per-thread history table. The late moves in ALL nodes are mostly ordered by the history heuristic, and if the history table is thread local, there is a decent chance that it becomes different in different threads. This would have the effect that different threads often don't search the same moves at the same time, and would therefore be able to pick up results from other threads from the transposition table.

2. If you do heavy LMR/LMP type reductions, and the reduction a move receives depends on the move order, a side effect of a thread local history table is that different threads will use different reduction amounts for the same move. Since the moves searched early by each thread typically get a lower reduction, and those results are later picked up through the hash table by other threads, the effect is as if the average LMR/LMP reduction is lower than in a single threaded search. It is already known that LMR/LMP is a big loss for fixed depth games, but a big gain in fixed time games, so I expect this effect would cause some elo increase, although not as big an elo increase as when using an SMP algorithm that avoids duplicated work without decreasing LMR/LMP.