Lazy SMP and 44 cores

jdart · Post by **jdart** » Wed Nov 21, 2018 3:18 pm

I still have a pretty simple-mined approach to assigning depth to threads.

But one thing I have looked at recently is the selection logic when the search is done and there are results from multiple threads. It is possible that some of these are failing low. I extend time if the main thread is failing low, but there is a limit on the extension, so it might still be failing low at search end.

What to do in this case? You really only have an upper bound for the score, so there is no telling how bad the score might become if the search iteration were to complete. Therefore it could be risky to take that search result (also: it may have no PV, so you won't get a ponder hit). You could just say, I won't accept a fail-low score .. but due to the time limit, it is even possible that all threads are failing low at search end (although not likely at all with 44 cores).

--Jon

Ronald · Post by **Ronald** » Wed Nov 21, 2018 4:47 pm

Hi Sander,

I took a quick look at your code, and it looks like you have one global iterative deepening loop in which you start new threads for your calculateBestmove search on every iteration. Most lazy SMP implementations (i think

) have their own iterative deepening loop for every thread, or in other words, all threads execute exactly the same iterative deepening search as you would normally do for a single thread. You can assign one of the threads to be the boss, which updates the score and pv, and makes the decision to stop searching.

sandermvdb · Post by **sandermvdb** » Wed Nov 21, 2018 4:51 pm

Ronald wrote: ↑Wed Nov 21, 2018 4:47 pm Hi Sander,

I took a quick look at your code, and it looks like you have one global iterative deepening loop in which you start new threads for your calculateBestmove search on every iteration. Most lazy SMP implementations (i think ) have their own iterative deepening loop for every thread, or in other words, all threads execute exactly the same iterative deepening search as you would normally do for a single thread. You can assign one of the threads to be the boss, which updates the score and pv, and makes the decision to stop searching.

1.11 is indeed using my 'old' approach which I am trying to update the way as you are suggesting. Unfortunately so far no success, therefore I re-opened this topic

Ratosh · Post by **Ratosh** » Wed Nov 21, 2018 5:25 pm

Hi Sanders,

On Pirarucu i have similar search info:
- Only Pawn cache and Transposition are shared;
- No eval cache nor material cache;
- PV and score is updated by master thread and taken from TT.

I have 2 different threads running (main search and helper search):
- The main controls the search stop conditions, pv and score of the search. There is no difference on this when using single thread or multi thread.
- The Helper is used basically to populate the TT with deeper search information. This thread will stop searching when reaching max depth (127) or when the search stop condition is set. Each depth is individual and uses a depth skip logic (based on laser) it will go deeper on different rates and keep it searching deeper than the main search.

It is pretty simple and seems to scale well.

cdani · Post by **cdani** » Wed Nov 21, 2018 11:56 pm

In Andscacs there is only one iterative deepening loop in the main thread, which signals all the threads to start searching.
About depth of the threads:

Code: Select all

	for (int i = 1; i < NumThreads; i++) {
		incrementthread[i] = i & 1;
	}

	if (NumThreads <= 4) {
		incrementthread[0] = 0;
		incrementthread[1] = 1;
		incrementthread[2] = 2;
		incrementthread[3] = 3;
	}
	else {
		incrementthread[0] = 0;
		incrementthread[1] = 1;
		incrementthread[2] = 0;
		incrementthread[3] = 2;
		incrementthread[4] = 1;
		incrementthread[5] = 3;
		incrementthread[6] = 2;
		incrementthread[7] = 4;
	}

	if (NumThreads > 16) {
		for (int i = 16; i < NumThreads; i++) {
			incrementthread[i] = i & 7;
		}
	}

D Sceviour · Post by **D Sceviour** » Thu Nov 22, 2018 12:15 am

sandermvdb wrote: ↑Tue Aug 07, 2018 8:51 pm At the moment my chessengine participates in the TCEC tournament which uses a machine that has 44(!) cores. I was wondering what would be the best approach on how to use these cores effectively using lazy SMP. At the moment half of the threads search the actual depth and the other half searches at depth + 1. I thought this is what most engines do but I can imagine it isn't very useful when there are 44 threads.

Any advice?

Each independent thread should find its own maximum internal depth based on reductions, so what is the point in varying the depth at the root? Interesting is the idea of an aspiration spread for large numbers of threads. Has anyone tested a method of threading based on aspiration window rather than iterative depth? Maybe something like this is worth trying for large numbers of threads:

t = thread number
1.414 = square root of 2
MINIMUM_ASPIRATION_WINDOW = 5
MATE_SCORE = 32768

aspiration(t) = (int) (1.414 * t) ^ 2

with the limits of,
MINIMUM_ASPIRATION_WINDOW < aspiration(t) < MATE_SCORE

I have a 4 core machine, so testing 43 threads is only academic. Maybe someone with a large hardware might report findings on this.

jorose · Post by **jorose** » Thu Nov 22, 2018 12:52 pm

There are a few things which I could imagine makes Lazy SMP less effective in Winter. Eg There are no reductions at the root node and Winter does not use the history heuristic in any way at the moment, so the search is less stochastic.

The scheme I ended up coming up with was to use the one described in this thread (skip a depth iff more than half the other threads are searching that depth or higher) with an additional modification that was necessary in Winter's case. At every depth where a search actually occurs in a helper thread, there is a 25% chance to perturb the root move list for that thread in the following way:

Code: Select all

  for (int i = moves.size()-1; i > 0; i--) {
    if (rng() % 2) {
      std::swap(moves[i-1], moves[i]);
    }
  }

This perturbation has some nice properties imo. A move can only get knocked back by at most one rank, so the best move will get back to the front of the list quickly if it is incorrectly demoted. On the other hand any move can end up at an arbitrarily high rank, but with exponentially lower probability the further back in the list it is. Since it is stochastic you also don't need to come up with a scheme for to make sure different threads aren't doing exactly the same thing.

Originally I did this perturbation with probability 1/depth since one expects the move lists to be more correct over time, but I found the search scaled worse with more time than with a fixed 25% probability.

jstanback · Post by **jstanback** » Thu Nov 22, 2018 4:34 pm

jorose wrote: ↑Thu Nov 22, 2018 12:52 pm There are a few things which I could imagine makes Lazy SMP less effective in Winter. Eg There are no reductions at the root node and Winter does not use the history heuristic in any way at the moment, so the search is less stochastic.

The scheme I ended up coming up with was to use the one described in this thread (skip a depth iff more than half the other threads are searching that depth or higher) with an additional modification that was necessary in Winter's case. At every depth where a search actually occurs in a helper thread, there is a 25% chance to perturb the root move list for that thread in the following way:
Code: Select all
  for (int i = moves.size()-1; i > 0; i--) {
    if (rng() % 2) {
      std::swap(moves[i-1], moves[i]);
    }
  }
This perturbation has some nice properties imo. A move can only get knocked back by at most one rank, so the best move will get back to the front of the list quickly if it is incorrectly demoted. On the other hand any move can end up at an arbitrarily high rank, but with exponentially lower probability the further back in the list it is. Since it is stochastic you also don't need to come up with a scheme for to make sure different threads aren't doing exactly the same thing.

Originally I did this perturbation with probability 1/depth since one expects the move lists to be more correct over time, but I found the search scaled worse with more time than with a fixed 25% probability.

I do something similar in Wasp to try to get threads to search moves in different order. For each thread the moves have a "score" from the previous iteration and are searched in order of score. Any move that did not fail low has a valid score, while moves that failed low have a score of -12000-N where N is the ordinal of that move in the list. Before searching an iteration the slave threads subtract a random number between 0 and 16 from the score for each move. If there is more than one decent move then threads might search those moves in different order. The remaining moves will definitely be searched in a different order by each thread. I don't know if this helps much since I don't do a lot of MT testing.

John

lucasart · Post by **lucasart** » Sat Nov 24, 2018 1:36 am

jstanback wrote: ↑Wed Aug 08, 2018 4:33 pm BTW, I had a "bug" in the Wasp I sent for TCEC 12 which reduced the search speed by almost 50%. I have a structure called "EvalData" used in many places which I previously allocated on the stack at the start of eval(). But in a brain-dead moment, I changed this to an array of structures indexed by thread. I think this causes the memory for all threads to be associated with the NUMA node that first writes to this array. Threads not on that NUMA node will have slow access to this data. I fixed this in Wasp 3.0 by including the EvalData structure in the ThreadData structure which is allocated and written to by each thread when it is first started. I have some other global variables which are initialized by the master thread and used, but not changed, by all the threads during search. I think I'll try duplicating these in the ThreadData structure to see if I can get more speedup....

50% seems a bit high. Although in your case, the offending data structure was likely on the hot path (used throughout the eval).

In Demolito, I'm doing the mistake you describe with per worker data structures, such as Pawn HT, History table, etc.

It would be easy to "fix" this, but I don't have a NUMA machine to test it, so I'll probably leave it as is until I can measure it.

Ronald · Post by **Ronald** » Tue Nov 27, 2018 1:18 pm

Moving data from RAM to the processor cache is very expensive, so you need to keep related data close to each other in order to minimize the moves. I understand that you have 2 different arrays for the keys and for the values which will be far apart in RAM, this means that you need at least 2 moves from RAM to cache. I don't know how you implemented the bucket approach, if that info is also not within the same cache line (or overlapping multiple cache lines) you might even have more moves. Ideally a hash entry should be 64 bytes and be cache aligned so that it will not be spread over 2 cache lines. I don't know if this will solve your "problem" but it will probably help.

Concerning the other shared hash tables, you might try to make them thread local, especially the eval cache, because it will be overwritten so much that the hit rate will probably be low. I don't know about the others, bu Raoni says a global pawn hash works in Pirarucu.

In Rofchade a different aspiration window is used for 50% of the threads. The boss thread and 50% - 1 of the workers use the regular aspiration window, the others use bigger and smaller windows. After some/a lot of tweaking it seems to gain around 25 to 30 elo.

It would be nice if we could test once in a while on TCEC like hardware with many cores and multiple processes. For instance, in the current "TCEC version" the hash table is used also in quiescence, but I don't know if it's better to not use the hash table on the TCEC hardware. Does anybody have experience with that?

Lazy SMP and 44 cores

Re: Lazy SMP and 44 cores

Re: Lazy SMP and 44 cores

Re: Lazy SMP and 44 cores

Re: Lazy SMP and 44 cores

Re: Lazy SMP and 44 cores

Re: Lazy SMP and 44 cores

Re: Lazy SMP and 44 cores

Re: Lazy SMP and 44 cores

Re: Lazy SMP and 44 cores

Re: Lazy SMP and 44 cores