Questions on SMP search

Discussion of chess software programming and technical issues.

Moderator: Ras

uaf
Posts: 98
Joined: Sat Jul 31, 2010 8:48 pm
Full name: Ubaldo Andrea Farina

Re: Questions on SMP search

Post by uaf »

To update the history heuristic I just do:
history[index] += depth * depth;

You are right about that not being atomic but I will not use a lock, it would be a waste of time.
History heuristic are anyway already a mess so I don't care If I add a bit more noise :)
bob
Posts: 20943
Joined: Mon Feb 27, 2006 7:30 pm
Location: Birmingham, AL

Re: Questions on SMP search

Post by bob »

mcostalba wrote:
bob wrote: The XORs are essentially free, they provide some pretty easy to use extra work for a pipe. When I added the lockless hashing stuff, I didn't see any drop in NPS at all.

The down-side of local pawn hashes shows up on a 12-core or 24 core box. 24 separate pawn hashes can burn some memory. And for any configuration, it has a somewhat negative effect on cache.
Apart that bus sniffing is heavier for shared variables than for thread local ones, so that your argumentation on cache pressure is at least dubious.
Sorry, but my observations on cache issues are _spot on_. A couple of years back, AMD used a very clever cache sniffing tool they have to analyze a different issue in Crafty, but in doing so we learned exactly what was hurting what. But here's a hint: These things have shared caches. local data is replicated data. When you replicate something 12 times and beyond, there is a cost in terms of cache pressure. A cost that is much more significant than the MOESI traffic shared data causes. And you seem to forget that with 99% hits, there is no snoop/invalidate issues at all, it is only for that 1% (or less) that you have to write to...

But the key reason is that once we get a pointer to a pawn hash table entry we can use it much later in the evaluation (unlike of the main hash where you just quickly read value and tt move). For instance passed pawns are stored in pawn hash entry and so is some king safety related stuff that is accessed and used only much later the pawn entry pointer has been grabbed...and we really don't want to care about someone changes the pawn entry under our feet while we are still working in the evaluation code.
Do as I do and when you probe and hit, copy that entry to a local variable. And as long as the pawn hash signature doesn't change, you do no pawn hash probe, and nobody can modify your data at all. And since the pawn hash signature is in the pawn hash entry, every time you hit the pawn evaluation and want to check the hash, you first just check to see if the signature matches what is in the local entry. If so, no probing or anything. And it works _well_.
bob
Posts: 20943
Joined: Mon Feb 27, 2006 7:30 pm
Location: Birmingham, AL

Re: Questions on SMP search

Post by bob »

uaf wrote:To update the history heuristic I just do:
history[index] += depth * depth;

You are right about that not being atomic but I will not use a lock, it would be a waste of time.
History heuristic are anyway already a mess so I don't care If I add a bit more noise :)
That was _exactly_ my comment with respect to noisy history data...
mcostalba
Posts: 2684
Joined: Sat Jun 14, 2008 9:17 pm

Re: Questions on SMP search

Post by mcostalba »

bob wrote: Do as I do and when you probe and hit, copy that entry to a local variable.
This means to use a local per-thread cache for a single entry, and _if_ this works well, as you stated, this means that the part of pawn hash table present in cache in a given moment is far lower then the full table size, if this would not be true then you'd have an high frequency of different pawns hashes access and so a huge copying activity and you'd experiment slow downs when copying entries to a local variable.

So _if_ this does not occur it means that also in per-thread pawn hash only few lines each table are cached in average and the overhead for many CPU is really low.

The bottom line is that when you copy an entry to a local variable you just try to mimic manually and in a crude way what hardware cache is built for and is able to do in a much more efficient and sophisticated way...and because we prefecth pawn hash access, the job is even done in shadowed way.
bob
Posts: 20943
Joined: Mon Feb 27, 2006 7:30 pm
Location: Birmingham, AL

Re: Questions on SMP search

Post by bob »

mcostalba wrote:
bob wrote: Do as I do and when you probe and hit, copy that entry to a local variable.
This means to use a local per-thread cache for a single entry, and _if_ this works well, as you stated, this means that the part of pawn hash table present in cache in a given moment is far lower then the full table size, if this would not be true then you'd have an high frequency of different pawns hashes access and so a huge copying activity and you'd experiment slow downs when copying entries to a local variable.
That is not what is happening. Suppose you do a fixed-depth search with no extensions, so that when you get to ply N, you make a move and then call evaluate... Then you make another move and call evaluate. I generate piece moves before pawn moves, so the first half of the moves (or more) have _identical_ pawn structures. The second half of the moves produce nothing but pawn signature changes since they are pawn moves. And those don't take advantage of that single-entry cache. During that heavy-use period of time, it doesn't matter whether another thread decides to overwrite the corresponding pawn hash entry or not. My local copy is not affected, and I keep right on running at max speed with no cache invalidate/forward traffic at all.

So _if_ this does not occur it means that also in per-thread pawn hash only few lines each table are cached in average and the overhead for many CPU is really low.

The bottom line is that when you copy an entry to a local variable you just try to mimic manually and in a crude way what hardware cache is built for and is able to do in a much more efficient and sophisticated way...and because we prefecth pawn hash access, the job is even done in shadowed way.
I am not quite sure how you can reliably "pre-fetch" unless you write in asm and do some careful hardware performance monitor analysis. My point is that I don't want replicated data if possible. It is purely wasted space and it impacts cache. Pawn hash is a read-only data structure, basically, since there are very few writes done once it is populated, and well over 95% of accesses are reads, which doesn't cause local processor caches to generate a lot of invalidate message.

I'm simply explaining what I do, and what has been tested a zillion times over the years on various architectures dating back to the Cray. Anyone is free to use any scheme they want. Threads and shared data is the way to go for most things, if it is done properly. EGTBs included.
User avatar
Houdini
Posts: 1471
Joined: Tue Mar 16, 2010 12:00 am

Re: Questions on SMP search

Post by Houdini »

bob wrote:The XORs are essentially free, they provide some pretty easy to use extra work for a pipe. When I added the lockless hashing stuff, I didn't see any drop in NPS at all.
In Houdini I do find a drop in NPS between XORing hash entries and not XORing hash entries. The lockless hashing scheme on the main hash produces a small but clear performance hit (I can activate the option with a compilation switch).
bob wrote:The down-side of local pawn hashes shows up on a 12-core or 24 core box. 24 separate pawn hashes can burn some memory. And for any configuration, it has a somewhat negative effect on cache.
Sure, but 24 times 2 MB is still only 48 MB, nowadays that's become "quantité négligeable".
The main issue in systems with higher number of cores is to separate the memory usage per NUMA node, so that each thread uses its own NUMA node-specific memory allocation. It's better for a thread to use the thread-specific pawn hash in its own NUMA node than to use a shared pawn hash that will mostly be outside its NUMA node.

Robert
mcostalba
Posts: 2684
Joined: Sat Jun 14, 2008 9:17 pm

Re: Questions on SMP search

Post by mcostalba »

bob wrote: I am not quite sure how you can reliably "pre-fetch" unless you write in asm and do some careful hardware performance monitor analysis.
I have used VTune to validate the change and here is the prefetch:

Code: Select all

/// prefetch() preloads the given address in L1/L2 cache. This is a non
/// blocking function and do not stalls the CPU waiting for data to be
/// loaded from memory, that can be quite slow.

void prefetch(char* addr) {

#if defined(__INTEL_COMPILER) || defined(__ICL)
   // This hack prevents prefetches to be optimized away by
   // Intel compiler. Both MSVC and gcc seems not affected.
   __asm__ ("");
#endif

  _mm_prefetch(addr, _MM_HINT_T2);
  _mm_prefetch(addr+64, _MM_HINT_T2); // 64 bytes ahead
}

BTW the order in which moves are generated mean nothiing becuase search proceedes in a depth first fashion, so at whenever depth you generate the piece move you _first_ go down to qsearch where there are only captures (ordered in a way totally independent on how moves have been generated) and then you undo the move until come back to the starting depth before to try the next piece move, so cache is already well changed at that point.
bhlangonijr
Posts: 482
Joined: Thu Oct 16, 2008 4:23 am
Location: Milky Way

Re: Questions on SMP search

Post by bhlangonijr »

Houdini wrote:
bob wrote:The XORs are essentially free, they provide some pretty easy to use extra work for a pipe. When I added the lockless hashing stuff, I didn't see any drop in NPS at all.
In Houdini I do find a drop in NPS between XORing hash entries and not XORing hash entries. The lockless hashing scheme on the main hash produces a small but clear performance hit (I can activate the option with a compilation switch).
bob wrote:The down-side of local pawn hashes shows up on a 12-core or 24 core box. 24 separate pawn hashes can burn some memory. And for any configuration, it has a somewhat negative effect on cache.
Sure, but 24 times 2 MB is still only 48 MB, nowadays that's become "quantité négligeable".
The main issue in systems with higher number of cores is to separate the memory usage per NUMA node, so that each thread uses its own NUMA node-specific memory allocation. It's better for a thread to use the thread-specific pawn hash in its own NUMA node than to use a shared pawn hash that will mostly be outside its NUMA node.

Robert
Robert, concerning SMP code, what is the technique you have implemented in Houdini?
How does it scale (considering the number of threads/cores)?
How do you test SMP code changes?
What about the questions in the first post of this topic (I would like to know how you do it in Houdini)? :)

Regards,
bob
Posts: 20943
Joined: Mon Feb 27, 2006 7:30 pm
Location: Birmingham, AL

Re: Questions on SMP search

Post by bob »

mcostalba wrote:
bob wrote: I am not quite sure how you can reliably "pre-fetch" unless you write in asm and do some careful hardware performance monitor analysis.
I have used VTune to validate the change and here is the prefetch:

Code: Select all

/// prefetch() preloads the given address in L1/L2 cache. This is a non
/// blocking function and do not stalls the CPU waiting for data to be
/// loaded from memory, that can be quite slow.

void prefetch(char* addr) {

#if defined(__INTEL_COMPILER) || defined(__ICL)
   // This hack prevents prefetches to be optimized away by
   // Intel compiler. Both MSVC and gcc seems not affected.
   __asm__ ("");
#endif

  _mm_prefetch(addr, _MM_HINT_T2);
  _mm_prefetch(addr+64, _MM_HINT_T2); // 64 bytes ahead
}

BTW the order in which moves are generated mean nothiing becuase search proceedes in a depth first fashion, so at whenever depth you generate the piece move you _first_ go down to qsearch where there are only captures (ordered in a way totally independent on how moves have been generated) and then you undo the move until come back to the starting depth before to try the next piece move, so cache is already well changed at that point.
Incorrect. The _last_ ply is the one that repeatedly calls Evaluate(). In a typical position, there are 38 legal moves, maybe 1/4 of those are pawn moves. Depth-first has nothing to do with evaluate, evaluate is called along the frontier and beyond and there the order is significant.

Just dump a tree and see how many of the q-search captures are pawn captures. And then measure the performance of 'the trick' at frontier and beyond, to get a feel for how effective it is. I'd be willing to bet that will save you a measurable amount of time. It did me. And it is free with trivial effort required to do it.
bob
Posts: 20943
Joined: Mon Feb 27, 2006 7:30 pm
Location: Birmingham, AL

Re: Questions on SMP search

Post by bob »

Houdini wrote:
bob wrote:The XORs are essentially free, they provide some pretty easy to use extra work for a pipe. When I added the lockless hashing stuff, I didn't see any drop in NPS at all.
In Houdini I do find a drop in NPS between XORing hash entries and not XORing hash entries. The lockless hashing scheme on the main hash produces a small but clear performance hit (I can activate the option with a compilation switch).
bob wrote:The down-side of local pawn hashes shows up on a 12-core or 24 core box. 24 separate pawn hashes can burn some memory. And for any configuration, it has a somewhat negative effect on cache.
Sure, but 24 times 2 MB is still only 48 MB, nowadays that's become "quantité négligeable".
The main issue in systems with higher number of cores is to separate the memory usage per NUMA node, so that each thread uses its own NUMA node-specific memory allocation. It's better for a thread to use the thread-specific pawn hash in its own NUMA node than to use a shared pawn hash that will mostly be outside its NUMA node.

Robert
if you use only 2mb, you are faced with this:

log.001: time=3:00 mat=0 n=441934327 fh=91% nps=2.5M
log.002: time=3:00 mat=0 n=490668472 fh=91% nps=2.7M

Both searches are for 180 seconds. Which one would you rather toss out in a game? The first uses hashp=2m, the second uses hashp=128m. I know which one _I_ want. I'll take the almost 10% faster version myself. And when you have 24 of those, you are impacting cache significantly. And with no down-side to a shared cache, this is an idea that makes no sense to me. If this is a good idea, why not local transposition tables as well?

You won't see significant NUMA issues until you go beyond two physical processor chips (not cores). And the pawn hash trick I explained will eliminate about half of the pawn hash table accesses anyway. The main trans/ref table is _much_ worse because it is continually getting written to, roughly once per node searched. The pawn hash doesn't get modified much if it is big enough to prevent excessive overwrites.

But to each his own, of course...