history[index] += depth * depth;
You are right about that not being atomic but I will not use a lock, it would be a waste of time.
History heuristic are anyway already a mess so I don't care If I add a bit more noise
Moderator: Ras
Sorry, but my observations on cache issues are _spot on_. A couple of years back, AMD used a very clever cache sniffing tool they have to analyze a different issue in Crafty, but in doing so we learned exactly what was hurting what. But here's a hint: These things have shared caches. local data is replicated data. When you replicate something 12 times and beyond, there is a cost in terms of cache pressure. A cost that is much more significant than the MOESI traffic shared data causes. And you seem to forget that with 99% hits, there is no snoop/invalidate issues at all, it is only for that 1% (or less) that you have to write to...mcostalba wrote:Apart that bus sniffing is heavier for shared variables than for thread local ones, so that your argumentation on cache pressure is at least dubious.bob wrote: The XORs are essentially free, they provide some pretty easy to use extra work for a pipe. When I added the lockless hashing stuff, I didn't see any drop in NPS at all.
The down-side of local pawn hashes shows up on a 12-core or 24 core box. 24 separate pawn hashes can burn some memory. And for any configuration, it has a somewhat negative effect on cache.
Do as I do and when you probe and hit, copy that entry to a local variable. And as long as the pawn hash signature doesn't change, you do no pawn hash probe, and nobody can modify your data at all. And since the pawn hash signature is in the pawn hash entry, every time you hit the pawn evaluation and want to check the hash, you first just check to see if the signature matches what is in the local entry. If so, no probing or anything. And it works _well_.
But the key reason is that once we get a pointer to a pawn hash table entry we can use it much later in the evaluation (unlike of the main hash where you just quickly read value and tt move). For instance passed pawns are stored in pawn hash entry and so is some king safety related stuff that is accessed and used only much later the pawn entry pointer has been grabbed...and we really don't want to care about someone changes the pawn entry under our feet while we are still working in the evaluation code.
That was _exactly_ my comment with respect to noisy history data...uaf wrote:To update the history heuristic I just do:
history[index] += depth * depth;
You are right about that not being atomic but I will not use a lock, it would be a waste of time.
History heuristic are anyway already a mess so I don't care If I add a bit more noise
This means to use a local per-thread cache for a single entry, and _if_ this works well, as you stated, this means that the part of pawn hash table present in cache in a given moment is far lower then the full table size, if this would not be true then you'd have an high frequency of different pawns hashes access and so a huge copying activity and you'd experiment slow downs when copying entries to a local variable.bob wrote: Do as I do and when you probe and hit, copy that entry to a local variable.
That is not what is happening. Suppose you do a fixed-depth search with no extensions, so that when you get to ply N, you make a move and then call evaluate... Then you make another move and call evaluate. I generate piece moves before pawn moves, so the first half of the moves (or more) have _identical_ pawn structures. The second half of the moves produce nothing but pawn signature changes since they are pawn moves. And those don't take advantage of that single-entry cache. During that heavy-use period of time, it doesn't matter whether another thread decides to overwrite the corresponding pawn hash entry or not. My local copy is not affected, and I keep right on running at max speed with no cache invalidate/forward traffic at all.mcostalba wrote:This means to use a local per-thread cache for a single entry, and _if_ this works well, as you stated, this means that the part of pawn hash table present in cache in a given moment is far lower then the full table size, if this would not be true then you'd have an high frequency of different pawns hashes access and so a huge copying activity and you'd experiment slow downs when copying entries to a local variable.bob wrote: Do as I do and when you probe and hit, copy that entry to a local variable.
I am not quite sure how you can reliably "pre-fetch" unless you write in asm and do some careful hardware performance monitor analysis. My point is that I don't want replicated data if possible. It is purely wasted space and it impacts cache. Pawn hash is a read-only data structure, basically, since there are very few writes done once it is populated, and well over 95% of accesses are reads, which doesn't cause local processor caches to generate a lot of invalidate message.
So _if_ this does not occur it means that also in per-thread pawn hash only few lines each table are cached in average and the overhead for many CPU is really low.
The bottom line is that when you copy an entry to a local variable you just try to mimic manually and in a crude way what hardware cache is built for and is able to do in a much more efficient and sophisticated way...and because we prefecth pawn hash access, the job is even done in shadowed way.
In Houdini I do find a drop in NPS between XORing hash entries and not XORing hash entries. The lockless hashing scheme on the main hash produces a small but clear performance hit (I can activate the option with a compilation switch).bob wrote:The XORs are essentially free, they provide some pretty easy to use extra work for a pipe. When I added the lockless hashing stuff, I didn't see any drop in NPS at all.
Sure, but 24 times 2 MB is still only 48 MB, nowadays that's become "quantité négligeable".bob wrote:The down-side of local pawn hashes shows up on a 12-core or 24 core box. 24 separate pawn hashes can burn some memory. And for any configuration, it has a somewhat negative effect on cache.
I have used VTune to validate the change and here is the prefetch:bob wrote: I am not quite sure how you can reliably "pre-fetch" unless you write in asm and do some careful hardware performance monitor analysis.
Code: Select all
/// prefetch() preloads the given address in L1/L2 cache. This is a non
/// blocking function and do not stalls the CPU waiting for data to be
/// loaded from memory, that can be quite slow.
void prefetch(char* addr) {
#if defined(__INTEL_COMPILER) || defined(__ICL)
// This hack prevents prefetches to be optimized away by
// Intel compiler. Both MSVC and gcc seems not affected.
__asm__ ("");
#endif
_mm_prefetch(addr, _MM_HINT_T2);
_mm_prefetch(addr+64, _MM_HINT_T2); // 64 bytes ahead
}
Robert, concerning SMP code, what is the technique you have implemented in Houdini?Houdini wrote:In Houdini I do find a drop in NPS between XORing hash entries and not XORing hash entries. The lockless hashing scheme on the main hash produces a small but clear performance hit (I can activate the option with a compilation switch).bob wrote:The XORs are essentially free, they provide some pretty easy to use extra work for a pipe. When I added the lockless hashing stuff, I didn't see any drop in NPS at all.
Sure, but 24 times 2 MB is still only 48 MB, nowadays that's become "quantité négligeable".bob wrote:The down-side of local pawn hashes shows up on a 12-core or 24 core box. 24 separate pawn hashes can burn some memory. And for any configuration, it has a somewhat negative effect on cache.
The main issue in systems with higher number of cores is to separate the memory usage per NUMA node, so that each thread uses its own NUMA node-specific memory allocation. It's better for a thread to use the thread-specific pawn hash in its own NUMA node than to use a shared pawn hash that will mostly be outside its NUMA node.
Robert
Incorrect. The _last_ ply is the one that repeatedly calls Evaluate(). In a typical position, there are 38 legal moves, maybe 1/4 of those are pawn moves. Depth-first has nothing to do with evaluate, evaluate is called along the frontier and beyond and there the order is significant.mcostalba wrote:I have used VTune to validate the change and here is the prefetch:bob wrote: I am not quite sure how you can reliably "pre-fetch" unless you write in asm and do some careful hardware performance monitor analysis.
Code: Select all
/// prefetch() preloads the given address in L1/L2 cache. This is a non /// blocking function and do not stalls the CPU waiting for data to be /// loaded from memory, that can be quite slow. void prefetch(char* addr) { #if defined(__INTEL_COMPILER) || defined(__ICL) // This hack prevents prefetches to be optimized away by // Intel compiler. Both MSVC and gcc seems not affected. __asm__ (""); #endif _mm_prefetch(addr, _MM_HINT_T2); _mm_prefetch(addr+64, _MM_HINT_T2); // 64 bytes ahead }
BTW the order in which moves are generated mean nothiing becuase search proceedes in a depth first fashion, so at whenever depth you generate the piece move you _first_ go down to qsearch where there are only captures (ordered in a way totally independent on how moves have been generated) and then you undo the move until come back to the starting depth before to try the next piece move, so cache is already well changed at that point.
if you use only 2mb, you are faced with this:Houdini wrote:In Houdini I do find a drop in NPS between XORing hash entries and not XORing hash entries. The lockless hashing scheme on the main hash produces a small but clear performance hit (I can activate the option with a compilation switch).bob wrote:The XORs are essentially free, they provide some pretty easy to use extra work for a pipe. When I added the lockless hashing stuff, I didn't see any drop in NPS at all.
Sure, but 24 times 2 MB is still only 48 MB, nowadays that's become "quantité négligeable".bob wrote:The down-side of local pawn hashes shows up on a 12-core or 24 core box. 24 separate pawn hashes can burn some memory. And for any configuration, it has a somewhat negative effect on cache.
The main issue in systems with higher number of cores is to separate the memory usage per NUMA node, so that each thread uses its own NUMA node-specific memory allocation. It's better for a thread to use the thread-specific pawn hash in its own NUMA node than to use a shared pawn hash that will mostly be outside its NUMA node.
Robert