bob wrote:bhlangonijr wrote:bob wrote:
What exactly do you mean by "atomic operations"??? Using a lock to protect an entry?
You don't need to use a lock to protect a history entry, since it is usually an integer type so that the cpu will use only a single instruction to update it. There is no concurrency issue as It is done atomically. I suppose is that what he means...
He's wrong. You have to read a value from memory, update it, and write it back. That is _not_ done atomically unless you do something like an add with a lock prefix. This is subject to the classic "interleaved update" problem. Two cpus, A and B. A and B simultaneously read a value for X. Each adds 1 to that value, then each writes X back to memory. When you are done, X was incremented by 1, not 2.
If a single instruction, such as add [history + 8 * edx], eax is done, the CPU has to do a read and write and they most definitely are _not_ done atomically unless you request it with a lck prefix. And a compiler doesn't do that normally.
While its true that the whole operation (read-modify-write) is not atomic, I thought he just meant that the read is atomic and the write is atomic.
So at worst you'll miss an increment on the "interleaved update". As long as you don't do anything with the counters besides read them and increment them.
(I guess it might be different if there's any occasional thing triggered like "divide all of the history counters by two" -- in that case, interleaved
updates during the rescaling might be small but real problem.
Houdini wrote:bob wrote:if you use only 2mb, you are faced with this:
log.001: time=3:00 mat=0 n=441934327 fh=91% nps=2.5M
log.002: time=3:00 mat=0 n=490668472 fh=91% nps=2.7M
Both searches are for 180 seconds. Which one would you rather toss out in a game? The first uses hashp=2m, the second uses hashp=128m. I know which one _I_ want. I'll take the almost 10% faster version myself. And when you have 24 of those, you are impacting cache significantly. And with no down-side to a shared cache, this is an idea that makes no sense to me.
You report results obtained with Crafty, I report Houdini results.
Houdini uses a 2 MB pawn hash per thread and typically spends less than 2 % of the time in the pawn hash code.
While your results are certainly interesting, they're not relevant for Houdini.
bob wrote:If this is a good idea, why not local transposition tables as well?
There is a fundamental difference between the pawn hash and the main transposition table.
The pawn hash is just for speed-up, it doesn't alter the actual search process.
The main hash modifies the search process very significantly, not sharing it would be counterproductive.
bob wrote:You won't see significant NUMA issues until you go beyond two physical processor chips (not cores). And the pawn hash trick I explained will eliminate about half of the pawn hash table accesses anyway. The main trans/ref table is _much_ worse because it is continually getting written to, roughly once per node searched. The pawn hash doesn't get modified much if it is big enough to prevent excessive overwrites.
My NUMA development and tests are on a 16 core box with 4 NUMA nodes.
Like Marco explained, your pawn hash trick is just a manual caching method that can be dealt with very efficiently by the normal memory caching mechanism including prefetching.
To each his own...
Robert
If your pawn hash was globally shared and only 2 MB, the entire thing could fit in L3 cache which are often 6-8 MB or larger. But if you have twelve of those, they'll be 24 MB and they won't fit.
Regardless, bob is right that you're increasing the cache pressure by not sharing them. Writes are rare so they shouldn't cause much contention. You get pawn hash hits >95% of the time, right? With a shared cache, some of those hits will land in the same cache lines and be cheaper than the reads evicting something and going out to a deeper level of cache or to main memory. With separate tables per thread, they will definitely land in different cache lines and increase contention in the smaller caches.
In fact... depending on how you allocated the 2 MB buffers, its possible (and even rather likely) that when looking up
the same pawn hash key in 2 different threads, there is a false collision due to aliasing and one thread's copy of the entry for that key gets
evicted from L1 or L2 to make room for the other thread's independent entry for the same key. I don't know how commonly this occurs, but its surely an ironic outcome if the intention behind not sharing the tables was to reduce cache contention.
