multi-threading performance

jdart · Post by **jdart** » Mon Aug 11, 2008 6:11 pm

I have recently been working on other things but I have started now to look a little closer at Arasan's multi-threading performance. I recently got a quad machine (Q6700) and am seeing some pretty bad scaling numbers. Latest test I ran it was doing about 1.7M nps on 2 processors but with four it is barely breaking 2.0M - and I've not seen anything close to 3.0M. So something is way wrong.

Now, the obvious culprit is locking. There are several lock points in the program but the only one I can think of that's likely to affect things is the main hash table, which is shared. All subsidiary tables like pawn hash, killers etc. are per-thread and so don't lock. There are also locks on data structures used for thread management: the thread pool, the child thread list in each search structure, etc. But those should be hit mainly when splitting or joining. And when a node is being searched by multiple threads the node structure is locked when updates occur to the PV or best move. My first take is all these locks should be hit with low frequency relative to the hash table but it's possible I'm wrong about that or there's some bug.

I suppose cache contention could also be a factor since the hash table is many times the cache size and also there are multiple other tables: a scoring cache used in the q-search, king/pawn and pawn tables, attack tables, etc.

Any hints on what to look for or how best to find contention points? I understand Intel has some threading tools in their compiler suite but I don't have experience with those.

bob · Post by **bob** » Mon Aug 11, 2008 7:02 pm

jdart wrote:I have recently been working on other things but I have started now to look a little closer at Arasan's multi-threading performance. I recently got a quad machine (Q6700) and am seeing some pretty bad scaling numbers. Latest test I ran it was doing about 1.7M nps on 2 processors but with four it is barely breaking 2.0M - and I've not seen anything close to 3.0M. So something is way wrong.

Now, the obvious culprit is locking. There are several lock points in the program but the only one I can think of that's likely to affect things is the main hash table, which is shared. All subsidiary tables like pawn hash, killers etc. are per-thread and so don't lock. There are also locks on data structures used for thread management: the thread pool, the child thread list in each search structure, etc. But those should be hit mainly when splitting or joining. And when a node is being searched by multiple threads the node structure is locked when updates occur to the PV or best move. My first take is all these locks should be hit with low frequency relative to the hash table but it's possible I'm wrong about that or there's some bug.

I suppose cache contention could also be a factor since the hash table is many times the cache size and also there are multiple other tables: a scoring cache used in the q-search, king/pawn and pawn tables, attack tables, etc.

Any hints on what to look for or how best to find contention points? I understand Intel has some threading tools in their compiler suite but I don't have experience with those.

Here are a couple of things to look for...

1. arrays of the form array_something[NTHREADS]. Anytime you have something indexed by thread number, you have a potential problem. If thread 1 modifies thread[1], and thread 2 modifies thread[2] you generate a ton of MOESI cache-coherency protocol traffic. I did this with an array nodes[thread] where each thread incremented its own node counter once per ply. Now what I ended up with was 8 ints in consecutive memory addresses, that all fit into the same cache line. And that line was bouncing around among the various caches like mad. The solution is to make sure that such "entries" are separated by at least 64 bytes (128 bytes on PIV) so that the counters are in different cache lines. But anywhere that you index an array by a thread ID and it is done often, you can see this issue.

2. excessive locking. At some point you are going to want to use a global lock, but not for accessing data that only a couple of threads might be sharing, such as a "split block" in crafty terminology. I buried the necessary locks to control access to the split block, _in_ the split block, to reduce lock contention.

3. No locks on the hash table. Intel lock is _way_ too slow and bypasses cache to boot for the atomic read/write.

In general, global data causes no issues if it is not modified inside the search. Global data that is modified is a bit problematic since that kind of data is not indexed by a thread ID which means a central lock is needed.

Also, in general, if a thread is going to modify something, it needs to be at least 64 bytes away from any data another thread might either read or write. Because when thread 1 writes to a cache line, it is invalidated in all other caches, and has to be sent from the current cache to anyone that requests it later by accessing it. So any store basically wipes out the surrounding 64 bytes on all other caches. Done carefully, there is no problem. Done carelessly, and you get tons of cache coherency traffic, and little else.

bob · Post by **bob** » Mon Aug 11, 2008 8:01 pm

I should also add that AMD helped me find a couple of contention issues, they have some sort of non-public tool that can find cache hot-spots and such, they found the nodes[threadid] issue I had, I had just not thought about how bad that could be. And changing from node[8] to node[128] and then changing the subscript from threadid to threadid*16 did the trick (we didn't actually do it that way, we turned it into an array of structs with 56 bytes of filler between each 8 byte node counter.)

jdart · Post by **jdart** » Mon Aug 11, 2008 11:17 pm

How can you not lock the hash table if it's shared?

--Jon

bob · Post by **bob** » Tue Aug 12, 2008 3:01 am

jdart wrote:How can you not lock the hash table if it's shared?

--Jon

OK, first let's agree on what can go wrong with a hash entry. The most likely candidate is where multiple CPUs store at the same table address at the same time, and the stores get interleaved. In Crafty, a hash entry is 16 bytes. On the PC, an 8-byte write is an atomic operation. That is, if you write the first byte of an address divisible by 8, you will write the next 7 before anyone else can slip in and write to them. But an entry is 16 bytes, which means two atomic write operations. And that is a problem. So the issue is that you could get two 8-byte pieces that do not "go together.

There are three possibilities:

(1) if the above can not "break" your program, such as might happen if the hash signature does not go with the best move so that you get an illegal best move from this corrupted entry, then you can ignore it. Remember the paper I did with Cozzie that showed that occasional hash errors do not hurt a chess engine at all, even one error every 1,000 probes has no significant effect. If you can deal with this, then you need no locking or anything.

(2) there is another validity check, the "lockless" algorithm used in older crafty versions. When you store the hash signature, xor it with the other 8-byte value you store. Then when you do a lookup, before matching the table entry do the same xor again. If the two parts of the entry don't go together, the xor result will not match the current hash signature and you just won't use these rare corrupted positions.

(3) you can do a lock, but if you have one lock for the table, you are going to have a ton of lock conflicts...

jdart · Post by **jdart** » Tue Aug 12, 2008 3:40 am

The most likely candidate is where multiple CPUs store at the same table address at the same time, and the stores get interleaved.

The problem is not just multiple writers, but the writer/reader problem - one thread can be reading while another thread is writing. I guess you are saying if the data is on an 8-byte boundary then effectively you have a lock anyway (for Intel CPU I assume).

Arasan has code for validating killer moves so it could do that for hash moves too - I can try that, so it won't crash from a bad move. But getting a bad score would also be a problem.

--Jon

bob · Post by **bob** » Tue Aug 12, 2008 4:47 am

jdart wrote:
The most likely candidate is where multiple CPUs store at the same table address at the same time, and the stores get interleaved.
The problem is not just multiple writers, but the writer/reader problem - one thread can be reading while another thread is writing. I guess you are saying if the data is on an 8-byte boundary then effectively you have a lock anyway (for Intel CPU I assume).

Arasan has code for validating killer moves so it could do that for hash moves too - I can try that, so it won't crash from a bad move. But getting a bad score would also be a problem.

--Jon

A bad score is not much of a problem. One bad score every 1,000 nodes doesn't have any significant effect on the search (the hash collisions paper in JICCA or on my web site gives some really good test results on how significant this is). And you are not going to get that many such read/write problems or write/write problems. With a decent sized table, what is the probability that any two (or more) processors is going to write to the same position at the same time? Roughly 1/N^2 (1/N x 1/N) for the two updater case. With typical hash sizes, say a minimum of 1M entries, this is close enough to zero that you might see it happen 2-3-4 times per game. I have actually measured that event by validating the info from the table to see if it matches what should be there and count the number of times it doesn't. The number is tiny over long games, and is zero in many. You really can ignore it.

jdart · Post by **jdart** » Tue Aug 12, 2008 5:08 am

I'm not sure you can just do a probabalistic calculation here, since
if you split a node and you have two searchers going off the same position, they are not really independent. My first take was that you'd get a collision pretty readily given that one search can be hitting positions that are reached by transposition in the other search. But for this to be significant you'd have to have them doing that at the same time, and that could well be unlikely.

Thanks for the info and pointers.

bob · Post by **bob** » Tue Aug 12, 2008 8:50 am

jdart wrote:I'm not sure you can just do a probabalistic calculation here, since
if you split a node and you have two searchers going off the same position, they are not really independent. My first take was that you'd get a collision pretty readily given that one search can be hitting positions that are reached by transposition in the other search. But for this to be significant you'd have to have them doing that at the same time, and that could well be unlikely.

Thanks for the info and pointers.

There are several assumptions that might not hold water. My simple probability was based on uniform distribution of probe addresses. Nothing says that will happen, because the Zobrist numbers might not have big hamming distances, or the hashing result always has some bits in common, or never has some specific bits in common, etc. Lots of variability. But the key for me was to simply validity check the entry before using it, and if it is found to be bad, just output an error and check after a hundred games to see how many of those you get...

A dozen per game is the same as zero. And I have never seen a dozen per game even...

multi-threading performance

multi-threading performance

Re: multi-threading performance

Re: multi-threading performance

Re: multi-threading performance

Re: multi-threading performance

Re: multi-threading performance

Re: multi-threading performance

Re: multi-threading performance

Re: multi-threading performance