Crafty Transpostion Table Question

bob · Post by **bob** » Fri Jun 04, 2010 8:51 pm

mcostalba wrote:
bob wrote:
FrancoisK wrote:In BugChess2 I validate the move after each TT probe on a single thread (no SMP yet) to protect the program against a very rare crash due to hash key collision.
The validating code is really straightforward and I am pretty sure it has a completely negligible impact on performance. I can check if you are interested.

François
My issue is the "best move". I use Make/Unmake, which means an illegal move can wreck the chessboard (for example, the move O-O after the king has moved elsewhere will results in a second king on the board. Rather than validating the move, I prevent the out-of-order-stores caused by multiple threads from producing the problem in the first place.
I guess that is clear for you that you have not answered the question.

Here the issue is hash key aliasing, nothing to do with SMP, it has to do with the fact that relation between a position and its hash key is not univoque.

Hope it is clear, in case is not I state more clearly: you cannot avoid checking for illegality even in single thread case as long as hash keys are not univocal to positions (and are not).

Of course you can say that is a very very very rare event, but this doesn't change the fact that you cannot skip legality check if you want to be sure that a crash will _never_ occur.

Collisions are infrequent. _very_ infrequent. Those that might cause a crash are so infrequent that I have not had one in over 50,000,000 games played on our cluster so far. However, I did remove the lockless hash code and had _immediate_ problems in SMP code. I could barely play a game without a crash with the lockless hash code removed. Also, that is not the _only_ hash table where this is an issue. Pawn hashing has the same issue if there is data that can be corrupted and that may cause a crash or hang.

This is a simple way to avoid a _major_ number of errors. In the case of pawn hash, I do not want to look at all the information I store and validate it as I may just as well re-compute the score from scratch.

In the case of Crafty, I have always checked the validity of the hash best move and I even log an error if the move fails the legality check, but the pawn hash is a killer for me if it gets corrupted. Fortunately with 64 bits, the pawn collisions are way infrequent, if they happen at all, since you don't really need 64 bits to represent all possible pawn positions.

jwes · Post by **jwes** » Fri Jun 04, 2010 10:09 pm

bob wrote: To properly validate a move in Crafty, it was > 1% overhead, which is not something I would ignore. And when it provides a little more safety in hashing when doing SMP searches, it is even better. The "lockless hash" idea introduces one extra XOR on a store, and on a probe. That's cheaper than any sort of validation procedure I could imagine, by orders of magnitude

You could partially validate the move. I think that checking that the from square has the right type of piece and that the to square has the captured piece or is empty for a non-capture would be enough to ensure the program did not crash and should not be "orders of magnitude" more expensive.

bob · Post by **bob** » Fri Jun 04, 2010 10:32 pm

jwes wrote:
bob wrote: To properly validate a move in Crafty, it was > 1% overhead, which is not something I would ignore. And when it provides a little more safety in hashing when doing SMP searches, it is even better. The "lockless hash" idea introduces one extra XOR on a store, and on a probe. That's cheaper than any sort of validation procedure I could imagine, by orders of magnitude
You could partially validate the move. I think that checking that the from square has the right type of piece and that the to square has the captured piece or is empty for a non-capture would be enough to ensure the program did not crash and should not be "orders of magnitude" more expensive.

The XOR adds 1 instruction.

Just checking the from and 2 squares is not enough. Already debugged a case of that a few months back where if you annotate a game, since you continually flip sides, it is possible to end up with a position where a pawn can move backward (this particular case was a killer move which also has to be validates since I search them before generating moves). The current ValidMove() function uses at least 100x that many cycles due to the procedure call and the various tests for each piece type, etc. The XOR is the cheapest thing you can do on the CPU.

Onno Garms · Post by **Onno Garms** » Sat Jun 05, 2010 12:30 am

bob wrote:(1) validate the move, assuming that making an illegal move will corrupt your board or other data structures;

(2) use an atomic lock to make sure that the move part of an entry goes with the corresponding signature part;

(3) use the lockless approach which is very low cost.

Any of the above will work just fine, but (1) introduces overhead for the validation procedure, while 2 introduces tons of memory/cache issues as well as stalling processes when there really is no conflict. I just took the solution with the least impact on NPS since all work equally well.

Why does (2) stall processes when there is no conflict? Why does it introduce cache issues? Are you talking of a single lock for the whole trans map?

Sure, (3) is the fastest. But it introduces errors on the hashfull-counter (which is required by the uci protocol). Also, locking is easier to understand.

Btw., a purist might argue that you need to check the trans move anyway because of the possibility of two different positions with same hash. If you store all 64 bits of the hash key in the entry this might be negligible, if you store just 32 bit it seems important to me:

Sooner or later the hash table is full. So when I store an entry, I find the entry for some position in its place. There is a 1:2^32 chance that this position has the same 32-bit hash. Assuming 1 Mp/s, this problem can be expected every 1000 seconds (16:40 min). This must be corrected by the propability of hash hits (the entry really belongs to the same position), but the magnitude is not affected by that.

I have never tested, but because of above reasoning I decided to check the trans move.

wgarvin · Post by **wgarvin** » Sat Jun 05, 2010 1:13 am

mcostalba wrote: Hope it is clear, in case is not I state more clearly: you cannot avoid checking for illegality even in single thread case as long as hash keys are not univocal to positions (and are not).

Of course you can say that is a very very very rare event, but this doesn't change the fact that you cannot skip legality check if you want to be sure that a crash will _never_ occur.

I disagree with this. The chance of an actual collision (where you have two positions with identical Zobrist key but they are not actually the same) is no bigger than 1 / 2^60, and might even be a lot less due to the dependence between positions being searched. You could probably run your engine for a thousand years without this type of collision ever occurring.

However, the problem that Dr. Hyatt is defending against is a different problem altogether; it is memory re-ordering effects due to multi-threaded or SMP memory accesses. He stores a hash entry using two atomic 64-bit writes, but several things (compiler optims, OS thread scheduling, store gathering, cache effects) can cause these to occur out of order or with arbitrary delays between them. So you can easily read a corrupted hash table entry after its Zobrist key is written but BEFORE the data is written, or vice versa. The engine will be confused into thinking it has a valid move when it has garbage from another position. The probability of this occurring is not related at all to zobrist key sizes, and Dr. Hyatt found that it happens quite frequently unless he guards against it somehow. The XOR trick lets him atomically read two words which might or might not have come from the same hash entry, except if they really did come from two different entries, then the Zobrist word will be garbage after the XOR so he will treat it the same as any other hash miss.

So this XOR trick is extremely cheap, and completely solves the problem The only possible risk I can think of is if XORing the wrong data causes a different Zobrist key to become the one you expected it to be, but this is extremely improbable and you could guard against it by storing your hash move in the same bits of one word, as the part of the zobrist key from which you form your hash index. In other words, if the hashMove did NOT match the one in the same hash entry that the zobrist key went with, it would change enough bits of the zobrist key to guarantee 100% that it was treated as a hash miss. I don't know if Crafty does it that way or not.

ogarms wrote: Why does (2) stall processes when there is no conflict? Why does it introduce cache issues? Are you talking of a single lock for the whole trans map?

Touching cache lines that other processes are using is expensive. Hence locking is expensive even when there is no contention--even to check if some other thread owns the lock, requires an RFO on the cacheline (Read For Ownership). The beauty of the XOR trick is that it does not touch any memory other than the cache line with the actual hash entry in it, which you are obviously touching anyway !

bob · Post by **bob** » Sat Jun 05, 2010 4:34 am

Onno Garms wrote:
bob wrote:(1) validate the move, assuming that making an illegal move will corrupt your board or other data structures;

(2) use an atomic lock to make sure that the move part of an entry goes with the corresponding signature part;

(3) use the lockless approach which is very low cost.

Any of the above will work just fine, but (1) introduces overhead for the validation procedure, while 2 introduces tons of memory/cache issues as well as stalling processes when there really is no conflict. I just took the solution with the least impact on NPS since all work equally well.
Why does (2) stall processes when there is no conflict? Why does it introduce cache issues? Are you talking of a single lock for the whole trans map?

You can lock however you want. You certainly do not want a lock for each hash entry. The lock access is _terribly_ slow. I have (in the past) tried a single lock for the entire table, which begins to break down beyond 4 processors as the cache traffic becomes horrendous. I have used "block" locks so that there was a lock for each N hash entries (where N was reasonably large to reduce the number of locks to something managable). But again, the lock mechanism is a hardware nightmare and floods the cache controllers with cache-to-cache traffic out the wazoo.

You will have some conflicts for the same cache line coming from main memory when two (or more) threads access the same 64 bytes of memory, but for reads, you need no protection using the XOR trick, you never modify the hash entry at the lookup point (or you might rarely modify it to update the "age" field perhaps). But if you have a lock, you modify it _every_ time you access it, and there goes the cache coherency traffic again.

For storing, you have a cache coherency issue on the block that contains the modified entry, but there is little conflict on a single entry with a large entry. Some, but not a large amount. But a lock is way more expensive, and by definition has way more conflicts, hence way more coherency traffic.

Sure, (3) is the fastest. But it introduces errors on the hashfull-counter (which is required by the uci protocol). Also, locking is easier to understand.

(a) who the heck cares. Lie to the UCI protocol if you want. It could care less about that number. All it does is display the value. (b) why would it be wrong? I just run thru the hash checking the "age" field and count the number of entries where entry->age == current_age. Notice that I XOR to modify the signature that is stored, the data is never modified.

Btw., a purist might argue that you need to check the trans move anyway because of the possibility of two different positions with same hash. If you store all 64 bits of the hash key in the entry this might be negligible, if you store just 32 bit it seems important to me:

Sooner or later the hash table is full. So when I store an entry, I find the entry for some position in its place. There is a 1:2^32 chance that this position has the same 32-bit hash. Assuming 1 Mp/s, this problem can be expected every 1000 seconds (16:40 min). This must be corrected by the propability of hash hits (the entry really belongs to the same position), but the magnitude is not affected by that.

I have never tested, but because of above reasoning I decided to check the trans move.

I check the trans/ref best move as well. And when (if) it fails, I write out a log entry. On our cluster testing, after each test is done, I extract any ERROR messages in the logs before tossing them. The only ones I have seen in the last 30M games or so were related to an illegal move to ponder. Caused by a bad call to HashStore() after an EGTB probe would hit at ply=2, and I unintentionally tried to store a wrong best move that was invalid. Have not seen a single "ERROR: bad move from hash table" message since I cleared the ERROR messages about 30M games ago. Have not seen any ERROR messages for almost 4M games now since the EGTB issue was fixed.

In Crafty I store the entire 64 bit signature. A few years ago, several of us on r.g.c.c ran exhaustive tests with 64 bit hashes. I burned weeks of Cray time, where I stored the normal hash and a real FEN board description as well, and always matched the FEN if the sigs matched. I searched many trillions of nodes and never had one collision when playing real games.

For me. pawn hashing is the real issue because I store a bunch of data, and some of it can't be wrong or it will cause some infinite loops in the passed pawn scoring if a mask says there is a passed pawn on a file, but there is no pawn there at all. I could either validate the data as I use it, which would slow it down, or I could validate the data when I look it up, which would involve re-computing the data losing the hash benefit. The lockless hash trick is used on the pawn hash table in exactly the same way, to solve the same problem. Only difference is, it is critical for me on pawn hashing, not that important on regular hash.

mcostalba · Post by **mcostalba** » Sat Jun 05, 2010 1:11 pm

bob wrote: For me. pawn hashing is the real issue

Have you tried with one pawn hash table per thread ?

Onno Garms · Post by **Onno Garms** » Sat Jun 05, 2010 1:19 pm

bob wrote: You can lock however you want. You certainly do not want a lock for each hash entry. The lock access is _terribly_ slow.

Well, I have a 1-byte lock for every cluster of 4 entries (64 byte, one cacheline) in the first entry. Lock of entries 1 to 3 is unused. Obviously it does cost some performance but it does not nuke the overall performance.

I have a macro to switch off everything that is only needed for multithreading, including the hash lock. The overall performance loss is below 2%.

I have (in the past) tried a single lock for the entire table, which begins to break down beyond 4 processors as the cache traffic becomes horrendous.

ACK

(a) who the heck cares. Lie to the UCI protocol if you want. It could care less about that number. All it does is display the value.

OK, that is an option.

(b) why would it be wrong? I just run thru the hash checking the "age" field and count the number of entries where entry->age == current_age. Notice that I XOR to modify the signature that is stored, the data is never modified.

If you count the entries every time you want to send hashful, it won't be wrong. But this costs performance. I count while writing (one counter per thread). If two threads write the same entry, both increment their counter, though only one entry makes it into the hash table.

In Crafty I store the entire 64 bit signature.

I think that is the difference, the reason why you almost never observe two positions with identical hash.

bob · Post by **bob** » Sat Jun 05, 2010 4:47 pm

mcostalba wrote:
bob wrote: For me. pawn hashing is the real issue
Have you tried with one pawn hash table per thread ?

Why would I want to do that? Pawn scoring is a global concept, not thread-local. That is a similar case to those that say "don't store mate scores in the hash" and such. It is easier to do it right, and is certainly way more cache-friendly on todays chips that have shared cache between multiple cores.

bob · Post by **bob** » Sat Jun 05, 2010 4:52 pm

Onno Garms wrote:
bob wrote: You can lock however you want. You certainly do not want a lock for each hash entry. The lock access is _terribly_ slow.
Well, I have a 1-byte lock for every cluster of 4 entries (64 byte, one cacheline) in the first entry. Lock of entries 1 to 3 is unused. Obviously it does cost some performance but it does not nuke the overall performance.

I have a macro to switch off everything that is only needed for multithreading, including the hash lock. The overall performance loss is below 2%.

I have (in the past) tried a single lock for the entire table, which begins to break down beyond 4 processors as the cache traffic becomes horrendous.
ACK

(a) who the heck cares. Lie to the UCI protocol if you want. It could care less about that number. All it does is display the value.
OK, that is an option.

(b) why would it be wrong? I just run thru the hash checking the "age" field and count the number of entries where entry->age == current_age. Notice that I XOR to modify the signature that is stored, the data is never modified.
If you count the entries every time you want to send hashful, it won't be wrong. But this costs performance. I count while writing (one counter per thread). If two threads write the same entry, both increment their counter, though only one entry makes it into the hash table.

That is semantics. It is still wrong. And since absolute accuracy is not important, it doesn't matter IMHO. However, if you count while writing, that must mean you are doing a test to detect overwriting an old entry (which should inc the counter) while not incrementing if you are overwriting a current entry? I'd move all the overhead out of hash store completely and only send meaningful hashfull messages between iterations, and just sent the previous message if the GUI wants an update.

In Crafty I store the entire 64 bit signature.
I think that is the difference, the reason why you almost never observe two positions with identical hash.

If you use a 16M table, you use the rightmost 24 bits of the signature as the address. So storing a 32 bit signature still gives you 56 bits. You are not far from 64. And since you do know that the rightmost N bits have to match, this is really a 64 bit signature match. If you run on a big memory machine and use 4 gigs of entries, you use the entire 64 bits.

Crafty Transpostion Table Question

Re: Crafty Transpostion Table Question

Re: Crafty Transpostion Table Question

Re: Crafty Transpostion Table Question

Re: Crafty Transpostion Table Question

Re: Crafty Transpostion Table Question

Re: Crafty Transpostion Table Question

Re: Crafty Transpostion Table Question

Re: Crafty Transpostion Table Question

Re: Crafty Transpostion Table Question

Re: Crafty Transpostion Table Question