Lazy SMP and shared hash table

connor_mcmonigle · Post by **connor_mcmonigle** » Wed Aug 11, 2021 1:05 am

amanjpro wrote: ↑Wed Aug 11, 2021 12:46 am
connor_mcmonigle wrote: ↑Tue Aug 10, 2021 11:47 pm
amanjpro wrote: ↑Tue Aug 10, 2021 11:26 pm Read a few older posts, and that is what I will do:

- Xor to make sure that the data read is not corrupt
- Test the hell out of SMP on my machine and merge it

Another PR (follows immediately):
- Test pseudo-legality for hash moves (as well as killers)
- Use killers before generating quiet moves (if possible)

Thanks everyone for your contribution in the subject, it really means a lot to me!
I've found the XOR "trick" to be useless. On x86, you have certain aligned write atomicity guarantees which just make it pointless. Perhaps, on other platforms, it holds some value, though I'm still doubtful. Regardless, you'll still need resiliency to incorrect TT data due to the possibility of zobrist collisions. Therefore, the best solution is to just insure your engine doesn't crash even when faced with a TT full of junk. I would recommend experimenting the XOR trick only after you have such an implementation.
The XOR trick, only improves the chance that the key (checksum) belongs to the data... which agreed, won't protect the engine from crashing, but protects it from reading bad data... so yeah, I agree with you, move pseudo-legality (or make-move protection really) is a must, but the XOR is only a cheap bonus, that I don't think it is going to hurt, but can only add value...

The atomicity guarantee, only guarantees that each single word is written atomically, not that the key and the data together are written atomically

Perhaps it can only add value, though my experience has been that it adds no value (which isn't a contradiction). I believe that you're correct that the key and data could be written separately yielding a key data mismatch assuming the standard 16 byte entry size. However, I didn't find this to be an issue in practice. Likely, the probability of this occurring is very low and the probability of a mismatch actually meaningfully influencing the search is also low.

Jakob Progsch · Post by **Jakob Progsch** » Wed Aug 11, 2021 12:23 pm

amanjpro wrote: ↑Tue Aug 10, 2021 3:42 pm That is an interesting use of buckets! Unfortunately I am not using buckets in my TT, so changing to this representation is a big change, as I am already introducing a whole new world of changes with Lazy SMP... but will see what happens, as I am still exploring

Well, I guess the buckets are just a detail. The point is that unless your hash entry cleanly fits into an "atomically sized" piece of memory you have to at least reason about what happens when a data race causes you to observe a "corrupt" entry. I'm currently working under the assumption any HW I'd plausibly want to run multithreaded on will also have 64bit atomic load/stores. So typical TT entry sizes will be split over at most two atomically readable locations.

If you can squeeze the entire TT entry into 64bit then you wouldn't have to do anything. However most engines seem to opt for entry sizes in the 80-96bit range. So your entry will forcibly be split over two atomically accessible memory locations. Even worse if you densely pack them the alignment requirements will result in them being split into these locations differently which is messy to reason about. For one entry you can atomically read half the signature and the move, for the next one the entire signature can be read atomically but the move and score end up in a separate atomic load etc.

Which is why I went with this approach of packing all the meta data of three entries into one 64bit metadata chunk and then have three equally structured 64bit chunks for the rest. That way the "split" is always the same. Those forming a bucket is just a convenient thing to do. I could also treat these as three non-bucketed entries... but then I'd have to index modulo 3.

Another option would be to simply have two arrays I guess both of which get indexed "in parallel". That has the downside that the two chunks making up one entry don't share the same cache line.

But the important thing really is that this establishes a layout that you can reason about in terms of which data is accessed together atomically. And then you can figure out how to distribute the data among these two locations to make your engine resilient towards "corrupt" entries. For example if one chunk contained the entire signature while the move and score were in the other, that would probably be dangerous. Since in a race condition you would observe a "valid" signature that gets paired with a wrong move and score. If on the other hand the signature is distributed over those two chunks you'd probably detect the mismatch since the chance that the two "racing" entries have signatures that magically combine to the signature of your current position is very small.

amanjpro · Post by **amanjpro** » Thu Aug 12, 2021 12:04 am

Quick update:

- I implemented the XOR trick, which forced me to pack my data in 64 bits
- This resulted in a much more efficient TT storage
- Single-threaded got slightly stronger (due to the above point) by 13.8 +/- 15.6 in 805 games (so far), I have the feeling that the rating will stick, but you never know (match is still ongoing)
- Two-threaded is 57.2 +/- 13.6 after 1200 games so far (still going on), so this is within the expected margin too.
- Next, I'll implement Move Pseudo Legality check, check hash moves for legality, and play killers before generating quiet moves (finally finishing my partial multi-stage movegen)

Lazy SMP and shared hash table

Re: Lazy SMP and shared hash table

Re: Lazy SMP and shared hash table

Re: Lazy SMP and shared hash table