syzygy wrote: ↑Sun Jun 08, 2025 3:22 pm
Your engines don't have an array of Zobrist keys in RAM?
They have it in cache; I consider it part of the 'core data' that I try to always keep in L1. Since i368 or amd64 CPUs don't have write-through caching, they are written directly in L1 after having been calculated, and from that time on just stay there.
For some variants the Zobrist table would occupy an unacceptably large fraction of the cache (e.g. if there are 60 piece types on a 256-square board). For those the keys are calculated on the fly, by multiplying a 32-bit piece key with a 32-bit square key. That then requires only 60+256 times 4 bytes, instead of 60×256×8 bytes.
Another method for saving some space in less extreme cases is to use random keys that are only 8 bits instead of 64, and access those as *((*uint64_t) (Zobrist + index)).
syzygy wrote: ↑Sun Jun 08, 2025 3:22 pm
Your engines don't have an array of Zobrist keys in RAM?
They have it in cache; I consider it part of the 'core data' that I try to always keep in L1. Since i368 or amd64 CPUs don't have write-through caching, they are written directly in L1 after having been calculated, and from that time on just stay there.
For some variants the Zobrist table would occupy an unacceptably large fraction of the cache (e.g. if there are 60 piece types on a 256-square board). For those the keys are calculated on the fly, by multiplying a 32-bit piece key with a 32-bit square key. That then requires only 60+256 times 4 bytes, instead of 60×256×8 bytes.
Another method for saving some space in less extreme cases is to use random keys that are only 8 bits instead of 64, and access those as *((*uint64_t) (Zobrist + index)).
How does that relate to "(And you would calculate the keys only on loading, not for every new game.) " ?
If you calculate them "only on loading" and do not repeat this when starting a new game, you clearly mean "at engine startup", and you will have no way to get around storing them in an array in RAM. Clearly this takes more time than having the array already in RAM as part of the executable.
Starting a new game doesn't flush the cache, right? The cache content would only be written back to RAM when it gets replaced by some other data. But if the Zobrist table is frequently used it would never be selected for replacement. That honor would go to cache lines that hold hash entries and such.
hgm wrote: ↑Sun Jun 08, 2025 10:08 pm
Starting a new game doesn't flush the cache, right? The cache content would only be written back to RAM when it gets replaced by some other data. But if the Zobrist table is frequently used it would never be selected for replacement. That honor would go to cache lines that hold hash entries and such.
Writing the first key to a cache line will force the CPU to load the whole cache line from RAM into L1. You can as well use this to load precomputed data from RAM into L1 and skip doing the calculation. If you really want to save cycles, then the Zobrist keys can be prefetched into L1 while the CPU is doing some other kind of engine initialization.
I suppose that the CPU write logic is such that when you write an entire cache line, it will not be so stupid as to first load data from RAM just to overwrite it all. Otherwise the CPU developers would have missed some really low hanging fruit. Many programs sequentially write a stretch of memory.
The store unit maintains a queue where data to store resides even before it accesses any cache. This is in associative memory, as reading from an address that was scheduled for storing must come from this queue. So writing at an address that is in the same cache line as data scheduled for storing can be easily detected, and could be stored in the same queue element rather than allocating a new one. This is useful anyway, as the cach unit emptying the queue then only needs one cache access to process both. If the element would grow to a complete 64-byte item, the cache unit would see that, and can suppress generation of a load from a higher cache level.
hgm wrote: ↑Mon Jun 09, 2025 10:25 am
I suppose that the CPU write logic is such that when you write an entire cache line, it will not be so stupid as to first load data from RAM just to overwrite it all. Otherwise the CPU developers would have missed some really low hanging fruit. Many programs sequentially write a stretch of memory.
It's not. I guess CPU developers are not as intelligent as you.
To write to a location in RAM, a CPU core first needs to issue a "request for ownership" on the location's cache line. An RFO fetches the cache line from RAM to cache.
It seems Intel has a patent on doing an RFO_NODATA, which does not fetch the cacheline's content from RAM. But it seems this was intended for implementing special instructions which perform "non-temporal" writes, such as MOVNTDQA. https://www.felixcloutier.com/x86/movntdqa
I think that the optimization your propose does not play well with the strongly ordered memory model of x86/x86-64, unless the CPU (or the compiler) can predict that the full cache line will be written so that it knows in advance that an RFO_NODATA suffices, but even then there might be complications.
As an alternative, one can use crc32c keys. As an example, in my Othello engine Edax, I do not use Zobrist keys anymore, but crc32c keys to index the hash table. As modern CPUs have got fast instructions to compute crc32c keys, it speeds up a little my engine (by a few percent). On the other hand, I got more hash collisions than the previous Zobrist implementation (a dozen per billion accesses vs less than 1). As in Othello, it does not cost much to store the whole board, those collisions does not impact much the search, so it was an overall improvement.
I am not sure this is worth it in Chess. In Othello, moves are more complex than in chess, with several square modified at each turn. This render a Zobrist key harder to update after each move than in Chess. In practice, it is easier to compute the hash key from the full board after each move.
syzygy wrote: ↑Mon Jun 09, 2025 7:23 pm
To write to a location in RAM, a CPU core first needs to issue a "request for ownership" on the location's cache line. An RFO fetches the cache line from RAM to cache.
It seems Intel has a patent on doing an RFO_NODATA, which does not fetch the cacheline's content from RAM. But it seems this was intended for implementing special instructions which perform "non-temporal" writes, such as MOVNTDQA. https://www.felixcloutier.com/x86/movntdqa
I think that the optimization your propose does not play well with the strongly ordered memory model of x86/x86-64, unless the CPU (or the compiler) can predict that the full cache line will be written so that it knows in advance that an RFO_NODATA suffices, but even then there might be complications.