Gerd Isenberg wrote:
Important is that hashentry + 4*N is on a 64-byte aligned address (address & 0x3f == 0), so that 4 consecutive slots are not only 64 byte, but completely fit inside one 64-byte cacheline.
yes, I can guarantee this, acutally the hash entries are grouped inside a cluster that is 64 bytes aligned, I access the whole cluster, not the single entries.
Gerd Isenberg wrote:
Too many prefetches are counter productive, since they may pollute L1. They likely interact (as well with a hardware prefetcher).
I have tried with prefetcht1 instead of prefetcht0 so to leave the cache line in L2 but it was a bit slower. The hardware prefetcher, for what I have understood, works well with regular access patterns, but accessing hash table is almost a random process, so I would not think it could help a lot.
Gerd Isenberg wrote:
What do you do between prefetch and actually reading four 16 byte slots via four movdqa? Likely you use prefetch after updating the zobristkey by make move, but before updating other stuff and checking for repetitions.
Actually I do a lot of other stuff because I have tried to prefetch as early as possible, as you said, just after updating the key in do_move().
I have even shuffled the code inside do_move() so to update the key as early as possible, then prefetch, then move all the other stuff after the prefecth so to do how much as possible before to return.
I have tested to prefetch in different places and at the end it turn out that earlier was the prefetch the better.
I have no idea why two prefetches are better the one. But they are. I am sure the cluster of entries fits the 64 bytes (there is a compile error otherwise).
And I am sure to read only one cluster when probing the hash table.
It is like the processor when accessing some data in RAM tries to fetch also the next cache line of data apart from the requested one, even if I have not asked for it, so that prefetching also the next line it seems to help.