A more prosaic explanation is that TT hits by definition have been written to, perhaps quite recently. Unless evicted, such TT entries remain in the processor's data caches ready for fast reading.hgm wrote:With what you say, it should indeed not matter much whether you have a hit or a miss. The only thing I can think of is that you might have a hardware prefetcher that is too smart for its own good, and predicts that after n, n+1, n+2 and n+3, itis likely that n+4 will be needed next, and already starts fetching it from memory.
Misses are different. In a short search with an enormous TT, most slots have not been written to at all. 'Miss' timings therefore reflect the cost of reading from RAM -- hundreds of clock cycles.
I implemented this suggested trick:
but it had no effect on timings.hgm wrote:...it is better to test n, n^1, n^2 and n^3 (in a 64-byte aligned TT), without any fudging of n.
Lastly there is the puzzle why 64-byte alignment did not significantly reduce TT probe times on my Core i5. It turns out that it does, iff I prefetch the first entry with a carefully placed __builtin_prefetch().
These results are for 1024 MB TT, measurement conditions as described in original posting.
Code: Select all
probe time (ns)
prefetch aligned hit miss Mnps
no no 64 103 4.10
no yes 63 101 4.12
yes no 24 55 4.13
yes yes 24 43 4.22
A shame that it gives me only 3% increase in nps.
Robert P.