Direct timing of TT probes

Discussion of chess software programming and technical issues.

Moderators: hgm, Rebel, chrisw

micron
Posts: 155
Joined: Mon Feb 15, 2010 9:33 am
Location: New Zealand

Re: Direct timing of TT probes

Post by micron »

hgm wrote:With what you say, it should indeed not matter much whether you have a hit or a miss. The only thing I can think of is that you might have a hardware prefetcher that is too smart for its own good, and predicts that after n, n+1, n+2 and n+3, itis likely that n+4 will be needed next, and already starts fetching it from memory.
A more prosaic explanation is that TT hits by definition have been written to, perhaps quite recently. Unless evicted, such TT entries remain in the processor's data caches ready for fast reading.
Misses are different. In a short search with an enormous TT, most slots have not been written to at all. 'Miss' timings therefore reflect the cost of reading from RAM -- hundreds of clock cycles.

I implemented this suggested trick:
hgm wrote:...it is better to test n, n^1, n^2 and n^3 (in a 64-byte aligned TT), without any fudging of n.
but it had no effect on timings.


Lastly there is the puzzle why 64-byte alignment did not significantly reduce TT probe times on my Core i5. It turns out that it does, iff I prefetch the first entry with a carefully placed __builtin_prefetch().

These results are for 1024 MB TT, measurement conditions as described in original posting.

Code: Select all

                    probe time (ns)
prefetch  aligned    hit  miss        Mnps 
    no        no      64   103        4.10
    no       yes      63   101        4.12
   yes        no      24    55        4.13
   yes       yes      24    43        4.22
Wow, look at all those nanoseconds saved by prefetching! And more by aligning to 64-bytes! As Senator Dirksen should have said: "A nanosecond here, a nanosecond there; pretty soon you're talking real speed-up".
A shame that it gives me only 3% increase in nps.
Robert P.
bob
Posts: 20943
Joined: Mon Feb 27, 2006 7:30 pm
Location: Birmingham, AL

Re: Direct timing of TT probes

Post by bob »

micron wrote:
hgm wrote:With what you say, it should indeed not matter much whether you have a hit or a miss. The only thing I can think of is that you might have a hardware prefetcher that is too smart for its own good, and predicts that after n, n+1, n+2 and n+3, itis likely that n+4 will be needed next, and already starts fetching it from memory.
A more prosaic explanation is that TT hits by definition have been written to, perhaps quite recently. Unless evicted, such TT entries remain in the processor's data caches ready for fast reading.
Misses are different. In a short search with an enormous TT, most slots have not been written to at all. 'Miss' timings therefore reflect the cost of reading from RAM -- hundreds of clock cycles.

I implemented this suggested trick:
hgm wrote:...it is better to test n, n^1, n^2 and n^3 (in a 64-byte aligned TT), without any fudging of n.
but it had no effect on timings.


Lastly there is the puzzle why 64-byte alignment did not significantly reduce TT probe times on my Core i5. It turns out that it does, iff I prefetch the first entry with a carefully placed __builtin_prefetch().

These results are for 1024 MB TT, measurement conditions as described in original posting.

Code: Select all

                    probe time (ns)
prefetch  aligned    hit  miss        Mnps 
    no        no      64   103        4.10
    no       yes      63   101        4.12
   yes        no      24    55        4.13
   yes       yes      24    43        4.22
Wow, look at all those nanoseconds saved by prefetching! And more by aligning to 64-bytes! As Senator Dirksen should have said: "A nanosecond here, a nanosecond there; pretty soon you're talking real speed-up".
A shame that it gives me only 3% increase in nps.
Robert P.
But you have learned something important. And that idea of "prefetching" (not using the prefetch instruction, but sucking in 64 bytes when you access any byte in that block) is important. If you arrange memory so that variables that are used close together in time are also close together in memory, you save some more nanoseconds. Because the first access may miss, but the ones close around it will not.

I've spent a ton of time doing this to Crafty, and occasionally go back and rearrange for that very reason.