Direct timing of TT probes

micron · Post by **micron** » Sat Feb 26, 2011 10:03 am

An optimization that has been discussed
http://www.talkchess.com/forum/viewtopic.php?t=36516
is to align TT buckets to a cache-line (64 bytes). With the usual 16-byte TT entry, this should have the cache-friendly effect of loading 4 adjacent entries (slots) simultaneously. The slots can then be searched for a TT hit without invoking more cache traffic.

In testing whether my recent alignment code was correct, I somehow ended up directly measuring the times taken by TT probes.

Code: Select all

               nanoseconds per TT probe
           16-byte-aligned    64-byte-aligned
TT MB       avg  hit  miss     avg  hit  miss
 2048       116   68   132     109   66   124
 1024       104   61   118      98   60   111
  512        93   56   106      89   56   100
  256        88   54    99      85   55    95
   64        78   49    88      75   49    84
   32        74   46    83      69   45    77
   16        66   40    74      60   37    66
    8        53   33    59      48   31    52
    4        33   25    35      32   24    33
    2        28   24    29      28   24    28

TT entry is 16 bytes. There are 4 slots.
2.66 GHz Core i5 with 8 MB L3 cache; cacheline is 64 bytes.
All searches were single-threaded, nominal 15 ply from the start position. There were about 3.2E+6 TT probes in 2.8--3.0 s.
Each run was done three times, the run with the smallest average (avg) result being chosen for the table.
Individual timings:

Code: Select all

uint64_t start = mach_absolute_time&#40;); // the best timer on a Mac
...search up to 4 TT slots for hit
uint64_t end = mach_absolute_time&#40;);
double elapsedNanoseconds = gTimerScale*&#40;end - start - gTimerLatency&#41;;
if ( hit ) 
  &#123; gTotalTTHitProbeTime += elapsedNanoseconds;
    gNumTTHits++;
  &#125;
else
  &#123; gTotalTTMissProbeTime += elapsedNanoseconds;
    gNumTTMisses++;
  &#125;

where gTimerLatency is precomputed by timing empty loops.

Conclusion 1: 64-byte alignment indeed has a small beneficial effect, especially on TT misses (for which all 4 slots have to be read). The effect of alignment is negligible for really small TTs, presumably because they fit entirely in L3 cache. So far, so comprehensible...

Conclusion 2: the time for a TT miss is longer than for a hit. To be expected, but why, in a huge TT, is a miss so much (nearly 2x) more expensive than a hit?

Conclusion 3: probe time increases remorselessly with the size of the TT, this effect being much larger than the piddling result of alignment. Hmmm, that's unexpected. What would cause it?

Robert P.

Rémi Coulom · Post by **Rémi Coulom** » Sat Feb 26, 2011 10:27 am

Conclusion 2: the time for a TT miss is longer than for a hit. To be expected, but why, in a huge TT, is a miss so much (nearly 2x) more expensive than a hit?

If the first probe is not aligned with the cache line, then it is likely that a TT miss will have to load two lines. It might be a good idea to align the address of every first probe, too.

Just a guess, I did not test it.

Rémi

hgm · Post by **hgm** » Sat Feb 26, 2011 1:00 pm

Which 4 entries do you probe? If you probe n, n+1, n+2 and n+3, where n can be anything, effect of 64-byte alignment of the TT base address really should not have any effect at all. Only when (like Remi suggests) you would align both the TT start address with a 64-byte boundary, and n with a multiple of 4 (n &= ~3), there is alignment.

But it is better to test n, n^1, n^2 and n^3 (in a 64-byte aligned TT), without any fudging of n. That will always stay within the cache line, (as the XOR willnever cange the highr-order bits of the index, just the lowes two), but has the advantage that you are more likely to find an entry where you stored it on the first try.

The time of the probes increases, because probes outside L3 are very slow, and you get a larger fraction of them as the table size increases. E.g. with 4MB L3 and a 4MB table nearly everything comes from cache, but with 8MB table only half, with 16MB 25%, etc. So the average ncreases, and asymptotically should reach the RAMprobe time. A second effect is that at some point you will also start to get TLB misses of the paging unit, which will also take time to load,possibly from RAM.

bob · Post by **bob** » Sat Feb 26, 2011 6:54 pm

micron wrote:An optimization that has been discussed
http://www.talkchess.com/forum/viewtopic.php?t=36516
is to align TT buckets to a cache-line (64 bytes). With the usual 16-byte TT entry, this should have the cache-friendly effect of loading 4 adjacent entries (slots) simultaneously. The slots can then be searched for a TT hit without invoking more cache traffic.

In testing whether my recent alignment code was correct, I somehow ended up directly measuring the times taken by TT probes.
Code: Select all
               nanoseconds per TT probe
           16-byte-aligned    64-byte-aligned
TT MB       avg  hit  miss     avg  hit  miss
 2048       116   68   132     109   66   124
 1024       104   61   118      98   60   111
  512        93   56   106      89   56   100
  256        88   54    99      85   55    95
   64        78   49    88      75   49    84
   32        74   46    83      69   45    77
   16        66   40    74      60   37    66
    8        53   33    59      48   31    52
    4        33   25    35      32   24    33
    2        28   24    29      28   24    28
TT entry is 16 bytes. There are 4 slots.
2.66 GHz Core i5 with 8 MB L3 cache; cacheline is 64 bytes.
All searches were single-threaded, nominal 15 ply from the start position. There were about 3.2E+6 TT probes in 2.8--3.0 s.
Each run was done three times, the run with the smallest average (avg) result being chosen for the table.
Individual timings:
Code: Select all
uint64_t start = mach_absolute_time&#40;); // the best timer on a Mac
...search up to 4 TT slots for hit
uint64_t end = mach_absolute_time&#40;);
double elapsedNanoseconds = gTimerScale*&#40;end - start - gTimerLatency&#41;;
if ( hit ) 
  &#123; gTotalTTHitProbeTime += elapsedNanoseconds;
    gNumTTHits++;
  &#125;
else
  &#123; gTotalTTMissProbeTime += elapsedNanoseconds;
    gNumTTMisses++;
  &#125;
where gTimerLatency is precomputed by timing empty loops.

Conclusion 1: 64-byte alignment indeed has a small beneficial effect, especially on TT misses (for which all 4 slots have to be read). The effect of alignment is negligible for really small TTs, presumably because they fit entirely in L3 cache. So far, so comprehensible...

Conclusion 2: the time for a TT miss is longer than for a hit. To be expected, but why, in a huge TT, is a miss so much (nearly 2x) more expensive than a hit?

Conclusion 3: probe time increases remorselessly with the size of the TT, this effect being much larger than the piddling result of alignment. Hmmm, that's unexpected. What would cause it?

Robert P.

The problem is that the virtual-to-real translation process can, in a worst case, take 4 memory probes just to translate to a physical memory address. And you _still_ have to go to memory (or cache) to get the actual data. As you increase the table size, the number of virtual pages climbs. And once you exceed the number of TLB entries, you start to incur the above penalties. That is why some resort to using the "huge page size" that Intel/AMD hardware supports. If you make the pages 4M rather than 4K, you have have reduced the number of pages by 99.9%. And thereby reduced the stress on the TLB.

All I can guess for the hit vs miss question is that on a hit, when you have a bucket size of 4 entries, on average you will search 2.5 buckets to see if you get a hit (min=1, max=4, average = 2.5). On a miss, you are going to examine all 4, which will take longer.

rbarreira · Post by **rbarreira** » Sat Feb 26, 2011 7:46 pm

Will operating systems start using bigger pages any time soon (by default) to mitigate these TLB effects?

micron · Post by **micron** » Sat Feb 26, 2011 10:27 pm

hgm wrote:Which 4 entries do you probe? If you probe n, n+1, n+2 and n+3, where n can be anything, effect of 64-byte alignment of the TT base address really should not have any effect at all. Only when (like Remi suggests) you would align both the TT start address with a 64-byte boundary, and n with a multiple of 4 (n &= ~3), there is alignment.

With 64-byte alignment, I do indeed ensure than n is a multiple of 4, then search n, n+1, n+2, n+3, all of which should lie in the same cache-line. Moreover, the bucket is Really Truly aligned:

Code: Select all

assert&#40; n % 4 == 0 );
intptr_t x;
assert&#40; x = &#40;intptr_t&#41;&sTT&#91;n&#93;.hashCode );
assert&#40; x % 64 == 0 ); // 64-byte alignment

But it is better to test n, n^1, n^2 and n^3 (in a 64-byte aligned TT), without any fudging of n. That will always stay within the cache line, (as the XOR willnever cange the highr-order bits of the index, just the lowes two), but has the advantage that you are more likely to find an entry where you stored it on the first try.

Interesting. I'll give that a try.
Robert P.

hgm · Post by **hgm** » Sat Feb 26, 2011 10:53 pm

With what you say, it should indeed not matter much whether you have a hit or a miss. The only thing I can think of is that you might have a hardware prefetcher that is too smart for its own good, and predicts that after n, n+1, n+2 and n+3, itis likely that n+4 will be needed next, and already starts fetching it from memory.

bob · Post by **bob** » Sun Feb 27, 2011 12:38 am

rbarreira wrote:Will operating systems start using bigger pages any time soon (by default) to mitigate these TLB effects?

I suspect not. They come with their own set of problems. For example, a 4mb page includes 1024 4K pages. That means that none of those physical pages can be allocated to any process, else the entire 4mb page is not usable. Then there is the fragmentation (inside a page) issue. You won't want 4mb every time you create something, but you will get it. And waste part of the page. That's why we have seen page sizes vary over the years, from the 512byte pages on the Vax to the large pages Intel supports.

It is not so easy to let a program start off using small pages, and then switch. All of the old physical pages have to go away and be replaced by a smaller number of large pages. That is inconvenient, to say the least. How to copy the data from 1024 small pages scattered all over physical ram into one large 4mb page (or 2mb) page? Does this happen suddenly when something triggers the system to say "hey, big pages would be better because of ..... (fill in your reason here)."

It will happen over time, no doubt, but not really yet.

rbarreira · Post by **rbarreira** » Sun Feb 27, 2011 12:43 am

bob wrote:
rbarreira wrote:Will operating systems start using bigger pages any time soon (by default) to mitigate these TLB effects?
I suspect not. They come with their own set of problems. For example, a 4mb page includes 1024 4K pages. That means that none of those physical pages can be allocated to any process, else the entire 4mb page is not usable. Then there is the fragmentation (inside a page) issue. You won't want 4mb every time you create something, but you will get it. And waste part of the page. That's why we have seen page sizes vary over the years, from the 512byte pages on the Vax to the large pages Intel supports.

It is not so easy to let a program start off using small pages, and then switch. All of the old physical pages have to go away and be replaced by a smaller number of large pages. That is inconvenient, to say the least. How to copy the data from 1024 small pages scattered all over physical ram into one large 4mb page (or 2mb) page? Does this happen suddenly when something triggers the system to say "hey, big pages would be better because of ..... (fill in your reason here)."

It will happen over time, no doubt, but not really yet.

I was thinking of something like 8-32 KB, not the huge 4 MB pages.

bob · Post by **bob** » Sun Feb 27, 2011 4:46 am

rbarreira wrote:
bob wrote:
rbarreira wrote:Will operating systems start using bigger pages any time soon (by default) to mitigate these TLB effects?
I suspect not. They come with their own set of problems. For example, a 4mb page includes 1024 4K pages. That means that none of those physical pages can be allocated to any process, else the entire 4mb page is not usable. Then there is the fragmentation (inside a page) issue. You won't want 4mb every time you create something, but you will get it. And waste part of the page. That's why we have seen page sizes vary over the years, from the 512byte pages on the Vax to the large pages Intel supports.

It is not so easy to let a program start off using small pages, and then switch. All of the old physical pages have to go away and be replaced by a smaller number of large pages. That is inconvenient, to say the least. How to copy the data from 1024 small pages scattered all over physical ram into one large 4mb page (or 2mb) page? Does this happen suddenly when something triggers the system to say "hey, big pages would be better because of ..... (fill in your reason here)."

It will happen over time, no doubt, but not really yet.
I was thinking of something like 8-32 KB, not the huge 4 MB pages.

You only have normal and huge, not arbitrary sizes. It is a real pain to design the architecture to work with two page sizes, much less a dozen. I think the newer processors from Intel can actually use a 1gb page size, which is effectively a flat memory allocation model with minimal page translation overhead (the usual x86-64 page tables are 4 tables of size 2^9. Which is an address space of 2^36 pages and with 4kb pages, you get 2^48 which is the max virtual address space you can use today. With 1gb pages, which is 2^30, you are left with just 2^18 pages which can be done with a 2-level map. Of course no memory is that large, yet. So in reality with 1gb pages, you will effectively have 256 pages at most on real hardware, and that the TLB might deal with, although with existing hardware the TLB is "split" with part handling small pages and part handling large pages...

Anyway, enough about the MMU hardware for the moment. You can get 3 page sizes (and if you know what PAE is, you can also use 2mb pages) on X86-64, 4kb, 4mb and 1gb (with the 2mb option possible instead of 4mb if PAE is used).

But it does complicate things since using a big page means none of the composite small pages can be used...

Direct timing of TT probes

Direct timing of TT probes

Re: Direct timing of TT probes

Re: Direct timing of TT probes

Re: Direct timing of TT probes

Re: Direct timing of TT probes

Re: Direct timing of TT probes

Re: Direct timing of TT probes

Re: Direct timing of TT probes

Re: Direct timing of TT probes

Re: Direct timing of TT probes