Cache pollution when reading/writing hash table

wgarvin · Post by **wgarvin** » Mon Aug 10, 2009 11:56 pm

mcostalba wrote:1) Under Windows Vista is impossible to speed test with precision because OS throttles the CPU at its wishes and I didn't found a way from preventing it to do it.

You can try setting your thread priority way up to realtime. Make sure its not an infinite loop though, or you might have some difficulty killing the process

Also, don't ever call the windows API function Sleep(), or Windows will throttle you. GetMessage() might also be bad. In the context of games, where monopolizing one or more cores is normal and acceptable, you can write apps that never call either of these functions. For normal, well-behaved apps you might not have any choice but to use them and sometimes get throttled.

bob · Post by **bob** » Tue Aug 11, 2009 2:15 am

Zach Wegner wrote:Or he is using a P4. EDIT: I thought I remembered the P4 cache line being 32 bytes, but it appears it's actually 64. Nevermind.

And 128 in L2. My office box is a dual PIV. Not the best machine I have ever run across either.

Gian-Carlo Pascutto · Post by **Gian-Carlo Pascutto** » Tue Aug 11, 2009 7:33 am

mcostalba wrote:
Gian-Carlo Pascutto wrote: You forgot to align your hashtable entries to 64 byte boundaries.
I don't think so. At the very least, I wouldn't believe this claim without evidence

The evidence is that starting a prefetch 32 bytes later speeds up your program.

mcostalba · Post by **mcostalba** » Tue Aug 11, 2009 9:49 am

Gian-Carlo Pascutto wrote:
mcostalba wrote:
Gian-Carlo Pascutto wrote: You forgot to align your hashtable entries to 64 byte boundaries.
I don't think so. At the very least, I wouldn't believe this claim without evidence
The evidence is that starting a prefetch 32 bytes later speeds up your program.

This is an odd thing of prefetching is not an evidence that access is not aligned.

IOW I have also tested with only 1 prefetch at +32 offset and program slows down again.

Currently I use two prefetches at 0 and at +64, it seems comparable to 0+32 in terms of speed and it looks nicer

I have also tested 1 prefetch at 128 byte alignement but again it was slower.

I am also sure my hash entry cluster fits the 64 bytes (I read 4 entry in one go) so that one prefetch of 64 bytes would suffice and there would be no need of a second one at the next line, at least there is no evidence in the code for the need of a second sequential access.

Gerd Isenberg · Post by **Gerd Isenberg** » Tue Aug 11, 2009 1:50 pm

mcostalba wrote:
Gian-Carlo Pascutto wrote:
mcostalba wrote:
Gian-Carlo Pascutto wrote: You forgot to align your hashtable entries to 64 byte boundaries.
I don't think so. At the very least, I wouldn't believe this claim without evidence
The evidence is that starting a prefetch 32 bytes later speeds up your program.
This is an odd thing of prefetching is not an evidence that access is not aligned.

IOW I have also tested with only 1 prefetch at +32 offset and program slows down again.

Currently I use two prefetches at 0 and at +64, it seems comparable to 0+32 in terms of speed and it looks nicer

I have also tested 1 prefetch at 128 byte alignement but again it was slower.

I am also sure my hash entry cluster fits the 64 bytes (I read 4 entry in one go) so that one prefetch of 64 bytes would suffice and there would be no need of a second one at the next line, at least there is no evidence in the code for the need of a second sequential access.

Important is that hashentry + 4*N is on a 64-byte aligned address (address & 0x3f == 0), so that 4 consecutive slots are not only 64 byte, but completely fit inside one 64-byte cacheline. Too many prefetches are counter productive, since they may pollute L1. They likely interact (as well with a hardware prefetcher).

What do you do between prefetch and actually reading four 16 byte slots via four movdqa? Likely you use prefetch after updating the zobristkey by make move, but before updating other stuff and checking for repetitions.

Code: Select all

update zobrist by move
baseAddress =  hashtable + &#40;zobrist & pow2sizeMinus1 & ~63&#41;;
assert (&#40;baseAddress & 63&#41; == 0&#41;;

prefetch baseAddress

check for repetitions
update other stuff by move

... do some other usefull stuff, even if not always needed, e.g. in case of hashhit
... but don't do too many other memory references to nullify the prefetch

// probe 4 aligned slots
movdqa xmm0, &#91;baseAddress&#93;
movdqa xmm1, &#91;baseAddress + 16&#93;
movdqa xmm2, &#91;baseAddress + 32&#93;
movdqa xmm3, &#91;baseAddress + 48&#93;
look for hit by pcmpeqxx with zobkey, and pmovmskb, etc...

mcostalba · Post by **mcostalba** » Tue Aug 11, 2009 2:21 pm

Gerd Isenberg wrote: Important is that hashentry + 4*N is on a 64-byte aligned address (address & 0x3f == 0), so that 4 consecutive slots are not only 64 byte, but completely fit inside one 64-byte cacheline.

yes, I can guarantee this, acutally the hash entries are grouped inside a cluster that is 64 bytes aligned, I access the whole cluster, not the single entries.

Gerd Isenberg wrote: Too many prefetches are counter productive, since they may pollute L1. They likely interact (as well with a hardware prefetcher).

I have tried with prefetcht1 instead of prefetcht0 so to leave the cache line in L2 but it was a bit slower. The hardware prefetcher, for what I have understood, works well with regular access patterns, but accessing hash table is almost a random process, so I would not think it could help a lot.

Gerd Isenberg wrote: What do you do between prefetch and actually reading four 16 byte slots via four movdqa? Likely you use prefetch after updating the zobristkey by make move, but before updating other stuff and checking for repetitions.

Actually I do a lot of other stuff because I have tried to prefetch as early as possible, as you said, just after updating the key in do_move().

I have even shuffled the code inside do_move() so to update the key as early as possible, then prefetch, then move all the other stuff after the prefecth so to do how much as possible before to return.

I have tested to prefetch in different places and at the end it turn out that earlier was the prefetch the better.

I have no idea why two prefetches are better the one. But they are. I am sure the cluster of entries fits the 64 bytes (there is a compile error otherwise).

And I am sure to read only one cluster when probing the hash table.

It is like the processor when accessing some data in RAM tries to fetch also the next cache line of data apart from the requested one, even if I have not asked for it, so that prefetching also the next line it seems to help.

hgm · Post by **hgm** » Tue Aug 11, 2009 3:06 pm

Have you tried two prefetches from the same address? A prefetch can be ignored if the CPU considers the memory system so heavily loaded with real reads that prefetching would be counterproductive. The second prefetches might grab a second "window of opportunity".

(This is just a wild guess.)

For similar reasons I once tried to use a store in a dummy byte of the cache line to force the line into cache, as a store could not be ignored by the CPU, but has the advantage that you don't have to wait out the latency, as in a load. This worked on P-IV, where predetching turned out to be a no-op (when the prefetch missed the TLB, which for hash probes was always the case).

mcostalba · Post by **mcostalba** » Tue Aug 11, 2009 3:34 pm

hgm wrote:Have you tried two prefetches from the same address? A prefetch can be ignored if the CPU considers the memory system so heavily loaded with real reads that prefetching would be counterproductive. The second prefetches might grab a second "window of opportunity".

(This is just a wild guess.)

Yes I have tried also this one

No luck, it was slower. Like if it was a single prefetch.

hgm wrote: For similar reasons I once tried to use a store in a dummy byte of the cache line to force the line into cache, as a store could not be ignored by the CPU, but has the advantage that you don't have to wait out the latency, as in a load. This worked on P-IV, where predetching turned out to be a no-op (when the prefetch missed the TLB, which for hash probes was always the case).

This is another interesting trick.

One thing (among many) that I still don't understund is why on Windows Vista MSVC compiled it works while on Linux Intel and gcc compiled it doesn't. I have already noticed prefetch was _optimized_ away, but now I have forced it manually with an asm volatile statement and I have verified prefetcht0 is in the code.

wgarvin · Post by **wgarvin** » Tue Aug 11, 2009 7:13 pm

hgm wrote:Have you tried two prefetches from the same address? A prefetch can be ignored if the CPU considers the memory system so heavily loaded with real reads that prefetching would be counterproductive. The second prefetches might grab a second "window of opportunity".

(This is just a wild guess.)

For similar reasons I once tried to use a store in a dummy byte of the cache line to force the line into cache, as a store could not be ignored by the CPU, but has the advantage that you don't have to wait out the latency, as in a load. This worked on P-IV, where predetching turned out to be a no-op (when the prefetch missed the TLB, which for hash probes was always the case).

Rather than prefetch, it might make sense to do an actual read through a "pointer to volatile char" or something like that. So the optimizer can't throw it out, the TLB miss won't cause it to be discarded, and the out-of-order execution can still do a whole bunch of stuff before you get to the second ("actual") read.

Zach Wegner · Post by **Zach Wegner** » Tue Aug 11, 2009 7:42 pm

Can you really do that much with out of order execution? Fetching from RAM is a few hundred cycles, can the processor really keep executing for that whole time? I don't know much about microarchitecture, but I'd think it would stall at some point.

HG's idea of storing sounds like it might work better, but then you have to write a byte where you might not want to (like if you use the full cache line). It might screw with cache coherency too, making other processors invalidate their cache entries in case the entry is already cached (and in this case prefetching wouldn't really help anyways).

Cache pollution when reading/writing hash table

Re: Cache pollution when reading/writing hash table

Re: Cache pollution when reading/writing hash table

Re: Cache pollution when reading/writing hash table

Re: Cache pollution when reading/writing hash table

Re: Cache pollution when reading/writing hash table

Re: Cache pollution when reading/writing hash table

Re: Cache pollution when reading/writing hash table

Re: Cache pollution when reading/writing hash table

Re: Cache pollution when reading/writing hash table

Re: Cache pollution when reading/writing hash table