Cache pollution when reading/writing hash table

hgm · Post by **hgm** » Tue Aug 11, 2009 7:52 pm

I agree with your objections. I had a byte to spare, thouh (7 9-byte entries in a cache line). And you should only use it if later on you have to write the cache line anyway. Like you do in hashing, even if ther was a miss.

Re-order buffers used to have only some 76 entries at the time I tried this; at 3 instructions per clock this would stall the CPU on a cache-missing read after 25 clocks, way before the actual data came in. Out-of-order is good for hiding L2 latency, but against DRAM acces it doesn't stand much chance.

mcostalba · Post by **mcostalba** » Tue Aug 11, 2009 8:03 pm

Thanks for your suggestions.

I have tried the "write a char" trick because the last bytes of the cluster are just padding.

It works a bit, but not so well like the prefetch. I think it happens what you have suggested: the out of order execution tries to hide the pending write as long that it can, but at the end the CPU stalls.

I have now realized that on modern CPU a RAM access is _really_ a no-no.

It can easily stall the CPU for hundreds cycles or even more. I have put the prefetch very very early, there is a ton of stuff to do before the data is accessed by the hash probe, nevertless we are almost there and I think the CPU almost stalls, although for just a bit, when probing the hash tables.

hgm · Post by **hgm** » Tue Aug 11, 2009 9:14 pm

For this reason some people do not probe hash at all in QS (which has the advantage that there is also no point in storing QS entries). But you say this slows the engine down (was also true for Joker, btw). But you might try a kind of intermediate, which I never tried in an engine, but which did work quite well in perfting:

You could keep QS nodes out of the main hash, and store them in a separate small hash table that permanently resides in L2. (Like 1/4 the size of L2, so that it leaves plenty of space for the code and core data.) With a 4MB L2 such a cache could still hold 16K cache lines. Your hit-rate on QS positions would surely drop compared to storng them in the main hash, but it would still be significant (perhaps half of what it would otherwise be), and the probing would be so fast as to be essentially free. So it might pay off.

bob · Post by **bob** » Wed Aug 12, 2009 2:09 am

hgm wrote:I agree with your objections. I had a byte to spare, thouh (7 9-byte entries in a cache line). And you should only use it if later on you have to write the cache line anyway. Like you do in hashing, even if ther was a miss.

Re-order buffers used to have only some 76 entries at the time I tried this; at 3 instructions per clock this would stall the CPU on a cache-missing read after 25 clocks, way before the actual data came in. Out-of-order is good for hiding L2 latency, but against DRAM acces it doesn't stand much chance.

The "byte to spare" is not the issue. For anyone intending to develop a parallel search, this is a very bad idea.

First, there are somewhat more reads than writes to the hash table. For misses, or for insufficient depth (most of the cases) we do a read at the front of search and a store before we exit. A few will get out without doing the store due to a useful depth and a bound/score that is good enough to abort the current search.

Storing a single byte, just to pre-fetch the cache line is bad, because this invalidates it on every other L1/L2 cache in the system. For every read you do. Because you first do the store. And this most likely makes the cure worse than the disease for all cases.

Storing into something that is shared, just to prefetch it, is ugly in forwarding cache systems. Or non-forwarding for that matter...

Bo Persson · Post by **Bo Persson** » Wed Aug 12, 2009 5:36 pm

wgarvin wrote:
hgm wrote:Have you tried two prefetches from the same address? A prefetch can be ignored if the CPU considers the memory system so heavily loaded with real reads that prefetching would be counterproductive. The second prefetches might grab a second "window of opportunity".

(This is just a wild guess.)

For similar reasons I once tried to use a store in a dummy byte of the cache line to force the line into cache, as a store could not be ignored by the CPU, but has the advantage that you don't have to wait out the latency, as in a load. This worked on P-IV, where predetching turned out to be a no-op (when the prefetch missed the TLB, which for hash probes was always the case).
Rather than prefetch, it might make sense to do an actual read through a "pointer to volatile char" or something like that. So the optimizer can't throw it out, the TLB miss won't cause it to be discarded, and the out-of-order execution can still do a whole bunch of stuff before you get to the second ("actual") read.

If the optimizer is smart enough, it can detect that the real data isn't volatile and still throw the read out. Trying to outsmart the compiler is often counter productive.

hgm · Post by **hgm** » Wed Aug 12, 2009 7:56 pm

I don't think an optimizer could ever decide something is non-volatile. There could be memory-mapped hardware at that address.

wgarvin · Post by **wgarvin** » Thu Aug 13, 2009 1:34 am

hgm wrote:I agree with your objections. I had a byte to spare, thouh (7 9-byte entries in a cache line). And you should only use it if later on you have to write the cache line anyway. Like you do in hashing, even if ther was a miss.

Re-order buffers used to have only some 76 entries at the time I tried this; at 3 instructions per clock this would stall the CPU on a cache-missing read after 25 clocks, way before the actual data came in. Out-of-order is good for hiding L2 latency, but against DRAM acces it doesn't stand much chance.

Okay, I see what you mean. The issue is retirement. I wonder if there is a way to issue a "real read" followed by a clear of that register? I'm not sure if dead reads have to go all the way to the retirement stage, but in any event, its getting very architecture specific at that point, and might be too risky to rely on that across x86 chips in general. [Edit: Core2 reorder buffer have 96 entries now, but your conclusions are still right].

Bo wrote:If the optimizer is smart enough, it can detect that the real data isn't volatile and still throw the read out. Trying to outsmart the compiler is often counter productive.

It's not allowed to do that. "Pointer to volatile" essentially means "whatever is accessed through this pointer must be accessed like it was volatile". In the same way that you can make a const pointer to some data that is modifiable elsewhere, but you just can't modify it *through that const pointer*, the optimizer could mess with the accesses you do through some other pointer to the same address but it can't eliminate the access through the pointer-to-volatile.

Bo Persson · Post by **Bo Persson** » Fri Aug 14, 2009 5:52 pm

hgm wrote:I don't think an optimizer could ever decide something is non-volatile. There could be memory-mapped hardware at that address.

This was about using pointer-to-volatile, to persuade the compiler to actually read/write the data.

However, the compiler just might be smart enough to detect that the pointer always points to an malloc'ed hashtable, which isn't volatile.

wgarvin · Post by **wgarvin** » Sat Aug 15, 2009 5:38 pm

Bo Persson wrote:
hgm wrote:I don't think an optimizer could ever decide something is non-volatile. There could be memory-mapped hardware at that address.
This was about using pointer-to-volatile, to persuade the compiler to actually read/write the data.

However, the compiler just might be smart enough to detect that the pointer always points to an malloc'ed hashtable, which isn't volatile.

Smart has nothing to do with it, if you write to a volatile-qualified lvalue (such as *myVolatilePtr) then the compiler is not allowed to optimize away that access. It doesn't matter what myVolatilePtr points to, it could point to a local variable declared in this function, and the compiler must still not optimize away that access or re-order it with respect to other volatile accesses or sequence points (such as function calls, and in C++ also function returns). Like const, you can add or remove the volatile qualifier to your pointers with a simple (though non-portable) cast. The qualifier only affects the compiler's behaviour for accessing the storage through that name. You can declare a "volatile int x;" and then "int* ptr = (int*)&x;" and then reads and writes of *ptr will be non-volatile.

Volatile doesn't provide atomicity and it doesn't provide any ordering guarantees with respect to non-volatile accesses, so its not particularly helpful for multi-threaded programming. Also, most production-quality compilers have optimization bugs related to volatile. And its easy to fall into the trap of thinking that volatile guarantees things about your program's behaviour that it doesn't actually guarantee, and I've heard that its more or less impossible to write a "completely portable C or C++ app" that uses volatile (because its difficult to do anything useful with volatile without invoking implementation-defined or undefined behavior). Despite all this, its still occasionally useful...

bob · Post by **bob** » Sat Aug 15, 2009 6:33 pm

wgarvin wrote:
Bo Persson wrote:
hgm wrote:I don't think an optimizer could ever decide something is non-volatile. There could be memory-mapped hardware at that address.
This was about using pointer-to-volatile, to persuade the compiler to actually read/write the data.

However, the compiler just might be smart enough to detect that the pointer always points to an malloc'ed hashtable, which isn't volatile.
Smart has nothing to do with it, if you write to a volatile-qualified lvalue (such as *myVolatilePtr) then the compiler is not allowed to optimize away that access. It doesn't matter what myVolatilePtr points to, it could point to a local variable declared in this function, and the compiler must still not optimize away that access or re-order it with respect to other volatile accesses or sequence points (such as function calls, and in C++ also function returns). Like const, you can add or remove the volatile qualifier to your pointers with a simple (though non-portable) cast. The qualifier only affects the compiler's behaviour for accessing the storage through that name. You can declare a "volatile int x;" and then "int* ptr = (int*)&x;" and then reads and writes of *ptr will be non-volatile.

Volatile doesn't provide atomicity and it doesn't provide any ordering guarantees with respect to non-volatile accesses, so its not particularly helpful for multi-threaded programming. Also, most production-quality compilers have optimization bugs related to volatile. And its easy to fall into the trap of thinking that volatile guarantees things about your program's behaviour that it doesn't actually guarantee, and I've heard that its more or less impossible to write a "completely portable C or C++ app" that uses volatile (because its difficult to do anything useful with volatile without invoking implementation-defined or undefined behavior). Despite all this, its still occasionally useful...

Actually it is _critical_ for parallel programming. If there was no "volatile" option, locks would be impossible, as the compiler would hang on the first lock since it would think it never changed and it would only read the lock value one time and spin on that value forever.

The bad thing is the occasional misuse of volatile int *x, as opposed to int * volatile x or even volatile int * volatile x.. They are a long way from the same thing and I have seen several cases over the past few years where the wrong one was used. Is the pointer volatile, or does it point to something that is volatile, or are both true?

Cache pollution when reading/writing hash table

Re: Cache pollution when reading/writing hash table

Re: Cache pollution when reading/writing hash table

Re: Cache pollution when reading/writing hash table

Re: Cache pollution when reading/writing hash table

Re: Cache pollution when reading/writing hash table

Re: Cache pollution when reading/writing hash table

Re: Cache pollution when reading/writing hash table

Re: Cache pollution when reading/writing hash table

Re: Cache pollution when reading/writing hash table

Re: Cache pollution when reading/writing hash table