hgm wrote:Or is the problem the specific way it is programmed in Crafty, not making a copy, but following the pointer twice, relying on the optimizer having made the copy in the register anyway. The latter seems risky. Doing a second memory (= L1 cache) access is actually the fastest way to do this on many x86 CPUs; code that only uses architectural registers often runs very slow there, and you seem to need a mix of memory operands and register operands to be able to reach maximum throughput. If I would hand-optimize the code I probably would opt for a second memory access. Especially if the 64-bit data part was in a register, but I needed only a single byte of it that stored the draft. Just reading the required byte from memory would definitely consume fewer CPU resources than shifting and masking the value stored in the register.
Yes, that was the point I was trying to make about L1 being so fast -- the compiler likely knows that reading from it a second time is quite cheap.
But without the volatile, even making a copy of the hash entry into a local variable, may not protect you from compiler optimizations. By alias analysis, it will know that nothing else has written to the original hash slot (since it considers only the "single threaded" code you are executing, and isn't aware of other threads). The compiler might choose to keep each member of the structure in a register, so the "copy" will consist of read(s) from the hash table into registers. It might then decide to split the live range of those registers and/or spill them to the stack. At that point it will notice that it can re-create the value by reading it straight from the original memory, and if it does that it doesn't have to actually spill the register to memory (so it saves a store if it rematerializes from the original location instead of loading from the local variable on the stack). Since it knows the original location was recently accessed, it knows it will hit in L1. It has to preserve the abstract C machine behavior only for a "single threaded" machine; it doesn't know or care that you might actually have other threads. It could easily turn the one read from your source code into two or more reads in the machine code.
Volatile is at least the partial solution, and since the hash really is accessed by multiple threads, marking it as volatile seems like a good idea.
Volatile means two things: (1) the compiler can't delete accesses that were in your source code or insert its own accesses that weren't in your source code (so it can't merge them or split them [edit: in particular, it can't hoist them out of loops!]), and (2) it can't re-order volatile accesses with respect to OTHER volatile accesses in your source code. Note that it can and will re-order them with respect to other computations including non-volatile loads and stores! But marking the hash as volatile should prevent any of the compiler shenanigans that I can think of that would break the XOR trick.
At least, as long as your compiler implements volatile correctly...
