petero2 wrote:mar wrote:petero2 wrote:On current cpu architectures they are therefore free.
I'm not quite sure about that. If you have many threads that RMW the same cache line, false sharing will kill performance.
There are no RMW operations involved though when using memory_order_relaxed. Here is the relevant parts of the source code I use:
Code: Select all
struct TTEntryStorage {
std::atomic<U64> key;
std::atomic<U64> data;
};
class TTEntry {
U64 key;
U64 data;
};
void
TranspositionTable::TTEntry::store(TTEntryStorage& ent) {
ent.key.store(key ^ data, std::memory_order_relaxed);
ent.data.store(data, std::memory_order_relaxed);
}
void
TranspositionTable::TTEntry::load(const TTEntryStorage& ent) {
key = ent.key.load(std::memory_order_relaxed);
data = ent.data.load(std::memory_order_relaxed);
key ^= data;
}
When I compile this I get the following assembly code:
Code: Select all
store:
movq 8(%rdi), %rax
xorq (%rdi), %rax
movq %rax, (%rsi)
movq 8(%rdi), %rax
movq %rax, 8(%rsi)
ret
load:
movq (%rsi), %rax
movq %rax, (%rdi)
movq 8(%rsi), %rax
xorq %rax, (%rdi)
movq %rax, 8(%rdi)
ret
There are no locked instructions and no memory barriers. I get almost the same assembly code if I change the TTEntryStorage to U64:
Code: Select all
store:
movq 8(%rdi), %rax
movq %rax, %rdx
xorq (%rdi), %rdx
movq %rax, 8(%rsi)
movq %rdx, (%rsi)
ret
load:
movq (%rsi), %rax
movq 8(%rsi), %rdx
xorq %rdx, %rax
movq %rdx, 8(%rdi)
movq %rax, (%rdi)
ret
The only difference is that the compiler missed an optimization opportunity when using memory_order_relaxed, so that it performs an extra read (in the store case) or write (in the load case) from thread-local memory (which will be an L1 cache hit). I doubt this has a measurable speed impact, but it can be fixed at the cost of making the code longer:
Code: Select all
void
TranspositionTable::TTEntry::store(TTEntryStorage& ent) {
U64 k = key;
U64 d = data;
ent.key.store(k ^ d, std::memory_order_relaxed);
ent.data.store(d, std::memory_order_relaxed);
}
void
TranspositionTable::TTEntry::load(const TTEntryStorage& ent) {
U64 k = ent.key.load(std::memory_order_relaxed);
U64 d = ent.data.load(std::memory_order_relaxed);
k ^= d;
key = k;
data = d;
}
With that source code I get the following assembly code:
Code: Select all
store:
movq 8(%rdi), %rax
movq %rax, %rdx
xorq (%rdi), %rdx
movq %rdx, (%rsi)
movq %rax, 8(%rsi)
ret
load:
movq (%rsi), %rax
movq 8(%rsi), %rdx
xorq %rdx, %rax
movq %rdx, 8(%rdi)
movq %rax, (%rdi)
ret
Now the only difference is that the compiler swapped the order of the last two move instructions in the store function.
In conclusion, using memory_order_relaxed for lockless hashing can be made to have zero overhead on the x64 architecture.