multi thread

Zach Wegner · Post by **Zach Wegner** » Thu Jun 04, 2009 9:55 am

Gian-Carlo Pascutto wrote:
Zach Wegner wrote: 1. We are looking up the same position that we stored last time. The other CPU that overwrote the entry read the entry that we stored, failed the hash key check, and then it stored the results for a different position once it is done. Technically the other processor could be storing the same position if we stored the entry after it probed, but this case will be pretty rare, and it hurts anyways. So the entry we needed got overwritten with a useless one.

2. We are looking up a different position. There are two sub-cases:
2a. The other CPU overwrote the entry with the position we want
2b. The other CPU overwrote the entry with a third (or possibly the first due to a race condition)

1 is obviously bad. 2b is bad too, but probably 2a happens more, and that is the only case that helps. I'd say that 1 is much more common than 2 though.
Only if you're using tiny tables because they're CPU-local...with a large table and hit rates over 90%, the typical case should be 2, no?

Not sure I understand what you mean--it seems like the opposite to me. The smaller the table is, the less bits you are using to index it, and the more likely the top bits are different. As an extreme example, a table with 1 entry will have virtually 100% collisions, and one with 2^64 should theoretically get exactly 0% (if your hash function is good enough). So this makes it more likely that if we're accessing the same index in the table (=same cache line) that it's the same position.

Just to make sure we're on the same page, I'll lay out exactly what's happening in each case:

Case 1:

CPU0 probes pawn position P in the pawn table entry which maps to cache line L and misses. L is marked as exclusive.
CPU0 evaluates the position and writes it back to memory. L is marked as modified.
CPU1 probes pawn position P', which also maps to L, which is invalid in 1's cache. L marked as owned in 0 and shared in 1.
CPU1 evaluates and stores the entry. L is set to modified in 1 and invalid in 0.
CPU0 probes again with position P, L is set to owned in 1 and shared in 0.

3 cache misses total, 1 that goes to main memory and 2 to the other cache. 3 pawn table misses, 0 hits.

Case 2a:

CPU0 probes pawn position P' in the pawn table entry which maps to cache line L and misses. L is marked as exclusive.
CPU0 evaluates the position and writes it back to memory. L is marked as modified.
CPU1 probes pawn position P, which also maps to L. L marked as owned in 0 and shared in 1.
CPU1 evaluates and stores the entry. L is set to modified in 1, and invalid in 0.
CPU0 probes again but with position P, L is set to owned in 1 and shared in 0, gets a pawn table hit.

3 misses, 1 from main memory, 2 from other caches. 2 pawn table misses, 1 hit.

Case 2b:

CPU0 probes pawn position P' in the pawn table entry which maps to cache line L and misses. L is marked as exclusive.
CPU0 evaluates the position and writes it back to memory. L is marked as modified.
CPU1 probes pawn position P'', which also maps to L. L marked as owned in 0 and shared in 1.
CPU1 evaluates and stores the entry. L is set to modified in 1, and invalid in 0.
CPU0 probes again but with position P, L is set to owned in 1 and modified in 0.

3 misses, 1 from main memory, 2 from other caches. 3 pawn table misses, 0 hits.

You or Bob might need to check the coherency states, I think that's all right though.

Gian-Carlo Pascutto · Thu Jun 04, 2009 11:58 am

Ah, we are talking completely next to each other. In my opinion, the case where you have a pawn hash collision is uninteresting because it happens so seldom when you have a big table and good hitrates.

We got confused because I said the interesting case is the one where cache coherency matters. But maybe it's a misunderstanding on my part. The case that interests me is CPU1 searching the position and storing the result. This would cause it to be put into its cache (L1 and possibly shared L2/L3 depending on CPU vendor). Now CPU2 does a probe on the same position. What happens now?

From what I understand CPU1 will do a memory probe and CPU2 will snoop that request and return the result (from its cache), adjusting states.

Zach Wegner · Post by **Zach Wegner** » Thu Jun 04, 2009 4:59 pm

Gian-Carlo Pascutto wrote:Ah, we are talking completely next to each other. In my opinion, the case where you have a pawn hash collision is uninteresting because it happens so seldom when you have a big table and good hitrates.

We got confused because I said the interesting case is the one where cache coherency matters. But maybe it's a misunderstanding on my part. The case that interests me is CPU1 searching the position and storing the result. This would cause it to be put into its cache (L1 and possibly shared L2/L3 depending on CPU vendor). Now CPU2 does a probe on the same position. What happens now?

From what I understand CPU1 will do a memory probe and CPU2 will snoop that request and return the result (from its cache), adjusting states.

OK, I see what you mean now. I was talking about a cache line that was already in the cache, but was invalidated by another store. I don't know enough about microarchitectures to say whether that happens or not. I'd say most definitely if it's a shared L2/L3, but I'm really not sure about multi-socket situations (or even the 2x2 core quad core intel chips). Seems to me that would be too slow to do that.

And also, since you have good hit rates, the case you are talking about is going to be pretty rare too, since most likely CPU2 has already stored it in the table.

Zach Wegner · Post by **Zach Wegner** » Thu Jun 04, 2009 5:17 pm

bob wrote:
Zach Wegner wrote:1. We are looking up the same position that we stored last time. The other CPU that overwrote the entry read the entry that we stored, failed the hash key check, and then it stored the results for a different position once it is done. Technically the other processor could be storing the same position if we stored the entry after it probed, but this case will be pretty rare, and it hurts anyways. So the entry we needed got overwritten with a useless one.
This can be solved. I always check to see if the current pawn hash signature is the same as the signature the last time I did a probe. If so I use the copy of that entry I cleverly saved in thread-local memory and don't go back to the table itself where the entry might well be gone by now... A great majority of the positions are reached using this trick since most moves are not pawn moves.

This sounded right at first, but it by no means solves the problem. Exactly how many times does that happen? In other words, what is the hit rate (one CPU) on a pawn hash table with 1 entry?

bob · Post by **bob** » Thu Jun 04, 2009 6:43 pm

Zach Wegner wrote:
bob wrote:
Zach Wegner wrote:1. We are looking up the same position that we stored last time. The other CPU that overwrote the entry read the entry that we stored, failed the hash key check, and then it stored the results for a different position once it is done. Technically the other processor could be storing the same position if we stored the entry after it probed, but this case will be pretty rare, and it hurts anyways. So the entry we needed got overwritten with a useless one.
This can be solved. I always check to see if the current pawn hash signature is the same as the signature the last time I did a probe. If so I use the copy of that entry I cleverly saved in thread-local memory and don't go back to the table itself where the entry might well be gone by now... A great majority of the positions are reached using this trick since most moves are not pawn moves.
This sounded right at first, but it by no means solves the problem. Exactly how many times does that happen? In other words, what is the hit rate (one CPU) on a pawn hash table with 1 entry?

Very high. I don't have data handy, but you can certainly test the idea. Remember, this is not quite the same as one entry, since each thread would have its own one-entry table, plus a big one behind it that is shared. It is a significant speedup however.

Here's the issue that is being overlooked, however. This is a trade-off. On one side, we save a _lot_ of effort when we get a pawn hash hit. It lets us get away with a lot of very time-consuming code in EvaluatePawns() because it is executed so rarely it doesn't affect cpu time at all. On the other side we are talking about cache issues. Compare the time for one cache to forward a block to another cache, against the advantage of globally not evaluating the same pawn structure more than once, and the cache issue becomes irrelevant.

Consider how many _new_ hash entries you add to this table and it becomes apparent that this data is almost exclusively read-only. So the cache forwarding issue is not significant

This is a real issue when we talk about a single counter mixed in with a lot of read-only values, so that every time the counter is modified, the entire block has to be forwarded all over creation because it is in every cache. But here it does not appear to be a problem. My NPS scaling is nearly perfect, which suggests that there is no excessive cache issues as the number of threads go up.

bob · Post by **bob** » Thu Jun 04, 2009 6:48 pm

Zach Wegner wrote:
Gian-Carlo Pascutto wrote:
Zach Wegner wrote: 1. We are looking up the same position that we stored last time. The other CPU that overwrote the entry read the entry that we stored, failed the hash key check, and then it stored the results for a different position once it is done. Technically the other processor could be storing the same position if we stored the entry after it probed, but this case will be pretty rare, and it hurts anyways. So the entry we needed got overwritten with a useless one.

2. We are looking up a different position. There are two sub-cases:
2a. The other CPU overwrote the entry with the position we want
2b. The other CPU overwrote the entry with a third (or possibly the first due to a race condition)

1 is obviously bad. 2b is bad too, but probably 2a happens more, and that is the only case that helps. I'd say that 1 is much more common than 2 though.
Only if you're using tiny tables because they're CPU-local...with a large table and hit rates over 90%, the typical case should be 2, no?
Not sure I understand what you mean--it seems like the opposite to me. The smaller the table is, the less bits you are using to index it, and the more likely the top bits are different. As an extreme example, a table with 1 entry will have virtually 100% collisions, and one with 2^64 should theoretically get exactly 0% (if your hash function is good enough). So this makes it more likely that if we're accessing the same index in the table (=same cache line) that it's the same position.

Just to make sure we're on the same page, I'll lay out exactly what's happening in each case:

Case 1:
CPU0 probes pawn position P in the pawn table entry which maps to cache line L and misses. L is marked as exclusive.

CPU0 evaluates the position and writes it back to memory. L is marked as modified.

CPU1 probes pawn position P', which also maps to L, which is invalid in 1's cache. L marked as owned in 0 and shared in 1.

CPU1 evaluates and stores the entry. L is set to modified in 1 and invalid in 0.

CPU0 probes again with position P, L is set to owned in 1 and shared in 0.

3 cache misses total, 1 that goes to main memory and 2 to the other cache. 3 pawn table misses, 0 hits.

Case 2a:
CPU0 probes pawn position P' in the pawn table entry which maps to cache line L and misses. L is marked as exclusive.

CPU0 evaluates the position and writes it back to memory. L is marked as modified.

CPU1 probes pawn position P, which also maps to L. L marked as owned in 0 and shared in 1.

CPU1 evaluates and stores the entry. L is set to modified in 1, and invalid in 0.

CPU0 probes again but with position P, L is set to owned in 1 and shared in 0, gets a pawn table hit.
3 misses, 1 from main memory, 2 from other caches. 2 pawn table misses, 1 hit.

Case 2b:
CPU0 probes pawn position P' in the pawn table entry which maps to cache line L and misses. L is marked as exclusive.

CPU0 evaluates the position and writes it back to memory. L is marked as modified.

CPU1 probes pawn position P'', which also maps to L. L marked as owned in 0 and shared in 1.

CPU1 evaluates and stores the entry. L is set to modified in 1, and invalid in 0.

CPU0 probes again but with position P, L is set to owned in 1 and modified in 0.
3 misses, 1 from main memory, 2 from other caches. 3 pawn table misses, 0 hits.

You or Bob might need to check the coherency states, I think that's all right though.

It is pretty close. Intel doesn't have an O state, however, that is AMD. Intel uses F (forwarding) which is roughly the same, in that the F state says this cache has to snoop reads and forward this block to any other cache that tries to read it from main memory...

However, see my other post. We already know we get over 99% phash hits. I just ran a quick test and it is actually over 99.9% for the short test I ran. This means that less than 0.1% of the time we actually store something, which makes the store/forward/invalidate stuff _extremely_ rare. Compare that very rare case against the benefit of _never_ evaluating a pawn position more than once...

bob · Post by **bob** » Thu Jun 04, 2009 6:53 pm

Zach Wegner wrote:
Gian-Carlo Pascutto wrote:Ah, we are talking completely next to each other. In my opinion, the case where you have a pawn hash collision is uninteresting because it happens so seldom when you have a big table and good hitrates.

We got confused because I said the interesting case is the one where cache coherency matters. But maybe it's a misunderstanding on my part. The case that interests me is CPU1 searching the position and storing the result. This would cause it to be put into its cache (L1 and possibly shared L2/L3 depending on CPU vendor). Now CPU2 does a probe on the same position. What happens now?

From what I understand CPU1 will do a memory probe and CPU2 will snoop that request and return the result (from its cache), adjusting states.
OK, I see what you mean now. I was talking about a cache line that was already in the cache, but was invalidated by another store. I don't know enough about microarchitectures to say whether that happens or not. I'd say most definitely if it's a shared L2/L3, but I'm really not sure about multi-socket situations (or even the 2x2 core quad core intel chips). Seems to me that would be too slow to do that.

And also, since you have good hit rates, the case you are talking about is going to be pretty rare too, since most likely CPU2 has already stored it in the table.

That's what makes the xeon-MP processors more expensive and late compared to the initial versions. yes, with a multi-socket system the caches have to "chat". And they have to do it in a big hurry with low latency.

multi thread

Re: multi thread

Re: multi thread

Re: multi thread

Re: multi thread

Re: multi thread

Re: multi thread

Re: multi thread