multi thread

frankp · Post by **frankp** » Wed Jun 03, 2009 9:51 pm

Thanks!

Zach Wegner · Post by **Zach Wegner** » Wed Jun 03, 2009 11:26 pm

Gian-Carlo Pascutto wrote:
So the locality for pawn hash could probably go one way or the other, but the cache coherency still matters.
If you think about this more, it should be obvious that the case in which the cache coherency is going to penalize you, is exactly the case where you really want a shared table. (A miss followed by a store)

Is it? I don't think so. Just for simplicity, assume 1 entry per cache line (having more entries per line could only hurt anyway). So we probe, and there's a cache miss due to the cache line being invalid. Since it had to be in the cache in the first place for this to happen, that means that the last thing we did was store an entry into it (or zero it, which is irrelevant here). Now there are two cases:

1. We are looking up the same position that we stored last time. The other CPU that overwrote the entry read the entry that we stored, failed the hash key check, and then it stored the results for a different position once it is done. Technically the other processor could be storing the same position if we stored the entry after it probed, but this case will be pretty rare, and it hurts anyways. So the entry we needed got overwritten with a useless one.

2. We are looking up a different position. There are two sub-cases:
2a. The other CPU overwrote the entry with the position we want
2b. The other CPU overwrote the entry with a third (or possibly the first due to a race condition)

1 is obviously bad. 2b is bad too, but probably 2a happens more, and that is the only case that helps. I'd say that 1 is much more common than 2 though.

Bob does have a good point about shared L3, where each cpu increases the cache footprint. But depending on the size of the table, it's not clear to me that dividing the size of the table up (or even making it shared between a subset of the processors) wouldn't be better.

Daniel Shawul · Post by **Daniel Shawul** » Wed Jun 03, 2009 11:35 pm

I have used globally shared pawn hash and eval hashtables for some time. When I switched to allocating separate tables, it made a big difference on NPS scaling. I always wondered why I coudn't get nps scaling close to 4 on quads (It was 3.2 max). After that I tested again by allocating 2x bigger pawn/eval cache sizes but still the nps scaling was not as good. Also if you share them , you can have collisions which could break things if you dont care to lock them or do sanity check. Lesson learned, don't share anything except the main transposition table.

Daniel

wgarvin · Post by **wgarvin** » Thu Jun 04, 2009 12:35 am

bah. didn't see that page 2 at all

bob · Post by **bob** » Thu Jun 04, 2009 12:40 am

Zach Wegner wrote:
Gian-Carlo Pascutto wrote:
So the locality for pawn hash could probably go one way or the other, but the cache coherency still matters.
If you think about this more, it should be obvious that the case in which the cache coherency is going to penalize you, is exactly the case where you really want a shared table. (A miss followed by a store)
Is it? I don't think so. Just for simplicity, assume 1 entry per cache line (having more entries per line could only hurt anyway). So we probe, and there's a cache miss due to the cache line being invalid. Since it had to be in the cache in the first place for this to happen, that means that the last thing we did was store an entry into it (or zero it, which is irrelevant here). Now there are two cases:

1. We are looking up the same position that we stored last time. The other CPU that overwrote the entry read the entry that we stored, failed the hash key check, and then it stored the results for a different position once it is done. Technically the other processor could be storing the same position if we stored the entry after it probed, but this case will be pretty rare, and it hurts anyways. So the entry we needed got overwritten with a useless one.

This can be solved. I always check to see if the current pawn hash signature is the same as the signature the last time I did a probe. If so I use the copy of that entry I cleverly saved in thread-local memory and don't go back to the table itself where the entry might well be gone by now... A great majority of the positions are reached using this trick since most moves are not pawn moves.

2. We are looking up a different position. There are two sub-cases:
2a. The other CPU overwrote the entry with the position we want
2b. The other CPU overwrote the entry with a third (or possibly the first due to a race condition)

1 is obviously bad. 2b is bad too, but probably 2a happens more, and that is the only case that helps. I'd say that 1 is much more common than 2 though.

Bob does have a good point about shared L3, where each cpu increases the cache footprint. But depending on the size of the table, it's not clear to me that dividing the size of the table up (or even making it shared between a subset of the processors) wouldn't be better.

One big benefit however. A cache miss does not mean a memory access. It may well mean (in your last two cases above) that the data gets forwarded from the cache with the good copy, which is way faster than a memory access.

bob · Post by **bob** » Thu Jun 04, 2009 12:42 am

Daniel Shawul wrote:I have used globally shared pawn hash and eval hashtables for some time. When I switched to allocating separate tables, it made a big difference on NPS scaling. I always wondered why I coudn't get nps scaling close to 4 on quads (It was 3.2 max). After that I tested again by allocating 2x bigger pawn/eval cache sizes but still the nps scaling was not as good. Also if you share them , you can have collisions which could break things if you dont care to lock them or do sanity check. Lesson learned, don't share anything except the main transposition table.

Daniel

I have a shared pawn hash table and it has no effect on my NPS scaling whatsoever, which has always been near-optimal. For 8-core boxes it is about as good as it can be. Don't see how pawn hash would cause a NPS scaling issue unless you are using bits and pieces of the hash entry throughout your evaluation, in which case you should make a local copy of the entire entry anyway.

Daniel Shawul · Post by **Daniel Shawul** » Thu Jun 04, 2009 2:28 am

I keep local copies of hash table entries after a successful probe. There is one pawn_record_entry local to each thread ,which it uses to a keep copy. In the case of eval table, I have copies for each trade at each ply so that I can compare effect of moves on evaluations. These copies are not shared between the searchers. A mistake that I made when comparing NPS scaling was to forget allocating 2x larger tables for dual core speedup test, 4x for quad etc... But increasing the table sizes while sharing them was not as effective as un-sharing them and allocate tables separately.
Redoing the test with shared pawn table did not affect the nps scaling much (same result as you). But sharing the eval cache did impact it a lot. I have this result but can only guess for the causes
a) lower hit rate for eval tt compared to pawn tt
b) the differce storage structure for pawn/eval copies i mentioned above
I can post numbers after i get back my computer in a few days.

I think that to not share is also better for cluster computing, NUMA etc

bob · Post by **bob** » Thu Jun 04, 2009 2:45 am

wgarvin wrote:bah. didn't see that page 2 at all

??

wgarvin · Post by **wgarvin** » Thu Jun 04, 2009 2:51 am

bob wrote:
wgarvin wrote:bah. didn't see that page 2 at all
??

I posted the direct link to your lockless hashing page, not noticing that he had already found it and replied.

On the plus side, it led to me wandering through the DTS page for the first time in a couple of years. That page is always interesting to read (though I am in over my head when I try to understand the little details).

Gian-Carlo Pascutto · Post by **Gian-Carlo Pascutto** » Thu Jun 04, 2009 8:15 am

Zach Wegner wrote: 1. We are looking up the same position that we stored last time. The other CPU that overwrote the entry read the entry that we stored, failed the hash key check, and then it stored the results for a different position once it is done. Technically the other processor could be storing the same position if we stored the entry after it probed, but this case will be pretty rare, and it hurts anyways. So the entry we needed got overwritten with a useless one.

2. We are looking up a different position. There are two sub-cases:
2a. The other CPU overwrote the entry with the position we want
2b. The other CPU overwrote the entry with a third (or possibly the first due to a race condition)

1 is obviously bad. 2b is bad too, but probably 2a happens more, and that is the only case that helps. I'd say that 1 is much more common than 2 though.

Only if you're using tiny tables because they're CPU-local...with a large table and hit rates over 90%, the typical case should be 2, no?

multi thread

Re: multi thread

Re: multi thread

Re: multi thread

Re: multi thread

Re: multi thread

Re: multi thread

Re: multi thread

Re: multi thread

Re: multi thread

Re: multi thread