Crafty Transpostion Table Question

Discussion of chess software programming and technical issues.

Moderators: hgm, Rebel, chrisw

mcostalba
Posts: 2684
Joined: Sat Jun 14, 2008 9:17 pm

Re: Crafty Transpostion Table Question

Post by mcostalba »

bob wrote:
mcostalba wrote:
bob wrote: For me. pawn hashing is the real issue
Have you tried with one pawn hash table per thread ?
Why would I want to do that? Pawn scoring is a global concept, not thread-local. That is a similar case to those that say "don't store mate scores in the hash" and such. It is easier to do it right, and is certainly way more cache-friendly on todays chips that have shared cache between multiple cores.
Becuase could result in a faster engine ;-)

SF (also Glaurung) uses this concept. I understand you reasoning, but unfortunatly computers are illiterate and don't care a lot about good argumentations.

Please don't get angry, just joking, what I would like to say is that in this kind of tests our intuitivness can be fallace, for instance relatively small pawn hash tables and higher cache invalidation rates can be an argument in favor of per-thread pawn hashes.

I am sure you have your "on the paper" counter-argument but the bottom line is, IMHO, that until you don't test you don't know for sure.
bob
Posts: 20943
Joined: Mon Feb 27, 2006 7:30 pm
Location: Birmingham, AL

Re: Crafty Transpostion Table Question

Post by bob »

mcostalba wrote:
bob wrote:
mcostalba wrote:
bob wrote: For me. pawn hashing is the real issue
Have you tried with one pawn hash table per thread ?
Why would I want to do that? Pawn scoring is a global concept, not thread-local. That is a similar case to those that say "don't store mate scores in the hash" and such. It is easier to do it right, and is certainly way more cache-friendly on todays chips that have shared cache between multiple cores.
Becuase could result in a faster engine ;-)

SF (also Glaurung) uses this concept. I understand you reasoning, but unfortunatly computers are illiterate and don't care a lot about good argumentations.

Please don't get angry, just joking, what I would like to say is that in this kind of tests our intuitivness can be fallace, for instance relatively small pawn hash tables and higher cache invalidation rates can be an argument in favor of per-thread pawn hashes.

I am sure you have your "on the paper" counter-argument but the bottom line is, IMHO, that until you don't test you don't know for sure.
The thing is I _have_ tested it, but found it worse in terms of performance. It is not a huge issue, because speed loss/gain is not really large, but it does have an effect. Particularly on processors with shared L2 (or even L3) cache. For example, our 8-core cluster has dual quad-core chips, each chip has 4 cores with their own L1/data and L1/instruction cache, but a shared L2 cache. Reasonable sized pawn hashes will blow out L2 easily, when in reality you are fetching the same entry from different threads.

Even more importantly, 99% of accesses are probes that hit, so very few stores to do cache invalidating / read-for-ownership / whatever your chip of choice uses for SMP cache coherency. Invalidate traffic is very low on pawn hashing. Much higher on normal hash, particularly if one hashes q-search (which I do not do). When I originally tried this on our 128 node dual-cpu cluster, there was very little difference. The shared table was just a tiny bit faster overall. But when we got the cluster with the shared L2 chips, the difference was more pronounced. If it were not for the almost 100% hit rate, this would be an issue, since this is happening out in the q-search which means very high frequency of accesses. But the stores become pretty infrequent, and often in long searches, they drop way off. I just ran a test to print out a message every time 1,000 stores is done. at around 5M nodes per sec on my dual-core laptop. At 1 minute per move, by the time I got to move 8 or so, I was under 1000 stores per second in the opening. If you take positions like kopec #22, this drops way off as the pawn structure is a little less dynamic compared to the opening positions.

Just my $.02
bob
Posts: 20943
Joined: Mon Feb 27, 2006 7:30 pm
Location: Birmingham, AL

Re: Crafty Transpostion Table Question

Post by bob »

bob wrote:
mcostalba wrote:
bob wrote:
mcostalba wrote:
bob wrote: For me. pawn hashing is the real issue
Have you tried with one pawn hash table per thread ?
Why would I want to do that? Pawn scoring is a global concept, not thread-local. That is a similar case to those that say "don't store mate scores in the hash" and such. It is easier to do it right, and is certainly way more cache-friendly on todays chips that have shared cache between multiple cores.
Becuase could result in a faster engine ;-)

SF (also Glaurung) uses this concept. I understand you reasoning, but unfortunatly computers are illiterate and don't care a lot about good argumentations.

Please don't get angry, just joking, what I would like to say is that in this kind of tests our intuitivness can be fallace, for instance relatively small pawn hash tables and higher cache invalidation rates can be an argument in favor of per-thread pawn hashes.

I am sure you have your "on the paper" counter-argument but the bottom line is, IMHO, that until you don't test you don't know for sure.
The thing is I _have_ tested it, but found it worse in terms of performance. It is not a huge issue, because speed loss/gain is not really large, but it does have an effect. Particularly on processors with shared L2 (or even L3) cache. For example, our 8-core cluster has dual quad-core chips, each chip has 4 cores with their own L1/data and L1/instruction cache, but a shared L2 cache. Reasonable sized pawn hashes will blow out L2 easily, when in reality you are fetching the same entry from different threads.

Even more importantly, 99% of accesses are probes that hit, so very few stores to do cache invalidating / read-for-ownership / whatever your chip of choice uses for SMP cache coherency. Invalidate traffic is very low on pawn hashing. Much higher on normal hash, particularly if one hashes q-search (which I do not do). When I originally tried this on our 128 node dual-cpu cluster, there was very little difference. The shared table was just a tiny bit faster overall. But when we got the cluster with the shared L2 chips, the difference was more pronounced. If it were not for the almost 100% hit rate, this would be an issue, since this is happening out in the q-search which means very high frequency of accesses. But the stores become pretty infrequent, and often in long searches, they drop way off. I just ran a test to print out a message every time 1,000 stores is done. at around 5M nodes per sec on my dual-core laptop. At 1 minute per move, by the time I got to move 8 or so, I was under 1000 stores per second in the opening. If you take positions like kopec #22, this drops way off as the pawn structure is a little less dynamic compared to the opening positions.

Just my $.02
The change is simple enough that I made it and tried it on my core-2 duo which has a shared L2. Runs at 7.6M normal, 7.5M with the dual pawn hashes, on one position. Each test showed about that same performance loss. Did not try it on anything more exotic since it would likely be a little worse. Did try it on a box with much smaller L2 cache and there it was 3-4% slower overall, probably due to excessive cache-thrashing since everything ends up being duplicated in both tables, and in cache.

Still doesn't work for me it seems.
mcostalba
Posts: 2684
Joined: Sat Jun 14, 2008 9:17 pm

Re: Crafty Transpostion Table Question

Post by mcostalba »

bob wrote: Still doesn't work for me it seems.
I have done my little homework too. I have changed the code a bit so that on pawn hash miss before to calculate new pawn structure value and store it now I probe in the other threads tables and check if I am able to find the hit there.

Note that at the end I always realculate pawn structure and save in my pawn table, so that I don't get any pergormance gain but I am only interested in how could change hit rate in this throretical optimal case (optimal because in a real implementation I would not recalculate pawn has and simply keep the value found elswhere).

Without the patch I have an hit rate of 93% on both pawn and material tables, with the sibling's probing patch I go to 95% on both cases. So not a big gain in hit rate, anyhow your arguments on resoure limited L2 can have a point in the final performance difference...although also xor-ing at each access is slow on 32 bits system (where I am developing) so still an open ending story...

I have also to store much more then 64 bits of data, actually a pawn hash entry is of 64 bytes (included the 8 bytes for the key), so the xor-ing trick I would think is not applicable.
bob
Posts: 20943
Joined: Mon Feb 27, 2006 7:30 pm
Location: Birmingham, AL

Re: Crafty Transpostion Table Question

Post by bob »

mcostalba wrote:
bob wrote: Still doesn't work for me it seems.
I have done my little homework too. I have changed the code a bit so that on pawn hash miss before to calculate new pawn structure value and store it now I probe in the other threads tables and check if I am able to find the hit there.

Note that at the end I always realculate pawn structure and save in my pawn table, so that I don't get any pergormance gain but I am only interested in how could change hit rate in this throretical optimal case (optimal because in a real implementation I would not recalculate pawn has and simply keep the value found elswhere).

Without the patch I have an hit rate of 93% on both pawn and material tables, with the sibling's probing patch I go to 95% on both cases. So not a big gain in hit rate, anyhow your arguments on resoure limited L2 can have a point in the final performance difference...although also xor-ing at each access is slow on 32 bits system (where I am developing) so still an open ending story...

I have also to store much more then 64 bits of data, actually a pawn hash entry is of 64 bytes (included the 8 bytes for the key), so the xor-ing trick I would think is not applicable.
XOR trick works for any size. You just XOR up all 8 byte chunks. A pawn hash entry in Crafty is 32 bytes BTW. I just do 3 XORs with the signature against the other 3 chunks of 8 bytes...

Don't quite understand your hit rate however. Mine sticks at 99%+ whenever I watch it... You are just hashing pawn stuff using just pawn locations I assume??? It would seem that if you probe the other tables, you would be better off just having one shared table as you still get the read-for-ownership (or whatever your cache does, it is different for AMD, new Intel and old Intel processors) traffic...
mcostalba
Posts: 2684
Joined: Sat Jun 14, 2008 9:17 pm

Re: Crafty Transpostion Table Question

Post by mcostalba »

bob wrote: You are just hashing pawn stuff using just pawn locations I assume???
Sorry, I don't undertand this could you please explain ?

BTW what is your pawn hash size (in terms of number of entries) ?

Here we have:

Code: Select all

  // Sizes of pawn and material hash tables
  const int PawnTableSize = 16384;
  const int MaterialTableSize = 1024;

Regarding the pawn key, it is incrementally updated in do_move() as I guess is the standard way of doing it:

Code: Select all

      // Update pawn hash key
      st->pawnKey ^= zobrist[us][PAWN][from] ^ zobrist[us][PAWN][to];
bob
Posts: 20943
Joined: Mon Feb 27, 2006 7:30 pm
Location: Birmingham, AL

Re: Crafty Transpostion Table Question

Post by bob »

mcostalba wrote:
bob wrote: You are just hashing pawn stuff using just pawn locations I assume???
Sorry, I don't undertand this could you please explain ?
What I meant was does your pawn hash signature just include the locations of pawns and nothing else? Those low numbers sound like either a very small hash table, or else you are hashing in king location which will significantly reduce hits.


BTW what is your pawn hash size (in terms of number of entries) ?
Default is 16K entries, but I generally use something bigger, say 1M entries (32Mbytes total size). I made defaults small years ago as users wanted to be able to start it up on small machines.. 16K is probably no longer necessary and I should increase it.


Here we have:

Code: Select all

  // Sizes of pawn and material hash tables
  const int PawnTableSize = 16384;
  const int MaterialTableSize = 1024;

Regarding the pawn key, it is incrementally updated in do_move() as I guess is the standard way of doing it:
Same here, if all you use is pawn locations and don't include the king.

Code: Select all

      // Update pawn hash key
      st->pawnKey ^= zobrist[us][PAWN][from] ^ zobrist[us][PAWN][to];
mcostalba
Posts: 2684
Joined: Sat Jun 14, 2008 9:17 pm

Re: Crafty Transpostion Table Question

Post by mcostalba »

bob wrote: Default is 16K entries, but I generally use something bigger, say 1M entries (32Mbytes total size). I made defaults small years ago as users wanted to be able to start it up on small machines.. 16K is probably no longer necessary and I should increase it.
If you already have 99% hits you don't need to increase anything IMHO, perhaps it could be even better to decrease a bit to reduce cache pressure.

My 93% comes from the fact that I measure searching at fixed depth on a set of different positions, so it is worst then in real game case where, from move to move the positions change less.

I got 93% searching at depth 12 on some midgame positions. If I measure on endgame positions at depth 18 I get 99%.

I have also tested searching with one or 2 threads, but results doesn't change. I think with bigger search depths (or in real games) hit rate increases.

BTW when you speed tested with one table per thread did you remove the XOR-ing code ? I think you should if you didn't otherwise the test is meaningless.
bob
Posts: 20943
Joined: Mon Feb 27, 2006 7:30 pm
Location: Birmingham, AL

Re: Crafty Transpostion Table Question

Post by bob »

mcostalba wrote:
bob wrote: Default is 16K entries, but I generally use something bigger, say 1M entries (32Mbytes total size). I made defaults small years ago as users wanted to be able to start it up on small machines.. 16K is probably no longer necessary and I should increase it.
If you already have 99% hits you don't need to increase anything IMHO, perhaps it could be even better to decrease a bit to reduce cache pressure.
I have not verified 99% with the default size, since I never use that. I'll try to collect some data and post it later today, using short and long time controls, small, med and large pawn hash sizes, to see what the hit rate is for all cases.

Note that there are a lot of possible pawn positions. Something like 2^40 is a rough approximation if you assume 2.5 bits per pawn square x 16 pawn squares. Many of those are meaningless (where pawns have "passed thru each other or captured pieces to "pass". But there are a lot of 'em. If the table is big enough, eventually you store almost all reachable positions. I'll try to quantify the cache pressure issue on my laptop since it has 6M L2... More later today.

My 93% comes from the fact that I measure searching at fixed depth on a set of different positions, so it is worst then in real game case where, from move to move the positions change less.
I agree there. My observations are gathered while playing real games, where Crafty logs the pawn hash hit rate after each move, which provides realistic continuity between searches and positions.

I got 93% searching at depth 12 on some midgame positions. If I measure on endgame positions at depth 18 I get 99%.

I have also tested searching with one or 2 threads, but results doesn't change. I think with bigger search depths (or in real games) hit rate increases.

BTW when you speed tested with one table per thread did you remove the XOR-ing code ? I think you should if you didn't otherwise the test is meaningless.
Yes I did, but I can't even measure the cost of the XOR. It is 3 instructions folded into the pipeline and intermingled with other instructions anyway since no program keeps all the pipes busy all the time. I tested the version without XOR (the 22.x versions did not all have this, I added new code that broke on bad data somewhere either at the end of 22.x or in 23.0 and had to add the XOR trick to eliminate the problem. In any case, removing this is a 2-line change and it doesn't affect the speed at all.
bob
Posts: 20943
Joined: Mon Feb 27, 2006 7:30 pm
Location: Birmingham, AL

Re: Crafty Transpostion Table Question - some data

Post by bob »

OK, I ran 6 tests. Three with 512K memory (16K entries) and three with 32MB memory (1M entries). I simply played a game at 5 secs/move with each pawn hash size, and I personally played the same moves each time. No pondering. I started at move 10 in a common book position, and then did 30 moves beyond that point so that there was "continuity" between each move/position. I then repeated at 15 secs/move, and finally at 60 secs/move

Results look like this:

1. pawn hash hit percentage is quite high _overall_ for either size. Over the course of the game, I counted total pawn evals and total hash hits, and did not reset the counters. For the three time controls,

512K: 98% 98% 98%
32M; 99% 99% 99%

So not much gain overall. For the first 3 moves of each game:

512K: 91% 94% 91%
32M: 95% 95% 95%

So a difference early when the pawn structure is more dynamic, but they even out as things get fixed.

For NPS

early average:
512K: 3.0M 2.9M 3.1M
32M: 2.9M 3.2M 3.2M

late average:
512K: 3.6M 3.6M 3.6M
32M: 3.7M 3.7M 3.8M

So bigger seems a little faster at longer time controls, which makes sense since I end up doing more full pawn evals with smaller hash.