PV-move ordering necessary if you have TT-move ordering?

hgm · Post by **hgm** » Fri Jul 02, 2021 11:43 pm

lithander wrote: ↑Fri Jul 02, 2021 4:28 pmNow in stage 2 the PV is played. It's taken from the tri-table.

I am still a bit confused about what you are doing. When you take it from the triangular array, where exactly do you take the move from? If you have pv[node][ply], the normal procedure is that you would take pv[ROOT][ply] as move to follow in the first branch of the tree that you search, and when you reach the end of that branch, stop looking at it altogether. So it doesn't matter that the triangular array is constantly updated.

If you keep using the moves from the triangular array in all nodes of your search, many nodes would find it already changed compared to what it was in the previous iteration, at least if the new iteration overturns the PV. And do you always use the root PV, or do you also use other rows of the triangular array?

lithander · Post by **lithander** » Sat Jul 03, 2021 2:38 am

I use the move at pv[Index(depth)] where Index(depth) = depth + (depth - 1) + (depth - 2) + ... + 1;

The above Index(depth) maps to the root of the subtable of the given depth. I will always take the move[0] of the respective subtable so that it fits the current depth. So, yes - after the first branch is explored the moves I play are not from the PV-sequence anymore that the last iteration determined but whatever was established as the current PV move by a different branch at the same depth.

lithander · Post by **lithander** » Sat Jul 03, 2021 11:07 am

hgm wrote: ↑Fri Jul 02, 2021 11:43 pm the normal procedure is that you would take pv[ROOT][ply] as move to follow in the first branch of the tree that you search, and when you reach the end of that branch, stop looking at it altogether. So it doesn't matter that the triangular array is constantly updated.

So I've made four builds and kept them running on the same testset tonight:

A: PV-move first (in my old way of doing it), no TT.
B: PV-move in the first branch (like you said what is the normal procedure), after that only captures. No TT.
C: PV-move in the first branch, after that the best move from the TT if available.
D: TT move first if available, then a PV-move (in my old way of doing it). [this is MMC 0.5 with noch changes]

A is worse than B. And C is the best of all.

So my conclusion is that playing the PV move my way was not providing any benefit at all after the first branch. But playing the old PV first (in the first branch) is such a huge benefit that it masked the inefficiency of continuing playing PV moves after that.

When I tried to rely on TT moves only and didn't play the PV from the tri-table anymore then sometimes a PV move was lost in the TT. That's why the combined technique - despite the inefficiency of continuing playing PV moves after the first branch - was providing me with a benefit. And in a close call like that being able to skip the move generation in 50% of the nodes also helped to make the PV-playing seem worthwhile.

But now that I have tested all versions and B is better than A and C is the best of all versions.

===D===
Searched 879 positions to depth 11. 32 749 036K nodes visited. Took 27008.855 seconds!
Best move found in 552 / 879 positions!
Operation took 27796.6767s

===C===
Searched 879 positions to depth 11. 24 411 759K nodes visited. Took 23879.176 seconds!
Best move found in 549 / 879 positions!
Operation took 23881.9067s

D is equivalent to the current version 0.5 of MinimalChess. But C explores 30% less nodes, is still 15% faster and it found the best move in 2 more positions.

So your expectations where correct: PV-move ordering is not beneficial if the PV move is still in the TT. If the TT is too small or too simple like in my case it might be beneficial...

hgm wrote: ↑Thu Jul 01, 2021 9:58 pm ...to make absolutely sure it is, some engines stuff back the PV from the tri-angular array into the TT after each iteration. The search itself then only has to bother with the TT move.

like hgm suggested in the 2nd post.

...the same effort could probably be expended to come up with a way to better protected the PV moves in the TT (somehow) and extract that PV for the GUI from the TT directly and get rid of the tri-table altogether.

You have both been right from the beginning but it was still a journey to uncover why exactly my PV-playing version was doing better than the one without it. Now I know! Thanks!

I'll make one of the above mentioned two changes for MMC 0.5.1 and I will also try using the TT in the Q-search.

hgm · Post by **hgm** » Sat Jul 03, 2021 12:31 pm

lithander wrote: ↑Sat Jul 03, 2021 11:07 amSo my conclusion is that playing the PV move my way was not providing any benefit at all after the first branch. But playing the old PV first (in the first branch) is such a huge benefit that it masked the inefficiency of continuing playing PV moves after that.

That makes sense.

And what you say is true: the very simple replacement scheme you are using ('always replace', which I user in micro-Max too, btw) greatly increases the probability that the PV from the previous iteration gets flushed out of the TT, as it was stored in the first branch, and then the entire remainder of the tree got the chance to overwrite it. But if the TT is large enough, this problem disappears; having a better replacement algorithm just means you can greatly shrink the size of the TT to get the same performance.

You must decide for yourself which approach conforms best to your measure of simplicity:

1) Putting extra code in Search() to follow the PV from the triangular array in the first branch of the tree, next to probing the TT and using the TT move.
2) Stuffing the PV from the triangular array back into the TT after each search.
3) Using a TT with buckets of 2, and replace the least deep of those.
4) Just require a very large TT and ignore the problem.

I am biased in favor of (3), as it does not only solve the PV problem in the first branch, but increases the efficiency of the TT through the entire tree. (But, in the limit of very large TT, this advantage of course disappears entirely.) In itself 'shallowest-of-two' replacement is not very complex; it just requires two extra lines of code for manipulating the index to which you map a position:

Code: Select all

index = KEY_TO_INDEX(haskKey);
signature = KEY_TO_SIGNATURE(hashKey); // can, but doesn't have to be entire key
if(TT[index].signature != signature) index ^= 1; // try other entry in 'bucket' if not in first one
if(TT[index].signature == signature) {
  // treat hash hit
} else { // hash miss; prepare replacement
  if(TT[index].depth > TT[index^1].depth) index ^= 1; // prepare to replace least deep entry in bucket
}

If you clear the TT after every move, that is all there is to it, and it is probably simpler and less bug-prone than solution (1). But if you don't clear the TT, the deepest entry in each bucket will eventually get filled with results from deep searches on positions earlier in the game that are no longer reacheable. So that your current search effectively can use only one entry of each bucket. Then you are back in the always-replace situation, but with only half the table size. Therefore people that use such 'depth protection' in their replacement schemes usually combine it with 'aging', where they keep track (in the TT) of whether an entry is used in a search. And if it has not been used for a number of searches, it gets cleared or replaced.

This complicates the code, and what is even worse, requires extra space in your TT entry to keep track of the aging. But there is a simple solution that achieves a good degree of depth protection without the problem that the TT eventually gets poisoned by useless deep entries: 'undercut replacement'. This is similar to 'shallowest of two', except that you don't refrain from replacing the deepest of the two when the new depth is exacty one less than the stored depth. So the last line in the code above would become

Code: Select all

  if(TT[index].depth != depth+1 || TT[index].depth > TT[index^1].depth) index ^= 1;

This protects the deeper entries, which divert attempts to overwrite them to the other entry in the bucket, but not absolutely. Eventually they will get flushed out of the TT, but the higher their depth, the longer that will take, as the rarer stores with the depth required to do it will be.

lithander · Post by **lithander** » Sat Jul 03, 2021 2:18 pm

hgm wrote: ↑Sat Jul 03, 2021 12:31 pm You must decide for yourself which approach conforms best to your measure of simplicity:

1) Putting extra code in Search() to follow the PV from the triangular array in the first branch of the tree, next to probing the TT and using the TT move.
2) Stuffing the PV from the triangular array back into the TT after each search.
3) Using a TT with buckets of 2, and replace the least deep of those.
4) Just require a very large TT and ignore the problem.

Thanks! I appreciate that you keep my fools errand to prefer minimal solutions in mind.

(1) is basically version C from my nightly test. (2) is also easy to do because I already have the TT. Both could serve as reference implementations so I know what to aim for. (4) I don't like because using vast amounts of memory isn't minimal. MMC also runs fine on small computers like the Raspberry Pi and I'd like to keep it that way. Also I believe smaller hash tables might have the advantage that they are more cache friendly and thus faster.

So I'm also in favor for version 3.

hgm wrote: ↑Sat Jul 03, 2021 12:31 pm I am biased in favor of (3), as it does not only solve the PV problem in the first branch, but increases the efficiency of the TT through the entire tree. (But, in the limit of very large TT, this advantage of course disappears entirely.) In itself 'shallowest-of-two' replacement is not very complex; it just requires two extra lines of code for manipulating the index to which you map a position:
Code: Select all
index = KEY_TO_INDEX(haskKey);
signature = KEY_TO_SIGNATURE(hashKey); // can, but doesn't have to be entire key
if(TT[index].signature != signature) index ^= 1; // try other entry in 'bucket' if not in first one
if(TT[index].signature == signature) {
  // treat hash hit
} else { // hash miss; prepare replacement
  if(TT[index].depth > TT[index^1].depth) index ^= 1; // prepare to replace least deep entry in bucket
}

That's interesting! So each slot serves the double role of a 1st bucket (to all positions where KEY_TO_INDEX(haskKey) maps to their index directly) and as a 2nd bucket (to all positions where KEY_TO_INDEX(haskKey) ^ 1 maps to their index) or does the KEY_TO_INDEX only produces even indices? So that there are little gaps in the mapping reserved for buckets?

Because having these little gaps was what I wanted to try because I think that approach could offer a neat solution to the problem of the table filling up with old, deep positions.

Imagine that the KEY_TO_INDEX always maps to even indices. The odd indices map to the 2nd bucket of the preceding index. The 1st bucket would only be overwritten when the depth is bigger.
But after you have computed a move and you would consider to clear the TT to flush out the old, deep positions instead you "change" the KEY_TO_INDEX mapping to now always map to odd indices. Now the 1st bucket (that's going to store high depth values) is mapping to the former 2nd bucket where it will find only rather low depth positions from the former search. The 2nd bucket will contain the former high-depth indices but these are now always replaced.

You don't have to clear the table and you don't have to keep track of the age. You just have to toggle between mapping A and B from time to time and I would expect to do it whenever a search concludes. (Where you would also consider clearing the TT if you'd do such thing)

Currently with one bucket my Index is computed like this:

Code: Select all

int Index(ulong hash, ulong table_size) => (int)(hash % table_size); //indices are [0, 1, 2, 3, ...]

With two buckets that do not alternate it would be like this:

Code: Select all

int Index(ulong hash, ulong table_size) => 2 * (int)(hash % (table_size >> 2))); //indices are [0, 2, 4, 6, ...]

And with two buckets where you can toggle the mapping every new search it would be like this:

Code: Select all

int Index(ulong hash, ulong table_size, bool tick_tock) => 2 * (int)(hash % (table_size >> 2))) + (int)tick_tock; //indices are either [0, 2, 4, 6, ...] or [1, 3, 5, 7, ...]

You just have to make sure tick_tock alternates between 'true' and 'false' between searches. While the old PV would still be there when you need it (at the start of the search) it then get's flushed out of the system eventually as it is now sitting in the always replace bucket.

Does that make sense? Or am I overlooking something?

mvanthoor · Post by **mvanthoor** » Sat Jul 03, 2021 3:03 pm

lithander wrote: ↑Sat Jul 03, 2021 2:18 pm So I'm also in favor for version 3.

Darn. That took a lot of convincing. Again

Now I'm wondering what number of buckets are optimal.

In Rust, it's hard to test how big a struct is, because Rust can re-arrange and pad the struct members as it sees fit, to optimize the compilation. Sometimes I get really strange results. I put 17 bytes into a struct, and I would expect it to be rounded up to the next multiple of 8, which is 24, but the size_of() function says 16, so it is obviously "collapsing" / interleaving bytes in the struct somehow. I've also got a struct somewhere that contains only 12 bytes, which is already a multiple of 3, but Rust pads it to be 24 bytes.

I am assuming it doesn't do that just for fun. It makes me wonder if "keep your entire bucket under 64 bytes" is still applicable when using Rust, because the compiler does exactly what it wants.

I could use it to leave the structs alone and represent it as C would do, with the given order of the members, by using "repr(C)" on it. That is not recommended however. "repr(C)" is intended to make a C-type struct in case you need to interface with C-code that either takes or returns structs.

I now use 4 entries/bucket (a bucket sits at an index, obviously). The TT generally is 256 MB, because CCRL uses that size, even though the engine only gets it filled up half-way in 10s+0.1s testing. Maybe I should test versions with 3 and 2 buckets to see if they are faster.

Also the question... you can have so many entries per bucket that you end up with a "vertical" TT... where there is only one index/bucket with all the entries in it. I wonder where the point lies where an increase in entries/bucket is detrimental to TT performance. As you increase the number of entries/bucket, it takes more and more time trawling through all the entries to find the one to replace. Same as with the killers: if you have more than 2, maybe 3, keeping them all unique costs so much time that the advantage of the extra killer(s) is negated by the time it costs to maintain them.

hgm wrote: ↑Sat Jul 03, 2021 12:31 pm I am biased in favor of (3), as it does not only solve the PV problem in the first branch, but increases the efficiency of the TT through the entire tree.

So you think having 2 buckets and aging (which I don't have yet) is enough to not get the PV-move be overwritten in the TT; i.e., it's not useful to re-stuff the PV into the TT?

lithander wrote: ↑Sat Jul 03, 2021 2:18 pm You don't have to clear the table and you don't have to keep track of the age. You just have to toggle between mapping A and B from time to time and I would expect to do it whenever a search concludes. (Where you would also consider clearing the TT if you'd do such thing)

Interesting idea... in my tests, clearing the TT and not clearing the TT did not really make a difference when using a 256 MB TT with buckets at a 10s + 0.1s time control. Even using more entries/bucket, or no buckets at all didn't really make a difference. The largest difference from these tests was 7 Elo, but with a 10 Elo error bar.

Currently with one bucket my Index is computed like this:

Are you putting a bucket "across" two indices, if I'm seeing this correctly?

In my case, I actually have a "Bucket" struct, which sits at the TT index, and it contains an array with N entries. (Where N >= 1, at compile time.)

You just have to make sure tick_tock alternates between 'true' and 'false' between searches. While the old PV would still be there when you need it (at the start of the search) it then get's flushed out of the system eventually as it is now sitting in the always replace bucket.

Does that make sense? Or am I overlooking something?

It's an interesting idea. You're basically clearing half the TT for each search.

To me, it still feels strange to clear any part of the TT during search, except for overwriting entries that are not useful anymore... Maybe you could try it and see if it's an improvement. I doubt it would be a big improvement when playing super-fast games. The TT doesn't seem to affect those games much. Just _having_ a TT of a somewhat decent size near the end game seems to be enough.

hgm · Post by **hgm** » Sat Jul 03, 2021 4:02 pm

lithander wrote: ↑Sat Jul 03, 2021 2:18 pmDoes that make sense? Or am I overlooking something?

What happens in practice with the code I posted is that one of the two slots in the bucket starts to act as an always-replace slot (but you don't know which one), while the other retains the largest depth that ever mapped to the bucket. Almost all entries produced in the search will have depth 0 or 1, and will continually overwrite each other in the always-replace slot. Very rarely something will overwrite the always-replace slot with something that has higher depth than what was in the other. Then the slots swap role, and what used to be the depth-preferred slot will be very quickly overwritten with depth 0 or 1.

It doesn't really matter whether the 'primary slot' that you map a position to is even-only or can both be even or odd; which of the slots in the end will contain the high depth that is worth preserving depends on whether the order of the positions mapping to the bucket was such that the highest depth in the bucket was superceded an even or odd number of times.

If I understood the tick-tock method correctly, you want to assign a fixed purpose to the two slots. This would be almost equivalent to erasing all deep entries from the previous iteration. Except that you postpone it until after the PV has been used for the first branch. But a quite large fraction of the erased positions might have been useful. Erasing in general is not good. Except for 'stale' entries that you can be sure are never needed again (e.g. because the root already has fewer pieces). That they were created in the previous search (let alone iteration) is not yet an indication for that. In general entries will be useful for several searches in a row.

If you don't want to keep track of aging in the entry, you could erase them 'stochastically' at such a low rate that on average they wil survive for, say, 5 searches. Or just keep track of how many entries of each depth are in the table, and overwrite entries of that depth when there get too many of those.

hgm · Post by **hgm** » Sat Jul 03, 2021 4:51 pm

mvanthoor wrote: ↑Sat Jul 03, 2021 3:03 pmAlso the question... you can have so many entries per bucket that you end up with a "vertical" TT... where there is only one index/bucket with all the entries in it. I wonder where the point lies where an increase in entries/bucket is detrimental to TT performance. As you increase the number of entries/bucket, it takes more and more time trawling through all the entries to find the one to replace. Same as with the killers: if you have more than 2, maybe 3, keeping them all unique costs so much time that the advantage of the extra killer(s) is negated by the time it costs to maintain them.

The common design uses linear search within the bucket, which quickly becomes intolerable when the number of slots/bucket goes up. You can tolerate some inefficiency here, because to get the bucket into cache is nearly two orders of magnitude more expensive than anything you do within it (provided you stay in the same cache line). And as cache lines are only 64 bytes, you can never cram so many entries in it that even a linear search would significantly drive up the cost. (In another thread I treated a case where you could hash 41 EGT exceptions into one cache line; there it made sense to treat the bucket as a minature hash table, so that you could find the sought entry (or conclude it was not there) with just 4 probes.)

I once did test 'shallowest-of-7' replacement; it was worse than 'shallowest-of-4'. But that was not so much because of the increased time it required to find a position within the bucket, but more a matter of the space that could be used by low-depth entries became too small. It is possible to cram 7 entries into a cache line, but then it is better to limit the possible places where a particular entry could be stored to 3 or 4.

You cannot see the difference between various replacement schemes until the nodecount of the tree is at least 10 times larger than the number of TT slots.

mvanthoor · Post by **mvanthoor** » Sat Jul 03, 2021 5:00 pm

hgm wrote: ↑Sat Jul 03, 2021 4:51 pm You cannot see the difference between various replacement schemes until the nodecount of the tree is at least 10 times larger than the number of TT slots.

I've already been thinking about doing some tests with a 96 MB or 64 MB TT, because Rustic can almost exactly fill up (= use each entry at least once) a 128 MB TT during a 10s + 0.1s game on an i7-6700K. Then I can test what's better in my case: 2, 3, or 4 buckets. The reason why I want to test it is because I can't correctly determine the size of my TT entry (and thus the entire bucket), because Rust orders and pads struct members how it sees fit to get the best speed. I don't want to get into hand-optimizing struct member order as if this C in 1976. I'd rather just write my struct with the least amount of bytes possible, then let the compiler have a go at it, and determine if 2, 3 or 4 entries/bucket is best.

And yes, I've thought about making the bucket another TT, but didn't try it yet... I found that a bit decadent, having a hash table that has hash tables in its entries. I have no proof for it, but for some reason it feels as if that has to be incredibly slow.

hgm · Post by **hgm** » Sat Jul 03, 2021 6:29 pm

mvanthoor wrote: ↑Sat Jul 03, 2021 5:00 pmAnd yes, I've thought about making the bucket another TT, but didn't try it yet... I found that a bit decadent, having a hash table that has hash tables in its entries. I have no proof for it, but for some reason it feels as if that has to be incredibly slow.

When an entry could be stored anywhere in the bucket, little improvement is possible over a linear search: most hash probes will be misses, and to conclude a miss you will have to check all slots where the entry could have possibly gone. And anything you do inside a cahe line takes negligible time compared to what it took to access that cache line in the first place.

PV-move ordering necessary if you have TT-move ordering?

Re: PV-move ordering necessary if you have TT-move ordering?

Re: PV-move ordering necessary if you have TT-move ordering?

Re: PV-move ordering necessary if you have TT-move ordering?

Re: PV-move ordering necessary if you have TT-move ordering?

Re: PV-move ordering necessary if you have TT-move ordering?

Re: PV-move ordering necessary if you have TT-move ordering?

Re: PV-move ordering necessary if you have TT-move ordering?

Re: PV-move ordering necessary if you have TT-move ordering?

Re: PV-move ordering necessary if you have TT-move ordering?

Re: PV-move ordering necessary if you have TT-move ordering?