Idea #8430: Optimizing move ordering, very slowly

Michael Sherwin · Post by **Michael Sherwin** » Wed Apr 08, 2015 12:03 am

bob wrote:
matthewlai wrote:
sje wrote:Idea #8430: Optimizing move ordering, very slowly

An experiment:

Step 1: Collect a small set of test positions, preferably those where a traditional search will change its PV prediction several times. Run each of these through a program to a fixed depth with A/B pruning disconnected, instead relying upon a full minimax search at each node. During each search and at each searched node, record the moves at each node along with the move's backed-up minimax evaluation. Much patience and fat disks will be necessary.

Step 2: Again, run each position through the program at a fixed depth with A/B turned on this time. Have each search access the stored search results to supply "perfect" move ordering. Record the total searched node counts to get a baseline figure, the best which can be had with all else held equal.

Step 3: Repeatedly run each position through the program again at the same fixed depth with A/B turned on while tinkering with the move ordering. Compare the node counts with the baseline (best) figures. Iteratively modify the move ordering code by examining those nodes where the ordering was very different form the baseline ordering. Perhaps an automated tuning algorithm could help with this.
I have been thinking about something like that for a long time. I will probably try it some time in the next few months, but using a neural network instead, which should be more flexible, and should train much faster.
I have said many times, we are losing SO MUCH information it is not funny. We search billions of nodes, to get a couple of dozen moves for the PV. There's a wealth of information hidden in the tree. Singular extensions was one idea that comes from this sort of "data mining" idea, let the tree tell you which moves are more important. Ditto for which moves to reduce or not reduce, etc...

I personally think there is a lot to be had there, but it is not easy since the tree is so incredibly large today. Can't see the forest for all the damned trees.

My very first attempt at producing a working chess program may contain a clue. It was not a complete legal program as it could not castle, capture en passant or even promote. It did not even have an evaluation other than material. However, it was slightly more than just a material searcher. It selected between otherwise equal moves based upon one statistic. It kept track of the number of beta cutoffs (+computer/-opponent) for each root move. It would play the move that generated the 'best score'. Captures would only be played if they were materially better. The search algorithm was simple fixed depth alpha-beta and nothing else. The surprising thing about this is that the program could hold its own against Chessmaster 5000 (the current chessmaster at that time) and was even winning with the black pieces once towards the endgame when it encountered a position that it could not handle legally and had to forfeit.

This simple program played very different than cm5000. Its play was very surreal and beautiful in a strange way. It moved it center pawns first, avoided weak pawns and loved forming pawn chains. It also developed its pieces in attacking positions and would punish a careless human. I ran this on a 50 mHz 80386 so it could not search but a few ply deep. I think that it is time for this to be looked at again.

bob · Post by **bob** » Wed Apr 08, 2015 12:26 am

sje wrote:
bob wrote:I have said many times, we are losing SO MUCH information it is not funny. We search billions of nodes, to get a couple of dozen moves for the PV. There's a wealth of information hidden in the tree.
As we all know, if a perfectly ordered A/B tree has n nodes, then the corresponding minimax tree will have O(n^2) nodes.

So if an eight ply A/B search sees a million nodes, then its minimax tree will have on the order of a trillion nodes and will require a disk farm for storage. Just writing the tree to disk could take weeks.

The alternative approach of saving the worst prediction failures based on the mis-ranking of cut-off moves is far more tractable.

For a particular program, how often is a cut-off move ranked at, say, fifth and four other moves were tried before the node could be tossed? What does the histogram of failures look like? Also, how many moves are tried on average before one of them can even raise alpha? What are the features of a position which most contribute to prediction failure? We should know more about this.

I've actually analyzed the fail high data. I did it two ways.

(1) capture the number of moves searched when a fail-high happens. We already know this is on the order of 90% of all fail high positions. The fall-off is pretty quick after that, but then smoothly decreases as far as you care to measure, or until you go beyond the max branching factor for all nodes in your test.

Here's one sample run from something I was doing yesterday:

fh: 57.1M 4.9M 1.6M 797.8K 498.0K 373.0K 293.0K 257.2K 235.9K 206.9K 178.6K 153.6K 116.4K 101.7K 91.7K 77.5K 69.2K 59.9K 271.6K

57.1M fail highs on first move, 4.9M on second, etc. The last one is bigger because anything > 19 was lumped into that counter.

(2) (just did this one yesterday for a different reason). Crafty has "phases" of move ordering. (a) hash move; (b) non-losing captures; (c) killer moves; (d) history-ordered moves (roughly 1/2 of total moves to search here) and then (e) remaining moves. No surprise there except that the hash move is not the biggest phase for producing fail-highs. Non-losing captures is a clear first.

Here's the raw data:

fh: 1:16 2:22.1M 4:17.3M 5:7.9M 6:1.4M 7:620.1K 8:289.9K 10:19.3K 11:17.7M

the format is phase:fail-highs

1 is a root-move only phase. 16 times a move beyond the first move failed high. These are ordered separately so there is no capture/hash/etc move ordering here.

2 is the hash move. 4 is non-losing captures, 5-6-7-8 are the four killer moves, 10 is history moves, and 11 is remaining moves.

Here's one showing non-losing captures as better than hash move:

fh: 2:5.3M 4:28.3M 5:4.8M 6:519.4K 7:225.5K 8:121.1K 10:18.6K 11:3.2M

(note, in the last one I had excluded the root move fail highs as that was not important, it reduced the output just a bit.)

I was fiddling around with a modification to YBW, I had called "young brothers wait longer". IE rather than splitting after just one move is searched, I would defer until 2 or even 3 moves were split. Overall it is certainly worse. You do far fewer incorrect splits, which translates to reduced search overhead. But you accumulate more wait time while waiting on that condition to be satisfied. I was playing around to see if there was any potential "indicator" that would say "better be a little more cautious splitting here and search a couple of moves first" (initial thought was if no hash move, the likelihood of a poor first move goes up. Turns out not to be so true, as captures are still the bigger.)

Enough for now. But there is a ton of data hidden inside these enormous trees, waiting for someone to extract it and use it to improve the search.

bob · Post by **bob** » Wed Apr 08, 2015 12:29 am

stegemma wrote:
bob wrote:[...]I personally think there is a lot to be had there, but it is not easy since the tree is so incredibly large today. Can't see the forest for all the damned trees.
Just open a grand-master's book and you'll find all of what we miss in our softwares. It's the old conflict between knowledge and brute force, i think.

I don't think today we are missing ANYTHING a GM sees. The problem is there is still much more to see if we directed the search a bit better. Imagine what might happen if we reduced 90% of the moves that should be reduced, and none of the moves that should not. Right now we probably get close to that 90%, but we reduce too many that should not be reduced also, and those reductions hide tactics.

bob · Post by **bob** » Wed Apr 08, 2015 12:32 am

Michael Sherwin wrote:
bob wrote:
matthewlai wrote:
sje wrote:Idea #8430: Optimizing move ordering, very slowly

An experiment:

Step 1: Collect a small set of test positions, preferably those where a traditional search will change its PV prediction several times. Run each of these through a program to a fixed depth with A/B pruning disconnected, instead relying upon a full minimax search at each node. During each search and at each searched node, record the moves at each node along with the move's backed-up minimax evaluation. Much patience and fat disks will be necessary.

Step 2: Again, run each position through the program at a fixed depth with A/B turned on this time. Have each search access the stored search results to supply "perfect" move ordering. Record the total searched node counts to get a baseline figure, the best which can be had with all else held equal.

Step 3: Repeatedly run each position through the program again at the same fixed depth with A/B turned on while tinkering with the move ordering. Compare the node counts with the baseline (best) figures. Iteratively modify the move ordering code by examining those nodes where the ordering was very different form the baseline ordering. Perhaps an automated tuning algorithm could help with this.
I have been thinking about something like that for a long time. I will probably try it some time in the next few months, but using a neural network instead, which should be more flexible, and should train much faster.
I have said many times, we are losing SO MUCH information it is not funny. We search billions of nodes, to get a couple of dozen moves for the PV. There's a wealth of information hidden in the tree. Singular extensions was one idea that comes from this sort of "data mining" idea, let the tree tell you which moves are more important. Ditto for which moves to reduce or not reduce, etc...

I personally think there is a lot to be had there, but it is not easy since the tree is so incredibly large today. Can't see the forest for all the damned trees.
My very first attempt at producing a working chess program may contain a clue. It was not a complete legal program as it could not castle, capture en passant or even promote. It did not even have an evaluation other than material. However, it was slightly more than just a material searcher. It selected between otherwise equal moves based upon one statistic. It kept track of the number of beta cutoffs (+computer/-opponent) for each root move. It would play the move that generated the 'best score'. Captures would only be played if they were materially better. The search algorithm was simple fixed depth alpha-beta and nothing else. The surprising thing about this is that the program could hold its own against Chessmaster 5000 (the current chessmaster at that time) and was even winning with the black pieces once towards the endgame when it encountered a position that it could not handle legally and had to forfeit.

This simple program played very different than cm5000. Its play was very surreal and beautiful in a strange way. It moved it center pawns first, avoided weak pawns and loved forming pawn chains. It also developed its pieces in attacking positions and would punish a careless human. I ran this on a 50 mHz 80386 so it could not search but a few ply deep. I think that it is time for this to be looked at again.

Again, as I said. Number of fail highs is something that is generally ignored with alpha/beta. But it might contain useful information that can be extracted. Another issue I don't like is that LMR tends to screw up anything using node counts. The last move searched at the root will have the fewest nodes in the sub-tree, mainly because it occurs so late and gets reduced more. What if it is a potentially good move?

Michael Sherwin · Post by **Michael Sherwin** » Wed Apr 08, 2015 1:24 am

bob wrote:
Michael Sherwin wrote:
bob wrote:
matthewlai wrote:
sje wrote:Idea #8430: Optimizing move ordering, very slowly

An experiment:

Step 1: Collect a small set of test positions, preferably those where a traditional search will change its PV prediction several times. Run each of these through a program to a fixed depth with A/B pruning disconnected, instead relying upon a full minimax search at each node. During each search and at each searched node, record the moves at each node along with the move's backed-up minimax evaluation. Much patience and fat disks will be necessary.

Step 2: Again, run each position through the program at a fixed depth with A/B turned on this time. Have each search access the stored search results to supply "perfect" move ordering. Record the total searched node counts to get a baseline figure, the best which can be had with all else held equal.

Step 3: Repeatedly run each position through the program again at the same fixed depth with A/B turned on while tinkering with the move ordering. Compare the node counts with the baseline (best) figures. Iteratively modify the move ordering code by examining those nodes where the ordering was very different form the baseline ordering. Perhaps an automated tuning algorithm could help with this.
I have been thinking about something like that for a long time. I will probably try it some time in the next few months, but using a neural network instead, which should be more flexible, and should train much faster.
I have said many times, we are losing SO MUCH information it is not funny. We search billions of nodes, to get a couple of dozen moves for the PV. There's a wealth of information hidden in the tree. Singular extensions was one idea that comes from this sort of "data mining" idea, let the tree tell you which moves are more important. Ditto for which moves to reduce or not reduce, etc...

I personally think there is a lot to be had there, but it is not easy since the tree is so incredibly large today. Can't see the forest for all the damned trees.
My very first attempt at producing a working chess program may contain a clue. It was not a complete legal program as it could not castle, capture en passant or even promote. It did not even have an evaluation other than material. However, it was slightly more than just a material searcher. It selected between otherwise equal moves based upon one statistic. It kept track of the number of beta cutoffs (+computer/-opponent) for each root move. It would play the move that generated the 'best score'. Captures would only be played if they were materially better. The search algorithm was simple fixed depth alpha-beta and nothing else. The surprising thing about this is that the program could hold its own against Chessmaster 5000 (the current chessmaster at that time) and was even winning with the black pieces once towards the endgame when it encountered a position that it could not handle legally and had to forfeit.

This simple program played very different than cm5000. Its play was very surreal and beautiful in a strange way. It moved it center pawns first, avoided weak pawns and loved forming pawn chains. It also developed its pieces in attacking positions and would punish a careless human. I ran this on a 50 mHz 80386 so it could not search but a few ply deep. I think that it is time for this to be looked at again.
Again, as I said. Number of fail highs is something that is generally ignored with alpha/beta. But it might contain useful information that can be extracted. Another issue I don't like is that LMR tends to screw up anything using node counts. The last move searched at the root will have the fewest nodes in the sub-tree, mainly because it occurs so late and gets reduced more. What if it is a potentially good move?

I agree totally. However, the information is definitely useful as how else would such a simple and limited algorithm be able to "hold its own" against a state of the art program of the time like cm5000. Contained within the statistic is a sense of development, mobility, safety, attack, pawn play and more. Its play while a bit surreal was very intelligent looking. Not at all like a material only searcher. It even appeared to play having a plan.

So all i'm saying is that example algorithm may contain a clue that statistical information may be valuable when deciding between moves. If someone has a very simple chess engine that they could modify to duplicate the described algorithm I promise that it would be very entertaining and maybe inspiring as well.

Edit: And yes LMR at the root would make no sense in the algorithm that I described.

hgm · Post by **hgm** » Wed Apr 08, 2015 9:44 am

matthewlai wrote:Don't we just have to sort the moves of each sub-tree afterwards, then do an alpha-beta search using those orderings, so the resulting tree will be minimum?

The point is that sorting by score in general does not give you the smallest tree at all. To get a small tree you should not use the best move as cut move, but the move with the smallest sub-tree that still gets you above beta.

The problem is that this is window-dependent, and that the window is determined by what you searched before. It can easily happen that in a cut-node with a large window it would be better not to start searching the cut-move immediately (i.e. with large window), but first search a cheap move (e.g. that starts trading Queens, to reduce branching factor) that will produce a PV score (and thus up alpha), so that the expensive cut-move can be searched with a smaller window {alpha, beta} in stead of {-INF, beta}.

It is a highly non-trivial problem.

sje · Post by **sje** » Fri Apr 10, 2015 4:15 am

hgm wrote:The problem is that this is window-dependent, and that the window is determined by what you searched before.

I don't see this.

Starting with a full minimax tree with the move ordering at each node determined beforehand by fully backed-up terminal evaluations, the α/β scan beginning with a [+∞,-∞] window will examine all the moves of each PV node, all the moves of each ALL node, and only the first move of each CUT node. It has to do these moves to prove it found the correct PV. Other than these moves, no others need to be searched. Note if the tree is only N ply deep, then there are only N PV nodes. Each of these will be visited on the first path down from the root.

hgm · Post by **hgm** » Fri Apr 10, 2015 11:06 am

You say "only the first move of each CUT node". But how do you determine what is "first"? Taking the move with the highest score will almost certainly not give you the smallest tree, if you do it in all CUT nodes.

You might be right that doing it in PV nodes is always optimal, though. This would immediately lead you to the PV score. You could then work your way back from the leaves to determine in any CUT node which move that gets you above the PV score has the smallest tree, and propagate that informaton back towards the root (adding the tree size of all moves in ALL nodes).

sje · Post by **sje** » Fri Apr 10, 2015 1:46 pm

hgm wrote:You say "only the first move of each CUT node". But how do you determine what is "first"? Taking the move with the highest score will almost certainly not give you the smallest tree, if you do it in all CUT nodes.

At every node, the moves are perfectly ordered by their backed-up values as determined by a prior, full minimax search. So we know that the first move is the best possible, although the α/β scan doesn't know this.

If at a CUT node there are multiple moves with the same, highest backed-up value, it still doesn't matter because any of those equi-optimal moves will cause a cut-off.

Each move from a CUT node will either be a terminal node (bottoming out) or an ALL node.

At each ALL node, the α/β scan must search every move to prove that none will make it into the window or cause a cut-off. Because of the prior minimax full search and the PV already established by the α/β scan, we know what the scan doesn't yet know: none of the moves will qualify.

And of course, each move from an ALL node leads either to a terminal node or a CUT node.

Take a big piece of paper and write down a tree several plies deep with several moves at each node. Randomly assign scores to the terminal nodes. Take a second piece of paper and copy the tree to it but with re-ordering all the moves for every node based on their backed up values. Finally, take a third piece of paper and pretend that you're an α/β search running through the tree from the second piece of paper. Writing down this third tree as it's explore by α/β rules, you'll see how the PV, CUT, and ALL nodes are explored just like I've described.

AlvaroBegue · Post by **AlvaroBegue** » Fri Apr 10, 2015 4:42 pm

sje wrote: If at a CUT node there are multiple moves with the same, highest backed-up value, it still doesn't matter because any of those equi-optimal moves will cause a cut-off.

Here's where the optimality might not be achieved by sorting things by score. In a CUT node you are looking for a move that refutes what was just played. There might be a move M1 that refutes it by a large margin of score and another move M2 that refutes it by a small margin, but whose subtree has a small branching factor. In those conditions, you would have explored fewer nodes if you had explored M2 first, but your method would have tried M1 first.

If you have transposition tables, things can get much much more difficult, because whether you explored M1 or M2 first might have an effect on how much effort it is to refute other moves later on. At that point it's hopeless.

Idea #8430: Optimizing move ordering, very slowly

Re: Idea #8430: Optimizing move ordering, very slowly

Re: Prediction failure

Re: Idea #8430: Optimizing move ordering, very slowly

Re: Idea #8430: Optimizing move ordering, very slowly

Re: Idea #8430: Optimizing move ordering, very slowly

Re: Idea #8430: Optimizing move ordering, very slowly

Re: Idea #8430: Optimizing move ordering, very slowly

Re: Idea #8430: Optimizing move ordering, very slowly

Re: Idea #8430: Optimizing move ordering, very slowly

Re: Idea #8430: Optimizing move ordering, very slowly