Position learning and opening books

Zenmastur · Post by **Zenmastur** » Thu Jul 02, 2015 7:10 am

mvk wrote:
mvk wrote:I'm currently in the midst of quantifying some of the book improvement results. One coarse measurement I did last month is where I pitted my engine with different versions of the book (from 4 different years) against each other, and one without book. No opponent learning was active in these test matches, just a static book. 'b<year>' means that a book from that year was used. Results are:
Code: Select all
Rank Name                Elo    +    - games score oppo. draws 
   1 rookie3.9x1-b2015     0   12   12  2000   54%   -22   48% 
   2 rookie3.9x1-b2014   -10   10   10  3000   55%   -41   47% 
   3 rookie3.9x1-b2013   -34    9    9  4000   57%   -79   44% 
   4 rookie3.9x1-b2012   -88   10   10  3000   50%   -88   43% 
   5 rookie3.9x1-bnone  -219   13   13  2000   26%   -61   36% 
This encourages me to believe that there is some kind of progression in the books, but also that that effect starts to taper off.
Returning to the above results once more: I have recently completed a higher resolution repeat of the above experiment. For that I picked 1 book version from every month in the past 4 years (namely, the last version of each month). Then I pitted them against each other with cutechess-cli, always using the same version of my program but letting that play with a different book version. Maximum allowable path error for the move selection is 0.1p. Each match had 1,000 games at 90s+0.5s time control for a total of 159,000 games. There are too many entrants for a full round robin test, therefore each program/book combo 'N' played against version N-1, N-2, N-4, N-8, N-16, etc (and, consequently, also against N+1, N+2, N+4, etc). Unlike the earlier test, I didn't enter a no-book variant.

The result is depicted in the plot below. The ratings and margins were calculated by bayeselo with default settings.

Both the books' elo increase over the years and the recent taper-off effect are confirmed by the second experiment. I tried to find a correlation between the delta-elo and the expansion methods I used in each book version (eg. learning, dropout expansion, importing playchess games, etc...), but it is not so clear what helps most, unfortunately. Maybe more on that later.

In the meantime I have streamlined my original set of scripts and I'm deepening the dropout searches of the repertoire. There is now a live visualisation of this update process running here, where I sometimes go watch when I'm bored . The yellow arrows represent moves that stay in the book (without indication if they are good or bad btw) and the red arrow is the current dropout move for that position (that is, the best move leaving book).

Finally, I did a test to see if there is really an impact on allowing for a certain path error, and corresponding score-wise position degradation, in exchange for opening line variability. For this I played the same program with the same, most recent, book version against itself while applying different maximum allowable path errors in book move selection: I tested for 0.10 pawn, 0.08 pawn, 0.06 pawn, 0.04 pawn and 0.02 pawn maximum path error. On the servers I normally use 0.1 pawn, which is very wide. During tournaments I normally use a much smaller value. There was a claim before that doing this weakens the program in theory and practice. The latter can be tested, which I did here. See results below. My conclusion is that, within a reasonable number of games, there is no observable correlation between performance and allowable path error in this range. (Note that a 0.1 pawn disadvantage would normally correspond to a 10 elo loss)
Code: Select all
Rank Name               Elo    +    - games score oppo. draws 
   1 rookie3.9x1-m040     1    3    3 34000   50%     0   32% 
   2 rookie3.9x1-m020     0    4    4 24000   50%     0   34% 
   3 rookie3.9x1-m100     0    4    4 24000   50%    -1   33% 
   4 rookie3.9x1-m080    -1    3    3 34000   50%     0   32% 
   5 rookie3.9x1-m060    -1    3    3 44000   50%     0   32% 

Hmmm...

I did some testing and in the opening the standard deviation of the evaluation seems to be about 10 to 20 centi-pawns over a 5 ply interval from 15 ply to 45 ply depending on the program. i.e. the SD between the evaluation at 15 ply and 20 ply (or 40 and 45 ply) using the same program. So, a path error of 10 centi-pawns or less seems too low. I think I would do some tests while doubling the path errors until I got a detectable (i.e. greater than the error bands) drop in ELO. Once you're sure you have a small ELO drop do several tests between the last two bounds to find the highest path error with no detectable drop in ELO. The point is to maximize the number of playable lines. This makes the program less predictable and does a better job of exploring move alternatives. It won't explore them optimally, but it will be better than what you are doing now.

If I recall, you stated that you use random selection from the available moves that meet the path error requirements. I don't think random move selection is optimal. I would try some other method of move selection that attempts to maximize game scores.

I liked the link you added. It says ~38Mnps. What kind of hardware is that running on? Whats the difference between the red and yellow arrows?

Regards,

Zen

mvk · Post by **mvk** » Thu Jul 02, 2015 5:37 pm

Zenmastur wrote:I did some testing and in the opening the standard deviation of the evaluation seems to be about 10 to 20 centi-pawns over a 5 ply interval from 15 ply to 45 ply depending on the program. i.e. the SD between the evaluation at 15 ply and 20 ply (or 40 and 45 ply) using the same program. So, a path error of 10 centi-pawns or less seems too low. I think I would do some tests while doubling the path errors until I got a detectable (i.e. greater than the error bands) drop in ELO. Once you're sure you have a small ELO drop do several tests between the last two bounds to find the highest path error with no detectable drop in ELO. The point is to maximize the number of playable lines.

For me that is not the goal. Sufficient variation is good enough. Above that my tree becomes too large to maintain/expand. Put in another way: if there are too many lines prepared that won't be played because there are so many of them, the effort is not as effective as it could have been. I'm about the decrease the maximum allowable path error on the servers for that reason. The repertoire is becoming too big. That is also the reason I can't easily increase this value to see what happens: Many book nodes that were never part of the repertoire have been analysed only with reduced depth. To widen the margin, all of those would need to be re-analyzed first.

If I recall, you stated that you use random selection from the available moves that meet the path error requirements. I don't think random move selection is optimal. I would try some other method of move selection that attempts to maximize game scores.

My primary move selection criterium is based on recent game history against the opponent for each of the available moves in the repertoire (if there are more than 1). As a tie breaker I select using a uniform random distribution. For example when playing a new opponent, or a in position that wasn't played against the opponent before yet.

I liked the link you added. It says ~38Mnps. What kind of hardware is that running on?

Just a mix of a few otherwise idle machines, new and old.

Whats the difference between the red and yellow arrows?

mvk wrote:The yellow arrows represent moves that stay in the book (without indication if they are good or bad btw) and the red arrow is the current dropout move for that position (that is, the best move leaving book).

Zenmastur · Post by **Zenmastur** » Wed Jul 08, 2015 2:13 am

mvk wrote:
Zenmastur wrote:I did some testing and in the opening the standard deviation of the evaluation seems to be about 10 to 20 centi-pawns over a 5 ply interval from 15 ply to 45 ply depending on the program. i.e. the SD between the evaluation at 15 ply and 20 ply (or 40 and 45 ply) using the same program. So, a path error of 10 centi-pawns or less seems too low. I think I would do some tests while doubling the path errors until I got a detectable (i.e. greater than the error bands) drop in ELO. Once you're sure you have a small ELO drop do several tests between the last two bounds to find the highest path error with no detectable drop in ELO. The point is to maximize the number of playable lines.
For me that is not the goal.

Hmmm...

If making your program win more games isn't your goal, then what is?

mvk wrote: Sufficient variation is good enough.

"good enough" for what purpose?

mvk wrote: Above that my tree becomes too large to maintain/expand. Put in another way: if there are too many lines prepared that won't be played because there are so many of them, the effort is not as effective as it could have been.

In what way is it too large? Do you mean that the amount of computing power required to generate the various variations becomes more than your budget will support? Or, are you saying that the files space required to store the book becomes excessive?

For me, file space is a non issue. If it takes 150 gigabytes so be it, if it takes more than 8 terabytes I might think for a about it for a while before I ordered new drives. I probably wouldn't think about if for too long though, because I already have plans for an 18-24 terabyte raid array. But for now I'm not anywhere near needing this kind of space for an opening book or the supporting data.

mvk wrote: I'm about the decrease the maximum allowable path error on the servers for that reason. The repertoire is becoming too big. That is also the reason I can't easily increase this value to see what happens: Many book nodes that were never part of the repertoire have been analysed only with reduced depth. To widen the margin, all of those would need to be re-analyzed first.

I guess this is why I don't like off-line analysis as the main method of extending the repertoire.

If I recall, you stated that you use random selection from the available moves that meet the path error requirements. I don't think random move selection is optimal. I would try some other method of move selection that attempts to maximize game scores.

mvk wrote: My primary move selection criterium is based on recent game history against the opponent for each of the available moves in the repertoire (if there are more than 1). As a tie breaker I select using a uniform random distribution. For example when playing a new opponent, or a in position that wasn't played against the opponent before yet.

I don't store the opening play of opponents because I see no benefit in it. I suppose if you play the same set of opponents all the time there could be some advantage to be gained. The random move selection I think is bad even if the circumstances for it use are limited. It's only redeeming value is unpredictability. This is gained at the expense of non-optimal play. The MAB selection algorithm will be unpredictable by an opponent if the moves are deemed of equal value. If however, one or more moves are superior to the others these will be played proportionately more often than the others, but still can't be predicted by an opponent. The exception is when one move is so superior that it will be played predominately. This isn't a problem because if the move is that much superior to the others it SHOULD be play because it's clearly the BEST move.

Regards,

Zen

mvk · Post by **mvk** » Wed Jul 08, 2015 10:47 pm

Zenmastur wrote:
mvk wrote:
Zenmastur wrote:The point is to maximize the number of playable lines.
For me that is not the goal.
If making your program win more games isn't your goal

That doesn't follow.

mvk wrote: Sufficient variation is good enough.
"good enough" for what purpose?

Good enough to avoid being exploitable by repeating the exact same book mistakes. That allows unsupervised play, without worries, on public chess servers before the next graph update is available. And good enough to explore a stronger opponent's weaker spots by focussing on alternative lines after losing, and then steering into those lines that give better results. Rookie does both, much to the chagrin of some operators who don't use book learning, or have broken book learning. They have to patch their books manually all the time.

mvk wrote: Above that my tree becomes too large to maintain/expand. Put in another way: if there are too many lines prepared that won't be played because there are so many of them, the effort is not as effective as it could have been.
In what way is it too large? Do you mean that the amount of computing power required to generate the various variations becomes more than your budget will support?

Moore's Law has been slowing down for a while, and it is becoming very visible recently: Intel has just postponed their 10nm generation indefinitely, while the 28nm generation is still the sweetspot in the cost/performance spectrum for chip makers. But Moore's Law hasn't stopped quite yet, it is just slowing down. That means that any investment in computation time is still subject to deflation. It is still only a matter of time before any offline preparation is defeated by what you can search during the game. The size of the opening preparation has a maximum practical size for that reason, above which the player will eventually catch up with the booker.

Or, are you saying that the files space required to store the book becomes excessive?

Space, query time and transfer rates are not an issue anymore, and are a waste of time to worry about.

mvk wrote: I'm about the decrease the maximum allowable path error on the servers for that reason. The repertoire is becoming too big. That is also the reason I can't easily increase this value to see what happens: Many book nodes that were never part of the repertoire have been analysed only with reduced depth. To widen the margin, all of those would need to be re-analyzed first.
I guess this is why I don't like off-line analysis as the main method of extending the repertoire.

If I recall, you stated that you use random selection from the available moves that meet the path error requirements. I don't think random move selection is optimal. I would try some other method of move selection that attempts to maximize game scores.

mvk wrote: My primary move selection criterium is based on recent game history against the opponent for each of the available moves in the repertoire (if there are more than 1). As a tie breaker I select using a uniform random distribution. For example when playing a new opponent, or a in position that wasn't played against the opponent before yet.
I don't store the opening play of opponents because I see no benefit in it. I suppose if you play the same set of opponents all the time there could be some advantage to be gained. The random move selection I think is bad even if the circumstances for it use are limited. It's only redeeming value is unpredictability. This is gained at the expense of non-optimal play.

Well, about that I can only say that I have measured the elo effect and I trust my results

Zenmastur · Post by **Zenmastur** » Fri Jul 10, 2015 12:50 am

mvk wrote:
mvk wrote: My primary move selection criterium is based on recent game history against the opponent for each of the available moves in the repertoire (if there are more than 1). As a tie breaker I select using a uniform random distribution. For example when playing a new opponent, or a in position that wasn't played against the opponent before yet.
I don't store the opening play of opponents because I see no benefit in it. I suppose if you play the same set of opponents all the time there could be some advantage to be gained. The random move selection I think is bad even if the circumstances for it use are limited. It's only redeeming value is unpredictability. This is gained at the expense of non-optimal play.
Well, about that I can only say that I have measured the elo effect and I trust my results

I have no doubt that your results are what you've claimed. I just think they could be improved with little effort with a different move selection algorithm or an enhanced learning capability. Or both. The former being the easiest to implement. The latter removing the need for off-line analysis but far more work to implement.

Regards,

Zen

mvk · Post by **mvk** » Fri Jul 10, 2015 11:32 pm

Zenmastur wrote:I have no doubt that your results are what you've claimed. I just think they could be improved

Improved by how much? How do you quantify and measure learning performance in your proposal, and what do you estimate the performance difference to be by that metric? Do you have quantitative results to share already?

Zenmastur · Post by **Zenmastur** » Fri Jul 17, 2015 8:54 am

mvk wrote:
Zenmastur wrote:I have no doubt that your results are what you've claimed. I just think they could be improved
Improved by how much? How do you quantify and measure learning performance in your proposal, and what do you estimate the performance difference to be by that metric? Do you have quantitative results to share already?

No, I don't have any real world results yet. I do have some partially completed test results that show that using engine search results are less than optimal predictors of game scores at least when games are played at time controls that are less than or approximately equal to the search-depth used in making a book. I haven't, as yet, tested TCs longer than about 12 minutes as I'm short of computing resources. I plan on testing 600+12.25 increment games at a latter date and maybe a smaller number of longer games.

An example of a test I used: Select a half dozen opening positions from your repertoire that can be found about a 100 times in a large opening database. These positions should NOT be related so pick different openings. They shouldn't be in a forcing sequence of moves or a position in which the side to move is in check etc. Make every legal move for the side on the move and record the FENs. This should give you around 180-240 positions. Analyze each to the depth of your choice. Since these are coming from your opening repertoire this should already be done but the scores need to be of the same depth. I used 15 to 35 plies in 5 ply increments. You could also use a timed analysis. Save the evaluation(s) for each position. Now play 100 games from each position at three (or more) time controls. Very-short, short, and medium. I used 120+2.4 for medium, 25+.5 for fast and 5+.1 for very fast. With an engine that has an EBF of 1.5 this is approximately 4 plies difference in search depth between the various time controls. The 120+2.4 games lasted about 12 minutes on average.

I used a spread sheet to collect the game scores and search evaluations one line per position with related positions grouped together. Calculate the average game score for each position. This should be done for each group of positions and each time control used. Now, use your favorite equation to predict what the evaluation of the position should be given the average game score produced by these games. This should produce one evaluation per time control per position. So something like 540 to 720 evaluations in total. I used the generic equation -400*LOG(1/(average game score)-1)/100. Now sort each group of positions (by rows) so the search evaluation (not the ones just calculated, the ones done prior to game play) are in descending order. If search evaluations in the opening are a good predictor of game results then the average game scores should be sorted in descending order as well as the evaluation you calculated. Do the same procedure for each time control and each group of positions.

It does take some time to do this test as it's will require about 12,000 games per position you select from your repertoire with 3 time controls for each set of positions. If you do this test, I think you will find that search evaluations aren't that great for opening play. The one thing I'm afraid of is that the results I've gotten so far will change dramatically at longer time controls. I don't think they will, but until I complete at least some LTC games from a few different positions I won't be confident in the results. As it is right now, I have very limited resources to do LTC tests and I have other things on my agenda.

New/more hardware is one of several things I'm addressing.

Regards,

Zen

mvk · Post by **mvk** » Mon Jul 20, 2015 12:49 am

Zenmastur wrote:Now, use your favorite equation to predict what the evaluation of the position should be given the average game score produced by these games. This should produce one evaluation per time control per position. So something like 540 to 720 evaluations in total. I used the generic equation -400*LOG(1/(average game score)-1)/100.
[...]
If you do this test, I think you will find that search evaluations aren't that great for opening play.

You picked my favourite equation. If the equation and the game results mismatch, I consider that an evaluation function problem. Actually, my evaluator is tuned to minimise precisely this error, using this same equation. (It doesn't need a "fixed pawn value", such as 100 for example, for that reason also.).

I'm rethinking my dropout expansion node selection criterium BTW. I now want to attempt a more bayesian approach for the node selection here, instead of just the fixed window. I have the impression it spends too much of its workload on deep positions recently.

Zenmastur · Post by **Zenmastur** » Mon Jul 20, 2015 9:28 am

mvk wrote:
Zenmastur wrote:Now, use your favorite equation to predict what the evaluation of the position should be given the average game score produced by these games. This should produce one evaluation per time control per position. So something like 540 to 720 evaluations in total. I used the generic equation -400*LOG(1/(average game score)-1)/100.
[...]
If you do this test, I think you will find that search evaluations aren't that great for opening play.
You picked my favourite equation. If the equation and the game results mismatch, I consider that an evaluation function problem. ...

I originally thought this as well until I did this test. Now it seems to me, that the time control used plays a major role in how well the evaluation function matches the game results. So without further testing at long and very long time controls I would be a little leery of making major changes to an evaluation function if they are based ONLY on this evidence. My suspicion is this: As the time controls are slowly increased (during testing) the "average" game scores will oscillate (sometimes wildly) around the score calculated from the out put of the evaluation function. The magnitude of the oscillations decrease as the time controls increase. If the evaluation function is accurate, then, when all oscillations cease at very long time controls the two values should converge to the same value. If they don't, then you still may not have a problem. It could be that the problem lies in the constants used in the equation. I use a "generic" equation because I'm not designing my algorithm to fit a specific engine. In your case, I would think, that you would optimize at least some or one of Constance in the equation to better fit your programs evaluation function. If this removes most of the remaining error then you're good to go. If not, I'll leave you to your own devices as to how to proceed.

There is also a couple of other things that could happen other than having the oscillations die down as time controls increase. They could get worse at longer time controls or they could stay about the same regardless of time controls. If either of these happen it could be the search that is causing the problem. If they get worse I would look very hard at the search assuming you have no instability. If you have no oscillations you engine is VERY special!

mvk wrote:... Actually, my evaluator is tuned to minimise precisely this error, using this same equation. (It doesn't need a "fixed pawn value", such as 100 for example, for that reason also.).

I'm rethinking my dropout expansion node selection criterium BTW. I now want to attempt a more bayesian approach for the node selection here, instead of just the fixed window. I have the impression it spends too much of its workload on deep positions recently.

Since you've been at this for quite awhile I'm sure that your understanding of the problem is markedly different than it was several years ago. It's always good to rethink things after your understanding of the problem has changed. Many times an insight will be triggered by recounting what you have thought, done, or observed while working on various parts of the problem.

In any case, good luck with your Bayesian excursion!

Regards,

Zen

Position learning and opening books

Re: Position learning and opening books

Re: Position learning and opening books

Re: Position learning and opening books

Re: Position learning and opening books

Re: Position learning and opening books

Re: Position learning and opening books

Re: Position learning and opening books

Re: Position learning and opening books

Re: Position learning and opening books