Extensions, anyone?

bob · Post by **bob** » Tue Feb 24, 2009 6:37 pm

zamar wrote:
bob wrote: A large number of games, using a large number of starting positions (playing each twice with alternating colors to eliminate bias from unbalanced positions) and with a significant number of different opponents is the way to go. You need more than one or two opponents as I have made changes that helped against A, but hurt against B, C and D. If you only play A you can reach the wrong conclusion and make a tuning mistake that will hurt in a real tournament.
Auch. I'm currently testing automated tuning system, where the basic idea is to test if randomly modified engine can beat the original one. But if that what you say is true, then I'm doomed to fail, because the changes might make engine to play worse against all the other engines. Is that kind of behaviour common? I guess you must have tested many variables for your paper.

Anthony Cozzie worked on this idea using simulated annealing (several years ago). What we got after I modified crafty to make each evaluation term individually changable via command was a lot of noise. Caused by the inherent randomness chess engines show when using time as the search limiting control feature.

I currently do _some_ eval tuning by running successive matches (32K games,, one hour per match) with different values to see which direction the term needs to move for improvement...

Until I started this about 2 years ago, I had no idea how hard it would be to actually determine whether a change is good or bad, unless it is a _major_ change...

Tord Romstad · Post by **Tord Romstad** » Tue Feb 24, 2009 7:43 pm

bob wrote:I realize playing large numbers of games is difficult if not impossible for most. But the alternative is to accept bad changes and reject good changes based purely on a coin flip, which is all 200 games is.

If you do some random change and run 200 games, then yes, I agree. But that's not how I and everybody else work. What I do is to make some change that I feel reasonably confident should improve the strength, and then run 200 games. If the "improved" version wins by (say) 104-96, I keep the change. The result by itself says very little, but when combined with my intuition, it is more likely than not (but of course still far from sure) that the change was an improvement. Without a doubt, it does happen that one of the changes I add this way is harmful, but in the long run the result is a steady progress.

This is not an opinion, but an easily verifiable fact: My program has improved by hundreds of rating points over the last few years, as have the programs of many other authors. If running thousands of test games for every little change were the only viable approach, Crafty would be the only program which was able to improve at all. Clearly, this is not the case.

Tord

bob · Post by **bob** » Tue Feb 24, 2009 7:49 pm

Tord Romstad wrote:
bob wrote:I realize playing large numbers of games is difficult if not impossible for most. But the alternative is to accept bad changes and reject good changes based purely on a coin flip, which is all 200 games is.
If you do some random change and run 200 games, then yes, I agree. But that's not how I and everybody else work. What I do is to make some change that I feel reasonably confident should improve the strength, and then run 200 games. If the "improved" version wins by (say) 104-96, I keep the change. The result by itself says very little, but when combined with my intuition, it is more likely than not (but of course still far from sure) that the change was an improvement. Without a doubt, it does happen that one of the changes I add this way is harmful, but in the long run the result is a steady progress.

This is not an opinion, but an easily verifiable fact: My program has improved by hundreds of rating points over the last few years, as have the programs of many other authors. If running thousands of test games for every little change were the only viable approach, Crafty would be the only program which was able to improve at all. Clearly, this is not the case.

Tord

We don't make "random changes" either. I gave some clear non-random examples that intuition says should be better, but which testing shows to be worse...

Your argument also has one major flaw. You assume because you have improved, that your choices were good. There's absolutely no evidence to support that however. Perhaps many of your changes were "good". But I'd bet, based on the unexpected results I have seen over the past two years, that a +lot+ of your changes were worse, and that even the good ones could have been better. If you make several changes over time, with a +7, a -5, a +3 a -1, a +0, a +4 a -3 and a -2 Elo change, overall you come out better. But eliminating those negative changes makes it much better. That was my point...

Tord Romstad · Post by **Tord Romstad** » Tue Feb 24, 2009 8:13 pm

bob wrote: We don't make "random changes" either. I gave some clear non-random examples that intuition says should be better, but which testing shows to be worse...

Your argument also has one major flaw. You assume because you have improved, that your choices were good.

No, not at all: Only that they are good more often than not (or at least that the sum of all the positive changes outweigh the sum of all the negative changes), which is basically what you are saying, too.

There's absolutely no evidence to support that however. Perhaps many of your changes were "good". But I'd bet, based on the unexpected results I have seen over the past two years, that a +lot+ of your changes were worse, and that even the good ones could have been better. If you make several changes over time, with a +7, a -5, a +3 a -1, a +0, a +4 a -3 and a -2 Elo change, overall you come out better. But eliminating those negative changes makes it much better. That was my point...

Sure, if it is possible to play thousands of games within a reasonable time, it is of course advantageous to do so. However, it is not the only way to improve, nor even the most efficient way to improve, for the vast majority of programmers. You are of course right that eliminating the negative changes is better, but only if the cost of eliminating them is not too high. For me -- and almost everybody else -- the cost would be several months of testing between every tiny change. Relying on intuition backed up by a few short test matches is more efficient.

Tord

bhlangonijr · Post by **bhlangonijr** » Tue Feb 24, 2009 8:45 pm

Anthony Cozzie worked on this idea using simulated annealing (several years ago). What we got after I modified crafty to make each evaluation term individually changable via command was a lot of noise. Caused by the inherent randomness chess engines show when using time as the search limiting control feature.

I've read a paper sometime ago "KnightCap: A chess program that learns by combining TD( ) with minimax search" in which the author claims a gain of 450 ELO points [page 12] by tuning evaluation terms "weights" through a process called Temporal-Difference (a sort of ANN). After 300 games. Personally, I doubt it.

http://cs.anu.edu.au/~Lex.Weaver/pub_se ... ghtcap.pdf

Anyway, this approach seems to be more plausible than just adjust the weights randomly.

hgm · Post by **hgm** » Tue Feb 24, 2009 9:04 pm

He must have started with really lousy parameter values then...

In Joker80 the improvement I got from carefully measured piece values (so not obtained indirectly through learning, but by direct measueement) compare to educated guesses was only 100 Elo.

Dann Corbit · Post by **Dann Corbit** » Tue Feb 24, 2009 9:08 pm

bhlangonijr wrote:
Anthony Cozzie worked on this idea using simulated annealing (several years ago). What we got after I modified crafty to make each evaluation term individually changable via command was a lot of noise. Caused by the inherent randomness chess engines show when using time as the search limiting control feature.
I've read a paper sometime ago "KnightCap: A chess program that learns by combining TD( ) with minimax search" in which the author claims a gain of 450 ELO points [page 12] by tuning evaluation terms "weights" through a process called Temporal-Difference (a sort of ANN). After 300 games. Personally, I doubt it.

http://cs.anu.edu.au/~Lex.Weaver/pub_se ... ghtcap.pdf

Anyway, this approach seems to be more plausible than just adjust the weights randomly.

See also:
http://www.cs.ualberta.ca/~mburo/ps/ordinal.ps.gz
http://www.cs.ualberta.ca/~tony/RecentP ... ending.pdf
https://chessprogramming.wikispaces.com/Learning
http://bace.bowron.us/BACE.ps

zamar · Post by **zamar** » Tue Feb 24, 2009 9:15 pm

bob wrote: Anthony Cozzie worked on this idea using simulated annealing (several years ago). What we got after I modified crafty to make each evaluation term individually changable via command was a lot of noise. Caused by the inherent randomness chess engines show when using time as the search limiting control feature.

Wow, this sounds to be very close to what I'm currently doing

I've modified glaurung to make each evaluation parameter modifiable through uci and have already tested simulated annealing and it truly cannot be directly used for engine parameter tuning.

Instead I'm using "simplified" approach (pseudo code):

x is the variable we want to adjust.
std_random(mean, std_dev) is function returning random number using standard deviation.
engine1, engine2 are copies of engine we want to tune.

apply_factor = 0.005
cooling_factor = 1 / 50000
game_count = 100000

x = "starting value"
x_dev = x * 0.05

for (i=0; i < game_count; i++) {
diff = std_random(0, x_dev * 2^(- i * cooling_factor))

engine1.x = x + diff
engine2.x = x - diff

play match (engine1, engine2)

if (engine1 wins)
x += diff * apply_factor
else if (engine 2 wins)
x -= diff *apply_factor
}

Of course you can and it's advisable to tune many variables simultaneously. I'm very sure that what I'm currently seeing is not just "random noise", but I'm not sure if changes benefit against other engines or at longer time controls (or the worst case: there is a bug hidden somewhere in my code and I'm only seeing result of systematic error!)

hgm · Post by **hgm** » Tue Feb 24, 2009 9:30 pm

I would say this method has no chance of succeeding at all, unless you are prepared to play several billion games. You are trying to search for the maximum of a function that is heavily contaminated with shot noise by looking in the neighborhood of the maximum, where the value is likely to be of second order of a small quantity, and thus swamped by the shot noise. Such an approach is doomed.

Have you tried this method on a model problem where you know the answer? E.g. suppose you already know that the winning probability for parameter value x is 60%*(1-(1-x)^2). (I.e. optimal value is x=1). When you apply the annealing by drawing random 'game results' with this probability, how many games do you need to converge to x=1? Does it converge at all?

A better method (but likely still not feasible) would be to try to estimate the position of the peak by finding two equally spoiled values far in the opposite wings of the performance peak. This would make you far less sensitive to the unavoidable noise.

zamar · Post by **zamar** » Tue Feb 24, 2009 9:36 pm

hgm wrote: A better method (but likely still not feasible) would be to try to estimate the position of the peak by finding two equally spoiled values far in the opposite wings of the performance peak. This would make you far less sensitive to the unavoidable noise.

If I could be sure about symmetry, I would of course use this approach.

Extensions, anyone?

Re: Extensions, anyone?

Re: Extensions, anyone?

Re: Extensions, anyone?

Re: Extensions, anyone?

Re: Extensions, anyone?

Re: Extensions, anyone?

Re: Extensions, anyone?

Re: Extensions, anyone?

Re: Extensions, anyone?

Re: Extensions, anyone?