Extensions, anyone?

hgm · Post by **hgm** » Tue Feb 24, 2009 9:39 pm

Even the asymmetry could better be estimated by measuring performance over a wider range of the parameter, over which it changes a lot, and then fitting some higher order curve through it, and determine the maximum of that.

Note your last posting crossed me editing mine.

bob · Post by **bob** » Tue Feb 24, 2009 9:53 pm

bhlangonijr wrote:
Anthony Cozzie worked on this idea using simulated annealing (several years ago). What we got after I modified crafty to make each evaluation term individually changable via command was a lot of noise. Caused by the inherent randomness chess engines show when using time as the search limiting control feature.
I've read a paper sometime ago "KnightCap: A chess program that learns by combining TD( ) with minimax search" in which the author claims a gain of 450 ELO points [page 12] by tuning evaluation terms "weights" through a process called Temporal-Difference (a sort of ANN). After 300 games. Personally, I doubt it.

http://cs.anu.edu.au/~Lex.Weaver/pub_se ... ghtcap.pdf

Anyway, this approach seems to be more plausible than just adjust the weights randomly.

The issue is this: How is the improvement actually made? If you start off a program and set all weights to zero, then it is quite likely that one could get a significant improvement by automated tuning. The problem Cozzie and I found, however, was that we have a _ton_ of weights that can be adjusted. And the anealerr would treat them all in the same way and we would get some of the most bizarre piece/square tables, because you can encode things like rook on 7th there in a simple way, and that then influences the normal rook on 7th scoring. The bottom line was that we never found an improvement for Crafty. They were never any worse as far as I could tell, but they were also never any better, which was disappointing.

Dirt · Post by **Dirt** » Tue Feb 24, 2009 10:12 pm

bob wrote:If you make several changes over time, with a +7, a -5, a +3 a -1, a +0, a +4 a -3 and a -2 Elo change, overall you come out better. But eliminating those negative changes makes it much better. That was my point...

How confident can we be that the changes will combine in even a close to linear manner? The sum of the changes you list is only +3, but it wouldn't surprise me if the actual combined effect could be a +12. An effect like this seems the easiest way to explain Tord's ability to improve Glaurung by such large amounts.

From what I remember of your testing, you've been choosing to remove evaluation terms if testing shows the same strength with or without them. I'm wondering if that might be backwards.

zamar · Post by **zamar** » Tue Feb 24, 2009 11:02 pm

hgm wrote: Have you tried this method on a model problem where you know the answer? E.g. suppose you already know that the winning probability for parameter value x is 60%*(1-(1-x)^2). (I.e. optimal value is x=1). When you apply the annealing by drawing random 'game results' with this probability, how many games do you need to converge to x=1? Does it converge at all?

Thanks a lot for this testing idea!

I simulated what I'm currently doing with that test problem:

starting value=1.1 (When values in chess programs are only guesses: 10% error must be quite common)
apply_factor = 0.005
game_number = 100000
cooling_factor = 0.000033

results for test runs:

1.06669502721239
1.09071359807219
1.05565070212092
1.09805863970575
1.07415661781521
1.07555396261112
1.07370599940547
1.06222552640416
1.10923052459681
1.07502243412874
1.05256581699926
1.03905816117969
1.09975969393299
1.06116394341623
1.06818713036942
1.11009953930953
1.05479705688086
1.07618830145693
1.07424096789635
1.06704461339334
1.05517567344185
1.05273080438802
1.07107781350668
1.06796286024463
1.10499407093071
1.05709955654527
1.09126213614311
1.05104686517821
1.03536649302966
1.07387081078621
1.06586306997842
1.07767047173808
1.09509721178012
1.06821800289923
1.07654689293
1.04742202936307
1.07623531665651
1.07826294176999
1.06429249814433
1.06135594045394
1.07465673719625
1.06688892620663
1.07144414863126

Conclusion: Did not converge, but usually clear progress was made.

starting value=1.03 (value is already close to optimum)
apply_factor = 0.005
game_number = 100000
cooling_factor = 0.000033

results for test runs:
1.01424274043369 (good)
1.03946473098555 (bad)
0.96028795881353 (bad)
1.0043281671954 (good)
1.0363814805147 (bad)
1.0050361072137 (good)
1.00130330151639 (good)
0.99359945330747 (good)
1.03981095415478 (bad)
0.97767835610430 (good)
1.02684625445085 (good)
1.0322709434732 (bad)
1.01948365226043 (good)
1.05377683784951 (bad)
1.0416868384411 (bad)
1.01541210682659 (good)
1.03275511131925 (bad)
1.0099650046204 (good)
1.00434409050246 (good)
1.04206986583233 (bad)
1.02317994724969 (good)
1.01820291437869 (good)
0.986799997590354 (good)

Conclusion: if value is already close to optimum, it's unclear if method will do more harm or good. (But then: why we would need balancing?)

I guess that to keep "random noise" away one needs to keep apply_factor small (and of course the number of games big).

bob · Post by **bob** » Wed Feb 25, 2009 12:57 am

Dirt wrote:
bob wrote:If you make several changes over time, with a +7, a -5, a +3 a -1, a +0, a +4 a -3 and a -2 Elo change, overall you come out better. But eliminating those negative changes makes it much better. That was my point...
How confident can we be that the changes will combine in even a close to linear manner? The sum of the changes you list is only +3, but it wouldn't surprise me if the actual combined effect could be a +12. An effect like this seems the easiest way to explain Tord's ability to improve Glaurung by such large amounts.

From what I remember of your testing, you've been choosing to remove evaluation terms if testing shows the same strength with or without them. I'm wondering if that might be backwards.

If an eval term can be set to zero with no strength change, and it is resistant to changes in general (no change improves, most hurt) then I feel pretty confident in removing that term.

Yes there are interactions between terms which is one of the problems the annealing approach seemed to have trouble with, in that you can change one in one direction, another in the opposite direction, and get no change at all in skill level.

michiguel · Post by **michiguel** » Wed Feb 25, 2009 1:39 am

Dirt wrote:
bob wrote:If you make several changes over time, with a +7, a -5, a +3 a -1, a +0, a +4 a -3 and a -2 Elo change, overall you come out better. But eliminating those negative changes makes it much better. That was my point...
How confident can we be that the changes will combine in even a close to linear manner? The sum of the changes you list is only +3, but it wouldn't surprise me if the actual combined effect could be a +12. An effect like this seems the easiest way to explain Tord's ability to improve Glaurung by such large amounts.

From what I remember of your testing, you've been choosing to remove evaluation terms if testing shows the same strength with or without them. I'm wondering if that might be backwards.

It is backwards if human knowledge tells you that the evaluation term is safe and beneficial. There are many that provide maybe 1 elo point or less; however, we know that they never hurt and once in a while there are good. For instance, knowledge about the wrong bishop with only a rook pawn left. I do not think you can measure any elo difference, but the parameter should stay. Once you have added several of these small items, you get an engine that plays better endgames and who knows, maybe 20 points stronger. Most of endgame knowledge is ant work.

Miguel

Greg Strong · Post by **Greg Strong** » Wed Feb 25, 2009 1:42 am

Ok, new results, this time with a much larger set - 384 games instead of 80. I added the 6 positions from Nunn set 2 for a total of 16 positions, and played against 12 different programs. 16 x 2 x 12 = 384. Time controls were set at game in 2 minutes (2 minutes for each engine, that is.) With these settings it takes almost exactly 24 hours to complete.

I was only able to use two computers, so I ran only two tests... The standard default settings (no Mate Threat extension), and full Mate Threat extension (a full ply extension at all nodes.) The win percentage was 79.04% with default settings, and 79.95% with the full Mate Threat extension.

Mate Threat 0/0: 273-61-50 (79.0365%)
Mate Threat 2/2: 278-58-48 (79.9479%)

I still wouldn't claim that this is 100% conclusive, but it is definitely starting to look like the mate threat extension does help, at least a little. I do need to do a little more testing, though. Unfortunately, two of the 12 engines I played Glaurung against clearly wern't strong enough to be of any use, as Glaurung won 100% against both of them in both cases. I'll have to find better engines to replace them, and, of course, more test positions would be good also.

Full results for Mate Threat 0/0 (default settings):

Code: Select all

Glaurung-w32 - AnMon 5.60            : 28.5/32 27-2-3 (1101111=11111111111==11011111111)  89%  +363
Glaurung-w32 - Crafty-22.10-win32    : 16.0/32 13-13-6 (=1100100=00=0=011011=0100=111101)  50%    ±0
Glaurung-w32 - Dragon 4.6            : 29.0/32 29-3-0 (11111111111111111011101110111111)  91%  +402
Glaurung-w32 - Fruit-2-3-1           : 22.5/32 21-8-3 (==11101011111=100101011110011111)  70%  +147
Glaurung-w32 - Hermann 2.4 32 bit    : 25.5/32 24-5-3 (01111=1111=1111=1111011110111100)  80%  +241
Glaurung-w32 - Nejmet 3.07           : 32.0/32 32-0-0 (11111111111111111111111111111111) 100% +1200
Glaurung-w32 - ProDeo 1.5            : 24.0/32 23-7-2 (111011111111011=11110110101110=0)  75%  +191
Glaurung-w32 - Ruffian 1.0.5         : 23.5/32 20-5-7 (0110001=====111111111111==111110)  73%  +173
Glaurung-w32 - Rybka 2.2 32 bit      : 17.0/32 12-10-10 (0100111=000==10==0=11=1=101110==)  53%   +21
Glaurung-w32 - Scorpio               : 28.0/32 26-2-4 (=110111110111111=1111=11111111=1)  88%  +346
Glaurung-w32 - SOS 5.1 for Arena     : 32.0/32 32-0-0 (11111111111111111111111111111111) 100% +1200
Glaurung-w32 - Stockfish_12_win32_ja : 20.0/32 14-6-12 (=01111=011=0111101=1==10=====01=)  63%   +92

Full results for Mate Threat 2/2:

Code: Select all

Glaurung-w32 MT2_2 - AnMon 5.60            : 27.0/32 25-3-4 (=11101==111111111111=11111011011)  84%  +288
Glaurung-w32 MT2_2 - Crafty-22.10-win32    : 21.5/32 19-8-5 (1010011011=11=01=11100=1111110=1)  67%  +123
Glaurung-w32 MT2_2 - Dragon 4.6            : 26.5/32 25-4-3 (01=1110111=1=1111111101110111111)  83%  +275
Glaurung-w32 MT2_2 - Fruit-2-3-1           : 22.5/32 21-8-3 (01=1111111111001=00=101101110111)  70%  +147
Glaurung-w32 MT2_2 - Hermann 2.4 32 bit    : 26.5/32 24-3-5 (1=1111110==110111111=111011111=1)  83%  +275
Glaurung-w32 MT2_2 - Nejmet 3.07           : 32.0/32 32-0-0 (11111111111111111111111111111111) 100% +1200
Glaurung-w32 MT2_2 - ProDeo 1.5            : 26.0/32 25-5-2 (111011101110111111110111101==111)  81%  +252
Glaurung-w32 MT2_2 - Ruffian 1.0.5         : 21.5/32 20-9-3 (01111101110=11110011=0110110101=)  67%  +123
Glaurung-w32 MT2_2 - Rybka 2.2 32 bit      : 18.0/32 14-10-8 (==1010100===1=01=1111=1000111010)  56%   +42
Glaurung-w32 MT2_2 - Scorpio               : 30.0/32 28-0-4 (1111111=11111=1111111111111=1=11)  94%  +478
Glaurung-w32 MT2_2 - SOS 5.1 for Arena     : 32.0/32 32-0-0 (11111111111111111111111111111111) 100% +1200
Glaurung-w32 MT2_2 - Stockfish_12_win32_ja : 18.5/32 13-8-11 (10===111==1010110=01=010==10==11)  58%   +56

If someone can recommend a couple more strong (and free!) engines, and possibly a couple more test positions, that would be a help.

Enjoy!

Edsel Apostol · Post by **Edsel Apostol** » Wed Feb 25, 2009 4:21 am

You could try Thinker, Bright and Spike and if you want a private Twisted Logic version just pm me.

For the test position you could try Noomen Test Suite 2008 or Gambit Suite 2008 available here in the Rybka web page: http://www.rybkachess.com/index.php?auswahl=Downloads

Dann Corbit · Post by **Dann Corbit** » Wed Feb 25, 2009 5:43 am

michiguel wrote:
Dirt wrote:
bob wrote:If you make several changes over time, with a +7, a -5, a +3 a -1, a +0, a +4 a -3 and a -2 Elo change, overall you come out better. But eliminating those negative changes makes it much better. That was my point...
How confident can we be that the changes will combine in even a close to linear manner? The sum of the changes you list is only +3, but it wouldn't surprise me if the actual combined effect could be a +12. An effect like this seems the easiest way to explain Tord's ability to improve Glaurung by such large amounts.

From what I remember of your testing, you've been choosing to remove evaluation terms if testing shows the same strength with or without them. I'm wondering if that might be backwards.
It is backwards if human knowledge tells you that the evaluation term is safe and beneficial. There are many that provide maybe 1 elo point or less; however, we know that they never hurt and once in a while there are good. For instance, knowledge about the wrong bishop with only a rook pawn left. I do not think you can measure any elo difference, but the parameter should stay. Once you have added several of these small items, you get an engine that plays better endgames and who knows, maybe 20 points stronger. Most of endgame knowledge is ant work.

Miguel

I think there are many reasons why these changes are hard to measure.
Another obvious one is that the positions that contain the evaluation terms only arise in rare circumstances. In your example of wrong bishop with a rook pawn, how many games will we have to examine before we will find such a game? Many games are won in the middle game phase and so these endgame phase problems look like noise or even may appear to make the program slightly weaker, if we constantly evaluate terms that occur rarely. But when the conditions do occur, the correct evaluation terms can easily be the difference between winning and losing. (Obviously, we can also improve things by having evaluation functions that only look at applicable terms by game phase or other suitable criteria).

Consider something simple like never putting your knight on a1, a8, h1, or h8 except in some dire emergency with plans to escape from there immediately. Probably in 99% of games, it would simply be eliminated by search or other simple things. And so a metric for it might seem useless. But if it should happen in endgame and escape detection by horizon, then it could easily result in a lost game.

Most important (I think) is that good evaluation terms will lead to better looking chess. Missing terms or badly scaled terms will (on those few rare occasions where they come into play) make the engine play ugly chess. Many people may not care at all if there is little statistical significance. But I think that beauty can be just as important as math from time to time.

bob · Post by **bob** » Wed Feb 25, 2009 6:58 am

Greg Strong wrote:Ok, new results, this time with a much larger set - 384 games instead of 80. I added the 6 positions from Nunn set 2 for a total of 16 positions, and played against 12 different programs. 16 x 2 x 12 = 384. Time controls were set at game in 2 minutes (2 minutes for each engine, that is.) With these settings it takes almost exactly 24 hours to complete.

I was only able to use two computers, so I ran only two tests... The standard default settings (no Mate Threat extension), and full Mate Threat extension (a full ply extension at all nodes.) The win percentage was 79.04% with default settings, and 79.95% with the full Mate Threat extension.

Mate Threat 0/0: 273-61-50 (79.0365%)
Mate Threat 2/2: 278-58-48 (79.9479%)

I still wouldn't claim that this is 100% conclusive, but it is definitely starting to look like the mate threat extension does help, at least a little. I do need to do a little more testing, though. Unfortunately, two of the 12 engines I played Glaurung against clearly wern't strong enough to be of any use, as Glaurung won 100% against both of them in both cases. I'll have to find better engines to replace them, and, of course, more test positions would be good also.

Full results for Mate Threat 0/0 (default settings):
Code: Select all
Glaurung-w32 - AnMon 5.60            : 28.5/32 27-2-3 (1101111=11111111111==11011111111)  89%  +363
Glaurung-w32 - Crafty-22.10-win32    : 16.0/32 13-13-6 (=1100100=00=0=011011=0100=111101)  50%    ±0
Glaurung-w32 - Dragon 4.6            : 29.0/32 29-3-0 (11111111111111111011101110111111)  91%  +402
Glaurung-w32 - Fruit-2-3-1           : 22.5/32 21-8-3 (==11101011111=100101011110011111)  70%  +147
Glaurung-w32 - Hermann 2.4 32 bit    : 25.5/32 24-5-3 (01111=1111=1111=1111011110111100)  80%  +241
Glaurung-w32 - Nejmet 3.07           : 32.0/32 32-0-0 (11111111111111111111111111111111) 100% +1200
Glaurung-w32 - ProDeo 1.5            : 24.0/32 23-7-2 (111011111111011=11110110101110=0)  75%  +191
Glaurung-w32 - Ruffian 1.0.5         : 23.5/32 20-5-7 (0110001=====111111111111==111110)  73%  +173
Glaurung-w32 - Rybka 2.2 32 bit      : 17.0/32 12-10-10 (0100111=000==10==0=11=1=101110==)  53%   +21
Glaurung-w32 - Scorpio               : 28.0/32 26-2-4 (=110111110111111=1111=11111111=1)  88%  +346
Glaurung-w32 - SOS 5.1 for Arena     : 32.0/32 32-0-0 (11111111111111111111111111111111) 100% +1200
Glaurung-w32 - Stockfish_12_win32_ja : 20.0/32 14-6-12 (=01111=011=0111101=1==10=====01=)  63%   +92
Full results for Mate Threat 2/2:
Code: Select all
Glaurung-w32 MT2_2 - AnMon 5.60            : 27.0/32 25-3-4 (=11101==111111111111=11111011011)  84%  +288
Glaurung-w32 MT2_2 - Crafty-22.10-win32    : 21.5/32 19-8-5 (1010011011=11=01=11100=1111110=1)  67%  +123
Glaurung-w32 MT2_2 - Dragon 4.6            : 26.5/32 25-4-3 (01=1110111=1=1111111101110111111)  83%  +275
Glaurung-w32 MT2_2 - Fruit-2-3-1           : 22.5/32 21-8-3 (01=1111111111001=00=101101110111)  70%  +147
Glaurung-w32 MT2_2 - Hermann 2.4 32 bit    : 26.5/32 24-3-5 (1=1111110==110111111=111011111=1)  83%  +275
Glaurung-w32 MT2_2 - Nejmet 3.07           : 32.0/32 32-0-0 (11111111111111111111111111111111) 100% +1200
Glaurung-w32 MT2_2 - ProDeo 1.5            : 26.0/32 25-5-2 (111011101110111111110111101==111)  81%  +252
Glaurung-w32 MT2_2 - Ruffian 1.0.5         : 21.5/32 20-9-3 (01111101110=11110011=0110110101=)  67%  +123
Glaurung-w32 MT2_2 - Rybka 2.2 32 bit      : 18.0/32 14-10-8 (==1010100===1=01=1111=1000111010)  56%   +42
Glaurung-w32 MT2_2 - Scorpio               : 30.0/32 28-0-4 (1111111=11111=1111111111111=1=11)  94%  +478
Glaurung-w32 MT2_2 - SOS 5.1 for Arena     : 32.0/32 32-0-0 (11111111111111111111111111111111) 100% +1200
Glaurung-w32 MT2_2 - Stockfish_12_win32_ja : 18.5/32 13-8-11 (10===111==1010110=01=010==10==11)  58%   +56
If someone can recommend a couple more strong (and free!) engines, and possibly a couple more test positions, that would be a help.

Enjoy!

The set of almost 4,000 positions I use are on my ftp machine at ftp.cis.uab.edu/pub/hyatt...

Extensions, anyone?

Re: Extensions, anyone?

Re: Extensions, anyone?

Re: Extensions, anyone?

Re: Extensions, anyone?

Re: Extensions, anyone?

Re: Extensions, anyone?

Re: Extensions, anyone?

Re: Extensions, anyone?

Re: Extensions, anyone?

Re: Extensions, anyone?