Beginners testing methodology

Dan Honeycutt · Post by **Dan Honeycutt** » Sat Aug 25, 2012 5:33 am

michiguel wrote:"2400 positions to start matches in pgn format.

Thanks, Miguel. Ask and ye shall receive.

Best
Dan H.

Rebel · Post by **Rebel** » Sat Aug 25, 2012 10:10 am

Ajedrecista wrote:Hello Ed et al!

Sorry for bumping this topic. I am conducting a fixed depth testing (depth d vs. depth (d - 1)) using Quazar 0.4 w32 and I took a look in this Ed's experiment. My questions are the following ones:

· How is the progress of this experiment?
· Have you reached some conclusions?

Truth is I forgot all about I still had 4 matches to run

I have restarted the experiment (9 vs 11 running).

Thanks for the reminder.

I read chapter 2 (diminishing return overview) of matches *.1 (DEPTH + 1) and make an artificial rating list, starting with depth 6:
Code: Select all
Depth:     Rating:
------     -------

  6            0
  7          180
  8          327
  9          478
 10          607
 11          734
I compute ratings as simple sums: 0 (the offset point); 0 + 180 = 180; 0 + 180 + 147 = 327; etc. I put these ratings in y axis, while I take ln(depth_i) in x axis, so there is a logarithmic scale in x axis. I adjust data points with a line by the method of least squares and I get a coefficient of determination R² ~ 0.9991 using Excel (very good!). The important thing of this line is the slope because the intercept varies with the offset point: Elo variations with depth will be proportional to slope and inverse proportional to depth.

I choose a line because of some reasons: when depth d tends to infinity, then ln(d) ~ 1 + 1/2 + 1/3 + ... + 1/d and ln(d - 1) ~ 1 + 1/2 + 1/3 + ... + 1/(d - 1); delta_x = ln(d) - ln(d - 1) = ln[d/(d - 1)] ~ 1/d. If Y(x) = mx + n, dY/dx = m; estimate Elo gain = delta_Y = m*delta_x ~ m/d ---> 0 if d ---> infinity (diminishing return exists with this model).

A quadratic function fails with the same previous analysis: Y(x) = ax² + bx + c; dY/dx = 2ax + b; delta_x ~ 1/d (the same as before); estimate Elo gain = delta_Y = (dY/dx)*delta_x = (2ax + b)/d ~ {2a*[d + (d - 1)]/2 + b}/d ~ 2a = constant: diminishing return does not exist with this model (the same with other polynomials of higher degree). In dY/dx, I choose the average mean x ~ [d + (d - 1)]/2 because it makes sense to me.

With the data points of the code box, I get Y(x) ~ 1206.5x - 2169.1 with Excel, where x_i = ln(depth_i). Of course, I do not take into account error bars, which should be more less ± 20 Elo for around 800 games and 95% confidence in the cases of depth = 6 and depth = 11 (in the rest of tested depths, error bars should be more less ± 14 Elo for around 1600 games and 95% confidence, but I am relying in my memory, something very risky).

What is curious is that I have a similar, very high R² value with my own data points and adjusted line by least squares up to now (I compute my ratings with BayesElo but I probably do not use the best commands)... so I guess that I am not doing things extremely bad! I am tempted to start a new topic in this subforum with my unfinished results and let people to post their own data and/or conclusions. But first I want to know if this kind of model/approach is reasonably good. Please answer with your suggestions, reporting possible errors in my explanation, etc. Thanks in advance!

I don't know. There are so many elo-systems, I just picked the most simple one as displayed at the end of the page. Do the numbers differ much?

Regards from Spain.

And from Holland in return.

Rebel · Post by **Rebel** » Sat Aug 25, 2012 10:26 am

Dan Honeycutt wrote:Hi Ed,

A few pointers about how to set up a test with popular GUIs would be nice if that fits with your intent. I, for one, have never figured out how to get Arena to use a book for a tournament. I do know how to get it to use different starting positions (Engines .. Tournament .. Options) but where does one get good starting positions? Mine are just some I threw together, I'm sure there must be better out there somewhere.

Best
Dan H.

Hi Dan,

I never test with book, always from a PGN file. In my case I simply use grandmaster games starting from move 11.

There is so much stuff available, just a few:

http://www.chessgameslinks.lars-balzer.info/
http://www.chess-poster.com/english/pgn/pgn_files.htm

Ajedrecista · Post by **Ajedrecista** » Sat Aug 25, 2012 12:16 pm

Hi again:

Rebel wrote:I don't know. There are so many elo-systems, I just picked the most simple one as displayed at the end of the page. Do the numbers differ much?

I exactly used your numbers of the end of the page, so numbers do not differ. I was asking about the validity (or not) of my model of a line by least squares, but anyway I think it is good enough inside my limitations. With your provided data, if I do Y(depth_i) ~ 1206.5*ln(depth_i) - 2169.1 and I calculate errors (rounding up to 0.1 Elo) as error_i = Y(depth_i) - rating_i:

Code: Select all

Depth:     Rating:     Error:
------     -------     ------

  6            0        -7.3
  7          180        -1.4
  8          327        12.7
  9          478         3.9
 10          607         2
 11          734       -10

Errors are not too large IMHO, fortunately. With this model, I expect around 105 Elo of advantage of depth 12 over depth 11 in ProDeo 1.74: Y(12) - Y(11) ~ 1206.5*ln(12/11) ~ 105 Elo. Who knows? There are still too few data points...

I will upload my games and results in a new separate topic: cutechess-cli 0.5.1 has played 22500 games so far! I want to add 2500 or 5000 games more in the next months (once CuteChess GUI 1.0 is out with the feature of pause and resume matches) but I probably will not manage it because each new 2500-game match takes insane amounts of time for me.

Regards from Spain.

Ajedrecista.

Adam Hair · Post by **Adam Hair** » Sat Aug 25, 2012 12:26 pm

Dan Honeycutt wrote:Hi Ed,

A few pointers about how to set up a test with popular GUIs would be nice if that fits with your intent. I, for one, have never figured out how to get Arena to use a book for a tournament. I do know how to get it to use different starting positions (Engines .. Tournament .. Options) but where does one get good starting positions? Mine are just some I threw together, I'm sure there must be better out there somewhere.

Best
Dan H.

Here is another source for starting positions: http://kirill-kryukov.com/chess/tools/opening-sampler/

The best pgns from the link are those from Cody Rowland, Aser Huegra, and Frank Quisinsky, for they all took the time to filter out bad positions. The positions that I contributed are distilled from all of the openings used by various rating lists. In theory, they should be pretty good. In practice, there are a few warts. But, if you need thousands of positions for some tests, they will work for you.

Dan Honeycutt · Post by **Dan Honeycutt** » Sat Aug 25, 2012 4:10 pm

Thanks Adam and Ed for the position links. I think I'm set for life

Best
Dan H.

brianr · Post by **brianr** » Wed Sep 12, 2012 3:01 pm

I have been fiddling with Tinker once again this summer. A while back I started testing with much larger numbers of games and used Bayeselo/LOS to see if things improved. At that time I noticed that the results often changed as the number of games increased, such that version A looked better after a few hundred games and version B after a few thousand. This was generally done with Arena.

Eventually, I noticed that I simply could not run enough games to resolve the +/- margins. This was typically with self-testing version A v version B only. Of course, I then tried using a gauntlet of more opponents. This is where things became problematic. Many engines simply don't support fast testing very well. I had many very old versions of numerous engines, but very few seemed to continue running overnight without problems. Years ago I only tested what would be considered long TC games today and the old engine versions were fine.

Recently, I have been using the latest Winboard as a TM, which I like very much for much faster time controls (there is a lot of overhead in Arena, which is useful for other things). I have two quads and typically test at 4 sec + 0.4 inc on the faster and 6 sec + 0.6 on the slower (speed difference is 3:2). I could go faster, especially for eval only changes, but I often change the search and would like to give things a chance to kick in. And, I have 8-10 stable opponents to test with new v old versions.

The latest wrinkle is that the starting positions seem to impact the results more than I would like. Granted, there should be many more games with more opponents, but look at these results (self-play only):

4,000 positions from Adam-Hair-12moves-397457.pgn from
http://kirill-kryukov.com/chess/tools/opening-sampler/

Code: Select all

Rank Name           Elo    +    - games score oppo. draws
   1 Tinker852x64    50   12   12  1168   58%    -9   27%
   2 Tinker863x64    -9    8    8  2519   49%     1   31%
   3 Tinker850x64   -40   11   11  1351   46%    -9   34%
ResultSet-EloRating>los
              Ti Ti Ti
Tinker852x64     99 99
Tinker863x64   0    99
Tinker850x64   0  0

The entire Adam-Hair-12moves-397457.pgn

Code: Select all

Rank Name           Elo    +    - games score oppo. draws
   1 Tinker850x64    11   10   10  1415   50%     8   49%
   2 Tinker863x64     8    8    7  2853   52%    -4   46%
   3 Tinker852x64   -18   10   10  1438   46%     8   44%
ResultSet-EloRating>los
              Ti Ti Ti
Tinker850x64     65 99
Tinker863x64  34    99
Tinker852x64   0  0

4,000 Bob Hyatt starting positions

Code: Select all

Rank Name           Elo    +    - games score oppo. draws
   1 Tinker850x64    22   10   10  1525   51%    18   39%
   2 Tinker863x64    18    8    8  2523   52%    -3   32%
   3 Tinker852x64   -40   13   13   998   44%    18   22%
ResultSet-EloRating>los
              Ti Ti Ti
Tinker850x64     68 99
Tinker863x64  31    99
Tinker852x64   0  0

Perhaps I should simply follow Adam's advice:

The best pgns from the link are those from Cody Rowland, Aser Huegra, and Frank Quisinsky, for they all took the time to filter out bad positions.

Comments or suggestions welcome.

PS: Testing is very exacting. It is all too easy to make a mistake with the TC, the starting postions file, where the results go, which engine versions are used, etc. I also use SCID to remove time loss games and PGN-Extract to remove duplicates. Oh, and I still do not understand the various Bayeselo options, but this is very helpful:
http://adamsccpages.blogspot.com/p/comp ... ams.html#k

I generally use just mm for a "small" number of games and mm 1 1 for 100,00+ with exactdist. I am not sure that Winboard actually does use the same position for both engines when asked since I run 4 concurrently (one with fast board updates to see something and 3 with /noGUI) and the .trn file is updated whenever. Will have to look at the pgn's to be sure.

hgm · Post by **hgm** » Wed Sep 12, 2012 3:15 pm

brianr wrote:I am not sure that Winboard actually does use the same position for both engines when asked since I run 4 concurrently (one with fast board updates to see something and 3 with /noGUI) and the .trn file is updated whenever. Will have to look at the pgn's to be sure.

It should do that. So if you find it is not working, let me know, because then it is a bug I should fix!

(WinBoard broadcasts the random seed in the tourney file when the latter is first created, and all insances running from the file should then use that same random seed, to derive secondary random seeds before each game. So that no matter on which instance a game is played, it would always use the same random seed.)

jdart · Post by **jdart** » Wed Sep 12, 2012 3:39 pm

A few comments:

> 4 sec + 0.4 inc on the faster and 6 sec + 0.6

I think most testers are using something faster than this.

Re starting positions: I don't think it should matter that much. Recently I just use unique 12 ply game positions from a large PGN set, sorted by frequency, and take the top N positions. down to a minimum frequency occurrence, in random order. The PGN is here: http://www.arasanchess.org/testpos.pgn (about 4500 positions).

It is a pretty good variety, is representative of actual games, and will tend to exclude bad positions because those will not occur at high frequency.

Also, I have found this much easier to do on Linux. I have a few scripts now that distribute a new version to the test machines, run the tests (using "disown" so I don't need to stay connected to the shell that starts the tests) and gather up and analyze the results. If I need to monitor it I can get a snapshot of the current results with another perl script.

bob · Post by **bob** » Wed Sep 12, 2012 5:46 pm

brianr wrote:I have been fiddling with Tinker once again this summer. A while back I started testing with much larger numbers of games and used Bayeselo/LOS to see if things improved. At that time I noticed that the results often changed as the number of games increased, such that version A looked better after a few hundred games and version B after a few thousand. This was generally done with Arena.

Eventually, I noticed that I simply could not run enough games to resolve the +/- margins. This was typically with self-testing version A v version B only. Of course, I then tried using a gauntlet of more opponents. This is where things became problematic. Many engines simply don't support fast testing very well. I had many very old versions of numerous engines, but very few seemed to continue running overnight without problems. Years ago I only tested what would be considered long TC games today and the old engine versions were fine.

Recently, I have been using the latest Winboard as a TM, which I like very much for much faster time controls (there is a lot of overhead in Arena, which is useful for other things). I have two quads and typically test at 4 sec + 0.4 inc on the faster and 6 sec + 0.6 on the slower (speed difference is 3:2). I could go faster, especially for eval only changes, but I often change the search and would like to give things a chance to kick in. And, I have 8-10 stable opponents to test with new v old versions.

The latest wrinkle is that the starting positions seem to impact the results more than I would like. Granted, there should be many more games with more opponents, but look at these results (self-play only):

4,000 positions from Adam-Hair-12moves-397457.pgn from
http://kirill-kryukov.com/chess/tools/opening-sampler/
Code: Select all
Rank Name           Elo    +    - games score oppo. draws
   1 Tinker852x64    50   12   12  1168   58%    -9   27%
   2 Tinker863x64    -9    8    8  2519   49%     1   31%
   3 Tinker850x64   -40   11   11  1351   46%    -9   34%
ResultSet-EloRating>los
              Ti Ti Ti
Tinker852x64     99 99
Tinker863x64   0    99
Tinker850x64   0  0
The entire Adam-Hair-12moves-397457.pgn
Code: Select all
Rank Name           Elo    +    - games score oppo. draws
   1 Tinker850x64    11   10   10  1415   50%     8   49%
   2 Tinker863x64     8    8    7  2853   52%    -4   46%
   3 Tinker852x64   -18   10   10  1438   46%     8   44%
ResultSet-EloRating>los
              Ti Ti Ti
Tinker850x64     65 99
Tinker863x64  34    99
Tinker852x64   0  0
4,000 Bob Hyatt starting positions
Code: Select all
Rank Name           Elo    +    - games score oppo. draws
   1 Tinker850x64    22   10   10  1525   51%    18   39%
   2 Tinker863x64    18    8    8  2523   52%    -3   32%
   3 Tinker852x64   -40   13   13   998   44%    18   22%
ResultSet-EloRating>los
              Ti Ti Ti
Tinker850x64     68 99
Tinker863x64  31    99
Tinker852x64   0  0
Perhaps I should simply follow Adam's advice:
The best pgns from the link are those from Cody Rowland, Aser Huegra, and Frank Quisinsky, for they all took the time to filter out bad positions.
Comments or suggestions welcome.

PS: Testing is very exacting. It is all too easy to make a mistake with the TC, the starting postions file, where the results go, which engine versions are used, etc. I also use SCID to remove time loss games and PGN-Extract to remove duplicates. Oh, and I still do not understand the various Bayeselo options, but this is very helpful:
http://adamsccpages.blogspot.com/p/comp ... ams.html#k

I generally use just mm for a "small" number of games and mm 1 1 for 100,00+ with exactdist. I am not sure that Winboard actually does use the same position for both engines when asked since I run 4 concurrently (one with fast board updates to see something and 3 with /noGUI) and the .trn file is updated whenever. Will have to look at the pgn's to be sure.

One note. To produce test positions, one typically extracts them from PGN. If you want to extract the most popular 4000 (as just a random number chosen out of thin air) you end up sorting and then counting duplicates. This tends to put similar positions close together.

For my test suite, I had a large "pool" of positions after sorting and extracting. I then randomized the set after duplicates were removed, and then extracted the first 4,000. My intent (and I was maybe partially successful) was to stop the first point you mentioned. The results change too much between 200 games and 4000 games played. By randomizing the positions, I see less of this, although there are still fluctuations as the error bar goes down as the games played climbs. But not nearly so much as with the originally sorted data, which for whatever reason appeared to have the positions Crafty was poor at near the top of the list, so that it would start off badly but then climb as games were played.

I still consider 4000 games to be too small, as your LOS scores show.

Beginners testing methodology

Re: Beginners testing methodology

Re: Some questions.

Re: Beginners testing methodology

Re: Some questions.

Re: Beginners testing methodology

Re: Beginners testing methodology

Starting Position Test Sets Re:Beginners testing methodology

Re: Starting Position Test Sets Re:Beginners testing methodo

Re: Starting Position Test Sets Re:Beginners testing methodo

Re: Starting Position Test Sets Re:Beginners testing methodo