Thanks, Miguel. Ask and ye shall receive.michiguel wrote:"2400 positions to start matches in pgn format.
Best
Dan H.
Moderators: hgm, Dann Corbit, Harvey Williamson
Thanks, Miguel. Ask and ye shall receive.michiguel wrote:"2400 positions to start matches in pgn format.
Truth is I forgot all about I still had 4 matches to runAjedrecista wrote:Hello Ed et al!
Sorry for bumping this topic. I am conducting a fixed depth testing (depth d vs. depth (d - 1)) using Quazar 0.4 w32 and I took a look in this Ed's experiment. My questions are the following ones:
· How is the progress of this experiment?
· Have you reached some conclusions?
I don't know. There are so many elo-systems, I just picked the most simple one as displayed at the end of the page. Do the numbers differ much?I read chapter 2 (diminishing return overview) of matches *.1 (DEPTH + 1) and make an artificial rating list, starting with depth 6:
I compute ratings as simple sums: 0 (the offset point); 0 + 180 = 180; 0 + 180 + 147 = 327; etc. I put these ratings in y axis, while I take ln(depth_i) in x axis, so there is a logarithmic scale in x axis. I adjust data points with a line by the method of least squares and I get a coefficient of determination R² ~ 0.9991 using Excel (very good!). The important thing of this line is the slope because the intercept varies with the offset point: Elo variations with depth will be proportional to slope and inverse proportional to depth.Code: Select all
Depth: Rating: ------ ------- 6 0 7 180 8 327 9 478 10 607 11 734
I choose a line because of some reasons: when depth d tends to infinity, then ln(d) ~ 1 + 1/2 + 1/3 + ... + 1/d and ln(d - 1) ~ 1 + 1/2 + 1/3 + ... + 1/(d - 1); delta_x = ln(d) - ln(d - 1) = ln[d/(d - 1)] ~ 1/d. If Y(x) = mx + n, dY/dx = m; estimate Elo gain = delta_Y = m*delta_x ~ m/d ---> 0 if d ---> infinity (diminishing return exists with this model).
A quadratic function fails with the same previous analysis: Y(x) = ax² + bx + c; dY/dx = 2ax + b; delta_x ~ 1/d (the same as before); estimate Elo gain = delta_Y = (dY/dx)*delta_x = (2ax + b)/d ~ {2a*[d + (d - 1)]/2 + b}/d ~ 2a = constant: diminishing return does not exist with this model (the same with other polynomials of higher degree). In dY/dx, I choose the average mean x ~ [d + (d - 1)]/2 because it makes sense to me.
With the data points of the code box, I get Y(x) ~ 1206.5x - 2169.1 with Excel, where x_i = ln(depth_i). Of course, I do not take into account error bars, which should be more less ± 20 Elo for around 800 games and 95% confidence in the cases of depth = 6 and depth = 11 (in the rest of tested depths, error bars should be more less ± 14 Elo for around 1600 games and 95% confidence, but I am relying in my memory, something very risky).
What is curious is that I have a similar, very high R² value with my own data points and adjusted line by least squares up to now (I compute my ratings with BayesElo but I probably do not use the best commands)... so I guess that I am not doing things extremely bad!I am tempted to start a new topic in this subforum with my unfinished results and let people to post their own data and/or conclusions. But first I want to know if this kind of model/approach is reasonably good. Please answer with your suggestions, reporting possible errors in my explanation, etc. Thanks in advance!
And from Holland in return.Regards from Spain.
Hi Dan,Dan Honeycutt wrote:Hi Ed,
A few pointers about how to set up a test with popular GUIs would be nice if that fits with your intent. I, for one, have never figured out how to get Arena to use a book for a tournament. I do know how to get it to use different starting positions (Engines .. Tournament .. Options) but where does one get good starting positions? Mine are just some I threw together, I'm sure there must be better out there somewhere.
Best
Dan H.
I exactly used your numbers of the end of the page, so numbers do not differ. I was asking about the validity (or not) of my model of a line by least squares, but anyway I think it is good enough inside my limitations. With your provided data, if I do Y(depth_i) ~ 1206.5*ln(depth_i) - 2169.1 and I calculate errors (rounding up to 0.1 Elo) as error_i = Y(depth_i) - rating_i:Rebel wrote:I don't know. There are so many elo-systems, I just picked the most simple one as displayed at the end of the page. Do the numbers differ much?
Code: Select all
Depth: Rating: Error:
------ ------- ------
6 0 -7.3
7 180 -1.4
8 327 12.7
9 478 3.9
10 607 2
11 734 -10Here is another source for starting positions: http://kirill-kryukov.com/chess/tools/opening-sampler/Dan Honeycutt wrote:Hi Ed,
A few pointers about how to set up a test with popular GUIs would be nice if that fits with your intent. I, for one, have never figured out how to get Arena to use a book for a tournament. I do know how to get it to use different starting positions (Engines .. Tournament .. Options) but where does one get good starting positions? Mine are just some I threw together, I'm sure there must be better out there somewhere.
Best
Dan H.
Code: Select all
Rank Name Elo + - games score oppo. draws
1 Tinker852x64 50 12 12 1168 58% -9 27%
2 Tinker863x64 -9 8 8 2519 49% 1 31%
3 Tinker850x64 -40 11 11 1351 46% -9 34%
ResultSet-EloRating>los
Ti Ti Ti
Tinker852x64 99 99
Tinker863x64 0 99
Tinker850x64 0 0
Code: Select all
Rank Name Elo + - games score oppo. draws
1 Tinker850x64 11 10 10 1415 50% 8 49%
2 Tinker863x64 8 8 7 2853 52% -4 46%
3 Tinker852x64 -18 10 10 1438 46% 8 44%
ResultSet-EloRating>los
Ti Ti Ti
Tinker850x64 65 99
Tinker863x64 34 99
Tinker852x64 0 0
Code: Select all
Rank Name Elo + - games score oppo. draws
1 Tinker850x64 22 10 10 1525 51% 18 39%
2 Tinker863x64 18 8 8 2523 52% -3 32%
3 Tinker852x64 -40 13 13 998 44% 18 22%
ResultSet-EloRating>los
Ti Ti Ti
Tinker850x64 68 99
Tinker863x64 31 99
Tinker852x64 0 0
Comments or suggestions welcome.The best pgns from the link are those from Cody Rowland, Aser Huegra, and Frank Quisinsky, for they all took the time to filter out bad positions.
It should do that. So if you find it is not working, let me know, because then it is a bug I should fix!brianr wrote:I am not sure that Winboard actually does use the same position for both engines when asked since I run 4 concurrently (one with fast board updates to see something and 3 with /noGUI) and the .trn file is updated whenever. Will have to look at the pgn's to be sure.
One note. To produce test positions, one typically extracts them from PGN. If you want to extract the most popular 4000 (as just a random number chosen out of thin air) you end up sorting and then counting duplicates. This tends to put similar positions close together.brianr wrote:I have been fiddling with Tinker once again this summer. A while back I started testing with much larger numbers of games and used Bayeselo/LOS to see if things improved. At that time I noticed that the results often changed as the number of games increased, such that version A looked better after a few hundred games and version B after a few thousand. This was generally done with Arena.
Eventually, I noticed that I simply could not run enough games to resolve the +/- margins. This was typically with self-testing version A v version B only. Of course, I then tried using a gauntlet of more opponents. This is where things became problematic. Many engines simply don't support fast testing very well. I had many very old versions of numerous engines, but very few seemed to continue running overnight without problems. Years ago I only tested what would be considered long TC games today and the old engine versions were fine.
Recently, I have been using the latest Winboard as a TM, which I like very much for much faster time controls (there is a lot of overhead in Arena, which is useful for other things). I have two quads and typically test at 4 sec + 0.4 inc on the faster and 6 sec + 0.6 on the slower (speed difference is 3:2). I could go faster, especially for eval only changes, but I often change the search and would like to give things a chance to kick in. And, I have 8-10 stable opponents to test with new v old versions.
The latest wrinkle is that the starting positions seem to impact the results more than I would like. Granted, there should be many more games with more opponents, but look at these results (self-play only):
4,000 positions from Adam-Hair-12moves-397457.pgn from
http://kirill-kryukov.com/chess/tools/opening-sampler/The entire Adam-Hair-12moves-397457.pgnCode: Select all
Rank Name Elo + - games score oppo. draws 1 Tinker852x64 50 12 12 1168 58% -9 27% 2 Tinker863x64 -9 8 8 2519 49% 1 31% 3 Tinker850x64 -40 11 11 1351 46% -9 34% ResultSet-EloRating>los Ti Ti Ti Tinker852x64 99 99 Tinker863x64 0 99 Tinker850x64 0 04,000 Bob Hyatt starting positionsCode: Select all
Rank Name Elo + - games score oppo. draws 1 Tinker850x64 11 10 10 1415 50% 8 49% 2 Tinker863x64 8 8 7 2853 52% -4 46% 3 Tinker852x64 -18 10 10 1438 46% 8 44% ResultSet-EloRating>los Ti Ti Ti Tinker850x64 65 99 Tinker863x64 34 99 Tinker852x64 0 0Perhaps I should simply follow Adam's advice:Code: Select all
Rank Name Elo + - games score oppo. draws 1 Tinker850x64 22 10 10 1525 51% 18 39% 2 Tinker863x64 18 8 8 2523 52% -3 32% 3 Tinker852x64 -40 13 13 998 44% 18 22% ResultSet-EloRating>los Ti Ti Ti Tinker850x64 68 99 Tinker863x64 31 99 Tinker852x64 0 0Comments or suggestions welcome.The best pgns from the link are those from Cody Rowland, Aser Huegra, and Frank Quisinsky, for they all took the time to filter out bad positions.
PS: Testing is very exacting. It is all too easy to make a mistake with the TC, the starting postions file, where the results go, which engine versions are used, etc. I also use SCID to remove time loss games and PGN-Extract to remove duplicates. Oh, and I still do not understand the various Bayeselo options, but this is very helpful:
http://adamsccpages.blogspot.com/p/comp ... ams.html#k
I generally use just mm for a "small" number of games and mm 1 1 for 100,00+ with exactdist. I am not sure that Winboard actually does use the same position for both engines when asked since I run 4 concurrently (one with fast board updates to see something and 3 with /noGUI) and the .trn file is updated whenever. Will have to look at the pgn's to be sure.