I am planning to write a couple of technical pages for starting programmers. One is already finished and posted here, The value of an evaluation function, currently in progress is a study about depth and diminishing returns.
I also want to write something about a good testing methodology for the average chess programmer who has only 1 or 2 quads. I am a newbie myself here, in my active days playing thousands of games on Pentium 90 / 266 / 450 was not an option and so I can use some advice. Two questions:
1. Say you play 1/1 games on a quad then how many threads can you start? When you use 4 threads are the timings still reliable with all the interventions of the OS? Or is it just better to use 3 threads?
2. Is there something like an increasing reliability when you increase the time control? An example:
Playing 5000 1/all games is as reliable as 2500 5/all games ?
Beginners testing methodology
Moderator: Ras
-
Rebel
- Posts: 7514
- Joined: Thu Aug 18, 2011 12:04 pm
- Full name: Ed Schröder
-
velmarin
- Posts: 1600
- Joined: Mon Feb 21, 2011 9:48 am
Re: Beginners testing methodology
Thanks Ed,
The first one I followed, very interesting,
Sometimes we tend to value too much, it may not be necessary.
Thank you very much for all your pages. very instructive.
The first one I followed, very interesting,
Sometimes we tend to value too much, it may not be necessary.
Thank you very much for all your pages. very instructive.
-
Kempelen
- Posts: 620
- Joined: Fri Feb 08, 2008 10:44 am
- Location: Madrid - Spain
Re: Beginners testing methodology
Hi Ed, your analysis and results are very interesting.Rebel wrote:I am planning to write a couple of technical pages for starting programmers. One is already finished and posted here, The value of an evaluation function, currently in progress is a study about depth and diminishing returns.
This would also be interesting for non-beginners programmers also, at least for meRebel wrote:I also want to write something about a good testing methodology for the average chess programmer who has only 1 or 2 quads. I am a newbie myself here, in my active days playing thousands of games on Pentium 90 / 266 / 450 was not an option and so I can use some advice. Two questions:
My experience is testing in a core duo which also has effect of timming conditions. In my oppinion and based in a mix of intuition and observations is that timming conditions, as they are a bit ramdom, affect the same to all engines. So in a long tournament one could say that effect is a despise effect.Rebel wrote: 1. Say you play 1/1 games on a quad then how many threads can you start? When you use 4 threads are the timings still reliable with all the interventions of the OS? Or is it just better to use 3 threads?
I suspect increase time controls favor weak engines because with time it is more easy to spot short tactics.Rebel wrote: 2. Is there something like an increasing reliability when you increase the time control? An example:
Playing 5000 1/all games is as reliable as 2500 5/all games ?
I would also give you a few ideas about testing for your guide, based on my experience:
* It would be nice to point when it is safe to stop and dismiss an ongoing tournament. I usually see if current elo is outside of expected+margin error bar window, and a reasonable number of games (i.e. 1000)
* Relative, but not about testing, is to have a good system to keep track changes and its elo gain/loss. Such a notepad is very useful along time, specially if you repeat tests.
* Also a good sources versioning and storage (and backups). I have my own tool that store and retrives the version I want from a repository.
* Same tips on when to use depth testing, shot time, long time control, or node based testing.
* Something I suspect would be positive, but I have never toy with it: how to contabilize draws.
* Tips on when it would be reasonable to increase number of games to get a reasonable conclusion.
regards,
Fermin
-
bob
- Posts: 20943
- Joined: Mon Feb 27, 2006 7:30 pm
- Location: Birmingham, AL
Re: Beginners testing methodology
Rebel wrote:I am planning to write a couple of technical pages for starting programmers. One is already finished and posted here, The value of an evaluation function, currently in progress is a study about depth and diminishing returns.
I also want to write something about a good testing methodology for the average chess programmer who has only 1 or 2 quads. I am a newbie myself here, in my active days playing thousands of games on Pentium 90 / 266 / 450 was not an option and so I can use some advice. Two questions:
1. Say you play 1/1 games on a quad then how many threads can you start? When you use 4 threads are the timings still reliable with all the interventions of the OS? Or is it just better to use 3 threads?
Test it. On some machines, I can run one position and get an NPS of N. I then run the same position 4 times in parallel, and still get an NPS of N. 4 threads is perfectly safe. On other machines, this is not true, and 4 threads will each run slower. Memory bandwidth is one issue, cache is another, and how smart the operating system is about recognizing duplicate executable pages and not replicating them in memory is yet another.
no. Different time controls will test different parts of the engine. Very fast time controls test the efficiency of the main engine control more than anything else, as it is not easy to do 1ms searches without a lot of time jitter that can cause problems. I've measured this specific thing for millions of games. Fast games are, in general, just as accurate as long games in measuring improvement. Unless you are doing something that has an exponential characteristic to it, such as monkeying around with search extensions, where the deeper you go, the more you can extend, which changes the shape of the tree and might look better or worse at deeper (or shallower) time controls. But even MOST search changes don't have problems with rapid testing, in my experiments.
2. Is there something like an increasing reliability when you increase the time control? An example:
Playing 5000 1/all games is as reliable as 2500 5/all games ?
5000 games is simply not enough unless you are looking for 20 Elo type improvements. Most are a fraction of that, requiring many games. Either get more cores, or use much shorter time controls, don't monkey with the number of games.
-
Rebel
- Posts: 7514
- Joined: Thu Aug 18, 2011 12:04 pm
- Full name: Ed Schröder
Re: Beginners testing methodology
This was useful. Thanks Bob.
-
jdart
- Posts: 4427
- Joined: Fri Mar 10, 2006 5:23 am
- Location: http://www.arasanchess.org
Re: Beginners testing methodology
Most engines now do things like scaling LMR with increasing depth, which is not going to happen at low depths, and there are other typically depth-dependent things such as IID, null-move verification etc.Unless you are doing something that has an exponential characteristic to it, such as monkeying around with search extensions, where the deeper you go, the more you can extend, which changes the shape of the tree and might look better or worse at deeper (or shallower) time controls. But even MOST search changes don't have problems with rapid testing, in my experiments.
I think there is probably some minimum search depth, below which you cease to be exercising significant parts of the search. Do you have such a minimum you'd recommend?
I have recently used game in 10sec + 0.1 sec increment, which gets me a depth between 12 (opening) and 20-30 (endgame). This seems like enough depth to me but I may be conservative here.
--Jon
-
Ajedrecista
- Posts: 2201
- Joined: Wed Jul 13, 2011 9:04 pm
- Location: Madrid, Spain.
Some questions.
Hello Ed et al!
Sorry for bumping this topic. I am conducting a fixed depth testing (depth d vs. depth (d - 1)) using Quazar 0.4 w32 and I took a look in this Ed's experiment. My questions are the following ones:
· How is the progress of this experiment?
· Have you reached some conclusions?
I read chapter 2 (diminishing return overview) of matches *.1 (DEPTH + 1) and make an artificial rating list, starting with depth 6:
I compute ratings as simple sums: 0 (the offset point); 0 + 180 = 180; 0 + 180 + 147 = 327; etc. I put these ratings in y axis, while I take ln(depth_i) in x axis, so there is a logarithmic scale in x axis. I adjust data points with a line by the method of least squares and I get a coefficient of determination R² ~ 0.9991 using Excel (very good!). The important thing of this line is the slope because the intercept varies with the offset point: Elo variations with depth will be proportional to slope and inverse proportional to depth.
I choose a line because of some reasons: when depth d tends to infinity, then ln(d) ~ 1 + 1/2 + 1/3 + ... + 1/d and ln(d - 1) ~ 1 + 1/2 + 1/3 + ... + 1/(d - 1); delta_x = ln(d) - ln(d - 1) = ln[d/(d - 1)] ~ 1/d. If Y(x) = mx + n, dY/dx = m; estimate Elo gain = delta_Y = m*delta_x ~ m/d ---> 0 if d ---> infinity (diminishing return exists with this model).
A quadratic function fails with the same previous analysis: Y(x) = ax² + bx + c; dY/dx = 2ax + b; delta_x ~ 1/d (the same as before); estimate Elo gain = delta_Y = (dY/dx)*delta_x = (2ax + b)/d ~ {2a*[d + (d - 1)]/2 + b}/d ~ 2a = constant: diminishing return does not exist with this model (the same with other polynomials of higher degree). In dY/dx, I choose the average mean x ~ [d + (d - 1)]/2 because it makes sense to me.
With the data points of the code box, I get Y(x) ~ 1206.5x - 2169.1 with Excel, where x_i = ln(depth_i). Of course, I do not take into account error bars, which should be more less ± 20 Elo for around 800 games and 95% confidence in the cases of depth = 6 and depth = 11 (in the rest of tested depths, error bars should be more less ± 14 Elo for around 1600 games and 95% confidence, but I am relying in my memory, something very risky).
What is curious is that I have a similar, very high R² value with my own data points and adjusted line by least squares up to now (I compute my ratings with BayesElo but I probably do not use the best commands)... so I guess that I am not doing things extremely bad!
I am tempted to start a new topic in this subforum with my unfinished results and let people to post their own data and/or conclusions. But first I want to know if this kind of model/approach is reasonably good. Please answer with your suggestions, reporting possible errors in my explanation, etc. Thanks in advance!
Regards from Spain.
Ajedrecista.
Sorry for bumping this topic. I am conducting a fixed depth testing (depth d vs. depth (d - 1)) using Quazar 0.4 w32 and I took a look in this Ed's experiment. My questions are the following ones:
· How is the progress of this experiment?
· Have you reached some conclusions?
I read chapter 2 (diminishing return overview) of matches *.1 (DEPTH + 1) and make an artificial rating list, starting with depth 6:
Code: Select all
Depth: Rating:
------ -------
6 0
7 180
8 327
9 478
10 607
11 734I choose a line because of some reasons: when depth d tends to infinity, then ln(d) ~ 1 + 1/2 + 1/3 + ... + 1/d and ln(d - 1) ~ 1 + 1/2 + 1/3 + ... + 1/(d - 1); delta_x = ln(d) - ln(d - 1) = ln[d/(d - 1)] ~ 1/d. If Y(x) = mx + n, dY/dx = m; estimate Elo gain = delta_Y = m*delta_x ~ m/d ---> 0 if d ---> infinity (diminishing return exists with this model).
A quadratic function fails with the same previous analysis: Y(x) = ax² + bx + c; dY/dx = 2ax + b; delta_x ~ 1/d (the same as before); estimate Elo gain = delta_Y = (dY/dx)*delta_x = (2ax + b)/d ~ {2a*[d + (d - 1)]/2 + b}/d ~ 2a = constant: diminishing return does not exist with this model (the same with other polynomials of higher degree). In dY/dx, I choose the average mean x ~ [d + (d - 1)]/2 because it makes sense to me.
With the data points of the code box, I get Y(x) ~ 1206.5x - 2169.1 with Excel, where x_i = ln(depth_i). Of course, I do not take into account error bars, which should be more less ± 20 Elo for around 800 games and 95% confidence in the cases of depth = 6 and depth = 11 (in the rest of tested depths, error bars should be more less ± 14 Elo for around 1600 games and 95% confidence, but I am relying in my memory, something very risky).
What is curious is that I have a similar, very high R² value with my own data points and adjusted line by least squares up to now (I compute my ratings with BayesElo but I probably do not use the best commands)... so I guess that I am not doing things extremely bad!
Regards from Spain.
Ajedrecista.
-
CRoberson
- Posts: 2095
- Joined: Mon Mar 13, 2006 2:31 am
- Location: North Carolina, USA
Re: Beginners testing methodology
On a quad, I run 4 games at a time. Pondering is turned off and all programs are single threaded.
Here is how I test.
First test: A standard set of benchmarks. Some changes should improve speed, but not node counts!
Assuming all went well, Second test: Large number of high speed games. I will watch the result go by for the first 20 games just to be sure things are on track.
Assuming all went well, Third test: Smaller number of games against different opponents at much longer time controls.
Assuming all went well, Fourth test: Games on ICC or FICS or ... with humans and other computers.
During each of the 2nd - 4th tests, I will take some of the loses and look them over myself to see how the changes modified the play.
Now, here is the difference between what I do and what some others do. It is not until now that I release a new version to some of the rating groups. I am amazed at the guys that release a much larger number of versions and can't tell you if the new version is better or not.
Here is how I test.
First test: A standard set of benchmarks. Some changes should improve speed, but not node counts!
Assuming all went well, Second test: Large number of high speed games. I will watch the result go by for the first 20 games just to be sure things are on track.
Assuming all went well, Third test: Smaller number of games against different opponents at much longer time controls.
Assuming all went well, Fourth test: Games on ICC or FICS or ... with humans and other computers.
During each of the 2nd - 4th tests, I will take some of the loses and look them over myself to see how the changes modified the play.
Now, here is the difference between what I do and what some others do. It is not until now that I release a new version to some of the rating groups. I am amazed at the guys that release a much larger number of versions and can't tell you if the new version is better or not.
-
Dan Honeycutt
- Posts: 5258
- Joined: Mon Feb 27, 2006 4:31 pm
- Location: Atlanta, Georgia
Re: Beginners testing methodology
Hi Ed,
A few pointers about how to set up a test with popular GUIs would be nice if that fits with your intent. I, for one, have never figured out how to get Arena to use a book for a tournament. I do know how to get it to use different starting positions (Engines .. Tournament .. Options) but where does one get good starting positions? Mine are just some I threw together, I'm sure there must be better out there somewhere.
Best
Dan H.
A few pointers about how to set up a test with popular GUIs would be nice if that fits with your intent. I, for one, have never figured out how to get Arena to use a book for a tournament. I do know how to get it to use different starting positions (Engines .. Tournament .. Options) but where does one get good starting positions? Mine are just some I threw together, I'm sure there must be better out there somewhere.
Best
Dan H.
-
michiguel
- Posts: 6401
- Joined: Thu Mar 09, 2006 8:30 pm
- Location: Chicago, Illinois, USA
Re: Beginners testing methodology
https://sites.google.com/site/gaviotach ... ects=0&d=1Dan Honeycutt wrote:Hi Ed,
A few pointers about how to set up a test with popular GUIs would be nice if that fits with your intent. I, for one, have never figured out how to get Arena to use a book for a tournament. I do know how to get it to use different starting positions (Engines .. Tournament .. Options) but where does one get good starting positions? Mine are just some I threw together, I'm sure there must be better out there somewhere.
Best
Dan H.
"2400 positions to start matches in pgn format. Most are book positions after ~10 moves. Randomly sorted. "
from
https://sites.google.com/site/gaviotach ... e/download
Miguel