Beginners testing methodology

Rebel · Post by **Rebel** » Tue Jun 19, 2012 11:12 am

I am planning to write a couple of technical pages for starting programmers. One is already finished and posted here, The value of an evaluation function, currently in progress is a study about depth and diminishing returns.

I also want to write something about a good testing methodology for the average chess programmer who has only 1 or 2 quads. I am a newbie myself here, in my active days playing thousands of games on Pentium 90 / 266 / 450 was not an option and so I can use some advice. Two questions:

1. Say you play 1/1 games on a quad then how many threads can you start? When you use 4 threads are the timings still reliable with all the interventions of the OS? Or is it just better to use 3 threads?

2. Is there something like an increasing reliability when you increase the time control? An example:

Playing 5000 1/all games is as reliable as 2500 5/all games ?

velmarin · Post by **velmarin** » Tue Jun 19, 2012 12:21 pm

Thanks Ed,

The first one I followed, very interesting,
Sometimes we tend to value too much, it may not be necessary.

Thank you very much for all your pages. very instructive.

Kempelen · Post by **Kempelen** » Tue Jun 19, 2012 1:56 pm

Rebel wrote:I am planning to write a couple of technical pages for starting programmers. One is already finished and posted here, The value of an evaluation function, currently in progress is a study about depth and diminishing returns.

Hi Ed, your analysis and results are very interesting.

Rebel wrote:I also want to write something about a good testing methodology for the average chess programmer who has only 1 or 2 quads. I am a newbie myself here, in my active days playing thousands of games on Pentium 90 / 266 / 450 was not an option and so I can use some advice. Two questions:

This would also be interesting for non-beginners programmers also, at least for me

Rebel wrote: 1. Say you play 1/1 games on a quad then how many threads can you start? When you use 4 threads are the timings still reliable with all the interventions of the OS? Or is it just better to use 3 threads?

My experience is testing in a core duo which also has effect of timming conditions. In my oppinion and based in a mix of intuition and observations is that timming conditions, as they are a bit ramdom, affect the same to all engines. So in a long tournament one could say that effect is a despise effect.

Rebel wrote: 2. Is there something like an increasing reliability when you increase the time control? An example:

Playing 5000 1/all games is as reliable as 2500 5/all games ?

I suspect increase time controls favor weak engines because with time it is more easy to spot short tactics.

I would also give you a few ideas about testing for your guide, based on my experience:
* It would be nice to point when it is safe to stop and dismiss an ongoing tournament. I usually see if current elo is outside of expected+margin error bar window, and a reasonable number of games (i.e. 1000)
* Relative, but not about testing, is to have a good system to keep track changes and its elo gain/loss. Such a notepad is very useful along time, specially if you repeat tests.
* Also a good sources versioning and storage (and backups). I have my own tool that store and retrives the version I want from a repository.
* Same tips on when to use depth testing, shot time, long time control, or node based testing.
* Something I suspect would be positive, but I have never toy with it: how to contabilize draws.
* Tips on when it would be reasonable to increase number of games to get a reasonable conclusion.

regards,
Fermin

bob · Post by **bob** » Tue Jun 19, 2012 6:04 pm

Rebel wrote:I am planning to write a couple of technical pages for starting programmers. One is already finished and posted here, The value of an evaluation function, currently in progress is a study about depth and diminishing returns.

I also want to write something about a good testing methodology for the average chess programmer who has only 1 or 2 quads. I am a newbie myself here, in my active days playing thousands of games on Pentium 90 / 266 / 450 was not an option and so I can use some advice. Two questions:

1. Say you play 1/1 games on a quad then how many threads can you start? When you use 4 threads are the timings still reliable with all the interventions of the OS? Or is it just better to use 3 threads?

Test it. On some machines, I can run one position and get an NPS of N. I then run the same position 4 times in parallel, and still get an NPS of N. 4 threads is perfectly safe. On other machines, this is not true, and 4 threads will each run slower. Memory bandwidth is one issue, cache is another, and how smart the operating system is about recognizing duplicate executable pages and not replicating them in memory is yet another.

2. Is there something like an increasing reliability when you increase the time control? An example:

Playing 5000 1/all games is as reliable as 2500 5/all games ?

no. Different time controls will test different parts of the engine. Very fast time controls test the efficiency of the main engine control more than anything else, as it is not easy to do 1ms searches without a lot of time jitter that can cause problems. I've measured this specific thing for millions of games. Fast games are, in general, just as accurate as long games in measuring improvement. Unless you are doing something that has an exponential characteristic to it, such as monkeying around with search extensions, where the deeper you go, the more you can extend, which changes the shape of the tree and might look better or worse at deeper (or shallower) time controls. But even MOST search changes don't have problems with rapid testing, in my experiments.

5000 games is simply not enough unless you are looking for 20 Elo type improvements. Most are a fraction of that, requiring many games. Either get more cores, or use much shorter time controls, don't monkey with the number of games.

Rebel · Post by **Rebel** » Tue Jun 19, 2012 9:36 pm

This was useful. Thanks Bob.

jdart · Post by **jdart** » Thu Jun 21, 2012 3:51 pm

Unless you are doing something that has an exponential characteristic to it, such as monkeying around with search extensions, where the deeper you go, the more you can extend, which changes the shape of the tree and might look better or worse at deeper (or shallower) time controls. But even MOST search changes don't have problems with rapid testing, in my experiments.

Most engines now do things like scaling LMR with increasing depth, which is not going to happen at low depths, and there are other typically depth-dependent things such as IID, null-move verification etc.

I think there is probably some minimum search depth, below which you cease to be exercising significant parts of the search. Do you have such a minimum you'd recommend?

I have recently used game in 10sec + 0.1 sec increment, which gets me a depth between 12 (opening) and 20-30 (endgame). This seems like enough depth to me but I may be conservative here.

--Jon

Ajedrecista · Post by **Ajedrecista** » Fri Aug 24, 2012 8:09 pm

Hello Ed et al!

Sorry for bumping this topic. I am conducting a fixed depth testing (depth d vs. depth (d - 1)) using Quazar 0.4 w32 and I took a look in this Ed's experiment. My questions are the following ones:

· How is the progress of this experiment?
· Have you reached some conclusions?

I read chapter 2 (diminishing return overview) of matches *.1 (DEPTH + 1) and make an artificial rating list, starting with depth 6:

Code: Select all

Depth:     Rating:
------     -------

  6            0
  7          180
  8          327
  9          478
 10          607
 11          734

I compute ratings as simple sums: 0 (the offset point); 0 + 180 = 180; 0 + 180 + 147 = 327; etc. I put these ratings in y axis, while I take ln(depth_i) in x axis, so there is a logarithmic scale in x axis. I adjust data points with a line by the method of least squares and I get a coefficient of determination R² ~ 0.9991 using Excel (very good!). The important thing of this line is the slope because the intercept varies with the offset point: Elo variations with depth will be proportional to slope and inverse proportional to depth.

I choose a line because of some reasons: when depth d tends to infinity, then ln(d) ~ 1 + 1/2 + 1/3 + ... + 1/d and ln(d - 1) ~ 1 + 1/2 + 1/3 + ... + 1/(d - 1); delta_x = ln(d) - ln(d - 1) = ln[d/(d - 1)] ~ 1/d. If Y(x) = mx + n, dY/dx = m; estimate Elo gain = delta_Y = m*delta_x ~ m/d ---> 0 if d ---> infinity (diminishing return exists with this model).

A quadratic function fails with the same previous analysis: Y(x) = ax² + bx + c; dY/dx = 2ax + b; delta_x ~ 1/d (the same as before); estimate Elo gain = delta_Y = (dY/dx)*delta_x = (2ax + b)/d ~ {2a*[d + (d - 1)]/2 + b}/d ~ 2a = constant: diminishing return does not exist with this model (the same with other polynomials of higher degree). In dY/dx, I choose the average mean x ~ [d + (d - 1)]/2 because it makes sense to me.

With the data points of the code box, I get Y(x) ~ 1206.5x - 2169.1 with Excel, where x_i = ln(depth_i). Of course, I do not take into account error bars, which should be more less ± 20 Elo for around 800 games and 95% confidence in the cases of depth = 6 and depth = 11 (in the rest of tested depths, error bars should be more less ± 14 Elo for around 1600 games and 95% confidence, but I am relying in my memory, something very risky).

What is curious is that I have a similar, very high R² value with my own data points and adjusted line by least squares up to now (I compute my ratings with BayesElo but I probably do not use the best commands)... so I guess that I am not doing things extremely bad!

I am tempted to start a new topic in this subforum with my unfinished results and let people to post their own data and/or conclusions. But first I want to know if this kind of model/approach is reasonably good. Please answer with your suggestions, reporting possible errors in my explanation, etc. Thanks in advance!

Regards from Spain.

Ajedrecista.

CRoberson · Post by **CRoberson** » Fri Aug 24, 2012 9:39 pm

On a quad, I run 4 games at a time. Pondering is turned off and all programs are single threaded.

Here is how I test.

First test: A standard set of benchmarks. Some changes should improve speed, but not node counts!

Assuming all went well, Second test: Large number of high speed games. I will watch the result go by for the first 20 games just to be sure things are on track.

Assuming all went well, Third test: Smaller number of games against different opponents at much longer time controls.

Assuming all went well, Fourth test: Games on ICC or FICS or ... with humans and other computers.

During each of the 2nd - 4th tests, I will take some of the loses and look them over myself to see how the changes modified the play.

Now, here is the difference between what I do and what some others do. It is not until now that I release a new version to some of the rating groups. I am amazed at the guys that release a much larger number of versions and can't tell you if the new version is better or not.

Dan Honeycutt · Post by **Dan Honeycutt** » Sat Aug 25, 2012 2:47 am

Hi Ed,

A few pointers about how to set up a test with popular GUIs would be nice if that fits with your intent. I, for one, have never figured out how to get Arena to use a book for a tournament. I do know how to get it to use different starting positions (Engines .. Tournament .. Options) but where does one get good starting positions? Mine are just some I threw together, I'm sure there must be better out there somewhere.

Best
Dan H.

michiguel · Post by **michiguel** » Sat Aug 25, 2012 5:06 am

Dan Honeycutt wrote:Hi Ed,

A few pointers about how to set up a test with popular GUIs would be nice if that fits with your intent. I, for one, have never figured out how to get Arena to use a book for a tournament. I do know how to get it to use different starting positions (Engines .. Tournament .. Options) but where does one get good starting positions? Mine are just some I threw together, I'm sure there must be better out there somewhere.

Best
Dan H.

https://sites.google.com/site/gaviotach ... ects=0&d=1

"2400 positions to start matches in pgn format. Most are book positions after ~10 moves. Randomly sorted. "

from
https://sites.google.com/site/gaviotach ... e/download

Miguel

Beginners testing methodology

Beginners testing methodology

Re: Beginners testing methodology

Re: Beginners testing methodology

Re: Beginners testing methodology

Re: Beginners testing methodology

Re: Beginners testing methodology

Some questions.

Re: Beginners testing methodology

Re: Beginners testing methodology

Re: Beginners testing methodology