nondeterministic testing

jswaff · Post by **jswaff** » Wed Aug 15, 2007 11:00 pm

I am in the process of developing a better testing strategy for testing eval changes than the ad-hoc methods I've used for forever. I've read on several occasions that most use Nunn type matches to remove the randomness of the opening book selection.

I ran Prophet vs. GNU 5.05 at 10 5. The results were far from deterministic. I'm repeating the experiment at 15 10. Even if 15 10 does prove a bit more deterministic, to run matches at 15 10 vs. 5 or 6 opponents (which seems to be the recommended approach) will take about a week of CPU time.

What time controls do others use? Is there a "shortcut" I'm missing for testing eval changes? Seems to be as much art as science. :-/

Oh - before anyone says anything - 3/5 matches don't show a full 40 games. I'm still investigating why. On the last match, which shows 38 games, the PGN actually had 39 games. One of those was adjourned. It was GNU's turn in what looked like a three-fold rep. I'm not sure yet why the game was adjourned, or what happened to the 40th game. Figuring that out would obviously affect the score for that particular match (somewhat), but the fact that it's different than other matches still shows nondeterministic behaviour.

--
James

Code: Select all

james@smeagol ~/prophet/scripts/10_5 $ ../pgn prophet-gnu505-200708092115.pgn

==========================================================
           Total played&#58; 40 &#40;unique games&#58; 40&#41;
           Note, only unique games are scored.
==========================================================
          Player  Wins  Losses  Draws  Score    Percent
==========================================================
  GNU Chess 5.05   16     16      8   20.0/40  &#40;50.00%)
         prophet   16     16      8   20.0/40  &#40;50.00%)

ELO Diff&#58; 0.00

james@smeagol ~/prophet/scripts/10_5 $ ../pgn prophet-gnu505-200708112116.pgn

==========================================================
           Total played&#58; 40 &#40;unique games&#58; 40&#41;
           Note, only unique games are scored.
==========================================================
          Player  Wins  Losses  Draws  Score    Percent
==========================================================
  GNU Chess 5.05   18     13      9   22.5/40  &#40;56.25%)
         prophet   13     18      9   17.5/40  &#40;43.75%)

ELO Diff&#58; 43.66

james@smeagol ~/prophet/scripts/10_5 $ ../pgn prophet-gnu505-200708122159.pgn

==========================================================
           Total played&#58; 38 &#40;unique games&#58; 38&#41;
           Note, only unique games are scored.
==========================================================
          Player  Wins  Losses  Draws  Score    Percent
==========================================================
  GNU Chess 5.05   19     10      9   23.5/38  &#40;61.84%)
         prophet   10     19      9   14.5/38  &#40;38.16%)

ELO Diff&#58; 83.88

james@smeagol ~/prophet/scripts/10_5 $ ../pgn prophet-gnu505-200708131502.pgn

==========================================================
           Total played&#58; 37 &#40;unique games&#58; 37&#41;
           Note, only unique games are scored.
==========================================================
          Player  Wins  Losses  Draws  Score    Percent
==========================================================
  GNU Chess 5.05   15     15      7   18.5/37  &#40;50.00%)
         prophet   15     15      7   18.5/37  &#40;50.00%)

ELO Diff&#58; 0.00

james@smeagol ~/prophet/scripts/10_5 $ ../pgn prophet-gnu505-200708140915.pgn

==========================================================
           Total played&#58; 38 &#40;unique games&#58; 38&#41;
           Note, only unique games are scored.
==========================================================
          Player  Wins  Losses  Draws  Score    Percent
==========================================================
  GNU Chess 5.05   20     14      4   22.0/38  &#40;57.89%)
         prophet   14     20      4   16.0/38  &#40;42.11%)

ELO Diff&#58; 55.32

bob · Post by **bob** » Thu Aug 16, 2007 5:00 am

jswaff wrote:I am in the process of developing a better testing strategy for testing eval changes than the ad-hoc methods I've used for forever. I've read on several occasions that most use Nunn type matches to remove the randomness of the opening book selection.

I ran Prophet vs. GNU 5.05 at 10 5. The results were far from deterministic. I'm repeating the experiment at 15 10. Even if 15 10 does prove a bit more deterministic, to run matches at 15 10 vs. 5 or 6 opponents (which seems to be the recommended approach) will take about a week of CPU time.

What time controls do others use? Is there a "shortcut" I'm missing for testing eval changes? Seems to be as much art as science. :-/

Oh - before anyone says anything - 3/5 matches don't show a full 40 games. I'm still investigating why. On the last match, which shows 38 games, the PGN actually had 39 games. One of those was adjourned. It was GNU's turn in what looked like a three-fold rep. I'm not sure yet why the game was adjourned, or what happened to the 40th game. Figuring that out would obviously affect the score for that particular match (somewhat), but the fact that it's different than other matches still shows nondeterministic behaviour.

--
James
Code: Select all
james@smeagol ~/prophet/scripts/10_5 $ ../pgn prophet-gnu505-200708092115.pgn

==========================================================
           Total played&#58; 40 &#40;unique games&#58; 40&#41;
           Note, only unique games are scored.
==========================================================
          Player  Wins  Losses  Draws  Score    Percent
==========================================================
  GNU Chess 5.05   16     16      8   20.0/40  &#40;50.00%)
         prophet   16     16      8   20.0/40  &#40;50.00%)

ELO Diff&#58; 0.00

james@smeagol ~/prophet/scripts/10_5 $ ../pgn prophet-gnu505-200708112116.pgn

==========================================================
           Total played&#58; 40 &#40;unique games&#58; 40&#41;
           Note, only unique games are scored.
==========================================================
          Player  Wins  Losses  Draws  Score    Percent
==========================================================
  GNU Chess 5.05   18     13      9   22.5/40  &#40;56.25%)
         prophet   13     18      9   17.5/40  &#40;43.75%)

ELO Diff&#58; 43.66

james@smeagol ~/prophet/scripts/10_5 $ ../pgn prophet-gnu505-200708122159.pgn

==========================================================
           Total played&#58; 38 &#40;unique games&#58; 38&#41;
           Note, only unique games are scored.
==========================================================
          Player  Wins  Losses  Draws  Score    Percent
==========================================================
  GNU Chess 5.05   19     10      9   23.5/38  &#40;61.84%)
         prophet   10     19      9   14.5/38  &#40;38.16%)

ELO Diff&#58; 83.88

james@smeagol ~/prophet/scripts/10_5 $ ../pgn prophet-gnu505-200708131502.pgn

==========================================================
           Total played&#58; 37 &#40;unique games&#58; 37&#41;
           Note, only unique games are scored.
==========================================================
          Player  Wins  Losses  Draws  Score    Percent
==========================================================
  GNU Chess 5.05   15     15      7   18.5/37  &#40;50.00%)
         prophet   15     15      7   18.5/37  &#40;50.00%)

ELO Diff&#58; 0.00

james@smeagol ~/prophet/scripts/10_5 $ ../pgn prophet-gnu505-200708140915.pgn

==========================================================
           Total played&#58; 38 &#40;unique games&#58; 38&#41;
           Note, only unique games are scored.
==========================================================
          Player  Wins  Losses  Draws  Score    Percent
==========================================================
  GNU Chess 5.05   20     14      4   22.0/38  &#40;57.89%)
         prophet   14     20      4   16.0/38  &#40;42.11%)

ELO Diff&#58; 55.32

You need lots of games. The closer the two programs are in skill, the more games you need. I mean thousands, not hundreds. You need a common set of starting positions to eliminate book randomness and book learning. Learning has to be disabled. No SMP search which adds more non-determinism. The list goes on and on and it is a far bigger problem that most here are aware of. I'm running matches using 40 starting positions, two games per position (alternating colors) and I repeat that 32 times. For each of four opponents. Even then there is still a slight bit of uncertainty left...

pedrox · Post by **pedrox** » Thu Aug 16, 2007 6:48 am

As is the times that you use for the games?

The number of games will depend on the changes that you make in your program. With great changes you will need less games and with small changes you will need more.

60 Nunn games, 3 or 4 opponents, 40/4 is ok for me.

Uri Blass · Post by **Uri Blass** » Thu Aug 16, 2007 12:31 pm

pedrox wrote:As is the times that you use for the games?

The number of games will depend on the changes that you make in your program. With great changes you will need less games and with small changes you will need more.

60 Nunn games, 3 or 4 opponents, 40/4 is ok for me.

He said fischer time control 10+5 and he think to repeat the matches at time control of 15+10.

one of my testing methods to test small changes is usually testing against previous version at fixed number of nodes.

It is productive to detect big bugs and if I see the new version lose by a big margin like 20:4 then I can be practically sure that I have a significant bug and I do not continue the match.

changes in the evaluation usually will not give big results even if there is a bug.

I choose to test them after testing that they work correctly on some positions also by matches at fixed number of nodes and if I both believe in the change and both get result of more than 50% against previous version I choose the winner without being sure if it is better(I believe that if I wait to have enough games to know if a change is better also by testing against other opponents the progress in improving the evaluation is going to be slower because it is better to do 60 changes that are productive by 1 elo and 40 changes that are counter productive by 1 elo and(so you get 20 elo improvement) and not to do one step that is productive by 1 elo.

After many changes I may make more serious test to see if I got an improvement relative to the previous tested version by the ccrl.

Uri

brianr · Post by **brianr** » Thu Aug 16, 2007 4:37 pm

I use a hierarchy of tests, that have been very slowly evolving from more to less ad-hoc.
There are about a half dozen hard coded test positions in Tinker covering various game stages
that over the years I have become familiar with and can see if something odd seems to be happening.

I also do a "mirror" test to make sure the evaluation is not broken.
Then, I do a quick WAC 300 test.
Finally, the tedious testing begins.

I used to test against just one other engine, initially with the Nunn2 (20 positions, not 25?), so 40 games (both black and white) at three different time controls: 0:10/1, 2:00/1, and 5:00/3, usually with ponder off (although I have a dual CPU system). These results were non-reproducible with a margin of error that was often larger than the apparent differences between versions.

Then, I went to more positions adding the Noomen 2006 (30) positions and playing 100 games. Again, not reproducible.

Finally, about a year ago, instead of even more positions and time controls against just one opponent, I went to a pool of opponents for 100 games each (Nunn2 plus Noomen, both black and white), but at a fairly fast time control (1min+2sec), with pondering (more realistic stress test, albeit less deterministic).

I started with 16 opponents (several stronger, weaker, and about the same), and am still working on a suitable list of engines. The more ambitious testers' results are very helpful in helping identify suitable opponents (ChessWar, NIL, etc). I do not test prior versions of Tinker against itself (assuming minor differences might be either too close to notice or could be magnified too much, not sure which, or both).

Naturally, it takes several days for a full 1,600 game run, but sometimes trends emerge early if a change seems much worse or better. Still non-deterministic results, but margin of error seems small enough that relative ranking and scoring seems useful. I will continue to distill down the list of opponents to hopefully around 9 "well chosen" ones and see how that does for a while. Of course, incremental tweaks, ongoing book frustration, magic bitboard experimentation, wanting to get back to 64 bits, dreams of parallel search and other distractions will continue to delay things

bob · Post by **bob** » Thu Aug 16, 2007 9:24 pm

pedrox wrote:As is the times that you use for the games?

The number of games will depend on the changes that you make in your program. With great changes you will need less games and with small changes you will need more.

60 Nunn games, 3 or 4 opponents, 40/4 is ok for me.

Run the same test 4 times and then report back. I already _know_ what the result is going to be. The variability for that small number of games is great.

Time control has little effect on deterministic results. I have tried all sorts of time controls from 1+0 (very fast) to 60+60 (very slow) and the variability is not significantly better at one than the other.

I can show you consecutive runs, 40 positions, 2 games per position for 80 games total, where the first 80 games show A is much better, the second shows B is much better, and a long match shows they are pretty equal. That small a sample is almost random noise.

pedrox · Post by **pedrox** » Fri Aug 17, 2007 8:22 am

I know that if you repeat the test of Nunn 2 times you obtain different values and also I know that more games and rivals better, in that we agree, but if you have a single computer, to play 2500 games 15 + 10 is impossible.

When your engine is weak, I do not believe that you must make such number of games for improvement your engine, it is possible that 50 - 100 games in blitz and you are progressing.

In many occasions, we can see for example in wbec a league where a engine has played 80 games and we already know that it has progressed, with a second tournament often we have already the verification, the 200 games that use in CCRL to give a fixed rating will make suppose that they also will consider that with 200 games it is more or less sufficient.

When in your engine you do not have new techniques and ideas that to use, then often you probe to fit values, then, yes, you need many more games to verify than you are progressing (this is not by Crafty, I admire to Crafty and with good hardware, 64 bits, several processors, Crafty is strong and I know that you continue proving thousand ideas nowadays)

bob · Post by **bob** » Fri Aug 17, 2007 5:38 pm

pedrox wrote:I know that if you repeat the test of Nunn 2 times you obtain different values and also I know that more games and rivals better, in that we agree, but if you have a single computer, to play 2500 games 15 + 10 is impossible.

When your engine is weak, I do not believe that you must make such number of games for improvement your engine, it is possible that 50 - 100 games in blitz and you are progressing.

In many occasions, we can see for example in wbec a league where a engine has played 80 games and we already know that it has progressed, with a second tournament often we have already the verification, the 200 games that use in CCRL to give a fixed rating will make suppose that they also will consider that with 200 games it is more or less sufficient.

When in your engine you do not have new techniques and ideas that to use, then often you probe to fit values, then, yes, you need many more games to verify than you are progressing (this is not by Crafty, I admire to Crafty and with good hardware, 64 bits, several processors, Crafty is strong and I know that you continue proving thousand ideas nowadays)

I'm sorry, but that is simply not going to work. If your engine is weak, and you make a change you want to evaluate, that "change" is going to produce at best a small change in the program's skill level. It takes way more than 100 games to determine if it is really better or not. And I do mean "way".

Again, to see why, just run your test 4 times and look at the different results you will get. If you only run it once, you might get the run that says the change is better when it is not, or that the change is worse when it is not, or that it is better when it is, or that it is worse when it is. If you are going to improve, you must get that right. 100 games is not going to do it.

Uri Blass · Post by **Uri Blass** » Fri Aug 17, 2007 9:10 pm

bob wrote:
pedrox wrote:I know that if you repeat the test of Nunn 2 times you obtain different values and also I know that more games and rivals better, in that we agree, but if you have a single computer, to play 2500 games 15 + 10 is impossible.

When your engine is weak, I do not believe that you must make such number of games for improvement your engine, it is possible that 50 - 100 games in blitz and you are progressing.

In many occasions, we can see for example in wbec a league where a engine has played 80 games and we already know that it has progressed, with a second tournament often we have already the verification, the 200 games that use in CCRL to give a fixed rating will make suppose that they also will consider that with 200 games it is more or less sufficient.

When in your engine you do not have new techniques and ideas that to use, then often you probe to fit values, then, yes, you need many more games to verify than you are progressing (this is not by Crafty, I admire to Crafty and with good hardware, 64 bits, several processors, Crafty is strong and I know that you continue proving thousand ideas nowadays)
I'm sorry, but that is simply not going to work. If your engine is weak, and you make a change you want to evaluate, that "change" is going to produce at best a small change in the program's skill level. It takes way more than 100 games to determine if it is really better or not. And I do mean "way".

Again, to see why, just run your test 4 times and look at the different results you will get. If you only run it once, you might get the run that says the change is better when it is not, or that the change is worse when it is not, or that it is better when it is, or that it is worse when it is. If you are going to improve, you must get that right. 100 games is not going to do it.

If the engine is weak then there may be changes that 100 games are enough to be sure that there is an improvement.

If you get an improvement of 100 elo by a single change then 100 games can be clearly enough to see that you got an improvement.

Uri

MartinBryant · Post by **MartinBryant** » Fri Aug 17, 2007 10:17 pm

I have to agree with Robert here. Nowadays I'm convinced that the only way to get some confidence in any change is to play massive numbers of games (thousands).

For what it's worth, here's the sort of testing regime I use...

1st, very importantly, get yourself a dedicated test machine. You do NOT want to tie up your main development machine for days running tests, it's just too depressing. I have an old 1GHz PC on my home network acting as a server, so most of the time it used to sit there doing nothing. Nowadays it sits there running test matches nearly 24x7. If you haven't got an old one, you can pick up a cheap PC suitable for testing for less than 200 pounds. You don't need a monitor, just use RDP to check on it occasionally from your main machine.

I typically run 1000 game test matches against a variety of opponents at or near the strength of my engine using a standard set of openings.
I play them at either 1m + 1s or even faster at 0.5m + 0.5s. A thousand game match then takes typically anything from 36-72 hours.
Personally I do not believe it matters whether you test a change at bullet, blitz or tournament time controls. If it's a good improvement, you'll see a plus score at any time control.

When judging the results of a match you must be true to the science not your heart. If it scores worse then no amount of "well everybody else says it's better", or "I was SURE that'd help!" will convince me to use it.
The only time I let my subjectivity into the equation is if the score is 'marginal'. (For 'marginal' you need to understand all this stats stuff about confidence intervals and what have you... I'm no expert on it but I think I've got a reasonable feel for it now.) So if I see a result that is marginally better (but not 'statistically significant') and my gut feel says it should be an improvement and I feel it isn't some monstrous piece of logic that's gonna cause me buggy grief later, I might still adopt it.

Also for confidence building after matches I run the changes against a large number of testsets. There are literally tens of thousands of test positions in numerous sets out there. You don't have to give your engine long on each position so you can run thousands of positions in a day. If the results show a plus score over the version without the change then you can have even more confidence in it. If it doesn't I may well run another 1000 game match.

Also NEVER pull the plug on a match early, even if it looks disastrously worse or massively better. I've seen bed-wettingly exciting ELO advantages after 100-200 games eroded to NOTHING after a thousand. Don't prejudge the test. Let it do it's job.
Also I've seen two engines swap leadership numerous times up to maybe 600+ games before one finally pulls out a small but consistent lead.

And I test EVERY change with at least one 1000 game match. Don't bundle a whole bunch of things together just to try to cut corners.
Also I test each change individually from some baseline version, not incrementally building up the changes on top of each other.

Going back to Robert's point about the number of games. I'd love to play 10,000 game matches but that's too tedious even at 0.5+0.5. Why? Well I had a match against Ruffian which, just for curiosity, I extended to about 4,700 games before I gave up. Graphing the score game by game showed a gradual but certain erosion of 10 ELO points from the score at 1000 games. Was it still eroding at 4,700 games? I don't know. But from that I conclude that even 4,700 games isn't 'enough' on all occasions.

And finally, before releasing my most recent version, I played another 5,600 games, with all the individual changes I had found to be good, put in together. Only when those games too showed a small improvement was I happy to release a new version.

Finally remember that engine writers live in a community with a massive untapped resource. There are hundreds of 'testing addicts' out there with PCs dying to run test matches JUST FOR YOU!
A wonderful chap called Paul Ward was very excited when Colossus came back on the scene and offered his services to me for testing. He regularly runs 1000 game test matches for me (many thanks again Paul!!!).
Just ask in the testing forum. You see people there all the time who are obviously fans of particular engines who would bite your hands off just to be allowed to test your latest 'experimental' version before anyone else!

nondeterministic testing

nondeterministic testing

Re: nondeterministic testing

Re: nondeterministic testing

Re: nondeterministic testing

Re: nondeterministic testing

Re: nondeterministic testing

Re: nondeterministic testing

Re: nondeterministic testing

Re: nondeterministic testing

Re: nondeterministic testing