Setting up a testing framework

Ferdy · Post by **Ferdy** » Sun May 08, 2011 1:23 pm

mcostalba wrote:Inital tests results of SF 2.1 are much less than stellar, to say the least. And, more important, are somehow unexpected because in our internal testing gain vs 2.0.1 was about +30 ELO. So I am now rethinking the whole testing framework because I strongly believe that reliable and consistent testing is critical for engine advancement: you cannot advance a strong engine without a reliable testing framework.

So I would like to start testing against a (small) engine pool instead of against previous SF version as we do currently. Just to be clear, I think that self testing is a good thing and has proven very useful for us: we have increased by hundreds ELO since Glaurung times relying only on this scheme that I consider proven and effective and IMHO the _best_ way to test features at 10-15 ELO resolution range.

But today, for a top engine, 10 ELO resolution is not enough, you really want to push up to 5 ELO otherwise you miss a lot of possible small but effective tweaks that summed up could make a difference. We have experienced with the last release that when dealing with 5 ELO features increasing number of played games is not enough, we need something different and so testing vs engine pool comes to play. Please note that I still don't know if pool testing is better or equal or even worse, it is just a new road that I would like to try (yes some people here have done this experience before, but I really don't care because I want to test myself).

So I have picked up LittleBlitzer and left the beloved cutechess-cli that does not easily allows to run gauntlets. But before to start I have to validate the testing framework, here is what I am planning to do:

STEP 1: RELIABILITY
I'll run a gauntlet of 10K games of SF compiled by me against a pool of engines included the Jim's official release SF 2.1. I have chosen TC of 1"+0.1" (LittleBliter's default), single thread. I will repeat the test 3 times, if results of the 3 runs are not the same with high accuracy then LB is not reliable and I will stop the validation process without further attempts.

STEP 2: SCALABILITY
I'll run the same gauntlet but at 10"+0.1" TC (it will take a while!) always single thread. If results are far apart than we have scalability problems and will need to find a better TC (I really hope not !)

STEP 3: MULTI-TOURNAMENT
In case we are lucky and we pass also step 2 then I will run again the same gauntlet at 1"+0.1 but this time running 2 tournament in parallel (I have a QUAD so I can allocate 1 engine per CPU). Also in this case results of the 10K games gauntlet should be consistent with previous tests.

Unfortunately our main testing framework is under Linux and LittleBlitzer is available for Windows only, but if it proves good I can use it on my QUAD as a useful validator/verification tool.

STEP 1: RELIABILITY
I'll run a gauntlet of 10K games of SF compiled by me against a pool of engines included the Jim's official release SF 2.1. I have chosen TC of 1"+0.1" (LittleBliter's default), single thread. I will repeat the test 3 times, if results of the 3 runs are not the same with high accuracy then LB is not reliable and I will stop the validation process without further attempts.

If I don't get close result after retests I just increase the time control, once I get a stable or near to stable results, I then choose that TC for my subsequent testing. Still if I don't get a stable results then maybe there is a problem with the interface or the engines I am testing are doing something that I am not aware of or perhaps processes on my pc are not constant especially when using very fast TC.

bob · Post by **bob** » Sun May 08, 2011 5:14 pm

mcostalba wrote:
Laskos wrote: Yes, self-testing exaggerates the differences
Just to be clear, exaggerating the differences is a good thing IMHO and is one if the main reasons why we use self-testing. We really want the differences to be exaggerates so to require less games to detect potential good stuff from garbage.

Our main testing framework is based on Linux+cute-chess+self test and I think we will stick to that. What I am wondering is that this is not enough. We really need a testing pipeline where the potential good changes filtered out by self test are further processed with a LittleBlitzer gauntlet step before to be committed.

Our main test framework currently uses 4 threads per engine (it is a QUAD), if the scalability and multi-tournament tests proves successful I could switch to use 1 thread per engine (this should also have side effect benefit of reducing noise due to high SMP undeterminism) and run 2, 3 or even 4 instances of cute-chess in parallel (no pondering), depending on scalability test results.

methodology warning. From experience.

You can add one bit of knowledge to program A', which is not in A. That change might well cause A' to beat A. But the value can be poorly tuned and cause A' to do _worse_ against other programs. This has happened to me in the past. You need a variety of opponents to test a change, so that different programs can react to that change differently. That gives you a much better chance of accepting good changes and rejecting bad ones.

nthom · Post by **nthom** » Mon May 09, 2011 11:43 am

mcostalba wrote: STEP 1: RELIABILITY
I'll run a gauntlet of 10K games of SF compiled by me against a pool of engines included the Jim's official release SF 2.1. I have chosen TC of 1"+0.1" (LittleBliter's default), single thread. I will repeat the test 3 times, if results of the 3 runs are not the same with high accuracy then LB is not reliable and I will stop the validation process without further attempts.

I would be very interested to see the results of this test. I have run tests like this on my LittleThought engine, albeit usually at around 50K games, and received very stable results.

mcostalba wrote: Unfortunately our main testing framework is under Linux and LittleBlitzer is available for Windows only, but if it proves good I can use it on my QUAD as a useful validator/verification tool.

If the demand is high enough, I could create a Linux version with command line interface only. I just don't have as much time these days as I'd like to work on it.

mcostalba · Post by **mcostalba** » Mon May 09, 2011 1:43 pm

nthom wrote: I would be very interested to see the results of this test. I have run tests like this on my LittleThought engine, albeit usually at around 50K games, and received very stable results.

Well, first of all thanks a lot for your very nice tool ! It is handy to use and quite stable: no loss on time in more then 10K games and this is _mainly_ a GUI feature not an engine one

I will post the results this week end. For me, apart from stability test that is an anavoidable prerequisite, it is also very interesting the scalability to multi-tournament that could cut the testing times in half if proven accurate. Currently a game at 1"+0.1 takes on average 14 secs to end, so 10K games are about a little more than 2 full days: cutting to a single day would be really great.

The Linux version would be really good of course, but, perhaps to easy the porting task, I think a command line interface with LB that reads what to do from a configuration file is more then enough, actually it is even more useful then a GUI because we run our testing framework remotely through an ssh connection. Regrading this point, the design chosen by cute-chess that has splitted the actual tournament manager from the GUI I think is a winning one.

mcostalba · Post by **mcostalba** » Fri May 13, 2011 6:49 pm

nthom wrote: I would be very interested to see the results of this test. I have run tests like this on my LittleThought engine, albeit usually at around 50K games, and received very stable results.

Ok, I have run 2 gauntlets with no parallel testing and the last one with two parallel tournaments (I have a QUAD so there was even more room becuase engines are no pondering).

Here are the results:

Code: Select all

Games Completed = 7487 of 10000 (Avg game length = 14.409 sec)
Settings = Gauntlet/64MB/1000ms+100ms/M 1000cp for 12 moves, D 150 moves/PGN:swcr.pgn(3395)
Time = 110864 sec elapsed, 37211 sec remaining
 1.  Stockfish BASE	3368.0/7487	1819-2570-3098  	(L: m=507 t=2 i=0 a=2061)	(D: r=1964 i=454 f=367 s=24 a=289)	(tpm=101.7 d=15.4 nps=4893395)
 2. 51.04% Stockfish 2.1 JA 64bit   	1274.0/2496	599-547-1350  	(L: m=34 t=0 i=0 a=513)	(D: r=1034 i=153 f=114 s=6 a=43)	(tpm=102.3 d=15.7 nps=5045090)
 3.  47.75% Critter 1.01 64-bit      	1192.0/2496	694-806-996  	(L: m=216 t=0 i=0 a=590)	(D: r=646 i=144 f=116 s=8 a=82)	(tpm=103.1 d=14.3 nps=6184471)
 4.  66.25% Houdini 1.5a x64         	1653.0/2495	1277-466-752  	(L: m=135 t=0 i=0 a=331)	(D: r=284 i=157 f=137 s=10 a=164)	(tpm=98.0 d=13.9 nps=7502338)

Games Completed = 7986 of 10000 (Avg game length = 14.441 sec)
Settings = Gauntlet/64MB/1000ms+100ms/M 1000cp for 12 moves, D 150 moves/PGN:swcr.pgn(3395)
Time = 118056 sec elapsed, 29773 sec remaining
 1.  Stockfish BASE  64bit	3527.0/7986	1888-2820-3278  	(L: m=568 t=0 i=0 a=2252)	(D: r=2109 i=455 f=349 s=28 a=337)	(tpm=101.6 d=15.4 nps=4861154)
 2.  51.67% Stockfish 2.1 JA 64bit   	1375.5/2662	655-566-1441  	(L: m=31 t=0 i=0 a=535)	(D: r=1152 i=153 f=83 s=4 a=49)	(tpm=102.1 d=15.8 nps=5049361)
 3.  48.91% Critter 1.01 64-bit      	1302.0/2662	776-834-1052  	(L: m=248 t=0 i=0 a=586)	(D: r=664 i=156 f=112 s=11 a=109)	(tpm=102.7 d=14.3 nps=6221863)
 4. 66,92%  Houdini 1.5a x64         	1781.5/2662	1389-488-785  	(L: m=152 t=0 i=0 a=336)	(D: r=293 i=146 f=154 s=13 a=179)	(tpm=97.5 d=13.9 nps=7515900)

SMP threads = 2
Games Completed = 10000 of 10000 (Avg game length = 14.102 sec)
Settings = Gauntlet/64MB/1000ms+100ms/M 1000cp for 12 moves, D 150 moves/PGN:swcr.pgn(3395)
Time = 72228 sec elapsed, 0 sec remaining
 1.  Stockfish BASE  64bit	4104.5/10000	2242-4033-3725  	(L: m=1076 t=0 i=0 a=2957)	(D: r=2386 i=553 f=439 s=23 a=324)	(tpm=102.2 d=13.2 nps=1770573)
 2.  52.57%   Stockfish 2.1 JA 64bit   	1753.0/3334	919-747-1668  	(L: m=74 t=0 i=0 a=673)	(D: r=1292 i=190 f=134 s=5 a=47)	(tpm=102.9 d=13.7 nps=1887995)
 3.  54.15%   Critter 1.01 64-bit      	1805.0/3333	1226-949-1158  	(L: m=297 t=0 i=0 a=652)	(D: r=786 i=163 f=105 s=8 a=96)	(tpm=101.6 d=13.1 nps=2738632)
 4.  70.11%   Houdini 1.5a x64         	2337.5/3333	1888-546-899  	(L: m=191 t=0 i=0 a=355)	(D: r=308 i=200 f=200 s=10 a=181)	(tpm=98.9 d=12.3 nps=2658080)

Reliability more or less is acceptable, although not as good as I would have wished for, but parallel tournament capability (the last result) is not good, for instance Critter jumps from 48% to 54% !!!!

Setting up a testing framework

Re: Setting up a testing framework

Re: Setting up a testing framework

Re: Setting up a testing framework

Re: Setting up a testing framework

Re: Setting up a testing framework