mcostalba wrote:Inital tests results of SF 2.1 are much less than stellar, to say the least. And, more important, are somehow unexpected because in our internal testing gain vs 2.0.1 was about +30 ELO. So I am now rethinking the whole testing framework because I strongly believe that reliable and consistent testing is critical for engine advancement: you cannot advance a strong engine without a reliable testing framework.
So I would like to start testing against a (small) engine pool instead of against previous SF version as we do currently. Just to be clear, I think that self testing is a good thing and has proven very useful for us: we have increased by hundreds ELO since Glaurung times relying only on this scheme that I consider proven and effective and IMHO the _best_ way to test features at 10-15 ELO resolution range.
But today, for a top engine, 10 ELO resolution is not enough, you really want to push up to 5 ELO otherwise you miss a lot of possible small but effective tweaks that summed up could make a difference. We have experienced with the last release that when dealing with 5 ELO features increasing number of played games is not enough, we need something different and so testing vs engine pool comes to play. Please note that I still don't know if pool testing is better or equal or even worse, it is just a new road that I would like to try (yes some people here have done this experience before, but I really don't care

because I want to test myself).
So I have picked up LittleBlitzer and left the beloved cutechess-cli that does not easily allows to run gauntlets. But before to start I have to validate the testing framework, here is what I am planning to do:
STEP 1: RELIABILITY
I'll run a gauntlet of 10K games of SF compiled by me against a pool of engines included the Jim's official release SF 2.1. I have chosen TC of 1"+0.1" (LittleBliter's default), single thread. I will repeat the test 3 times, if results of the 3 runs are not the same with high accuracy then LB is not reliable and I will stop the validation process without further attempts.
STEP 2: SCALABILITY
I'll run the same gauntlet but at 10"+0.1" TC (it will take a while!) always single thread. If results are far apart than we have scalability problems and will need to find a better TC (I really hope not !)
STEP 3: MULTI-TOURNAMENT
In case we are lucky and we pass also step 2 then I will run again the same gauntlet at 1"+0.1 but this time running 2 tournament in parallel (I have a QUAD so I can allocate 1 engine per CPU). Also in this case results of the 10K games gauntlet should be consistent with previous tests.
Unfortunately our main testing framework is under Linux and LittleBlitzer is available for Windows only, but if it proves good I can use it on my QUAD as a useful validator/verification tool.