Don wrote:AlvaroBegue wrote:I have been writing a new engine for about a month. Ideally I would like to use a sparring partner for my engine that is moderately stronger, because it's hard to learn what your biggest weaknesses are if you are matched against an overwhelmingly stronger opponent.
My new engine currently beats Fairy Max fairly systematically, it wins more than half of the games against gnuchess, but it losses badly to crafty or arasan.
Can anyone propose a good freely-available program that is likely to be stronger than my engine but not by too much? It is also important that it be able to handle very fast time controls.
Incidentally, I am using 10 seconds + 0.1 seconds per move Fisher clock for tests. What do others use?
What Larry and I did was to use the strongest available engine and handicap it as much as needed. This has the major advantage of maximizing CPU usage. In the early days we could take Glaurung or later Stockfish and give it just a fraction of the time. We actually used 3 or 4 different but strong programs.
As Komodo gradually got better we would just decrease the handicap and eventually the weaker players we had to drop as we did not want to start taking significant time testing them too.
There was a time when spike 1.2 was just too strong and we had to handicap it to have a valid match. What is really great about this system is that you can really see the progress over time. It was great to take a program off the list that was at one time too much to handle.
Using time as a handicap is certainly a good idea, because it maximizes CPU usage for your engine, and behaves better than using reduced strength features of engines (these typically introduce random blunders, thereby introducing an orthogonal source of noise that is not accounted for in the model).
But even with that, you need between 2x and 4x more time than in self-testing. As explained earlier, you need 4x more games. But if the opponents use half the time of your engine for example, then that will takes 3x more time "only".
On the other hand, the draw ratio is significantly less against different engines than in self play, so there's more statistical information in the same amount of games (see Kai Laskos' approximated formula for LOS which is only a function of the #win and #loss).
Another advantage of self-play is that it increases the sensitivity of the measure, so that less games will have to be played to detect an improvement or a regression. It is not uncommon that +50 elo in self-play becomes only +30 elo against a varied population of engines (with the same testing conditions).
I just use self-play (detecting microscopic improvements), but every now and then I play a gauntlet to see if any significant progress has been made. The latest one shows DiscoCheck on par with Fruit 05/11/03 and Gaviota 0.86, which is encouraging

Theory and practice sometimes clash. And when that happens, theory loses. Every single time.