I've just recently started testing larger volumes with very fast games. These results are from a 10000 game gauntlet, using bayeselo fixing LT 1.03 at 2300.
How would you interpret these results? I'm assuming the change was bad (2437 reduced to 2428), but some of the other engine results varied by more than that.
If, after 10,000 games against a variety of opponents, your results are worse, then you should be able to say with high confidence that your change was not positive.
My guess is that some of the ratings of the other engine vary so much because they are only connected to the one engine being tested and their individual ratings can vary more than would be expected from small changes in your program, IMO. Why not rate the rest of them with several thousand games each, or would that take too much time?
nthom wrote:I've just recently started testing larger volumes with very fast games. These results are from a 10000 game gauntlet, using bayeselo fixing LT 1.03 at 2300.
How would you interpret these results? I'm assuming the change was bad (2437 reduced to 2428), but some of the other engine results varied by more than that.
Miguel is right. No accurate comparison can be made unless you know
the Elo of each of the other engines under the test conditions. As Remi
has pointed out before, combine the sets of games in order to get a
comparison of LT 1.05.01 and 1.05.02. Also use the LOS function in Bayeselo
to get a qualitative comparison of strength.