In self testing two versions of Tinker I noticed the following results, and thought they might be worth sharing (for those of us without massive clusters).
I used Bob Hyatt's 3,891 starting positions.
Time control 0:10/0.5, no pondering, no books, yes EGTBs (but these factors should not matter, I think).
Results below, starting with 100 games and then rougly doubling.
Notice how the results start quite far apart, but then converge, and then actually start to drift apart a bit again, sigh.
It takes about 2 days for the 5,000+ games (it could run much longer, of course to 2x3,891 before repeating, and n times longer for a gauntlet with n other engines).
So, is the conclusion that only fairly large differences can be determined, and even then only with a large number of games?
If so, this is rather disconcerting, and many changes incorporated thinking that they are improvements may well not have been at all.
I suppose, when in doubt, one could go with simpler, or less code, or smaller number of nodes to a given depth
(but this depends on the positions, and preferred moves often change, for example the initial position).
Any suggestions ?
Thanks,
Brian
Code: Select all
Program Elo + - Games Score Av.Op. Draws
1 Tinker 753 x64 : 2414 61 61 100 54.0 % 2386 22.0 %
2 Tinker 752 x64 : 2386 61 61 100 46.0 % 2414 22.0 %
1 Tinker 753 x64 : 2413 42 42 200 53.8 % 2387 26.5 %
2 Tinker 752 x64 : 2387 42 42 200 46.2 % 2413 26.5 %
1 Tinker 753 x64 : 2410 29 28 400 53.0 % 2390 30.5 %
2 Tinker 752 x64 : 2390 28 29 400 47.0 % 2410 30.5 %
1 Tinker 753 x64 : 2401 19 19 800 50.3 % 2399 36.1 %
2 Tinker 752 x64 : 2399 19 19 800 49.7 % 2401 36.1 %
1 Tinker 753 x64 : 2403 13 13 1599 50.9 % 2397 37.4 %
2 Tinker 752 x64 : 2397 13 13 1599 49.1 % 2403 37.4 %
1 Tinker 753 x64 : 2401 10 10 3197 50.2 % 2399 36.3 %
2 Tinker 752 x64 : 2399 10 10 3197 49.8 % 2401 36.3 %
1 Tinker 753 x64 : 2401 7 7 5458 50.4 % 2399 34.8 %
2 Tinker 752 x64 : 2399 7 7 5458 49.6 % 2401 34.8 %