Self Testing Trends

brianr · Post by **brianr** » Fri Aug 07, 2009 7:21 pm

There have been several posts about self testing in general, and about testing trends and large numbers of games being required in particular.
In self testing two versions of Tinker I noticed the following results, and thought they might be worth sharing (for those of us without massive clusters).

I used Bob Hyatt's 3,891 starting positions.
Time control 0:10/0.5, no pondering, no books, yes EGTBs (but these factors should not matter, I think).

Results below, starting with 100 games and then rougly doubling.
Notice how the results start quite far apart, but then converge, and then actually start to drift apart a bit again, sigh.
It takes about 2 days for the 5,000+ games (it could run much longer, of course to 2x3,891 before repeating, and n times longer for a gauntlet with n other engines).

So, is the conclusion that only fairly large differences can be determined, and even then only with a large number of games?
If so, this is rather disconcerting, and many changes incorporated thinking that they are improvements may well not have been at all.

I suppose, when in doubt, one could go with simpler, or less code, or smaller number of nodes to a given depth
(but this depends on the positions, and preferred moves often change, for example the initial position).

Any suggestions ?

Thanks,
Brian

Code: Select all

  Program                          Elo    +   -   Games   Score   Av.Op.  Draws

1 Tinker 753 x64                 &#58; 2414   61  61   100    54.0 %   2386   22.0 %
2 Tinker 752 x64                 &#58; 2386   61  61   100    46.0 %   2414   22.0 %

1 Tinker 753 x64                 &#58; 2413   42  42   200    53.8 %   2387   26.5 %
2 Tinker 752 x64                 &#58; 2387   42  42   200    46.2 %   2413   26.5 %

1 Tinker 753 x64                 &#58; 2410   29  28   400    53.0 %   2390   30.5 %
2 Tinker 752 x64                 &#58; 2390   28  29   400    47.0 %   2410   30.5 %

1 Tinker 753 x64                 &#58; 2401   19  19   800    50.3 %   2399   36.1 %
2 Tinker 752 x64                 &#58; 2399   19  19   800    49.7 %   2401   36.1 %

1 Tinker 753 x64                 &#58; 2403   13  13  1599    50.9 %   2397   37.4 %
2 Tinker 752 x64                 &#58; 2397   13  13  1599    49.1 %   2403   37.4 %

1 Tinker 753 x64                 &#58; 2401   10  10  3197    50.2 %   2399   36.3 %
2 Tinker 752 x64                 &#58; 2399   10  10  3197    49.8 %   2401   36.3 %

1 Tinker 753 x64                 &#58; 2401    7   7  5458    50.4 %   2399   34.8 %
2 Tinker 752 x64                 &#58; 2399    7   7  5458    49.6 %   2401   34.8 %

krazyken · Post by **krazyken** » Fri Aug 07, 2009 10:29 pm

The trouble with using a particular set of opening positions, is that your results are going to be dependent on the order of the positions.

You need to validate the positions. Are the positions truly representative of positions you will see in actual games by Tinker? Are the positions homogeneous?

The other thing to consider is the tools you are using. You appear to be using BayesELO do you understand the assumptions this tool is built on? There may be better tools for determining what you want to know. Perhaps you should try a t-test?

Dann Corbit · Post by **Dann Corbit** » Fri Aug 07, 2009 11:24 pm

Does your engine have any forms of learning (book learning, position learning, etc)?

If so, and it is enabled, then we will expect to see long contests drift towards equality unless there is a very powerful improvement.

brianr · Post by **brianr** » Fri Aug 07, 2009 11:28 pm

The 3,891 positions should be pretty good.
See this rather long topic (starting around page 12)

http://www.talkchess.com/forum/viewtopi ... light=3891
http://www.talkchess.com/forum/viewtopi ... &start=110

Actually, using Elostat not Bayeselo (since former built-in to Arena), but have used Bayeselo also.
No, I do not fully understand the statistics involved, but do understand enough to clearly be "dangerous".

Hard to say if Tinker would commonly encounter these positions, since until very recently Tinker had no book of its own, and the selections were pretty limited.

So, bottom line, just relying on Hyatt's test set as a better variation than the 20 Nunn2, 30 Noomen, and 50 Silver positions
(although have used them all quite a bit).

krazyken · Post by **krazyken** » Fri Aug 07, 2009 11:47 pm

Yeah I know the history behind those positions, and they are probably suitable for testing Crafty since Crafty tends to play with a huge book and Bob can run very many games to even out any anomalies. But even Bob has stated that using those positions, his early results are not usually indicative of the final results.
I don't know as much about Elostat as I have never used it, but apparently there was enough dissatisfaction with it that most use BayesELO now. Maybe you should try comparing the results between the two.

bob · Post by **bob** » Sat Aug 08, 2009 2:54 am

brianr wrote:The 3,891 positions should be pretty good.
See this rather long topic (starting around page 12)

http://www.talkchess.com/forum/viewtopi ... light=3891
http://www.talkchess.com/forum/viewtopi ... &start=110

Actually, using Elostat not Bayeselo (since former built-in to Arena), but have used Bayeselo also.
No, I do not fully understand the statistics involved, but do understand enough to clearly be "dangerous".

Hard to say if Tinker would commonly encounter these positions, since until very recently Tinker had no book of its own, and the selections were pretty limited.

So, bottom line, just relying on Hyatt's test set as a better variation than the 20 Nunn2, 30 Noomen, and 50 Silver positions
(although have used them all quite a bit).

For the record, I have a new set and will post them soon. There are 4,000. The first 3891 should not have changed, but the odd number made the results look a bit odd in number of games played, now each opponent plays 4000 x 2 games which is a nice round number with zeroes on the end.

As to the positions, no claims about anything are made. I took a large collection of PGN games among _good_ players, and extracted the positions after 12 moves by both sides, and then sorted them followed by a "uniq -c". I then took the 4,000 most common positions. That's about all I can say, they cover a significant number of different openings, although I don't think there are any oddball openings (1. f4 and 2. Kf2 for example) included.

bob · Post by **bob** » Sat Aug 08, 2009 2:57 am

krazyken wrote:Yeah I know the history behind those positions, and they are probably suitable for testing Crafty since Crafty tends to play with a huge book and Bob can run very many games to even out any anomalies. But even Bob has stated that using those positions, his early results are not usually indicative of the final results.
I don't know as much about Elostat as I have never used it, but apparently there was enough dissatisfaction with it that most use BayesELO now. Maybe you should try comparing the results between the two.

The first N positions are not "odd". I just notice that when I watch the games complete (fast games complete at a rate of maybe 500 games a minute) if I catch the results after 200 games, and then repeat and again look after 200 games, the results can be wildly different. Not just wildly one-sided favoring either Crafty or its opponent, just different each time.

As far as comparisons go, I believe Bayeselo is about as good as it gets.

brianr · Post by **brianr** » Sat Aug 08, 2009 5:23 am

Dann Corbit wrote:Does your engine have any forms of learning (book learning, position learning, etc)?

If so, and it is enabled, then we will expect to see long contests drift towards equality unless there is a very powerful improvement.

Yes, Tinker has different types of learning (position learning sort of like permanent hash, book learning, and even temporal difference learning), but they are all turned off.

krazyken wrote:I don't know as much about Elostat as I have never used it, but apparently there was enough dissatisfaction with it that most use BayesELO now. Maybe you should try comparing the results between the two.

Bayeselo results are pretty much the same:

Code: Select all

Rank Name             Elo    +    - games score oppo. draws
   1 Tinker 753 x64     0    7    7  5458   50%     0   35%
   2 Tinker 752 x64     0    7    7  5458   50%     0   35%
ResultSet-EloRating>los
                Ti Ti
Tinker 753 x64     50
Tinker 752 x64  50

Don · Post by **Don** » Tue Sep 01, 2009 4:58 am

Funny, I did an auto-test opening book almost identical to this. I did the sort piped into uniq -c and I don't remember my threshold for how often it must be played, but I ended up with 3793 openings.

Actually, I had MORE than 3793 openings but I did some work to eliminate transpositions at the end of the sequence and I even found a few that were "likely" to transpose 2 or 3 moves later.

So if anyone wants my 3793 openings (they are all 10 ply long) just let me know and I will be glad to pass it along.

By the way, I have a nice tester if anyone is interested. It's not graphical but it's written in C and is very fast (I can play several 3 ply games per second for instance on 1 processor.) It has these features (and non-features):

* can handle up to 128 programs - with 3793 openings.
* multi-processor - set how many simultaneous games you want to play
* only supports Fischer time controls or fixed depth testing.
* Linux only, although it could be ported by a windows guru.
* coded in C to be high performance.
* No ponder support (since I don't test that way)
* configured via a configuration file.
* rates games, produces error bars.
* produces an html file as output as well as text output
* produces pgn record of each game.
* UCI only
* command line program - no GUI
* round robin only

It could be slicker, but I wrote it just for me although I'm willing to share. Several features that are missing could be added easily enough.

I think I described how the openings work in a previous post. Each player plays each side of each opening against each opponent!

So if anyone wants this thing, let me know. It uses libevent for multi-plexing the games.

bob wrote:
brianr wrote:The 3,891 positions should be pretty good.
See this rather long topic (starting around page 12)

http://www.talkchess.com/forum/viewtopi ... light=3891
http://www.talkchess.com/forum/viewtopi ... &start=110

Actually, using Elostat not Bayeselo (since former built-in to Arena), but have used Bayeselo also.
No, I do not fully understand the statistics involved, but do understand enough to clearly be "dangerous".

Hard to say if Tinker would commonly encounter these positions, since until very recently Tinker had no book of its own, and the selections were pretty limited.

So, bottom line, just relying on Hyatt's test set as a better variation than the 20 Nunn2, 30 Noomen, and 50 Silver positions
(although have used them all quite a bit).
For the record, I have a new set and will post them soon. There are 4,000. The first 3891 should not have changed, but the odd number made the results look a bit odd in number of games played, now each opponent plays 4000 x 2 games which is a nice round number with zeroes on the end.

As to the positions, no claims about anything are made. I took a large collection of PGN games among _good_ players, and extracted the positions after 12 moves by both sides, and then sorted them followed by a "uniq -c". I then took the 4,000 most common positions. That's about all I can say, they cover a significant number of different openings, although I don't think there are any oddball openings (1. f4 and 2. Kf2 for example) included.

Self Testing Trends

Self Testing Trends

Re: Self Testing Trends

Re: Self Testing Trends

Re: Self Testing Trends

Re: Self Testing Trends

Re: Self Testing Trends

Re: Self Testing Trends

Re: Self Testing Trends

Re: Self Testing Trends