testing engines with fixed depth

Togga · Post by **Togga** » Wed Mar 03, 2010 7:07 pm

not again the question:((

dont you see??

i asked Bernhard Bauer .

Sven · Post by **Sven** » Wed Mar 03, 2010 7:11 pm

Togga wrote:when one program is elo 2980 and other elo 2805 then the result can be 10:0 like stockfish - glaurung .

Paul,

sorry to bother you. But you have to understand, once and for all, that the outcome of only 10 games is not much different from random. There have been plenty of threads around this topic already, and the whole issue is theoretically founded, it is simply statistics that you have to accept.

So I propose that you stop any testing that includes playing matches of only 10 games, and then deriving any conclusions from that. Play 100 games and I'll still comment it the same way. Play 200 or 500 games and come back with it, then it may start to make sense, depending on which amount of ELO difference you want to measure. That is also what Bob tried to tell you, who definitely has very deep experience on that area. For testing of his own engine he needs/wants to measure very small improvements of only few ELO points and therefore needs to play several ten thousands of games until getting a result that has the required statistical significance.

Sven

govert · Post by **govert** » Wed Mar 03, 2010 7:11 pm

Togga wrote:not again the question:((

dont you see??

i asked Bernhard Bauer .

*swithching to threaded view*

Yes of course. Now I see. Silly me.

Dirt · Post by **Dirt** » Wed Mar 03, 2010 7:11 pm

Togga wrote:10 games:

1.d4 d5 2.c4
1.e4 e5
1.g3
1.e4 c6 2.d4 d5
1.e4 c5

reversed colors.

this should be a fair short test. isnt it?

By using such a short book you are effectively playing longer games than Bob is. I would expect that would make the rating difference larger, but I have no idea by how much.

Togga · Post by **Togga** » Wed Mar 03, 2010 7:23 pm

i am testing only for my private usage. i dont have big possibility at home. only one computer active, and less time for testing.
so 10 games per match is for me good compromise.
after 10 games i can say which engine is stronger. check it out:)

Sven · Post by **Sven** » Wed Mar 03, 2010 8:37 pm

Togga wrote:after 10 games i can say which engine is stronger. check it out:)

No, you can't, and that is the point you consistently miss.

Play a 10 games match between two engines that are 100 ELO apart. Assume you don't know these ELO ratings in advance but after you are done some serious well-accepted tester tells you he has played 10000 games of both engines against different opponents and got these ratings.

Now repeat your same match 10 times. A typical result may be something like 6,5:3,5 for the engine that is 100 ELO stronger but there may also be 8:2, 9:1, 4:6 and 3:7 results, just to name a few. If your first match result was 3:7 then what do you derive from that? See the problem? You are facing a Gauss distribution here. NEVER expect that one 10 games match tells you which engine is stronger. NEVER!

Sven

Togga · Post by **Togga** » Wed Mar 03, 2010 8:55 pm

never??:)) never say never:)

i played a 10 games match between stockfish and glaurung.
stockfish won 10:0. so i can clearly say that stockfish is better than glaurung. and its true that stockfish is better:)

but when 2 engines are similar equal then it is really better to run more than 10 games.

i am testing only 5 opening positions but there are 500 and more different opening positions. look at the eco code.

but from this 5 short opening positions that i play i get many many different opening variations.

for example after 1.d4 d5 2.c4

one engine plays 2...dc, other 2...e6, other 2...c6 ect.

testing engines with fixed depth

Re: testing engines with fixed depth

Re: rating list

Re: testing engines with fixed depth

Re: glaurung disaster part 2

Re: rating list

Re: rating list

Re: rating list