not again the question:((
dont you see??
i asked Bernhard Bauer .
testing engines with fixed depth
Moderator: Ras
-
- Posts: 4052
- Joined: Thu May 15, 2008 9:57 pm
- Location: Berlin, Germany
- Full name: Sven Schüle
Re: rating list
Paul,Togga wrote:when one program is elo 2980 and other elo 2805 then the result can be 10:0 like stockfish - glaurung .
sorry to bother you. But you have to understand, once and for all, that the outcome of only 10 games is not much different from random. There have been plenty of threads around this topic already, and the whole issue is theoretically founded, it is simply statistics that you have to accept.
So I propose that you stop any testing that includes playing matches of only 10 games, and then deriving any conclusions from that. Play 100 games and I'll still comment it the same way. Play 200 or 500 games and come back with it, then it may start to make sense, depending on which amount of ELO difference you want to measure. That is also what Bob tried to tell you, who definitely has very deep experience on that area. For testing of his own engine he needs/wants to measure very small improvements of only few ELO points and therefore needs to play several ten thousands of games until getting a result that has the required statistical significance.
Sven
-
- Posts: 270
- Joined: Thu Jan 15, 2009 12:52 pm
Re: testing engines with fixed depth
*swithching to threaded view*Togga wrote:not again the question:((
dont you see??
i asked Bernhard Bauer .
Yes of course. Now I see. Silly me.

-
- Posts: 2851
- Joined: Wed Mar 08, 2006 10:01 pm
- Location: Irvine, CA, USA
Re: glaurung disaster part 2
By using such a short book you are effectively playing longer games than Bob is. I would expect that would make the rating difference larger, but I have no idea by how much.Togga wrote:10 games:
1.d4 d5 2.c4
1.e4 e5
1.g3
1.e4 c6 2.d4 d5
1.e4 c5
reversed colors.
this should be a fair short test. isnt it?
Re: rating list
i am testing only for my private usage. i dont have big possibility at home. only one computer active, and less time for testing.
so 10 games per match is for me good compromise.
after 10 games i can say which engine is stronger. check it out:)
so 10 games per match is for me good compromise.
after 10 games i can say which engine is stronger. check it out:)
-
- Posts: 4052
- Joined: Thu May 15, 2008 9:57 pm
- Location: Berlin, Germany
- Full name: Sven Schüle
Re: rating list
No, you can't, and that is the point you consistently miss.Togga wrote:after 10 games i can say which engine is stronger. check it out:)
Play a 10 games match between two engines that are 100 ELO apart. Assume you don't know these ELO ratings in advance but after you are done some serious well-accepted tester tells you he has played 10000 games of both engines against different opponents and got these ratings.
Now repeat your same match 10 times. A typical result may be something like 6,5:3,5 for the engine that is 100 ELO stronger but there may also be 8:2, 9:1, 4:6 and 3:7 results, just to name a few. If your first match result was 3:7 then what do you derive from that? See the problem? You are facing a Gauss distribution here. NEVER expect that one 10 games match tells you which engine is stronger. NEVER!

Sven
Re: rating list
never??:)) never say never:)
i played a 10 games match between stockfish and glaurung.
stockfish won 10:0. so i can clearly say that stockfish is better than glaurung. and its true that stockfish is better:)
but when 2 engines are similar equal then it is really better to run more than 10 games.
i am testing only 5 opening positions but there are 500 and more different opening positions. look at the eco code.
but from this 5 short opening positions that i play i get many many different opening variations.
for example after 1.d4 d5 2.c4
one engine plays 2...dc, other 2...e6, other 2...c6 ect.
i played a 10 games match between stockfish and glaurung.
stockfish won 10:0. so i can clearly say that stockfish is better than glaurung. and its true that stockfish is better:)
but when 2 engines are similar equal then it is really better to run more than 10 games.
i am testing only 5 opening positions but there are 500 and more different opening positions. look at the eco code.
but from this 5 short opening positions that i play i get many many different opening variations.
for example after 1.d4 d5 2.c4
one engine plays 2...dc, other 2...e6, other 2...c6 ect.