CCRL update (4th May 2007)

Discussion of computer chess matches and engine tournaments.

Moderator: Ras

User avatar
Graham Banks
Posts: 44729
Joined: Sun Feb 26, 2006 10:52 am
Location: Auckland, NZ

Re: CCRL update (4th May 2007)

Post by Graham Banks »

IWB wrote: So in short:

Different books, different hardware, different time control, different hash size (1 pr 2 CPUs), different CPU brands, different set of tablebases (4 or 5 Pcs). The exact conditions what game is played how are not known, right?

All this combines to one rating list.

Hmm - I have to think about what this is worth!?

Thx Uri and bye
Ingo
Hi Ingo,

if you don't think it's worth anything, then you don't have to look at it.
We do this as a hobby and share our results as a group.

We have always advised that enthusiasts look at the results and rating lists of all testers and testing groups in order to get an accurate picture overall.
If you do this you'll find that results and ratings tend to be consistent.

Regards, Graham.
Shaun
Posts: 323
Joined: Wed Mar 08, 2006 9:55 pm
Location: Brighton - UK

Re: CCRL update (4th May 2007)

Post by Shaun »

IWB wrote:
Uri Blass wrote:
I am sure the lists will not be identical and one reason is that the relative speed of different program is different with different hardware.

http://kd.lab.nig.ac.jp/chess/discussio ... php?t=1486

Uri
So in short:

Different books, different hardware, different time control, different hash size (1 pr 2 CPUs), different CPU brands, different set of tablebases (4 or 5 Pcs). The exact conditions what game is played how are not known, right?

All this combines to one rating list.

Hmm - I have to think about what this is worth!?

Thx Uri and bye
Ingo
Hi Ingo,

Ideal testing conditions would require identical hardware - however this would be impractical as even 2 machines with the same processor can bench differently.

Therefore we use a specific crafty version to bench our machines and adjust the time control based on these results.

Now it is possible that time management issues in a particular engine could cause problems but in terms of depth 40/20 on a machine twices as fast to our control will equate to 40/80 on a machine half as fast.

With regards to books for any pairing both engines use the same book we also remove games where it appears the book has given a decisive advantage. (My worry here is we may be throwing away too many game when trying to be safe :lol: ).

A variety of books should also avoid a particular bias (although all our books are chosen with balance in mind).

Now in our internal database we know book/time control tester etc and usually hardware - the AMD/Intel bias is something I have been looking at in blitz - to my surprise there does not seem to be conclusive evidence of the bias in the results although this is something I will continue to monitor.

One word on Hash sizes we allocate hash so it is enough for the time control - having run several hundred games to check the effect of hash I can say that I have seen no evidence that we do not allocate enough hash and this finding has been backed up by others.

The biggest bias to ratings are opponent selection - our pure lists goes a very long way to address this and you can easily also look at results side by side to compare versions.

I hope this answers you concerns - please note we welcome all feedback as it has and will effect how we do things as well as trigger internal validation/debate.

All the best

Shaun
Marc MP

Re: CCRL update (4th May 2007)

Post by Marc MP »

As far as I am concerned, CCRL ratings are certainly precise enough. Whether the different openings books, cpus or time controls introduce some kind of (small) bias is absolutely irrelevant for me. More important for the rating list is readability, easiness to sort engines by cpus, 32-bits, freeware and finding the actual parameters for setting X of engine Y etc. For all these things CCRL is doing an outstanding job.

I would also add: what would be the additional use of a perfectly controlled rating list? i.e. if all engines used the same cpu, opening book and GUI?

If I don't own the same cpu (and hardware in general, OS etc), the results I will get home will differ a bit anyway.

Or maybe I will say that my favorite engine is disadvantaged by the generic book, it should never play this type of position etc. I'll use own books and results will again differ (slightly) from my neighbor or CCRL or CEGT anyway.
Dirt
Posts: 2851
Joined: Wed Mar 08, 2006 10:01 pm
Location: Irvine, CA, USA

Re: CCRL update (4th May 2007)

Post by Dirt »

Shaun wrote:... we also remove games where it appears the book has given a decisive advantage.
Wow, that sounds like a bad idea. The more subjective decisions (bad book line) have to be made, the more chance there is for inadvertent bias. Can you really force yourself to examine the opening with the same skepticism when Scorpio loses to Rybka as when Rybka loses to Scorpio?
Shaun wrote:(My worry here is we may be throwing away too many game when trying to be safe :lol: ). ...
One is too many, I think.
Shaun
Posts: 323
Joined: Wed Mar 08, 2006 9:55 pm
Location: Brighton - UK

Re: CCRL update (4th May 2007)

Post by Shaun »

Dirt wrote:
Shaun wrote:... we also remove games where it appears the book has given a decisive advantage.
Wow, that sounds like a bad idea. The more subjective decisions (bad book line) have to be made, the more chance there is for inadvertent bias. Can you really force yourself to examine the opening with the same skepticism when Scorpio loses to Rybka as when Rybka loses to Scorpio?
Shaun wrote:(My worry here is we may be throwing away too many game when trying to be safe :lol: ). ...
One is too many, I think.
The removal is about as fair as it could be without manually checking each game*, basically if both engines give a score greater than/less than n and this score is maintained or increased only then there is a strong risk the loss is due to a bad openning.

e.g. both engines give the score on leaving the book +1 the score never drops back and white wins.

* I have looked at some of these removed games and sometimes the opening is clearly bust - however sometimes it is unclear. Now in the absence of individual analysis by a better player than me I think it is safer to remove these game.

Shaun

P.S. this is only a very small percentage of games, it is just when thousands of games are involved the actul number becomes big.

P.P.S. Again our raw internal database does not drop games so if we enhance our detection then games will re-appear.
User avatar
Graham Banks
Posts: 44729
Joined: Sun Feb 26, 2006 10:52 am
Location: Auckland, NZ

Re: CCRL update (4th May 2007)

Post by Graham Banks »

Dirt wrote:
Shaun wrote:... we also remove games where it appears the book has given a decisive advantage.
Wow, that sounds like a bad idea. The more subjective decisions (bad book line) have to be made, the more chance there is for inadvertent bias. Can you really force yourself to examine the opening with the same skepticism when Scorpio loses to Rybka as when Rybka loses to Scorpio?
Shaun wrote:(My worry here is we may be throwing away too many game when trying to be safe :lol: ). ...
One is too many, I think.
In every generic opening book I've ever used, I've always come across at least one bad line.
Why gift an engine a win from a poor automatically played opening variation?
User avatar
Eelco de Groot
Posts: 4676
Joined: Sun Mar 12, 2006 2:40 am
Full name:   Eelco de Groot

Re: CCRL update (4th May 2007)

Post by Eelco de Groot »

Dirt wrote:
Shaun wrote:... we also remove games where it appears the book has given a decisive advantage.
Wow, that sounds like a bad idea. The more subjective decisions (bad book line) have to be made, the more chance there is for inadvertent bias. Can you really force yourself to examine the opening with the same skepticism when Scorpio loses to Rybka as when Rybka loses to Scorpio?
Shaun wrote:(My worry here is we may be throwing away too many game when trying to be safe :lol: ). ...
One is too many, I think.
Hello Shaun, intuitively I would say that you probably do this only with 1:0-1:0 and 0:1-0:1 results, two games with reversed colors where the opening seemed to have a big influence and were won by one color? And I suspect that the ½:½ - ½:½ results that also tie a match, are left alone?

That would leave all the 1:0-0:1, 0:1-1:0, 1:0-½:½, 0:1-½:½, ½:½-1:0, ½:½-0:1 results, where one of the engines wins the match. In other words, doing this will always favour the stronger engines as they win more matches.

Correction: and also the weakest engines as they lose more matches against the average opposition!

So intuitively I would say this is bad statistical practice unless you could somehow cull the dead draws in equal measure to the almost sure white or black wins...

Is this a correct conclusion, anyone?

At least personally I don't bother very much if a game result seems influenced by the opening; the influence is inevitable anyway, and you would also have to prune the games where the opening leads to almost dead draws (and this probably is harder to judge than sure white and black wins!) and do this in equal measure to compensate, and avoid bias.

Regards, Eelco

-A probably related philosophical question: what is a neutral opening book? What does that mean really? A book with completely random moves, is that neutral? Or a book with theoretically completely equal positions, is that a neutral book? -
User avatar
Eelco de Groot
Posts: 4676
Joined: Sun Mar 12, 2006 2:40 am
Full name:   Eelco de Groot

Re: CCRL update (4th May 2007)

Post by Eelco de Groot »

Eelco de Groot wrote:
Dirt wrote:
Shaun wrote:... we also remove games where it appears the book has given a decisive advantage.
Wow, that sounds like a bad idea. The more subjective decisions (bad book line) have to be made, the more chance there is for inadvertent bias. Can you really force yourself to examine the opening with the same skepticism when Scorpio loses to Rybka as when Rybka loses to Scorpio?
Shaun wrote:(My worry here is we may be throwing away too many game when trying to be safe :lol: ). ...
One is too many, I think.
Hello Shaun, intuitively I would say that you probably do this only with 1:0-1:0 and 0:1-0:1 results, two games with reversed colors where the opening seemed to have a big influence and were won by one color? And I suspect that the ½:½ - ½:½ results that also tie a match, are left alone?

That would leave all the 1:0-0:1, 0:1-1:0, 1:0-½:½, 0:1-½:½, ½:½-1:0, ½:½-0:1 results, where one of the engines wins the match. In other words, doing this will always favour the stronger engines as they win more matches.

Correction {1}: and also the weakest engines as they lose more matches against the average opposition!

So intuitively I would say this is bad statistical practice, Correction {2}; even more so if you try to cull the dead draws in equal measure to the almost sure white or black wins, because this again favours the strongest and the weakest engines, because they are over-represented in the decided matches...

Is this a correct conclusion, anyone?

At least personally I don't bother very much if a game result seems influenced by the opening; the influence is inevitable anyway, you would also have to prune the games where the opening leads to almost dead draws (this probably is harder to judge than sure white and black wins!) and, correction {3}: It is very difficult to do this without introducing bias if you do it afterwards trying to pick out the matches with bad lines, for one reason because these lines are more obvious in the drawn matches, those with one color wins or two dead draws.

Regards, Eelco

-A probably related philosophical question: what is a neutral opening book? What does that mean really? A book with completely random moves, is that neutral? Or a book with theoretically completely equal positions, is that a neutral book? -
Last edited by Eelco de Groot on Sun May 06, 2007 9:21 am, edited 1 time in total.
User avatar
Graham Banks
Posts: 44729
Joined: Sun Feb 26, 2006 10:52 am
Location: Auckland, NZ

Re: CCRL update (4th May 2007)

Post by Graham Banks »

Eelco de Groot wrote: -A probably related philosophical question: what is a neutral opening book? What does that mean really? A book with completely random moves, is that neutral? Or a book with theoretically completely equal positions, is that a neutral book? -
I regard a fair neutral opening book as one that sees both sides coming out of the opening with a position that does not lead to an immediate advantage to one of them.
Of course, this can be determined fairly quickly.

Our policy is that any game where the evaluation for one side is always more than -0.75 for the entire game and the other engine is in agreement, that game does not count towards our ratings.
This 0.75 threshold was reached after consultation with opening book authors and there is only a very small number of games affected.

Whereas in matchplay, one can negate the problem by getting each engine to play from a given position as both White and Black, this does not occur in tournaments.

Regards, Graham.
IWB
Posts: 1539
Joined: Thu Mar 09, 2006 2:02 pm

Re: CCRL update (4th May 2007)

Post by IWB »

Hello Shaun

My "intelectual" problem is not a single difference but the sum of all!

The different books are just giving different starting positions if maintained in a good manner these books are my smallest concerns.

If a Hash size is 128 or 256 is different, but in Elo it might be "a bit". (Nevertheless I do not see a reason for giving the double amount of hash to a dual core engine - because it fills the hash faster is not a reason for me, it is the couse why the engine is better and then giving an additional advantage of double hash?)
Different CPUs are more bad as one Engine might be better tuned for an AMD or Intel and if one tester contributes more games with his AMD or Intel for that particular tuned engine it might be better rated as it is (or the others worse then they are)
Different time controls are my biggest concern as I KNOW that some engines behiving differently on 40moves in 20 minutes then on 40moves in 10 minutes. If everyone of the testers are playing the same fraction of games for any engine as anyone else then fine but if one is playing a majority of games with one particular engine you might have a problem.

Overall I think you assume that all this differences are balancing itself out but the assumtion that they amplify is as valid!
The worst thing might happen is that the list is ok but one particular engine is discriminated by the sum off all the differences.

Again: I do not have any prove or even indication of something like this in your list - it is only a theoretical possibility.

Nevertheless I know what huge amount of work and money is invested in your list as I maintain something similar and you, as a team, have all my respect for that!

Bye
Ingo
Shaun wrote:
Hi Ingo,

Ideal testing conditions would require identical hardware - however this would be impractical as even 2 machines with the same processor can bench differently.

Therefore we use a specific crafty version to bench our machines and adjust the time control based on these results.

Now it is possible that time management issues in a particular engine could cause problems but in terms of depth 40/20 on a machine twices as fast to our control will equate to 40/80 on a machine half as fast.

With regards to books for any pairing both engines use the same book we also remove games where it appears the book has given a decisive advantage. (My worry here is we may be throwing away too many game when trying to be safe :lol: ).

A variety of books should also avoid a particular bias (although all our books are chosen with balance in mind).

Now in our internal database we know book/time control tester etc and usually hardware - the AMD/Intel bias is something I have been looking at in blitz - to my surprise there does not seem to be conclusive evidence of the bias in the results although this is something I will continue to monitor.

One word on Hash sizes we allocate hash so it is enough for the time control - having run several hundred games to check the effect of hash I can say that I have seen no evidence that we do not allocate enough hash and this finding has been backed up by others.

The biggest bias to ratings are opponent selection - our pure lists goes a very long way to address this and you can easily also look at results side by side to compare versions.

I hope this answers you concerns - please note we welcome all feedback as it has and will effect how we do things as well as trigger internal validation/debate.

All the best

Shaun