CCRL 40/40 and 40/4 lists tidied up

Discussion of computer chess matches and engine tournaments.

Moderators: hgm, Rebel, chrisw

Modern Times
Posts: 3555
Joined: Thu Jun 07, 2012 11:02 pm

Re: CCRL 40/40 and 40/4 lists tidied up

Post by Modern Times »

Norm Pollock wrote:The following link has 82 CCRL games where there is a 500+ difference in Elo ratings.

https://dl.dropboxusercontent.com/u/66249444/cf1.7z

-Norm
This is something we try to avoid, but 82 games out of 458,000+ isn't too big an issue unless it is confined to a small group of engines.
Adam Hair
Posts: 3226
Joined: Wed May 06, 2009 10:31 pm
Location: Fuquay-Varina, North Carolina

Re: CCRL 40/40 and 40/4 lists tidied up

Post by Adam Hair »

Number of games in CCRL-4040.pgn = 458023 (without a result = 0)
Number of players = 1324
Date range: 2005.12.19 - 2013.06.22
Elo range: 1790 - 3249
Number of: White Elos = 454269 Black Elos = 454297 Both = 451933
Average: White Elo = 2697.24 Black Elo = 2697.19
Elo distance: maximum = 859 average = 73.924

Number of White wins = 162656 ( 35.51 % )
Number of draws = 174916 ( 38.19 % )
Number of Black wins = 120451 ( 26.3 % )
White score = 54.61 %
Black score = 45.39 %
Number of ECOs = 458023 A: 96243 B: 124147 C: 76136 D: 87585 E: 73912
Number of PlyCounts = 458023 range: 28-702 average = 137.21

It does not seem that the 40/40 database is overly flawed for the purpose of book making. It just requires some filtering, for which I recommend using Norm's PGN tools. For example, approximately 1/4 of the database consists of games where both engines are rated at least 2800 Elo and are within 100 Elo of each other.
Norm Pollock
Posts: 1057
Joined: Thu Mar 09, 2006 4:15 pm
Location: Long Island, NY, USA

Re: CCRL 40/40 and 40/4 lists tidied up

Post by Norm Pollock »

Adam Hair wrote:Number of games in CCRL-4040.pgn = 458023 (without a result = 0)
Number of players = 1324
Date range: 2005.12.19 - 2013.06.22
Elo range: 1790 - 3249
Number of: White Elos = 454269 Black Elos = 454297 Both = 451933
Average: White Elo = 2697.24 Black Elo = 2697.19
Elo distance: maximum = 859 average = 73.924

Number of White wins = 162656 ( 35.51 % )
Number of draws = 174916 ( 38.19 % )
Number of Black wins = 120451 ( 26.3 % )
White score = 54.61 %
Black score = 45.39 %
Number of ECOs = 458023 A: 96243 B: 124147 C: 76136 D: 87585 E: 73912
Number of PlyCounts = 458023 range: 28-702 average = 137.21

It does not seem that the 40/40 database is overly flawed for the purpose of book making. It just requires some filtering, for which I recommend using Norm's PGN tools. For example, approximately 1/4 of the database consists of games where both engines are rated at least 2800 Elo and are within 100 Elo of each other.
I have a 3rd issue against using the CCRL database for book building. It is that for the most part, you are building a book based on moves from other books, and only slightly based on engine performance. Of course, the results will be original data.

Explanation-- Each game in the CCRL uses a book for the first 8-12 moves. So if you make a book from the CCRL database for 8-12 moves, you are mostly getting the moves that the engines got from their books. Now if you make a book for 16 moves, then the later book moves will be based on the engine's work.
User avatar
Kirill Kryukov
Posts: 492
Joined: Sun Mar 19, 2006 4:12 am

Re: CCRL 40/40 and 40/4 lists tidied up

Post by Kirill Kryukov »

Norm Pollock wrote:
Kirill explained as follows:

Engines with too few games don't have Elo tag in the pgn file. In 40/40 this probably means 200 games. This is our way of saying that those Elos are not reliable and should not be used for anything. For example, automatic opening book building may use ratings. You would not want to use any rating that is based on 15 games. 200 games requirement is arbitrary, but at least it ensures some minimum quality of rating estimates. (Engines with fewer than 200 games also don't appear in the main list).
Graham and Kiril,

With regard to using the CCRL database for creating an opening book, which Kiril mentions, I want to point out two major issues with that project:

1. Many of the engines that are tested are weak, relatively speaking. This is a wonderful attribute for CCRL testing, but it is not so wonderful for creating an opening book. A quality database for an opening book should be based on the moves and results from the strongest players.

2. Many games in the CCRL database have a wide difference in Elo ratings between the players. This skews the results of the games to the stronger engine even if the weaker engine uses the better opening. Players in each game in the database should be relatively equal in strength, perhaps at most a 100 Elo difference. Otherwise the importance of the opening will be obscured.

The following link has 82 CCRL games where there is a 500+ difference in Elo ratings.

https://dl.dropboxusercontent.com/u/66249444/cf1.7z

-Norm
The issues you mention are actually the main reason why we provide Elo tags in the PGN, and why the tag is only added for engines with sufficient number of games. The thing is: It's absolutely essential to filter against those two issues when building a book from any kind of game database. Such filtering requires accurate Elo tags, which we are happy to provide.

If we included Elo tags of all engines in the database, the filtering would be less accurate as some ratings would be bogus (based on 10 games).

The reason why we don't filter it for you and only release the complete database (instead of a database of games between strong players or with small rating differences) is that people may have different ideas about the criteria for acceptable games.

So, what you mention is not "two major issues with that project", but two major issues with building a book from any unfiltered database.

Frankly it's surprising that you out of all people will miss the point entirely, but still may be it's me who is missing something here.

Best,
Kirill
Modern Times
Posts: 3555
Joined: Thu Jun 07, 2012 11:02 pm

Re: CCRL 40/40 and 40/4 lists tidied up

Post by Modern Times »

Norm Pollock wrote:I have a 3rd issue against using the CCRL database for book building. It is that for the most part, you are building a book based on moves from other books, and only slightly based on engine performance. Of course, the results will be original data.

Explanation-- Each game in the CCRL uses a book for the first 8-12 moves. So if you make a book from the CCRL database for 8-12 moves, you are mostly getting the moves that the engines got from their books. Now if you make a book for 16 moves, then the later book moves will be based on the engine's work.
I agree with this. I don't really see the point of making books from books, unless as you say, the new book goes beyond 12 moves.