This is something we try to avoid, but 82 games out of 458,000+ isn't too big an issue unless it is confined to a small group of engines.Norm Pollock wrote:The following link has 82 CCRL games where there is a 500+ difference in Elo ratings.
https://dl.dropboxusercontent.com/u/66249444/cf1.7z
-Norm
CCRL 40/40 and 40/4 lists tidied up
Moderators: hgm, Rebel, chrisw
-
- Posts: 3555
- Joined: Thu Jun 07, 2012 11:02 pm
Re: CCRL 40/40 and 40/4 lists tidied up
-
- Posts: 3226
- Joined: Wed May 06, 2009 10:31 pm
- Location: Fuquay-Varina, North Carolina
Re: CCRL 40/40 and 40/4 lists tidied up
Number of games in CCRL-4040.pgn = 458023 (without a result = 0)
Number of players = 1324
Date range: 2005.12.19 - 2013.06.22
Elo range: 1790 - 3249
Number of: White Elos = 454269 Black Elos = 454297 Both = 451933
Average: White Elo = 2697.24 Black Elo = 2697.19
Elo distance: maximum = 859 average = 73.924
Number of White wins = 162656 ( 35.51 % )
Number of draws = 174916 ( 38.19 % )
Number of Black wins = 120451 ( 26.3 % )
White score = 54.61 %
Black score = 45.39 %
Number of ECOs = 458023 A: 96243 B: 124147 C: 76136 D: 87585 E: 73912
Number of PlyCounts = 458023 range: 28-702 average = 137.21
It does not seem that the 40/40 database is overly flawed for the purpose of book making. It just requires some filtering, for which I recommend using Norm's PGN tools. For example, approximately 1/4 of the database consists of games where both engines are rated at least 2800 Elo and are within 100 Elo of each other.
Number of players = 1324
Date range: 2005.12.19 - 2013.06.22
Elo range: 1790 - 3249
Number of: White Elos = 454269 Black Elos = 454297 Both = 451933
Average: White Elo = 2697.24 Black Elo = 2697.19
Elo distance: maximum = 859 average = 73.924
Number of White wins = 162656 ( 35.51 % )
Number of draws = 174916 ( 38.19 % )
Number of Black wins = 120451 ( 26.3 % )
White score = 54.61 %
Black score = 45.39 %
Number of ECOs = 458023 A: 96243 B: 124147 C: 76136 D: 87585 E: 73912
Number of PlyCounts = 458023 range: 28-702 average = 137.21
It does not seem that the 40/40 database is overly flawed for the purpose of book making. It just requires some filtering, for which I recommend using Norm's PGN tools. For example, approximately 1/4 of the database consists of games where both engines are rated at least 2800 Elo and are within 100 Elo of each other.
-
- Posts: 1057
- Joined: Thu Mar 09, 2006 4:15 pm
- Location: Long Island, NY, USA
Re: CCRL 40/40 and 40/4 lists tidied up
I have a 3rd issue against using the CCRL database for book building. It is that for the most part, you are building a book based on moves from other books, and only slightly based on engine performance. Of course, the results will be original data.Adam Hair wrote:Number of games in CCRL-4040.pgn = 458023 (without a result = 0)
Number of players = 1324
Date range: 2005.12.19 - 2013.06.22
Elo range: 1790 - 3249
Number of: White Elos = 454269 Black Elos = 454297 Both = 451933
Average: White Elo = 2697.24 Black Elo = 2697.19
Elo distance: maximum = 859 average = 73.924
Number of White wins = 162656 ( 35.51 % )
Number of draws = 174916 ( 38.19 % )
Number of Black wins = 120451 ( 26.3 % )
White score = 54.61 %
Black score = 45.39 %
Number of ECOs = 458023 A: 96243 B: 124147 C: 76136 D: 87585 E: 73912
Number of PlyCounts = 458023 range: 28-702 average = 137.21
It does not seem that the 40/40 database is overly flawed for the purpose of book making. It just requires some filtering, for which I recommend using Norm's PGN tools. For example, approximately 1/4 of the database consists of games where both engines are rated at least 2800 Elo and are within 100 Elo of each other.
Explanation-- Each game in the CCRL uses a book for the first 8-12 moves. So if you make a book from the CCRL database for 8-12 moves, you are mostly getting the moves that the engines got from their books. Now if you make a book for 16 moves, then the later book moves will be based on the engine's work.
-
- Posts: 492
- Joined: Sun Mar 19, 2006 4:12 am
Re: CCRL 40/40 and 40/4 lists tidied up
The issues you mention are actually the main reason why we provide Elo tags in the PGN, and why the tag is only added for engines with sufficient number of games. The thing is: It's absolutely essential to filter against those two issues when building a book from any kind of game database. Such filtering requires accurate Elo tags, which we are happy to provide.Norm Pollock wrote:Graham and Kiril,Kirill explained as follows:
Engines with too few games don't have Elo tag in the pgn file. In 40/40 this probably means 200 games. This is our way of saying that those Elos are not reliable and should not be used for anything. For example, automatic opening book building may use ratings. You would not want to use any rating that is based on 15 games. 200 games requirement is arbitrary, but at least it ensures some minimum quality of rating estimates. (Engines with fewer than 200 games also don't appear in the main list).
With regard to using the CCRL database for creating an opening book, which Kiril mentions, I want to point out two major issues with that project:
1. Many of the engines that are tested are weak, relatively speaking. This is a wonderful attribute for CCRL testing, but it is not so wonderful for creating an opening book. A quality database for an opening book should be based on the moves and results from the strongest players.
2. Many games in the CCRL database have a wide difference in Elo ratings between the players. This skews the results of the games to the stronger engine even if the weaker engine uses the better opening. Players in each game in the database should be relatively equal in strength, perhaps at most a 100 Elo difference. Otherwise the importance of the opening will be obscured.
The following link has 82 CCRL games where there is a 500+ difference in Elo ratings.
https://dl.dropboxusercontent.com/u/66249444/cf1.7z
-Norm
If we included Elo tags of all engines in the database, the filtering would be less accurate as some ratings would be bogus (based on 10 games).
The reason why we don't filter it for you and only release the complete database (instead of a database of games between strong players or with small rating differences) is that people may have different ideas about the criteria for acceptable games.
So, what you mention is not "two major issues with that project", but two major issues with building a book from any unfiltered database.
Frankly it's surprising that you out of all people will miss the point entirely, but still may be it's me who is missing something here.
Best,
Kirill
-
- Posts: 3555
- Joined: Thu Jun 07, 2012 11:02 pm
Re: CCRL 40/40 and 40/4 lists tidied up
I agree with this. I don't really see the point of making books from books, unless as you say, the new book goes beyond 12 moves.Norm Pollock wrote:I have a 3rd issue against using the CCRL database for book building. It is that for the most part, you are building a book based on moves from other books, and only slightly based on engine performance. Of course, the results will be original data.
Explanation-- Each game in the CCRL uses a book for the first 8-12 moves. So if you make a book from the CCRL database for 8-12 moves, you are mostly getting the moves that the engines got from their books. Now if you make a book for 16 moves, then the later book moves will be based on the engine's work.