ChessBase database annoyance: extracting an opening book

Discussion of anything and everything relating to chess playing software and machines.

Moderators: hgm, Dann Corbit, Harvey Williamson

KLc
Posts: 140
Joined: Wed Jun 03, 2020 6:46 am
Full name: Kurt Lanc

ChessBase database annoyance: extracting an opening book

Post by KLc »

I have been quite annoyed by the data quality in the ChessBase Big/Mega Database 2021 and I compiled a long list of garbage. Recently, I tried to extract a Polyglot opening book from the database and that wasn't easy at all. I thought to post my experience here in case someone else has difficulties, too.

My goal was to create an opening book of games not played by grandmaster but my "amateurs" (<2000 Elo). It's not relevant whether this is useful or not—what happens when you try to do it is relevant. My filter criteria were:

1 Elo 1200-1999 for both players
2. Slow games (i.e. not blitz and rapid)
3. Games with result only (1:0, 1/2:1/2, 0:1)
4. At least 7 moves (to filter out fragments)
5. Start position as move 1 (to filter out fragments whose first move is actually move 35 or something in a game)

Points 4 and especially 5 took me some time to realize and is necessary because there's a lot of garbage in the database.

This leaves 1,126,331 games with an average Elo of 1713. I've exported them to PGN and got a 1.11GB file. Of course, polyglot won't take it:

Code: Select all

./polyglot make-book -pgn ~/mega-2021-amateurs.pgn -bin ~/mega-2021-amateurs.bin -max-ply 24
PolyGlot 1.4.70b by Fabien Letouzey.
inserting games ...
tellusererror POLYGLOT: lexical error at line 1, column 0, game 1
What's happened? A hexdump revealed:

Code: Select all

hexdump -C -n 50 mega-2021-amateurs.pgn
00000000  ef bb bf 5b 45 76 65 6e  74 20 22 42 72 69 74 69  |...[Event "Briti|
00000010  73 68 20 55 6e 69 76 65  72 73 69 74 69 65 73 2d  |sh Universities-|
00000020  63 68 54 20 50 72 65 6c  69 6d 20 42 22 5d 0d 0a  |chT Prelim B"]..|
00000030  5b 53                                             |[S|
00000032
There are 3 bytes of garbage at the beginning. It's the byte order mark (BOM) to mark a file as unicode (some Windows garbage; I'm working under Linux/macOS). But how do I remove the first 3 bytes of a >1GB file? I tried dd:

Code: Select all

dd bs=3 skip=1 if=mega-2021-amateurs.pgn of=mega-2021-amateurs-nobom.pgn
But this has written only a few MB after several minutes: dd is working in 3-byte blocks and this takes some time of course. Argh. I thought quite a while a found a solution using tail (it's important that we have a text file not a binary file):

Code: Select all

tail -c +4 mega-2021-amateurs.pgn > mega-2021-amateurs-nobom.pgn
This works quickly. Can I now get my polyglot book please? No, of course not:

Code: Select all

./polyglot make-book -pgn ~/mega-2021-amateurs-nobom.pgn -bin ~/amateurs-2021.bin -max-ply 24
PolyGlot 1.4.70b by Fabien Letouzey.
inserting games ...
allocating 1.25MB ...
allocating 2.5MB ...
allocating 5MB ...
allocating 10MB ...
10000 games ...
allocating 20MB ...
20000 games ...
30000 games ...
allocating 40MB ...
40000 games ...
tellusererror POLYGLOT: book_insert(): illegal move "Z0" at line 1237737, column 47,game 41420
What's that?? Indeed, after searching I found games containing the move Z0! What on earth is this? Googling revealed that this is supposed to mean a "null move". Reading comments of the respective games it turned out that these were moves one couldn't decipher from the score sheet anymore! (By coincidence, almost all these games were from Senior tournaments...). How do I remove these corrupted games? I luckily found a tool called pgn-extract (https://www.cs.kent.ac.uk/people/staff/djb/pgn-extract/):

Code: Select all

./pgn-extract --fixresulttags --nobadresults --novars --output ~/mega-2021-amateurs-fixed.pgn -llog.txt -s ~/mega-2021-amateurs-nobom.pgn
I had to add the --novars argument because there are also games with variations containing the bloody Z0 which causes polyglot to crash. And now, finally, it worked:

Code: Select all

polyglot info-book -bin mega-2021-amateurs.bin
PolyGlot 1.4.70b by Fabien Letouzey.
Lines for white                :   246580
Lines for black                :   249507
Positions on lines for white   :   199806
Positions on lines for black   :   199305
Isolated positions             :    12680
Bloody hell!
User avatar
Ozymandias
Posts: 1529
Joined: Sun Oct 25, 2009 2:30 am

Re: ChessBase database annoyance: extracting an opening book

Post by Ozymandias »

Sorry if it sounds like I'm making fun of your predicament, but this brings back memories of the 2007-2009 period, when I started my DB and met all of those problems. It's somewhat entertaining to see things haven't changed that much in nearly 15 years.

In general, CB was by no means the worst of the lot, but it had all the problems you mention, plus some you don't.
KLc
Posts: 140
Joined: Wed Jun 03, 2020 6:46 am
Full name: Kurt Lanc

Re: ChessBase database annoyance: extracting an opening book

Post by KLc »

Ozymandias wrote: Sat Feb 20, 2021 9:39 am Sorry if it sounds like I'm making fun of your predicament, but this brings back memories of the 2007-2009 period, when I started my DB and met all of those problems. It's somewhat entertaining to see things haven't changed that much in nearly 15 years.

In general, CB was by no means the worst of the lot, but it had all the problems you mention, plus some you don't.
I’m not surprised. I reported some of my problems back to CB but doubt that they’ll fix anything. The most important things seems to be increasing the number of games (or should I say fragments?).

Edit: but yes, the MegaDatabase is still the best. At least they’ve put in efforts getting player names correct mostly.
User avatar
Ozymandias
Posts: 1529
Joined: Sun Oct 25, 2009 2:30 am

Re: ChessBase database annoyance: extracting an opening book

Post by Ozymandias »

"Mostly" being the operative word. I've seen games from people before their birth or after their death which, barring a miracle, means two or most players were consolidated under the same name. Another thing I noticed back in the day was the simple propagation of Elo ratings (among top players) to when they still didn't have any. Does it still happen?

It's also impossible to fully get rid or rapid, blitz or otherwise undesirable game formats. And in recent years, they've (TWIC) also included some select computer games, TCEC for example.

Maintaining a decent DB is an endless job.
Cornfed
Posts: 511
Joined: Sun Apr 26, 2020 11:40 pm
Full name: Brian D. Smith

Re: ChessBase database annoyance: extracting an opening book

Post by Cornfed »

Mind you, I have never tried to do what you are doing for the reasons you are doing it (polygot opening book). But...I have created what I call a 'Quality Base', culled from Megabase, with similar criteria, the biggest difference being the ratings of each player has to be 2200+.

It took a little while...but I met with success. The only real issues I had was when I tried to 'increase the quality' but including A) Annotated games from other sources (pgn files of books or Informatnts for example) and quality 'old games' where the players had no ratings. To take care of that, I created 'test bases' so I did not pollute the 'master base', simply sorted by White and then Black...and manually deleted dupes...merged and I have what I want...but that last part was a pain.
User avatar
Ozymandias
Posts: 1529
Joined: Sun Oct 25, 2009 2:30 am

Re: ChessBase database annoyance: extracting an opening book

Post by Ozymandias »

Manual labor is always a pain when it comes to cleaning a DB, but some tasks are best done this way. For example, consolidating names is something you don't want to try doing automatically.
KLc
Posts: 140
Joined: Wed Jun 03, 2020 6:46 am
Full name: Kurt Lanc

Re: ChessBase database annoyance: extracting an opening book

Post by KLc »

Ozymandias wrote: Sat Feb 20, 2021 3:07 pm "Mostly" being the operative word. I've seen games from people before their birth or after their death which, barring a miracle, means two or most players were consolidated under the same name. Another thing I noticed back in the day was the simple propagation of Elo ratings (among top players) to when they still didn't have any. Does it still happen?
This I don't know. What I did notice about ratings though is that the rating type (classical, blitz, etc.) is not always set correctly. In the current MegaDB 2021 Nakamura has an incredible normal time control (!) rating of 3268! Also, there are ratings before 1970.
User avatar
Ozymandias
Posts: 1529
Joined: Sun Oct 25, 2009 2:30 am

Re: ChessBase database annoyance: extracting an opening book

Post by Ozymandias »

There are, but they're scarce and may not even be Elo ratings. It's better to recalculate the whole database for that time period, which is one of the first things i did.
Jonathan003
Posts: 239
Joined: Fri Jul 06, 2018 4:23 pm
Full name: Jonathan Cremers

Re: ChessBase database annoyance: extracting an opening book

Post by Jonathan003 »

I first clean the pgn before making bin books by importing and exporting in SCID. Its much faster than cleaning the games with pgn-extract and the result is good.
KLc
Posts: 140
Joined: Wed Jun 03, 2020 6:46 am
Full name: Kurt Lanc

Re: ChessBase database annoyance: extracting an opening book

Post by KLc »

Ozymandias wrote: Mon Mar 01, 2021 10:23 pm There are, but they're scarce and may not even be Elo ratings. It's better to recalculate the whole database for that time period, which is one of the first things i did.
How do you do the recalculation? Is there a function in CB or are you using another tool?