10 million chess games

Discussion of anything and everything relating to chess playing software and machines.

Moderators: hgm, Rebel, chrisw

Dann Corbit
Posts: 12540
Joined: Wed Mar 08, 2006 8:57 pm
Location: Redmond, WA USA

10 million chess games

Post by Dann Corbit »

5243 Files, comprising 1,684,447,833 bytes after bzip2 compression:
http://cap.connx.com/a-openings/
http://cap.connx.com/b-openings/
http://cap.connx.com/c-openings/
http://cap.connx.com/d-openings/
http://cap.connx.com/e-openings/

Sure, it's junk. That's why we call it junkbase. The collection has actually grown so large now that there are not really any tools that handle it well. ChessAssistant, ChessBase, Scid... All of them die if I feed the whole pile to them and ask the tool to do something useful. So I am not sure how you can fully utilize the data, but have fun trying.

If you want real high quality game sets, buy a professional one. But if you are a starving college student and you want to examine a VOG chess game from 1989, then this is the collection for you.
Dann Corbit
Posts: 12540
Joined: Wed Mar 08, 2006 8:57 pm
Location: Redmond, WA USA

Re: 10 million chess games

Post by Dann Corbit »

Dann Corbit wrote:5243 Files, comprising 1,684,447,833 bytes after bzip2 compression:
http://cap.connx.com/a-openings/
http://cap.connx.com/b-openings/
http://cap.connx.com/c-openings/
http://cap.connx.com/d-openings/
http://cap.connx.com/e-openings/

Sure, it's junk. That's why we call it junkbase. The collection has actually grown so large now that there are not really any tools that handle it well. ChessAssistant, ChessBase, Scid... All of them die if I feed the whole pile to them and ask the tool to do something useful. So I am not sure how you can fully utilize the data, but have fun trying.

If you want real high quality game sets, buy a professional one. But if you are a starving college student and you want to examine a VOG chess game from 1989, then this is the collection for you.
You can also find SCID versions here:
http://cap.connx.com/scid/

they have been compressed with bzip2 so you will need bzip2 to decompress them or 7-zip or some other file manager that knows how to deal with the bz2 extension.

Caveat:
The full collection (jbase) is really too large for Scid to manage and so it is unreliable. The subsets (sorted by ECO) a,b,c,d and e are more trustable. If (for instance) you try to save the jbase collection as PGN, the GUI will churn for a long time, write out 6 GB of PGN and then crash.
LucenaTheLucid
Posts: 197
Joined: Mon Jul 13, 2009 2:16 am

Re: 10 million chess games

Post by LucenaTheLucid »

Holy...wow Dann that's crazy.

Where did you compile all the games from?
Dann Corbit
Posts: 12540
Joined: Wed Mar 08, 2006 8:57 pm
Location: Redmond, WA USA

Re: 10 million chess games

Post by Dann Corbit »

LucenaTheLucid wrote:Holy...wow Dann that's crazy.

Where did you compile all the games from?
I have been collecting since the late 80's.
My first location was the famous University of Pittsburg site (not sure if it is even still open). I get games from TWIC and from computer contests and correspondence sites and especially from the giant jumble of links you can find here:
http://www.chessgameslinks.lars-balzer.info/

P.S.
If you collect the FICS rated games in the scid folder, there are well over 100 million games (not exactly Kasparov verses Anand stuff, but it may be useful to the criminally insane like myself and a few others).
Dann Corbit
Posts: 12540
Joined: Wed Mar 08, 2006 8:57 pm
Location: Redmond, WA USA

Re: 10 million chess games

Post by Dann Corbit »

Dann Corbit wrote:
LucenaTheLucid wrote:Holy...wow Dann that's crazy.

Where did you compile all the games from?
I have been collecting since the late 80's.
My first location was the famous University of Pittsburg site (not sure if it is even still open). I get games from TWIC and from computer contests and correspondence sites and especially from the giant jumble of links you can find here:
http://www.chessgameslinks.lars-balzer.info/

P.S.
If you collect the FICS rated games in the scid folder, there are well over 100 million games (not exactly Kasparov verses Anand stuff, but it may be useful to the criminally insane like myself and a few others).
P.P.S.
I can't take credit for the FICS games. They are collected by Marcel van Kervinck from FICS using a rated games collection script. My only addition here is another hosting site and also making them available in SCID format.
Roger Brown
Posts: 782
Joined: Wed Mar 08, 2006 9:22 pm

Re: 10 million chess games

Post by Roger Brown »

Dann Corbit wrote: Caveat:
The full collection (jbase) is really too large for Scid to manage and so it is unreliable. The subsets (sorted by ECO) a,b,c,d and e are more trustable. If (for instance) you try to save the jbase collection as PGN, the GUI will churn for a long time, write out 6 GB of PGN and then crash.


Hello Dann,

Can you do SCID searches (by player etc.) on the full junkbase (jbase) without any issues or is any type of operation out of the question? I would not want to have to search A database then B down to E to get all the games of x player....

I certainly would not be converting the compressed Scid format to pgn on my machine!

Do you have any modified Scid versions that you tweaked to comfortably handle such massive databases?

Just asking, as I know that you are a data freak.

I meant that last in a good, awestruck kind of way!

:-)

Later.
Edmund
Posts: 670
Joined: Mon Dec 03, 2007 3:01 pm
Location: Barcelona, Spain

Re: 10 million chess games

Post by Edmund »

Dann Corbit wrote:... The collection has actually grown so large now that there are not really any tools that handle it well. ChessAssistant, ChessBase, Scid... All of them die if I feed the whole pile to them and ask the tool to do something useful. So I am not sure how you can fully utilize the data, but have fun trying....
The question really is, what the data should be used for ..

if you want to query games of a certain player or of a certain tournament, then the scid format is great. But for this case the database (jbase) could be cleaned out a lot. Eg I find a couple of games of the following type:

Code: Select all

[Event "?"]
[Site "?"]
[Date "????.??.??"]
[Round "?"]
[White "?"]
[Black "?"]
[Result "0-1"]
[ECO "A00h"]
[Variation "Durkin"]
[Annotator ""]
[Source ""]
[Remark ""]

1. Na3 g5 2. Nc4 0-1
This looks more like some general instructions for opening books to me and have no value in a games database.

However, if you want to use the database as a foundation for answering questions like, what have players played in this position before, I would rather suggest to transfer the database into another type of structure. That is either tree based or position based. The first being probably the most compact way of storing the database (and that without any loss of data), while the position based version catches transpositions and is also able to find positions similar to the current, but in exchange also requires more space and it looses some information about the games.
Dann Corbit
Posts: 12540
Joined: Wed Mar 08, 2006 8:57 pm
Location: Redmond, WA USA

Re: 10 million chess games

Post by Dann Corbit »

Roger Brown wrote:
Dann Corbit wrote: Caveat:
The full collection (jbase) is really too large for Scid to manage and so it is unreliable. The subsets (sorted by ECO) a,b,c,d and e are more trustable. If (for instance) you try to save the jbase collection as PGN, the GUI will churn for a long time, write out 6 GB of PGN and then crash.


Hello Dann,

Can you do SCID searches (by player etc.) on the full junkbase (jbase) without any issues or is any type of operation out of the question? I would not want to have to search A database then B down to E to get all the games of x player....

I certainly would not be converting the compressed Scid format to pgn on my machine!

Do you have any modified Scid versions that you tweaked to comfortably handle such massive databases?

Just asking, as I know that you are a data freak.

I meant that last in a good, awestruck kind of way!

:-)

Later.
I would have to rebuild Scid. I already made a special build to tolerate 4 million games. Then I made a special build to tolerate 10 million games. But there are more than 10 million so the operations are not reliable.

There is another problem that there is no reliable 64 bit TCL/TK for Windows so a SCID version that will handle large objects cannot be built.

I am trying to figure out how to handle this issue now. I would like to be able to have all 110M games online at all times. Of course, with current chess data systems you cannot do that.
Dann Corbit
Posts: 12540
Joined: Wed Mar 08, 2006 8:57 pm
Location: Redmond, WA USA

Re: 10 million chess games

Post by Dann Corbit »

Edmund wrote:
Dann Corbit wrote:... The collection has actually grown so large now that there are not really any tools that handle it well. ChessAssistant, ChessBase, Scid... All of them die if I feed the whole pile to them and ask the tool to do something useful. So I am not sure how you can fully utilize the data, but have fun trying....
The question really is, what the data should be used for ..

if you want to query games of a certain player or of a certain tournament, then the scid format is great. But for this case the database (jbase) could be cleaned out a lot. Eg I find a couple of games of the following type:

Code: Select all

[Event "?"]
[Site "?"]
[Date "????.??.??"]
[Round "?"]
[White "?"]
[Black "?"]
[Result "0-1"]
[ECO "A00h"]
[Variation "Durkin"]
[Annotator ""]
[Source ""]
[Remark ""]

1. Na3 g5 2. Nc4 0-1
This looks more like some general instructions for opening books to me and have no value in a games database.

However, if you want to use the database as a foundation for answering questions like, what have players played in this position before, I would rather suggest to transfer the database into another type of structure. That is either tree based or position based. The first being probably the most compact way of storing the database (and that without any loss of data), while the position based version catches transpositions and is also able to find positions similar to the current, but in exchange also requires more space and it looses some information about the games.
With 10 M games cobbled together by me, there is really no chance that I will find time to clean it up properly. Perhaps someone else will do it.
Christopher Conkie
Posts: 6073
Joined: Sat Apr 01, 2006 9:34 pm
Location: Scotland

Re: 10 million chess games

Post by Christopher Conkie »

Dann Corbit wrote:
Edmund wrote:
Dann Corbit wrote:... The collection has actually grown so large now that there are not really any tools that handle it well. ChessAssistant, ChessBase, Scid... All of them die if I feed the whole pile to them and ask the tool to do something useful. So I am not sure how you can fully utilize the data, but have fun trying....
The question really is, what the data should be used for ..

if you want to query games of a certain player or of a certain tournament, then the scid format is great. But for this case the database (jbase) could be cleaned out a lot. Eg I find a couple of games of the following type:

Code: Select all

[Event "?"]
[Site "?"]
[Date "????.??.??"]
[Round "?"]
[White "?"]
[Black "?"]
[Result "0-1"]
[ECO "A00h"]
[Variation "Durkin"]
[Annotator ""]
[Source ""]
[Remark ""]

1. Na3 g5 2. Nc4 0-1
This looks more like some general instructions for opening books to me and have no value in a games database.

However, if you want to use the database as a foundation for answering questions like, what have players played in this position before, I would rather suggest to transfer the database into another type of structure. That is either tree based or position based. The first being probably the most compact way of storing the database (and that without any loss of data), while the position based version catches transpositions and is also able to find positions similar to the current, but in exchange also requires more space and it looses some information about the games.
With 10 M games cobbled together by me, there is really no chance that I will find time to clean it up properly. Perhaps someone else will do it.
Would it be possible to use SQL like Jose?