5243 Files, comprising 1,684,447,833 bytes after bzip2 compression:
http://cap.connx.com/a-openings/
http://cap.connx.com/b-openings/
http://cap.connx.com/c-openings/
http://cap.connx.com/d-openings/
http://cap.connx.com/e-openings/
Sure, it's junk. That's why we call it junkbase. The collection has actually grown so large now that there are not really any tools that handle it well. ChessAssistant, ChessBase, Scid... All of them die if I feed the whole pile to them and ask the tool to do something useful. So I am not sure how you can fully utilize the data, but have fun trying.
If you want real high quality game sets, buy a professional one. But if you are a starving college student and you want to examine a VOG chess game from 1989, then this is the collection for you.
10 million chess games
Moderators: hgm, Rebel, chrisw
-
- Posts: 12540
- Joined: Wed Mar 08, 2006 8:57 pm
- Location: Redmond, WA USA
-
- Posts: 12540
- Joined: Wed Mar 08, 2006 8:57 pm
- Location: Redmond, WA USA
Re: 10 million chess games
You can also find SCID versions here:Dann Corbit wrote:5243 Files, comprising 1,684,447,833 bytes after bzip2 compression:
http://cap.connx.com/a-openings/
http://cap.connx.com/b-openings/
http://cap.connx.com/c-openings/
http://cap.connx.com/d-openings/
http://cap.connx.com/e-openings/
Sure, it's junk. That's why we call it junkbase. The collection has actually grown so large now that there are not really any tools that handle it well. ChessAssistant, ChessBase, Scid... All of them die if I feed the whole pile to them and ask the tool to do something useful. So I am not sure how you can fully utilize the data, but have fun trying.
If you want real high quality game sets, buy a professional one. But if you are a starving college student and you want to examine a VOG chess game from 1989, then this is the collection for you.
http://cap.connx.com/scid/
they have been compressed with bzip2 so you will need bzip2 to decompress them or 7-zip or some other file manager that knows how to deal with the bz2 extension.
Caveat:
The full collection (jbase) is really too large for Scid to manage and so it is unreliable. The subsets (sorted by ECO) a,b,c,d and e are more trustable. If (for instance) you try to save the jbase collection as PGN, the GUI will churn for a long time, write out 6 GB of PGN and then crash.
-
- Posts: 197
- Joined: Mon Jul 13, 2009 2:16 am
Re: 10 million chess games
Holy...wow Dann that's crazy.
Where did you compile all the games from?
Where did you compile all the games from?
-
- Posts: 12540
- Joined: Wed Mar 08, 2006 8:57 pm
- Location: Redmond, WA USA
Re: 10 million chess games
I have been collecting since the late 80's.LucenaTheLucid wrote:Holy...wow Dann that's crazy.
Where did you compile all the games from?
My first location was the famous University of Pittsburg site (not sure if it is even still open). I get games from TWIC and from computer contests and correspondence sites and especially from the giant jumble of links you can find here:
http://www.chessgameslinks.lars-balzer.info/
P.S.
If you collect the FICS rated games in the scid folder, there are well over 100 million games (not exactly Kasparov verses Anand stuff, but it may be useful to the criminally insane like myself and a few others).
-
- Posts: 12540
- Joined: Wed Mar 08, 2006 8:57 pm
- Location: Redmond, WA USA
Re: 10 million chess games
P.P.S.Dann Corbit wrote:I have been collecting since the late 80's.LucenaTheLucid wrote:Holy...wow Dann that's crazy.
Where did you compile all the games from?
My first location was the famous University of Pittsburg site (not sure if it is even still open). I get games from TWIC and from computer contests and correspondence sites and especially from the giant jumble of links you can find here:
http://www.chessgameslinks.lars-balzer.info/
P.S.
If you collect the FICS rated games in the scid folder, there are well over 100 million games (not exactly Kasparov verses Anand stuff, but it may be useful to the criminally insane like myself and a few others).
I can't take credit for the FICS games. They are collected by Marcel van Kervinck from FICS using a rated games collection script. My only addition here is another hosting site and also making them available in SCID format.
-
- Posts: 782
- Joined: Wed Mar 08, 2006 9:22 pm
Re: 10 million chess games
Dann Corbit wrote: Caveat:
The full collection (jbase) is really too large for Scid to manage and so it is unreliable. The subsets (sorted by ECO) a,b,c,d and e are more trustable. If (for instance) you try to save the jbase collection as PGN, the GUI will churn for a long time, write out 6 GB of PGN and then crash.
Hello Dann,
Can you do SCID searches (by player etc.) on the full junkbase (jbase) without any issues or is any type of operation out of the question? I would not want to have to search A database then B down to E to get all the games of x player....
I certainly would not be converting the compressed Scid format to pgn on my machine!
Do you have any modified Scid versions that you tweaked to comfortably handle such massive databases?
Just asking, as I know that you are a data freak.
I meant that last in a good, awestruck kind of way!
Later.
-
- Posts: 670
- Joined: Mon Dec 03, 2007 3:01 pm
- Location: Barcelona, Spain
Re: 10 million chess games
The question really is, what the data should be used for ..Dann Corbit wrote:... The collection has actually grown so large now that there are not really any tools that handle it well. ChessAssistant, ChessBase, Scid... All of them die if I feed the whole pile to them and ask the tool to do something useful. So I am not sure how you can fully utilize the data, but have fun trying....
if you want to query games of a certain player or of a certain tournament, then the scid format is great. But for this case the database (jbase) could be cleaned out a lot. Eg I find a couple of games of the following type:
Code: Select all
[Event "?"]
[Site "?"]
[Date "????.??.??"]
[Round "?"]
[White "?"]
[Black "?"]
[Result "0-1"]
[ECO "A00h"]
[Variation "Durkin"]
[Annotator ""]
[Source ""]
[Remark ""]
1. Na3 g5 2. Nc4 0-1
However, if you want to use the database as a foundation for answering questions like, what have players played in this position before, I would rather suggest to transfer the database into another type of structure. That is either tree based or position based. The first being probably the most compact way of storing the database (and that without any loss of data), while the position based version catches transpositions and is also able to find positions similar to the current, but in exchange also requires more space and it looses some information about the games.
-
- Posts: 12540
- Joined: Wed Mar 08, 2006 8:57 pm
- Location: Redmond, WA USA
Re: 10 million chess games
I would have to rebuild Scid. I already made a special build to tolerate 4 million games. Then I made a special build to tolerate 10 million games. But there are more than 10 million so the operations are not reliable.Roger Brown wrote:Dann Corbit wrote: Caveat:
The full collection (jbase) is really too large for Scid to manage and so it is unreliable. The subsets (sorted by ECO) a,b,c,d and e are more trustable. If (for instance) you try to save the jbase collection as PGN, the GUI will churn for a long time, write out 6 GB of PGN and then crash.
Hello Dann,
Can you do SCID searches (by player etc.) on the full junkbase (jbase) without any issues or is any type of operation out of the question? I would not want to have to search A database then B down to E to get all the games of x player....
I certainly would not be converting the compressed Scid format to pgn on my machine!
Do you have any modified Scid versions that you tweaked to comfortably handle such massive databases?
Just asking, as I know that you are a data freak.
I meant that last in a good, awestruck kind of way!
Later.
There is another problem that there is no reliable 64 bit TCL/TK for Windows so a SCID version that will handle large objects cannot be built.
I am trying to figure out how to handle this issue now. I would like to be able to have all 110M games online at all times. Of course, with current chess data systems you cannot do that.
-
- Posts: 12540
- Joined: Wed Mar 08, 2006 8:57 pm
- Location: Redmond, WA USA
Re: 10 million chess games
With 10 M games cobbled together by me, there is really no chance that I will find time to clean it up properly. Perhaps someone else will do it.Edmund wrote:The question really is, what the data should be used for ..Dann Corbit wrote:... The collection has actually grown so large now that there are not really any tools that handle it well. ChessAssistant, ChessBase, Scid... All of them die if I feed the whole pile to them and ask the tool to do something useful. So I am not sure how you can fully utilize the data, but have fun trying....
if you want to query games of a certain player or of a certain tournament, then the scid format is great. But for this case the database (jbase) could be cleaned out a lot. Eg I find a couple of games of the following type:
This looks more like some general instructions for opening books to me and have no value in a games database.Code: Select all
[Event "?"] [Site "?"] [Date "????.??.??"] [Round "?"] [White "?"] [Black "?"] [Result "0-1"] [ECO "A00h"] [Variation "Durkin"] [Annotator ""] [Source ""] [Remark ""] 1. Na3 g5 2. Nc4 0-1
However, if you want to use the database as a foundation for answering questions like, what have players played in this position before, I would rather suggest to transfer the database into another type of structure. That is either tree based or position based. The first being probably the most compact way of storing the database (and that without any loss of data), while the position based version catches transpositions and is also able to find positions similar to the current, but in exchange also requires more space and it looses some information about the games.
-
- Posts: 6073
- Joined: Sat Apr 01, 2006 9:34 pm
- Location: Scotland
Re: 10 million chess games
Would it be possible to use SQL like Jose?Dann Corbit wrote:With 10 M games cobbled together by me, there is really no chance that I will find time to clean it up properly. Perhaps someone else will do it.Edmund wrote:The question really is, what the data should be used for ..Dann Corbit wrote:... The collection has actually grown so large now that there are not really any tools that handle it well. ChessAssistant, ChessBase, Scid... All of them die if I feed the whole pile to them and ask the tool to do something useful. So I am not sure how you can fully utilize the data, but have fun trying....
if you want to query games of a certain player or of a certain tournament, then the scid format is great. But for this case the database (jbase) could be cleaned out a lot. Eg I find a couple of games of the following type:
This looks more like some general instructions for opening books to me and have no value in a games database.Code: Select all
[Event "?"] [Site "?"] [Date "????.??.??"] [Round "?"] [White "?"] [Black "?"] [Result "0-1"] [ECO "A00h"] [Variation "Durkin"] [Annotator ""] [Source ""] [Remark ""] 1. Na3 g5 2. Nc4 0-1
However, if you want to use the database as a foundation for answering questions like, what have players played in this position before, I would rather suggest to transfer the database into another type of structure. That is either tree based or position based. The first being probably the most compact way of storing the database (and that without any loss of data), while the position based version catches transpositions and is also able to find positions similar to the current, but in exchange also requires more space and it looses some information about the games.