10 million chess games

Dann Corbit · Post by **Dann Corbit** » Sat Jun 26, 2010 1:30 pm

Christopher Conkie wrote:
Dann Corbit wrote:
Edmund wrote:
Dann Corbit wrote:... The collection has actually grown so large now that there are not really any tools that handle it well. ChessAssistant, ChessBase, Scid... All of them die if I feed the whole pile to them and ask the tool to do something useful. So I am not sure how you can fully utilize the data, but have fun trying....
The question really is, what the data should be used for ..

if you want to query games of a certain player or of a certain tournament, then the scid format is great. But for this case the database (jbase) could be cleaned out a lot. Eg I find a couple of games of the following type:
Code: Select all
[Event "?"]
[Site "?"]
[Date "????.??.??"]
[Round "?"]
[White "?"]
[Black "?"]
[Result "0-1"]
[ECO "A00h"]
[Variation "Durkin"]
[Annotator ""]
[Source ""]
[Remark ""]

1. Na3 g5 2. Nc4 0-1
This looks more like some general instructions for opening books to me and have no value in a games database.

However, if you want to use the database as a foundation for answering questions like, what have players played in this position before, I would rather suggest to transfer the database into another type of structure. That is either tree based or position based. The first being probably the most compact way of storing the database (and that without any loss of data), while the position based version catches transpositions and is also able to find positions similar to the current, but in exchange also requires more space and it looses some information about the games.
With 10 M games cobbled together by me, there is really no chance that I will find time to clean it up properly. Perhaps someone else will do it.
Would it be possible to use SQL like Jose?

Jose is slow with 1/2 million games.

I am thinking about writing my own database interface. I can't think of any other way to get what I want.

Christopher Conkie · Post by **Christopher Conkie** » Sat Jun 26, 2010 1:34 pm

Dann Corbit wrote:
Christopher Conkie wrote:
Dann Corbit wrote:
Edmund wrote:
Dann Corbit wrote:... The collection has actually grown so large now that there are not really any tools that handle it well. ChessAssistant, ChessBase, Scid... All of them die if I feed the whole pile to them and ask the tool to do something useful. So I am not sure how you can fully utilize the data, but have fun trying....
The question really is, what the data should be used for ..

if you want to query games of a certain player or of a certain tournament, then the scid format is great. But for this case the database (jbase) could be cleaned out a lot. Eg I find a couple of games of the following type:
Code: Select all
[Event "?"]
[Site "?"]
[Date "????.??.??"]
[Round "?"]
[White "?"]
[Black "?"]
[Result "0-1"]
[ECO "A00h"]
[Variation "Durkin"]
[Annotator ""]
[Source ""]
[Remark ""]

1. Na3 g5 2. Nc4 0-1
This looks more like some general instructions for opening books to me and have no value in a games database.

However, if you want to use the database as a foundation for answering questions like, what have players played in this position before, I would rather suggest to transfer the database into another type of structure. That is either tree based or position based. The first being probably the most compact way of storing the database (and that without any loss of data), while the position based version catches transpositions and is also able to find positions similar to the current, but in exchange also requires more space and it looses some information about the games.
With 10 M games cobbled together by me, there is really no chance that I will find time to clean it up properly. Perhaps someone else will do it.
Would it be possible to use SQL like Jose?
Jose is slow with 1/2 million games.

I am thinking about writing my own database interface. I can't think of any other way to get what I want.

I was thinking more in terms of using something robust and large data capable like SQL Server Express rather than MySQL as in Jose. Maybe make a package to import, something like that.

Dann Corbit · Post by **Dann Corbit** » Sat Jun 26, 2010 1:37 pm

Christopher Conkie wrote:
Dann Corbit wrote:
Christopher Conkie wrote:
Dann Corbit wrote:
Edmund wrote:
Dann Corbit wrote:... The collection has actually grown so large now that there are not really any tools that handle it well. ChessAssistant, ChessBase, Scid... All of them die if I feed the whole pile to them and ask the tool to do something useful. So I am not sure how you can fully utilize the data, but have fun trying....
The question really is, what the data should be used for ..

if you want to query games of a certain player or of a certain tournament, then the scid format is great. But for this case the database (jbase) could be cleaned out a lot. Eg I find a couple of games of the following type:
Code: Select all
[Event "?"]
[Site "?"]
[Date "????.??.??"]
[Round "?"]
[White "?"]
[Black "?"]
[Result "0-1"]
[ECO "A00h"]
[Variation "Durkin"]
[Annotator ""]
[Source ""]
[Remark ""]

1. Na3 g5 2. Nc4 0-1
This looks more like some general instructions for opening books to me and have no value in a games database.

However, if you want to use the database as a foundation for answering questions like, what have players played in this position before, I would rather suggest to transfer the database into another type of structure. That is either tree based or position based. The first being probably the most compact way of storing the database (and that without any loss of data), while the position based version catches transpositions and is also able to find positions similar to the current, but in exchange also requires more space and it looses some information about the games.
With 10 M games cobbled together by me, there is really no chance that I will find time to clean it up properly. Perhaps someone else will do it.
Would it be possible to use SQL like Jose?
Jose is slow with 1/2 million games.

I am thinking about writing my own database interface. I can't think of any other way to get what I want.
I was thinking more in terms of using something robust and large data capable like SQL Server Express rather than MySQL as in Jose. Maybe make a package to import, something like that.

Monetdb might be worth a go. It holds data in RAM and so it is very fast. Of course, you would need to add proper indexes, etc.

Robert Flesher · Post by **Robert Flesher** » Sat Jun 26, 2010 7:24 pm

Dann Corbit wrote:
LucenaTheLucid wrote:Holy...wow Dann that's crazy.

Where did you compile all the games from?
I have been collecting since the late 80's.
My first location was the famous University of Pittsburg site (not sure if it is even still open). I get games from TWIC and from computer contests and correspondence sites and especially from the giant jumble of links you can find here:
http://www.chessgameslinks.lars-balzer.info/

P.S.
If you collect the FICS rated games in the scid folder, there are well over 100 million games (not exactly Kasparov verses Anand stuff, but it may be useful to the criminally insane like myself and a few others).

Dann, did you ever bring back old memories for me with the mention of the University of Pittsburg site. I had totally forgot about that place. To bad places like that are gone. Another great site from some time about was Gambitsoft. It had everything, sales, engines, reviews, links, tutorials, etc , etc. Ahhh, the old days ......

Christopher Conkie · Post by **Christopher Conkie** » Sat Jun 26, 2010 7:38 pm

Robert Flesher wrote:
Dann Corbit wrote:
LucenaTheLucid wrote:Holy...wow Dann that's crazy.

Where did you compile all the games from?
I have been collecting since the late 80's.
My first location was the famous University of Pittsburg site (not sure if it is even still open). I get games from TWIC and from computer contests and correspondence sites and especially from the giant jumble of links you can find here:
http://www.chessgameslinks.lars-balzer.info/

P.S.
If you collect the FICS rated games in the scid folder, there are well over 100 million games (not exactly Kasparov verses Anand stuff, but it may be useful to the criminally insane like myself and a few others).
Dann, did you ever bring back old memories for me with the mention of the University of Pittsburg site. I had totally forgot about that place. To bad places like that are gone. Another great site from some time about was Gambitsoft. It had everything, sales, engines, reviews, links, tutorials, etc , etc. Ahhh, the old days ......

http://www.pitt.edu/~schach/Archives/index2.html

shiv · Post by **shiv** » Sat Jun 26, 2010 9:31 pm

Christopher Conkie wrote:
Dann Corbit wrote:
Christopher Conkie wrote:
Dann Corbit wrote:
Edmund wrote:
Dann Corbit wrote:... The collection has actually grown so large now that there are not really any tools that handle it well. ChessAssistant, ChessBase, Scid... All of them die if I feed the whole pile to them and ask the tool to do something useful. So I am not sure how you can fully utilize the data, but have fun trying....
The question really is, what the data should be used for ..

if you want to query games of a certain player or of a certain tournament, then the scid format is great. But for this case the database (jbase) could be cleaned out a lot. Eg I find a couple of games of the following type:
Code: Select all
[Event "?"]
[Site "?"]
[Date "????.??.??"]
[Round "?"]
[White "?"]
[Black "?"]
[Result "0-1"]
[ECO "A00h"]
[Variation "Durkin"]
[Annotator ""]
[Source ""]
[Remark ""]

1. Na3 g5 2. Nc4 0-1
This looks more like some general instructions for opening books to me and have no value in a games database.

However, if you want to use the database as a foundation for answering questions like, what have players played in this position before, I would rather suggest to transfer the database into another type of structure. That is either tree based or position based. The first being probably the most compact way of storing the database (and that without any loss of data), while the position based version catches transpositions and is also able to find positions similar to the current, but in exchange also requires more space and it looses some information about the games.
With 10 M games cobbled together by me, there is really no chance that I will find time to clean it up properly. Perhaps someone else will do it.
Would it be possible to use SQL like Jose?
Jose is slow with 1/2 million games.

I am thinking about writing my own database interface. I can't think of any other way to get what I want.
I was thinking more in terms of using something robust and large data capable like SQL Server Express rather than MySQL as in Jose. Maybe make a package to import, something like that.

Wanted to add that Mysql is actually quite performant. I have seen it support a billion records with the right indexes. Yahoo, Google, and others use mysql extensively for large datasets.

When looking at Jose's code, the issue was not of mysql but of the general nature of the code. I was able to enable position indexes (you can see a comment in the source code on how to enable it). However, the buggy nature of the program lead to general issues for me, though the db side of things did seem to work. I must say Peter Schaffer did a fantastic job with the program. It was just unfortunate that people could not spend more time to fix/test issues with the program and the game loading/editing layer.

If I were to design a db based layer for chess from scratch, I might be tempted to use a key value database such as apache cassandra. I used cassandra in a previous workplace to manage about 750 million records. However, with cassandra you have to emulate indexes using custom keys. It may not be a good choice if you want to do complex positional queries such find all games where 2 bishops with a set of doubled pawns played against a bishop and a knight.

All popular existing chess programs seem to have created their own databases which I think is not ideal. Aquarium I think supports SQL server in its latest incarnation. I think using a standard database and optimizing indexes and other areas is the way to go.

Did not hear of monetdb before, looks quite promising as well.

Robert Flesher · Post by **Robert Flesher** » Sun Jun 27, 2010 12:45 am

Christopher Conkie wrote:
Robert Flesher wrote:
Dann Corbit wrote:
LucenaTheLucid wrote:Holy...wow Dann that's crazy.

Where did you compile all the games from?
I have been collecting since the late 80's.
My first location was the famous University of Pittsburg site (not sure if it is even still open). I get games from TWIC and from computer contests and correspondence sites and especially from the giant jumble of links you can find here:
http://www.chessgameslinks.lars-balzer.info/

P.S.
If you collect the FICS rated games in the scid folder, there are well over 100 million games (not exactly Kasparov verses Anand stuff, but it may be useful to the criminally insane like myself and a few others).
Dann, did you ever bring back old memories for me with the mention of the University of Pittsburg site. I had totally forgot about that place. To bad places like that are gone. Another great site from some time about was Gambitsoft. It had everything, sales, engines, reviews, links, tutorials, etc , etc. Ahhh, the old days ......
http://www.pitt.edu/~schach/Archives/index2.html

Thanks Chris, but I did google this the second I saw Dann's post. However, it seems the links are all old or broken, nothing like the past.

Christopher Conkie · Post by **Christopher Conkie** » Sun Jun 27, 2010 12:56 am

Robert Flesher wrote:
Christopher Conkie wrote:
Robert Flesher wrote:
Dann Corbit wrote:
LucenaTheLucid wrote:Holy...wow Dann that's crazy.

Where did you compile all the games from?
I have been collecting since the late 80's.
My first location was the famous University of Pittsburg site (not sure if it is even still open). I get games from TWIC and from computer contests and correspondence sites and especially from the giant jumble of links you can find here:
http://www.chessgameslinks.lars-balzer.info/

P.S.
If you collect the FICS rated games in the scid folder, there are well over 100 million games (not exactly Kasparov verses Anand stuff, but it may be useful to the criminally insane like myself and a few others).
Dann, did you ever bring back old memories for me with the mention of the University of Pittsburg site. I had totally forgot about that place. To bad places like that are gone. Another great site from some time about was Gambitsoft. It had everything, sales, engines, reviews, links, tutorials, etc , etc. Ahhh, the old days ......
http://www.pitt.edu/~schach/Archives/index2.html

Thanks Chris, but I did google this the second I saw Dann's post. However, it seems the links are all old or broken, nothing like the past.

They work for me. Maybe you could try a different browser?

Robert Flesher · Post by **Robert Flesher** » Sun Jun 27, 2010 3:33 am

Christopher Conkie wrote:
Robert Flesher wrote:
Christopher Conkie wrote:
Robert Flesher wrote:
Dann Corbit wrote:
LucenaTheLucid wrote:Holy...wow Dann that's crazy.

Where did you compile all the games from?
I have been collecting since the late 80's.
My first location was the famous University of Pittsburg site (not sure if it is even still open). I get games from TWIC and from computer contests and correspondence sites and especially from the giant jumble of links you can find here:
http://www.chessgameslinks.lars-balzer.info/

P.S.
If you collect the FICS rated games in the scid folder, there are well over 100 million games (not exactly Kasparov verses Anand stuff, but it may be useful to the criminally insane like myself and a few others).
Dann, did you ever bring back old memories for me with the mention of the University of Pittsburg site. I had totally forgot about that place. To bad places like that are gone. Another great site from some time about was Gambitsoft. It had everything, sales, engines, reviews, links, tutorials, etc , etc. Ahhh, the old days ......
http://www.pitt.edu/~schach/Archives/index2.html

Thanks Chris, but I did google this the second I saw Dann's post. However, it seems the links are all old or broken, nothing like the past.
They work for me. Maybe you could try a different browser?

Will do, thanks.

jdart · Post by **jdart** » Sun Jun 27, 2010 11:40 pm

> I think using a standard database and optimizing indexes and other areas is the way to go.

I agree and it is surprising this has not been done more. Especially for a "read mostly' DB it should be easy to use a standard db (MySQL or Postgres or SQL Express) and it should also be efficient and fast.

But as for this db - as I have commented before, it contains a lot of games with computer opponents that were generated using fixed limited size opening books. So the variety of moves that you get especially in the opening and early middlegame is limited. So it is not much good for opening analysis. And you should also not rely on the game result for analysis very much because you will find losses on time, and games that were truncated for other reasons (disconnection/forfeiture for example on the chess servers). So it really is junk, probably with some buried gems.

10 million chess games

Re: 10 million chess games

Re: 10 million chess games

Re: 10 million chess games

Re: 10 million chess games

Re: 10 million chess games

Re: 10 million chess games

Re: 10 million chess games

Re: 10 million chess games

Re: 10 million chess games

Re: 10 million chess games