10 million chess games

Discussion of anything and everything relating to chess playing software and machines.

Moderators: hgm, Rebel, chrisw

alpha123
Posts: 660
Joined: Sat Dec 05, 2009 5:13 am
Location: Colorado, USA

Re: 10 million chess games

Post by alpha123 »

Christopher Conkie wrote:
Dann Corbit wrote:
Christopher Conkie wrote:
Dann Corbit wrote:
Edmund wrote:
Dann Corbit wrote:... The collection has actually grown so large now that there are not really any tools that handle it well. ChessAssistant, ChessBase, Scid... All of them die if I feed the whole pile to them and ask the tool to do something useful. So I am not sure how you can fully utilize the data, but have fun trying....
The question really is, what the data should be used for ..

if you want to query games of a certain player or of a certain tournament, then the scid format is great. But for this case the database (jbase) could be cleaned out a lot. Eg I find a couple of games of the following type:

Code: Select all

[Event "?"]
[Site "?"]
[Date "????.??.??"]
[Round "?"]
[White "?"]
[Black "?"]
[Result "0-1"]
[ECO "A00h"]
[Variation "Durkin"]
[Annotator ""]
[Source ""]
[Remark ""]

1. Na3 g5 2. Nc4 0-1
This looks more like some general instructions for opening books to me and have no value in a games database.

However, if you want to use the database as a foundation for answering questions like, what have players played in this position before, I would rather suggest to transfer the database into another type of structure. That is either tree based or position based. The first being probably the most compact way of storing the database (and that without any loss of data), while the position based version catches transpositions and is also able to find positions similar to the current, but in exchange also requires more space and it looses some information about the games.
With 10 M games cobbled together by me, there is really no chance that I will find time to clean it up properly. Perhaps someone else will do it.
Would it be possible to use SQL like Jose?
Jose is slow with 1/2 million games.

I am thinking about writing my own database interface. I can't think of any other way to get what I want.
I was thinking more in terms of using something robust and large data capable like SQL Server Express rather than MySQL as in Jose. Maybe make a package to import, something like that.
[rant]
I'm a MySQL guy, so I just had to add that MySQL is far more robust than SQL Server. I'd say the only thing more scalable than MySQL is possibly Oracle. Seeing as Google, Yahoo!, Amazon, and many others use MySQL, I don't think it has trouble with a lot of data.
[/rant]

Sorry, I couldn't resist.

Peter
Christopher Conkie
Posts: 6073
Joined: Sat Apr 01, 2006 9:34 pm
Location: Scotland

Re: 10 million chess games

Post by Christopher Conkie »

alpha123 wrote:
Christopher Conkie wrote:
Dann Corbit wrote:
Christopher Conkie wrote:
Dann Corbit wrote:
Edmund wrote:
Dann Corbit wrote:... The collection has actually grown so large now that there are not really any tools that handle it well. ChessAssistant, ChessBase, Scid... All of them die if I feed the whole pile to them and ask the tool to do something useful. So I am not sure how you can fully utilize the data, but have fun trying....
The question really is, what the data should be used for ..

if you want to query games of a certain player or of a certain tournament, then the scid format is great. But for this case the database (jbase) could be cleaned out a lot. Eg I find a couple of games of the following type:

Code: Select all

[Event "?"]
[Site "?"]
[Date "????.??.??"]
[Round "?"]
[White "?"]
[Black "?"]
[Result "0-1"]
[ECO "A00h"]
[Variation "Durkin"]
[Annotator ""]
[Source ""]
[Remark ""]

1. Na3 g5 2. Nc4 0-1
This looks more like some general instructions for opening books to me and have no value in a games database.

However, if you want to use the database as a foundation for answering questions like, what have players played in this position before, I would rather suggest to transfer the database into another type of structure. That is either tree based or position based. The first being probably the most compact way of storing the database (and that without any loss of data), while the position based version catches transpositions and is also able to find positions similar to the current, but in exchange also requires more space and it looses some information about the games.
With 10 M games cobbled together by me, there is really no chance that I will find time to clean it up properly. Perhaps someone else will do it.
Would it be possible to use SQL like Jose?
Jose is slow with 1/2 million games.

I am thinking about writing my own database interface. I can't think of any other way to get what I want.
I was thinking more in terms of using something robust and large data capable like SQL Server Express rather than MySQL as in Jose. Maybe make a package to import, something like that.
[rant]
I'm a MySQL guy, so I just had to add that MySQL is far more robust than SQL Server. I'd say the only thing more scalable than MySQL is possibly Oracle. Seeing as Google, Yahoo!, Amazon, and many others use MySQL, I don't think it has trouble with a lot of data.
[/rant]

Sorry, I couldn't resist.

Peter
It is not like you to rant......

Reality is somewhat different. MySQL is nowhere near as robust, scalable, expandable or as functional as SQL Server. Oracle is certainly on a par with SQL Server but MySQL is not even close to either. That is not to say it has its uses. If you don't like SQL Server you could also try Firebird SQL Server. The support is good and it is faster than MySQL. It might even be a fashion statement for you as to which you use.

For the purpose of this experiment you should try one. Download the games database and have a go at importing it. The issue is indexing. It would be interesting to see what you make.
alpha123
Posts: 660
Joined: Sat Dec 05, 2009 5:13 am
Location: Colorado, USA

Re: 10 million chess games

Post by alpha123 »

Christopher Conkie wrote:
alpha123 wrote:
Christopher Conkie wrote:
Dann Corbit wrote:
Christopher Conkie wrote:
Dann Corbit wrote:
Edmund wrote:
Dann Corbit wrote:... The collection has actually grown so large now that there are not really any tools that handle it well. ChessAssistant, ChessBase, Scid... All of them die if I feed the whole pile to them and ask the tool to do something useful. So I am not sure how you can fully utilize the data, but have fun trying....
The question really is, what the data should be used for ..

if you want to query games of a certain player or of a certain tournament, then the scid format is great. But for this case the database (jbase) could be cleaned out a lot. Eg I find a couple of games of the following type:

Code: Select all

[Event "?"]
[Site "?"]
[Date "????.??.??"]
[Round "?"]
[White "?"]
[Black "?"]
[Result "0-1"]
[ECO "A00h"]
[Variation "Durkin"]
[Annotator ""]
[Source ""]
[Remark ""]

1. Na3 g5 2. Nc4 0-1
This looks more like some general instructions for opening books to me and have no value in a games database.

However, if you want to use the database as a foundation for answering questions like, what have players played in this position before, I would rather suggest to transfer the database into another type of structure. That is either tree based or position based. The first being probably the most compact way of storing the database (and that without any loss of data), while the position based version catches transpositions and is also able to find positions similar to the current, but in exchange also requires more space and it looses some information about the games.
With 10 M games cobbled together by me, there is really no chance that I will find time to clean it up properly. Perhaps someone else will do it.
Would it be possible to use SQL like Jose?
Jose is slow with 1/2 million games.

I am thinking about writing my own database interface. I can't think of any other way to get what I want.
I was thinking more in terms of using something robust and large data capable like SQL Server Express rather than MySQL as in Jose. Maybe make a package to import, something like that.
[rant]
I'm a MySQL guy, so I just had to add that MySQL is far more robust than SQL Server. I'd say the only thing more scalable than MySQL is possibly Oracle. Seeing as Google, Yahoo!, Amazon, and many others use MySQL, I don't think it has trouble with a lot of data.
[/rant]

Sorry, I couldn't resist.

Peter
It is not like you to rant......

Reality is somewhat different. MySQL is nowhere near as robust, scalable, expandable or as functional as SQL Server. Oracle is certainly on a par with SQL Server but MySQL is not even close to either. That is not to say it has its uses. If you don't like SQL Server you could also try Firebird SQL Server. The support is good and it is faster than MySQL. It might even be a fashion statement for you as to which you use.

For the purpose of this experiment you should try one. Download the games database and have a go at importing it. The issue is indexing. It would be interesting to see what you make.
Eh, I have to rant sometimes. Especially if people start criticizing certain pieces of software.... :P

I'll try the experiment.

Right, like Google and Amazon don't have to do much indexing.... why don't they use SQL Server if it's so great?

Peter
Christopher Conkie
Posts: 6073
Joined: Sat Apr 01, 2006 9:34 pm
Location: Scotland

Re: 10 million chess games

Post by Christopher Conkie »

alpha123 wrote:
Christopher Conkie wrote:
alpha123 wrote:
Christopher Conkie wrote:
Dann Corbit wrote:
Christopher Conkie wrote:
Dann Corbit wrote:
Edmund wrote:
Dann Corbit wrote:... The collection has actually grown so large now that there are not really any tools that handle it well. ChessAssistant, ChessBase, Scid... All of them die if I feed the whole pile to them and ask the tool to do something useful. So I am not sure how you can fully utilize the data, but have fun trying....
The question really is, what the data should be used for ..

if you want to query games of a certain player or of a certain tournament, then the scid format is great. But for this case the database (jbase) could be cleaned out a lot. Eg I find a couple of games of the following type:

Code: Select all

[Event "?"]
[Site "?"]
[Date "????.??.??"]
[Round "?"]
[White "?"]
[Black "?"]
[Result "0-1"]
[ECO "A00h"]
[Variation "Durkin"]
[Annotator ""]
[Source ""]
[Remark ""]

1. Na3 g5 2. Nc4 0-1
This looks more like some general instructions for opening books to me and have no value in a games database.

However, if you want to use the database as a foundation for answering questions like, what have players played in this position before, I would rather suggest to transfer the database into another type of structure. That is either tree based or position based. The first being probably the most compact way of storing the database (and that without any loss of data), while the position based version catches transpositions and is also able to find positions similar to the current, but in exchange also requires more space and it looses some information about the games.
With 10 M games cobbled together by me, there is really no chance that I will find time to clean it up properly. Perhaps someone else will do it.
Would it be possible to use SQL like Jose?
Jose is slow with 1/2 million games.

I am thinking about writing my own database interface. I can't think of any other way to get what I want.
I was thinking more in terms of using something robust and large data capable like SQL Server Express rather than MySQL as in Jose. Maybe make a package to import, something like that.
[rant]
I'm a MySQL guy, so I just had to add that MySQL is far more robust than SQL Server. I'd say the only thing more scalable than MySQL is possibly Oracle. Seeing as Google, Yahoo!, Amazon, and many others use MySQL, I don't think it has trouble with a lot of data.
[/rant]

Sorry, I couldn't resist.

Peter
It is not like you to rant......

Reality is somewhat different. MySQL is nowhere near as robust, scalable, expandable or as functional as SQL Server. Oracle is certainly on a par with SQL Server but MySQL is not even close to either. That is not to say it has its uses. If you don't like SQL Server you could also try Firebird SQL Server. The support is good and it is faster than MySQL. It might even be a fashion statement for you as to which you use.

For the purpose of this experiment you should try one. Download the games database and have a go at importing it. The issue is indexing. It would be interesting to see what you make.
Eh, I have to rant sometimes. Especially if people start criticizing certain pieces of software.... :P

I'll try the experiment.

Right, like Google and Amazon don't have to do much indexing.... why don't they use SQL Server if it's so great?

Peter
Criticizing, where? Why don't they use it? Maybe they don't want to pay someone for something that costs money.

PHP and MySQL are great.....in their own environment.

So you will try to make it with MySQL? Good, because I was about to try that too, with MySQL, SQL Server and PostgreSQL.

Like I said, it would be very interesting to see what results you come up with.

Maybe a new format is needed and you could put that need to belong and contribute toward that. PGN is old. I don't think you will find anyone saying otherwise. Go for it.... :-)

Chris
shiv
Posts: 351
Joined: Sat Apr 01, 2006 2:03 am

Re: 10 million chess games

Post by shiv »

Do not want to start a mysql vs sql server vs <insert name here> db war. I have been able to import all of mega base in jose with no problems at all. The bugs I found in import were related to pgn parsing and not to do with the DB.

I think the selling points of oracle, sql server over mysql is that they provide better tools (especially enterprise tools) and support over mysql. I have seen Mysql used at Yahoo, I know they use it at google and amazon (amongst many other firms).

There are no fundamental problems with mysql for even data as large as a billion records. It can handle large load well but does not come with advanced tools or support like the top vendors. As a result, many firms prefer the top vendors if they can afford the cash and see value. However, I know many successful startups and some large firms managing quite well with mysql alone.

On your claim that MySQL is nowhere near as robust, scalable, expandable or as functional as SQL Server, I would say that all are debatable.

Robustness implies that mysql loses more data during crashes when compared to other dbs, not sure about that. As
Scalability is an issue with almost all relational DBs, look at the nosql movement, to solve scalability in a radical way. I see scalability and expandability as similar. Mysql probably has less advanced support for partitioning and replication compared to the top vendors. This has not stopped many firms from using it as these extra features did not seem to justify the cost of switching DBs.

In terms of functionality, I agree. SQL server, Oracle have more features to justify their cost. Whether these features are necessary is another question. A feature that I miss in mysql is OLAP, however there is a joint third party plugin wins aims to make up for the loss.
shiv
Posts: 351
Joined: Sat Apr 01, 2006 2:03 am

Re: 10 million chess games

Post by shiv »

I wonder if this was not done because chessbase wanted to make it harder to import their games to a different format. Same logic applies for Chess Assistant.

Glad that newer CA versions support SQL Server. Perhaps its time for open source programs like Scid to embrace open dbs as well.
jdart wrote:> I think using a standard database and optimizing indexes and other areas is the way to go.

I agree and it is surprising this has not been done more. Especially for a "read mostly' DB it should be easy to use a standard db (MySQL or Postgres or SQL Express) and it should also be efficient and fast.

But as for this db - as I have commented before, it contains a lot of games with computer opponents that were generated using fixed limited size opening books. So the variety of moves that you get especially in the opening and early middlegame is limited. So it is not much good for opening analysis. And you should also not rely on the game result for analysis very much because you will find losses on time, and games that were truncated for other reasons (disconnection/forfeiture for example on the chess servers). So it really is junk, probably with some buried gems.
shiv
Posts: 351
Joined: Sat Apr 01, 2006 2:03 am

Re: 10 million chess games

Post by shiv »

Christopher Conkie wrote:
alpha123 wrote:
Christopher Conkie wrote:
Dann Corbit wrote:
Christopher Conkie wrote:
Dann Corbit wrote:
Edmund wrote:
Dann Corbit wrote:... The collection has actually grown so large now that there are not really any tools that handle it well. ChessAssistant, ChessBase, Scid... All of them die if I feed the whole pile to them and ask the tool to do something useful. So I am not sure how you can fully utilize the data, but have fun trying....
The question really is, what the data should be used for ..

if you want to query games of a certain player or of a certain tournament, then the scid format is great. But for this case the database (jbase) could be cleaned out a lot. Eg I find a couple of games of the following type:

Code: Select all

&#91;Event "?"&#93;
&#91;Site "?"&#93;
&#91;Date "????.??.??"&#93;
&#91;Round "?"&#93;
&#91;White "?"&#93;
&#91;Black "?"&#93;
&#91;Result "0-1"&#93;
&#91;ECO "A00h"&#93;
&#91;Variation "Durkin"&#93;
&#91;Annotator ""&#93;
&#91;Source ""&#93;
&#91;Remark ""&#93;

1. Na3 g5 2. Nc4 0-1
This looks more like some general instructions for opening books to me and have no value in a games database.

However, if you want to use the database as a foundation for answering questions like, what have players played in this position before, I would rather suggest to transfer the database into another type of structure. That is either tree based or position based. The first being probably the most compact way of storing the database (and that without any loss of data), while the position based version catches transpositions and is also able to find positions similar to the current, but in exchange also requires more space and it looses some information about the games.
With 10 M games cobbled together by me, there is really no chance that I will find time to clean it up properly. Perhaps someone else will do it.
Would it be possible to use SQL like Jose?
Jose is slow with 1/2 million games.

I am thinking about writing my own database interface. I can't think of any other way to get what I want.
I was thinking more in terms of using something robust and large data capable like SQL Server Express rather than MySQL as in Jose. Maybe make a package to import, something like that.
[rant]
I'm a MySQL guy, so I just had to add that MySQL is far more robust than SQL Server. I'd say the only thing more scalable than MySQL is possibly Oracle. Seeing as Google, Yahoo!, Amazon, and many others use MySQL, I don't think it has trouble with a lot of data.
[/rant]

Sorry, I couldn't resist.

Peter
It is not like you to rant......

Reality is somewhat different. MySQL is nowhere near as robust, scalable, expandable or as functional as SQL Server. Oracle is certainly on a par with SQL Server but MySQL is not even close to either. That is not to say it has its uses. If you don't like SQL Server you could also try Firebird SQL Server. The support is good and it is faster than MySQL. It might even be a fashion statement for you as to which you use.

For the purpose of this experiment you should try one. Download the games database and have a go at importing it. The issue is indexing. It would be interesting to see what you make.
Will try out this experiment as well. Looks like a good exercise.
alpha123
Posts: 660
Joined: Sat Dec 05, 2009 5:13 am
Location: Colorado, USA

Re: 10 million chess games

Post by alpha123 »

Christopher Conkie wrote:
alpha123 wrote:
Christopher Conkie wrote:
alpha123 wrote:
Christopher Conkie wrote:
Dann Corbit wrote:
Christopher Conkie wrote:
Dann Corbit wrote:
Edmund wrote:
Dann Corbit wrote:... The collection has actually grown so large now that there are not really any tools that handle it well. ChessAssistant, ChessBase, Scid... All of them die if I feed the whole pile to them and ask the tool to do something useful. So I am not sure how you can fully utilize the data, but have fun trying....
The question really is, what the data should be used for ..

if you want to query games of a certain player or of a certain tournament, then the scid format is great. But for this case the database (jbase) could be cleaned out a lot. Eg I find a couple of games of the following type:

Code: Select all

&#91;Event "?"&#93;
&#91;Site "?"&#93;
&#91;Date "????.??.??"&#93;
&#91;Round "?"&#93;
&#91;White "?"&#93;
&#91;Black "?"&#93;
&#91;Result "0-1"&#93;
&#91;ECO "A00h"&#93;
&#91;Variation "Durkin"&#93;
&#91;Annotator ""&#93;
&#91;Source ""&#93;
&#91;Remark ""&#93;

1. Na3 g5 2. Nc4 0-1
This looks more like some general instructions for opening books to me and have no value in a games database.

However, if you want to use the database as a foundation for answering questions like, what have players played in this position before, I would rather suggest to transfer the database into another type of structure. That is either tree based or position based. The first being probably the most compact way of storing the database (and that without any loss of data), while the position based version catches transpositions and is also able to find positions similar to the current, but in exchange also requires more space and it looses some information about the games.
With 10 M games cobbled together by me, there is really no chance that I will find time to clean it up properly. Perhaps someone else will do it.
Would it be possible to use SQL like Jose?
Jose is slow with 1/2 million games.

I am thinking about writing my own database interface. I can't think of any other way to get what I want.
I was thinking more in terms of using something robust and large data capable like SQL Server Express rather than MySQL as in Jose. Maybe make a package to import, something like that.
[rant]
I'm a MySQL guy, so I just had to add that MySQL is far more robust than SQL Server. I'd say the only thing more scalable than MySQL is possibly Oracle. Seeing as Google, Yahoo!, Amazon, and many others use MySQL, I don't think it has trouble with a lot of data.
[/rant]

Sorry, I couldn't resist.

Peter
It is not like you to rant......

Reality is somewhat different. MySQL is nowhere near as robust, scalable, expandable or as functional as SQL Server. Oracle is certainly on a par with SQL Server but MySQL is not even close to either. That is not to say it has its uses. If you don't like SQL Server you could also try Firebird SQL Server. The support is good and it is faster than MySQL. It might even be a fashion statement for you as to which you use.

For the purpose of this experiment you should try one. Download the games database and have a go at importing it. The issue is indexing. It would be interesting to see what you make.
Eh, I have to rant sometimes. Especially if people start criticizing certain pieces of software.... :P

I'll try the experiment.

Right, like Google and Amazon don't have to do much indexing.... why don't they use SQL Server if it's so great?

Peter
Criticizing, where? Why don't they use it? Maybe they don't want to pay someone for something that costs money.

PHP and MySQL are great.....in their own environment.

So you will try to make it with MySQL? Good, because I was about to try that too, with MySQL, SQL Server and PostgreSQL.

Like I said, it would be very interesting to see what results you come up with.

Maybe a new format is needed and you could put that need to belong and contribute toward that. PGN is old. I don't think you will find anyone saying otherwise. Go for it.... :-)

Chris
Right, Google is dirt poor. :)

I'm going to try importing Dann's 10 million games into MySQL 5.1.42, Oracle 10g Express Edition (would use 11g Enterprise except the license agreement says I cannot "- disclose results of any program benchmark tests without our prior consent." and I don't have a bazillion dollars for the commercial license), and Firebird SQL 2.1.3.18185. I might try SQL Server 2008 Express also, but I don't know if I can write a Perl script to import the games into that (is there an better way to do this?).

Peter
Dann Corbit
Posts: 12541
Joined: Wed Mar 08, 2006 8:57 pm
Location: Redmond, WA USA

Re: 10 million chess games

Post by Dann Corbit »

alpha123 wrote:
Christopher Conkie wrote:
alpha123 wrote:
Christopher Conkie wrote:
alpha123 wrote:
Christopher Conkie wrote:
Dann Corbit wrote:
Christopher Conkie wrote:
Dann Corbit wrote:
Edmund wrote:
Dann Corbit wrote:... The collection has actually grown so large now that there are not really any tools that handle it well. ChessAssistant, ChessBase, Scid... All of them die if I feed the whole pile to them and ask the tool to do something useful. So I am not sure how you can fully utilize the data, but have fun trying....
The question really is, what the data should be used for ..

if you want to query games of a certain player or of a certain tournament, then the scid format is great. But for this case the database (jbase) could be cleaned out a lot. Eg I find a couple of games of the following type:

Code: Select all

&#91;Event "?"&#93;
&#91;Site "?"&#93;
&#91;Date "????.??.??"&#93;
&#91;Round "?"&#93;
&#91;White "?"&#93;
&#91;Black "?"&#93;
&#91;Result "0-1"&#93;
&#91;ECO "A00h"&#93;
&#91;Variation "Durkin"&#93;
&#91;Annotator ""&#93;
&#91;Source ""&#93;
&#91;Remark ""&#93;

1. Na3 g5 2. Nc4 0-1
This looks more like some general instructions for opening books to me and have no value in a games database.

However, if you want to use the database as a foundation for answering questions like, what have players played in this position before, I would rather suggest to transfer the database into another type of structure. That is either tree based or position based. The first being probably the most compact way of storing the database (and that without any loss of data), while the position based version catches transpositions and is also able to find positions similar to the current, but in exchange also requires more space and it looses some information about the games.
With 10 M games cobbled together by me, there is really no chance that I will find time to clean it up properly. Perhaps someone else will do it.
Would it be possible to use SQL like Jose?
Jose is slow with 1/2 million games.

I am thinking about writing my own database interface. I can't think of any other way to get what I want.
I was thinking more in terms of using something robust and large data capable like SQL Server Express rather than MySQL as in Jose. Maybe make a package to import, something like that.
[rant]
I'm a MySQL guy, so I just had to add that MySQL is far more robust than SQL Server. I'd say the only thing more scalable than MySQL is possibly Oracle. Seeing as Google, Yahoo!, Amazon, and many others use MySQL, I don't think it has trouble with a lot of data.
[/rant]

Sorry, I couldn't resist.

Peter
It is not like you to rant......

Reality is somewhat different. MySQL is nowhere near as robust, scalable, expandable or as functional as SQL Server. Oracle is certainly on a par with SQL Server but MySQL is not even close to either. That is not to say it has its uses. If you don't like SQL Server you could also try Firebird SQL Server. The support is good and it is faster than MySQL. It might even be a fashion statement for you as to which you use.

For the purpose of this experiment you should try one. Download the games database and have a go at importing it. The issue is indexing. It would be interesting to see what you make.
Eh, I have to rant sometimes. Especially if people start criticizing certain pieces of software.... :P

I'll try the experiment.

Right, like Google and Amazon don't have to do much indexing.... why don't they use SQL Server if it's so great?

Peter
Criticizing, where? Why don't they use it? Maybe they don't want to pay someone for something that costs money.

PHP and MySQL are great.....in their own environment.

So you will try to make it with MySQL? Good, because I was about to try that too, with MySQL, SQL Server and PostgreSQL.

Like I said, it would be very interesting to see what results you come up with.

Maybe a new format is needed and you could put that need to belong and contribute toward that. PGN is old. I don't think you will find anyone saying otherwise. Go for it.... :-)

Chris
Right, Google is dirt poor. :)

I'm going to try importing Dann's 10 million games into MySQL 5.1.42, Oracle 10g Express Edition (would use 11g Enterprise except the license agreement says I cannot "- disclose results of any program benchmark tests without our prior consent." and I don't have a bazillion dollars for the commercial license), and Firebird SQL 2.1.3.18185. I might try SQL Server 2008 Express also, but I don't know if I can write a Perl script to import the games into that (is there an better way to do this?).

Peter
I suggest use of tool kits such as Rémi Coulom's PGN parser, PGN-Extract (I have a modified version that outputs the EPD string and the outcome of the game, for instance), and other tools of that nature. I think that writing your own parser will add a lot of tedium to the project.

P.S.
Here is my current SQL Model:
http://cap.connx.com/chess-engines/new- ... /CHESS.ZIP

it contains a SQL file and an ER/WIN model.
Dann Corbit
Posts: 12541
Joined: Wed Mar 08, 2006 8:57 pm
Location: Redmond, WA USA

Re: 10 million chess games

Post by Dann Corbit »

Dann Corbit wrote:
alpha123 wrote:
Christopher Conkie wrote:
alpha123 wrote:
Christopher Conkie wrote:
alpha123 wrote:
Christopher Conkie wrote:
Dann Corbit wrote:
Christopher Conkie wrote:
Dann Corbit wrote:
Edmund wrote:
Dann Corbit wrote:... The collection has actually grown so large now that there are not really any tools that handle it well. ChessAssistant, ChessBase, Scid... All of them die if I feed the whole pile to them and ask the tool to do something useful. So I am not sure how you can fully utilize the data, but have fun trying....
The question really is, what the data should be used for ..

if you want to query games of a certain player or of a certain tournament, then the scid format is great. But for this case the database (jbase) could be cleaned out a lot. Eg I find a couple of games of the following type:

Code: Select all

&#91;Event "?"&#93;
&#91;Site "?"&#93;
&#91;Date "????.??.??"&#93;
&#91;Round "?"&#93;
&#91;White "?"&#93;
&#91;Black "?"&#93;
&#91;Result "0-1"&#93;
&#91;ECO "A00h"&#93;
&#91;Variation "Durkin"&#93;
&#91;Annotator ""&#93;
&#91;Source ""&#93;
&#91;Remark ""&#93;

1. Na3 g5 2. Nc4 0-1
This looks more like some general instructions for opening books to me and have no value in a games database.

However, if you want to use the database as a foundation for answering questions like, what have players played in this position before, I would rather suggest to transfer the database into another type of structure. That is either tree based or position based. The first being probably the most compact way of storing the database (and that without any loss of data), while the position based version catches transpositions and is also able to find positions similar to the current, but in exchange also requires more space and it looses some information about the games.
With 10 M games cobbled together by me, there is really no chance that I will find time to clean it up properly. Perhaps someone else will do it.
Would it be possible to use SQL like Jose?
Jose is slow with 1/2 million games.

I am thinking about writing my own database interface. I can't think of any other way to get what I want.
I was thinking more in terms of using something robust and large data capable like SQL Server Express rather than MySQL as in Jose. Maybe make a package to import, something like that.
[rant]
I'm a MySQL guy, so I just had to add that MySQL is far more robust than SQL Server. I'd say the only thing more scalable than MySQL is possibly Oracle. Seeing as Google, Yahoo!, Amazon, and many others use MySQL, I don't think it has trouble with a lot of data.
[/rant]

Sorry, I couldn't resist.

Peter
It is not like you to rant......

Reality is somewhat different. MySQL is nowhere near as robust, scalable, expandable or as functional as SQL Server. Oracle is certainly on a par with SQL Server but MySQL is not even close to either. That is not to say it has its uses. If you don't like SQL Server you could also try Firebird SQL Server. The support is good and it is faster than MySQL. It might even be a fashion statement for you as to which you use.

For the purpose of this experiment you should try one. Download the games database and have a go at importing it. The issue is indexing. It would be interesting to see what you make.
Eh, I have to rant sometimes. Especially if people start criticizing certain pieces of software.... :P

I'll try the experiment.

Right, like Google and Amazon don't have to do much indexing.... why don't they use SQL Server if it's so great?

Peter
Criticizing, where? Why don't they use it? Maybe they don't want to pay someone for something that costs money.

PHP and MySQL are great.....in their own environment.

So you will try to make it with MySQL? Good, because I was about to try that too, with MySQL, SQL Server and PostgreSQL.

Like I said, it would be very interesting to see what results you come up with.

Maybe a new format is needed and you could put that need to belong and contribute toward that. PGN is old. I don't think you will find anyone saying otherwise. Go for it.... :-)

Chris
Right, Google is dirt poor. :)

I'm going to try importing Dann's 10 million games into MySQL 5.1.42, Oracle 10g Express Edition (would use 11g Enterprise except the license agreement says I cannot "- disclose results of any program benchmark tests without our prior consent." and I don't have a bazillion dollars for the commercial license), and Firebird SQL 2.1.3.18185. I might try SQL Server 2008 Express also, but I don't know if I can write a Perl script to import the games into that (is there an better way to do this?).

Peter
I suggest use of tool kits such as Rémi Coulom's PGN parser, PGN-Extract (I have a modified version that outputs the EPD string and the outcome of the game, for instance), and other tools of that nature. I think that writing your own parser will add a lot of tedium to the project.

P.S.
Here is my current SQL Model:
http://cap.connx.com/chess-engines/new- ... /CHESS.ZIP

it contains a SQL file and an ER/WIN model.
P.S.
In my model, the movelist is encoded as follows:
For each move in the game, have a chess move generator generate the possible move list. Sort the list lexically and output the move number of the move actually played encoded as an unsigned char. Terminate the list with the value '255'.

The value of this approach is that a 300 move game can store the game list in 600 bytes (one byte for each ply). It is the most compact notation that I am aware of.