Open Chess Game Database Standard

phhnguyen · Post by **phhnguyen** » Sun Oct 24, 2021 4:02 am

mvanthoor wrote: ↑Sat Oct 23, 2021 4:07 pm Don't get me wrong, I fully intend to follow this thread (and possibly provide some input or try some things in Rust myself), because I'd rather have a somewhat larger SQL database than a smaller binary database. I'm not trying to store the 100 million games Lichess puts out every day; if I can store my old MegaBase 2017 with TWIC updates (roughly 8 million games), that'd be fine; even if it's bigger than the Chessbase database.

mvanthoor wrote: ↑Sat Oct 23, 2021 11:23 pm
dangi12012 wrote: ↑Sat Oct 23, 2021 10:53 pm I think an SQL standard really is a great idea and I thank OP for opening this thread.
The one reason I think SQL would be a great idea (assuming it is workable with regard to speed and size) is that I don't have to write my own binary format when I start on my own user interface some day. I'm fully capable of designing such a format (did it before, for embedded projects where SQL was totally not an option), but I'd rather not.

I agreed too. After some quick measurements and tryouts, I see the performance of SQL chess game databases is better than I expected even with large data on the simplest designs/structures. The data size is large but still reasonable. The work (to build up, convert to) is so simple and quick (easy) and there is still much room to improve. I love the speed the SQL engine replies, the free, very flexible way I can query, the clear/clean, easy, almost-no-bug environments/tools I work with.

Look like I won’t come back to the world of binary databases (for chess). I have my own one but will probably abandon it soon and support SQL instead. What we could continue here is to try to push all boundaries to see how far we could go, to persuade more developers to join, at least to support by exporting/importing their own databases into/from SQL ones. Now I feel so lazy, lose the motivation to work myself directly with any chess binary databases

Even if they don’t like our work and decide to create their own SQL databases (with different table structures) we and other users still get a lot of benefit since SQL databases themselves are open standard, self-understandable/explainable, much easy to convert than their binary ones and more people can work with.

phhnguyen · Post by **phhnguyen** » Sun Oct 24, 2021 4:23 am

mvanthoor wrote: ↑Sat Oct 23, 2021 4:07 pm One way to quickly find every position in each was described by Daniel: you could store a Zobrist hash and a GUID for each position you encounter during the PGN -> DB conversion. The DB could then have a table in between the game and position table (don't know the english name: Dutch is "koppel-tabel", or "connection table"), where each record stores a reference to the game, to a position, and an order. So you could query a game from that table, order it on the "order" field, and you have all the positions occurring in that game. You could also query the table on a certain position and (distinct) game Id's, and you instantly got any game in which this position occurs.

dangi12012 wrote: ↑Sat Oct 23, 2021 7:27 pm
Position search is enabled by my tool too. The trick is to store a 128bit hash for all positions. A game becomes a list of hashIDs (IDs not the hashes themselves - so around 48 bits per position.

But its not too big. A few GB for 100s of thousands of games.

I have heard many times about storing hash keys into databases for position searching. At a quick glance, it seems to be a good idea: we can match out any given position with very high speed when both additional code and extra needed space are not large.

However, IMHO, hash keys won’t help much, almost redundant. The main reason is that they can help to search exactly only but that search is actually an approximate one.

Imagine, a user starts searching with a given position, typically he wants us to return all similar/closing positions, including that (exact) position! It means he accepts positions with one or some pieces missing or being added, located in different but closed cells. Exact positions are just a tiny, partly set of what users want to find.

Except for positions in the opening period, caused by the huge number of possible positions, in almost all cases, a user can't find out another game with a position matched exactly to the one he is querying. Thus position matching doesn't have much use here.

Below are some examples about position queries for which hash keys almost are useless:

Find all positions with:
- white Knight in the a6, back Rook in e2
- white Knight in the a6, back Rook in e2, white also has a Rook and a Pawn, black has a Bishop (those pieces can be in any cells)
- 3 Queens
- white has only 2 Rooks
- endgame KQvsKR

What we need is a position index system. It is not simple and not small in size.

How to query positions is another problem. At the moment we have Chess Query Language but it is not easy to use and install:

https://www.chessprogramming.org/Chess_Query_Language

Sopel · Post by **Sopel** » Sun Oct 24, 2021 11:42 am

dangi12012 wrote: ↑Sat Oct 23, 2021 7:27 pm
phhnguyen wrote: ↑Sat Oct 23, 2021 1:12 pm
mvanthoor wrote: ↑Sat Oct 23, 2021 8:06 am The problem with chess databases is not finding players, games tournaments, or results in the database. The problem is finding position X across all games. That is one of the points where most people who design a database format get stuck on SQL.
You are right! But… not totally

Position searching is always a huge challenge for any kind of database, from traditional binary to SQL ones.

At the moment for SQL we store a game as two strings, one for the starting FEN, another for the move list (SAN moves). That’s so simple, straightforward, making it looks like a simple copy of a PGN game. How can a binary database store a game? Perhaps, by using a binary array for the starting position and another binary array for the move list. That should be a simple binary version of the PGN game! The cores of binary and SQL ones are almost the same. Can one use those arrays for position searching? Probably not or with terrible performances since the search function has to create all extra information on the fly. I don’t think there is any magic with those arrays only!

To implement the position search ability, we typically have to add a lot of extra information, index a lot of things. However, what we can add to a binary database, we can add to a SQL database too. There is almost no limitation: SQL can store binary data, from bytes, big numbers to huge arrays. There may be some difficulties and differences between those databases such as speeds, data sizes but I believe they can work similarly.

However, with SQL, we can use existing-almost-perfect-query/search engines in the background, work on easy, clear, clean environments to debug, alternate data structures, thus developers can save much more time and energy to focus on the main work.

In the contrast, the traditional database developers must build almost everything from zero, using heavy buggy development environments, data itself is always the magic, puzzle, changing data structures may cause massive code changes and bugs. That’s why position searching became a very hard and nearly impossible task. That’s why almost all database apps don’t have full features (for position searching).
Position search is enabled by my tool too. The trick is to store a 128bit hash for all positions. A game becomes a list of hashIDs (IDs not the hashes themselves - so around 48 bits per position.

But its not too big. A few GB for 100s of thousands of games.

Last time I did such a DB it was ~1TB for 1 billion lichess games (~80 billion positions). 72-80bit hash is sufficient. Hashes are mostly unique so additional "hashIDs" are useless and only complicate things.

The thing is, there are always tradeoffs. With such a DB it's easy to probe for WDL information in near constant time, but information about individual games is lost (because that would require a linear search over all positions, in the worst case the whole DB would have to be returned). The more general the needed queries are the more linear the search needs to be. Some types of queries (as for example specific piece on square) require more information in the DB too (and you won't be able to do these queries efficiently in SQL unless you do a very fine grained table). There's no middle ground, so there won't ever be a "standard" DB software for chess.

phhnguyen · Post by **phhnguyen** » Sun Oct 24, 2021 1:54 pm

I have just pushed the first full source code into GitHub. The code is written in C++17, using SQLite3, and it can do:
- Create a SQL chess game database with all tables
- Convert games from a PGN file and push them into that database
- Do some queries to test speed

https://github.com/nguyenpham/ocgdb

There is a small database in that repo. The bigger one is about 500 MB, over the limit of GitHub thus it could be downloaded from the below link. Be aware it is not a permanent link and may become invalid when we replace it with newer ones.

Please note that both code and databases are just samples, work in progress, and all may be changed in the coming time. It is not a "must have" since you can use other programming languages and tools to build your own SQL one.

phhnguyen · Post by **phhnguyen** » Sat Oct 30, 2021 2:38 am

FYI: I have just tried some libraries to read, query, and display the SQL chess database and have been surprised about how easy to integrate and how fast and rich features they are. Clearly, both SQL engines and support libraries are mature things and we can get many benefits from them!!!

More details: I have tried the SQL module of Qt and it can work smoothly with SQLite databases, run, query, sort, and display almost instantly even SQL statements are completed with huge data. On the below illustration, the query involved 3 tables with over 3.4 million records and I did not feel any lag when using (querying, sorting, scrolling…).

With SQL we can freely and quickly query game information. For example, we can ask easily some questions which are huge challenges for traditional ones:

Find:
- all games won for players with the term “smith”, from 1990-1940, 2000-2022
- all games with long castings
- all games with comments with the term “excellent”
- all games with ?!
- all lost games for player Elo over 2000 and opening ECO xxx
- the total of all draw games with length under 50
- the ratio Win/Draw/Loss of Carlsen before 2019
- all games which under-promotions

Typically the thing we can query easily is information in PGN tags (event, players, result, date, timer...). Some other information involving games’ bodies/moves/fen can be queried too such as game lengths, move names (for example, we can query move “O-O-O”, “e7q”), comments… Some queries are tricky and/or complicated (such as finding games with under-promotions) but doable.

All tables can be edited directly and easily (features of support libraries): double click to an item, edit, and DONE.

I have been considering writing and publishing a new open-source of a simple GUI tool to help users to convert PGN into SQL databases, view and query data (it will work and look similar to the below illustration).

Query an SQLite database with over 3.4 million games

Sopel · Post by **Sopel** » Sat Oct 30, 2021 10:01 am

How do you search for a specific position?

dangi12012 · Post by **dangi12012** » Sun Oct 31, 2021 1:09 am

phhnguyen wrote: ↑Sat Oct 30, 2021 2:38 am FYI: I have just tried some libraries to read, query, and display the SQL chess database and have been surprised about how easy to integrate and how fast and rich features they are. Clearly, both SQL engines and support libraries are mature things and we can get many benefits from them!!!

More details: I have tried the SQL module of Qt and it can work smoothly with SQLite databases, run, query, sort, and display almost instantly even SQL statements are completed with huge data. On the below illustration, the query involved 3 tables with over 3.4 million records and I did not feel any lag when using (querying, sorting, scrolling…).

With SQL we can freely and quickly query game information. For example, we can ask easily some questions which are huge challenges for traditional ones:

Find:
- all games won for players with the term “smith”, from 1990-1940, 2000-2022
- all games with long castings
- all games with comments with the term “excellent”
- all games with ?!
- all lost games for player Elo over 2000 and opening ECO xxx
- the total of all draw games with length under 50
- the ratio Win/Draw/Loss of Carlsen before 2019
- all games which under-promotions

Typically the thing we can query easily is information in PGN tags (event, players, result, date, timer...). Some other information involving games’ bodies/moves/fen can be queried too such as game lengths, move names (for example, we can query move “O-O-O”, “e7q”), comments… Some queries are tricky and/or complicated (such as finding games with under-promotions) but doable.

All tables can be edited directly and easily (features of support libraries): double click to an item, edit, and DONE.

I have been considering writing and publishing a new open-source of a simple GUI tool to help users to convert PGN into SQL databases, view and query data (it will work and look similar to the below illustration).

Query an SQLite database with over 3.4 million games

Any news on the GUI release?
I would like to see an easy to use - modern - alternative to arena.

Also thats very great since many people suggested that sql is inferiour to handwritten binary formats. Which obviously now everyone can see the advantages

The question remains. How do you store each move under a game? This information needs to be accessed by sql as well. And every move should be queried as a PK to enable searches like:

How many times was this position reached by players above elo 2800?
Also then you could generate the win/lose rate over elo graph for any position. I guess there are some opening traps that spectaculary backfire - but only above a certain elo.

For tournament preparation this would be insane. Find the earliest position where this player consistently makes a mistake. ie plays consistenly sligt mistaken opening move.

phhnguyen · Post by **phhnguyen** » Mon Nov 01, 2021 2:14 am

Sopel wrote: ↑Sat Oct 30, 2021 10:01 am How do you search for a specific position?

That is the next step!

As I have mentioned in a previous post, finding exactly (a specific position) is almost meaningless since from the middle games we almost can’t find a position (from a game) in another game.

It is more common sense the search becomes finding closed/similar positions for a specific one. Thus the search should be approximate.

That should be a huge challenge task. There may be not simple, clear solutions.

IMO, there are two main problems building an approximate position search:
1) how the users can query a position: we may try the Chess Query Language (CQL) first. If it does not satisfy our needs, we may need to build another one, say, one is similar to SQL since it is highly flexible
2) build a system to adapt (1).

I imagine it should be a (large) index system. We may build a workable system first then optimize it for speed and size later.

phhnguyen · Post by **phhnguyen** » Mon Nov 01, 2021 2:32 am

dangi12012 wrote: ↑Sun Oct 31, 2021 1:09 am
Any news on the GUI release?

Just start working on a free, simple GUI tool for this kind of database. It plays the role of an example, free library thus other developers can integrate it into their tools.

dangi12012 wrote: ↑Sun Oct 31, 2021 1:09 am I would like to see an easy to use - modern - alternative to arena.

I am not aware Arena has its own database!

dangi12012 wrote: ↑Sun Oct 31, 2021 1:09 am Also thats very great since many people suggested that sql is inferiour to handwritten binary formats. Which obviously now everyone can see the advantages

The question remains. How do you store each move under a game? This information needs to be accessed by sql as well. And every move should be queried as a PK to enable searches like:

How many times was this position reached by players above elo 2800?
Also then you could generate the win/lose rate over elo graph for any position. I guess there are some opening traps that spectaculary backfire - but only above a certain elo.

ATM, a move list stored as a simple text, similar to what we see in a PGN file.

I didn't measure the performance yet, just see that normal queries are very fast, almost instantly. It may take a bit of time for some kinds of queries such as SELECT COUNT since they need to scan the whole data when the database is large but still acceptable (we can avoid them by using some extra information). When we are watching the game list and click on anyone, the system has known the game id and can extract the whole game almost instantly.

dangi12012 wrote: ↑Sun Oct 31, 2021 1:09 am
For tournament preparation this would be insane. Find the earliest position where this player consistently makes a mistake. ie plays consistenly sligt mistaken opening move.

Do you mean to use the database as an opening book? The problem is that there is no weight for any move. The non-tree structure is not optimized for the opening tasks either.

dangi12012 · Post by **dangi12012** » Mon Nov 01, 2021 2:00 pm

phhnguyen wrote: ↑Mon Nov 01, 2021 2:32 am
dangi12012 wrote: ↑Sun Oct 31, 2021 1:09 am
Any news on the GUI release?
Just start working on a free, simple GUI tool for this kind of database. It plays the role of an example, free library thus other developers can integrate it into their tools.

dangi12012 wrote: ↑Sun Oct 31, 2021 1:09 am I would like to see an easy to use - modern - alternative to arena.
I am not aware Arena has its own database!

dangi12012 wrote: ↑Sun Oct 31, 2021 1:09 am Also thats very great since many people suggested that sql is inferiour to handwritten binary formats. Which obviously now everyone can see the advantages

The question remains. How do you store each move under a game? This information needs to be accessed by sql as well. And every move should be queried as a PK to enable searches like:

How many times was this position reached by players above elo 2800?
Also then you could generate the win/lose rate over elo graph for any position. I guess there are some opening traps that spectaculary backfire - but only above a certain elo.
ATM, a move list stored as a simple text, similar to what we see in a PGN file.

I didn't measure the performance yet, just see that normal queries are very fast, almost instantly. It may take a bit of time for some kinds of queries such as SELECT COUNT since they need to scan the whole data when the database is large but still acceptable (we can avoid them by using some extra information). When we are watching the game list and click on anyone, the system has known the game id and can extract the whole game almost instantly.

dangi12012 wrote: ↑Sun Oct 31, 2021 1:09 am
For tournament preparation this would be insane. Find the earliest position where this player consistently makes a mistake. ie plays consistenly sligt mistaken opening move.
Do you mean to use the database as an opening book? The problem is that there is no weight for any move. The non-tree structure is not optimized for the opening tasks either.

PGNs may contain the evaluation and it would be great to have a lookup for every move in the db possible.
For example lichess has something like this:
1. e4 { [%eval 0.24] [%clk 0:03:00] } 1... e5 { [%eval 0.12] [%clk 0:03:00] } 2. Nf3 { [%eval 0.17] [%clk 0:02:59] } 2..
This can be used to select all games of a certain player and find the openings where he consistently plays a slight mistaken move.

Do you plan to write a wrapper as well?
So that we have a C++ header that we can use and get some already implemented methods?

For example "Chess_Database.h" - which already countains a Load(char* path) and Count() Methods.
I want to add that sqlites slower count(*) is deliberate and can be done fast by:

Code: Select all

SELECT MAX(_ROWID_) FROM "table" LIMIT 1;

I dont know if there is a shared language that can emit code for python, c++, c# and java with sqlite backends. Then we only would need to write simple query code once - and it will code for all languages and we can have a central github for that.

Open Chess Game Database Standard

Re: Open Chess Game Database Standard

Re: Open Chess Game Database Standard

Re: Open Chess Game Database Standard

Source code available

Re: Open Chess Game Database Standard

Re: Open Chess Game Database Standard

Re: Open Chess Game Database Standard

Re: Open Chess Game Database Standard

Re: Open Chess Game Database Standard

Re: Open Chess Game Database Standard