Open Chess Game Database Standard

Discussion of chess software programming and technical issues.

Moderators: hgm, Rebel, chrisw

User avatar
phhnguyen
Posts: 1474
Joined: Wed Apr 21, 2010 4:58 am
Location: Australia
Full name: Nguyen Hong Pham

Re: Open Chess Game Database Standard

Post by phhnguyen »

Jonathan003 wrote: Fri Feb 25, 2022 1:01 pm I'm trying out the new options of ocgdb beta 7.
Was anybody successful detecting duplicates when one game is included in another?
I have tried it with these two games
...
ocgdb Beta 7 didn't find the embedded double.
Maybe I do something wrong?
Thanks for the report. There is a bug and it is fixed. I have replied to the report with more details and executing files in GitHub's Issues of the project.
https://banksiagui.com
The most features chess GUI, based on opensource Banksia - the chess tournament manager
User avatar
phhnguyen
Posts: 1474
Joined: Wed Apr 21, 2010 4:58 am
Location: Australia
Full name: Nguyen Hong Pham

Re: Open Chess Game Database Standard

Post by phhnguyen »

Version Beta 8 released

Changes:
- Fixed bugs
- Merge games from PGN files or other databases
- Improve code-readable


So far, the program has main functions/features:
1. create an SQLite database from multi PGN files
2. merge/add games from multi PGN files/databases into an SQLite database
3. export multi SQLite databases to a PGN file
4. get/display PGN games/FEN strings with game IDs from an SQLite database
5. find duplicates/embedded games from multi SQLite databases
6. query games from multi SQLite databases or PGN files, using PQL (Position Query Language)

With this release, all main functions, and features we wished to have, have been implemented. Thus in the coming time, we will focus on bug fixing and then officially release version 1.0.
https://banksiagui.com
The most features chess GUI, based on opensource Banksia - the chess tournament manager
User avatar
phhnguyen
Posts: 1474
Joined: Wed Apr 21, 2010 4:58 am
Location: Australia
Full name: Nguyen Hong Pham

Re: Open Chess Game Database Standard

Post by phhnguyen »

I have been working on a new version. There are some new features:

1. support a new database format for storing EPD. The new one will have the file extension “.epd.db3”. The input could be EPD or PGN files
2. improve PQL on patterns and pattern controls:
- new pattern statements: users can point out pieces and their locations or use some FEN strings
- pattern operand: control how to match the pattern with a checking position. The matching may be equal (=, simple comparison between the chessboard of the pattern and the chessboard of the checking position), pattern as the sub-board (<, check if its pieces are the subset) or super-board (>), shifted (#, shift the pattern around the board then match them as sub-board)
- pattern tolerance: accept results with some errors/mismatches


EBNF for pattern statements:

Code: Select all

pattern = "{" pattern_term [ "," pattern_operand [ "," pattern_tolerance ] ] "}"
pattern_term = fenstring | piecename "[" square { ", " square } "]" { "," pattern_term }
pattern_operand = "="  | "=="| “<“ | ">" | "#"
pattern_tolerance = "0" | "1", "2" | ...
Example of some pattern statements:

Code: Select all

{ K[g1], P[f2, g2],R[e2],B[e3], <, 1 }
{ 8/2p5/2p5/1p6/1P6/1P6/8/8 w - - 0 154, # } 
The working code has been uploaded already. I am going to test it for a while. Any ideas, suggestions or helps are welcome
https://banksiagui.com
The most features chess GUI, based on opensource Banksia - the chess tournament manager
miked
Posts: 1
Joined: Mon Jan 16, 2023 10:54 pm
Full name: Mike Davies

Re: Open Chess Game Database Standard

Post by miked »

Hi Pham,

I only discovered BanksiaGui yesterday, and when I saw it was using SQLite as a game database, I found this thread and read it all yesterday evening.

I've been noodling around with using SQLite with PGN files, but as a searchable index to go along with the PGN file, not as a replacement, one index for searching by PGN tags, and one index file for somehow supporting position searches (exact and partial). So I was fascinated to read about your progress.

I chose not to store the full game in a database, because I'd concluded that PGN is such a compact format, that I'd have to do some sort of packing of information to get some benefit, at which point a custom file format should be better. I got stuck with how to deal with nested variations -- I wanted every move, in the game or in the analysis, to be treated the same way. And SQL plus arbitrary-depth nested structures didn't mix for me.

I am delighted to see you had success storing bitboards in the position search. I'm surprised by the great performance of it -- I suspect that the callback functions helps enormously, I was going to use plain bitwise expressions in SQL, and dreading the performance of it.

Amazing ideas on encoding the game moves (both variants: 2-byte and 1.5 byte). Does that solve any particular search problem, or is it just storage of moves?

I had a few thoughts reading your journey:


1.) The problem with positions occurring too much. Yes, good solution on ignoring the start position. Similarly: all games will match an ECO code. Most PGN files have it in the header, and pgn-extract can add it if missing. pgn-extract's eco.pgn results in about 4,000 positions total.

I thought having a table that mapped ECO codes to all the positions that occur in the move order -- similar to your table that's position_hash, blob_of_game_ids, but covering a lot of positions that are already encoded in a game as an ECO code. That should take a big chunk of highly-repeated positions.

I noticed one of Dan Corbitt's post he has a PGN fragment that shows an extended ECO code (so ECO code plus a lowercase letter) -- It was the post he talked about multiple move transpositions to reach a Gligoric King's Indian. I think that extended ECO code list is something that comes with SCID?

2.) SQLite locks on writes, only one write at a time. There's a WAL pragma that helps keep the performance of writes up (Write Ahead Log). May start to be an issue when attempting to do async processing. You're solving a lot by using transactions of a decent size.

3.) If you are having a hash to FEN lookup, consider having that in a separate file, and using SQLite's "ATTACH" to join it to the current database. Since FEN->hash is constant/global. Might be better to have one big lookup that's shared across all databases.

4.) does the ocgdb script support pipes (read from STDIN), so someone could do `gzcat big-pgn-file.pgn.gz | ocgdb ...`

5.) There are multiple FENs that match to a hash -- same position, but arrived at in different move order and the last move was a pawn-two-squares-forward. The difference is that FEN records the en-passant square even if there's no capture possible (no opposing pawns adjacent to the moved pawn). Zobrist purposely excludes enpassant when there isn't adjacent pawns. -- Sure, there's a canonicalisation of FEN that's possible.

6.) Seeing the resource and memory limits are being reached, separate the import process into 2 steps: One to get the games and moves into the database. Second to calculate the position lookup bitboards / hashes. It may mean things are recoverable, and position indexing re-runnable.

7.) What is currently a barrier for me switching over to your database schema for my needs:
* Having the move, move2 and move1 blobs in the Game table. Could we move those blob fields to a separate table? I think there might be a performance benefit -- separate table means separate block in the SQLite file, which means smaller Game table, which means more records in that table can be loaded up into RAM, less paging.
* Surprised ECO field isn't in the Game table by default.
* Not storing NAG (!/?/!?/?! and the Informant style annotation characters), and variations seem to be "text annotations" or ignored. That means PGN -> OCGDB -> PGN loses data.
* Having both datetime of the Game, and EventDate for the event (I can see why both are handy together and by themselves). Date in PGN is just a date, but lichess and chess.com have a UTCDate & UTCTime field that can make up a more accurate date/time of the game (take Firouzja's 200 game bullet match against Naroditsky during the Candidates last year -- the order of the games is important, even though they are minutes apart)


In light of your work, I am revalidating my existing conclusion that SQLite isn't suitable as a PGN replacement. Thank you!

MikeD