Open Chess Game Database Standard

Discussion of chess software programming and technical issues.

Moderator: Ras

User avatar
phhnguyen
Posts: 1504
Joined: Wed Apr 21, 2010 4:58 am
Location: Australia
Full name: Nguyen Hong Pham

Version Beta released

Post by phhnguyen »

Version Beta released

https://github.com/nguyenpham/ocgdb/rel ... ersionBeta

We have been working to improve the program to be a full feature/useful database program. Below is what users can do with the program:
- convert (multi) PGN files into an (SQLite) database
- search positions with PGN files
- search positions with databases
- get games by their IDs

Users can control many factors/options via arguments. For examples:

- create a new database with two PGN files, create column Moves and Moves2 for moves, discard all comments and sites, discard games with players' ELOs below 2000, game lengths (plies) under 30

Code: Select all

ocgdb -pgn big1.png -pgn big2.png -db big.ocgdb.db3 -cpu 4 -o moves;moves1;discardsites;discardcomments -elo 2000 -plycount 30
- search a database for all games having 3 White Queens

Code: Select all

ocgdb -db big.ocgdb.db3 -cpu 4 -q "Q=3"
- search a PGN file for all games having 3 White Queens

Code: Select all

ocgdb -pgn big1.png -cpu 4 -q "Q=3"
- get game with ID = 4432

Code: Select all

ocgdb -db big.ocgdb.db3 -g 4432
https://banksiagui.com
The most features chess GUI, based on opensource Banksia - the chess tournament manager
Ferdy
Posts: 4841
Joined: Sun Aug 10, 2008 3:15 pm
Location: Philippines

Re: Version Beta released

Post by Ferdy »

phhnguyen wrote: Thu Jan 27, 2022 1:57 am Version Beta released

https://github.com/nguyenpham/ocgdb/rel ... ersionBeta

We have been working to improve the program to be a full feature/useful database program. Below is what users can do with the program:
- convert (multi) PGN files into an (SQLite) database
- search positions with PGN files
- search positions with databases
- get games by their IDs

Users can control many factors/options via arguments. For examples:

- create a new database with two PGN files, create column Moves and Moves2 for moves, discard all comments and sites, discard games with players' ELOs below 2000, game lengths (plies) under 30

Code: Select all

ocgdb -pgn big1.png -pgn big2.png -db big.ocgdb.db3 -cpu 4 -o moves;moves1;discardsites;discardcomments -elo 2000 -plycount 30
- search a database for all games having 3 White Queens

Code: Select all

ocgdb -db big.ocgdb.db3 -cpu 4 -q "Q=3"
- search a PGN file for all games having 3 White Queens

Code: Select all

ocgdb -pgn big1.png -cpu 4 -q "Q=3"
- get game with ID = 4432

Code: Select all

ocgdb -db big.ocgdb.db3 -g 4432
I tried it but there is assertion error.

Code: Select all

PS F:\Chess\Tools\pgn_database\ocgdb-win64-beta3> ls

    Directory: F:\Chess\Tools\pgn_database\ocgdb-win64-beta3

Mode                 LastWriteTime         Length Name
----                 -------------         ------ ----
d----          2022-02-01    13:23                db3
-a---          2022-01-22    07:20      242466224 clean_master.pgn
-----          2022-02-01    12:44        1220096 ocgdb.exe

PS F:\Chess\Tools\pgn_database\ocgdb-win64-beta3> .\ocgdb.exe -pgn clean_master.pgn -db clean_master.ocgdb.db3 -cpu 1 -o moves
Open Chess Game Database Standard (OCGDB), (C) 2022 - version: Beta 3

Convert PGN files into a database...
Assertion failed: !paraRecord.dbPaths.empty(), file C:\Users\nguyenpham\ocgdb\src\builder.cpp, line 423
User avatar
phhnguyen
Posts: 1504
Joined: Wed Apr 21, 2010 4:58 am
Location: Australia
Full name: Nguyen Hong Pham

Re: Version Beta released

Post by phhnguyen »

Ferdy wrote: Tue Feb 01, 2022 6:34 am I tried it but there is assertion error.
Thanks for the report. Could you try with the new binary/release?
https://banksiagui.com
The most features chess GUI, based on opensource Banksia - the chess tournament manager
Ferdy
Posts: 4841
Joined: Sun Aug 10, 2008 3:15 pm
Location: Philippines

Re: Version Beta released

Post by Ferdy »

phhnguyen wrote: Tue Feb 01, 2022 10:19 am
Ferdy wrote: Tue Feb 01, 2022 6:34 am I tried it but there is assertion error.
Thanks for the report. Could you try with the new binary/release?
Beta 4 works, thanks.
Ferdy
Posts: 4841
Joined: Sun Aug 10, 2008 3:15 pm
Location: Philippines

Re: Version Beta released

Post by Ferdy »

I tried to check its duplicate detection capability.

I run with -dup and get this one from some.

Code: Select all

Duplicate games detected between IDs 139844 and 340722
Then tried to find these games with query.

Code: Select all

sqlite> SELECT g.ID, g.Round, Date, w.Name White, WhiteElo, b.Name Black, BlackElo, Result, Moves
   ...> FROM Games g
   ...> INNER JOIN Players w ON WhiteID = w.ID
   ...> INNER JOIN Players b ON BlackID = b.ID
   ...> WHERE g.ID = 139844;
I get this.

Code: Select all

139844|7|1990-11-11|Karaklajic, Nikola|2375|Ivkov, Borislav|2480|1/2-1/2|1. e4 c5 2. c3 e6 3. d4 d5 4. e5 Nc6 5. Nf3 Qb6 6. Be2 cxd4 7. cxd4 Nge7 8.
Nc3 Nf5 9. Na4 1/2-1/2
Then the other game.

Code: Select all

sqlite> SELECT g.ID, g.Round, Date, w.Name White, WhiteElo, b.Name Black, BlackElo, Result, Moves
   ...> FROM Games g
   ...> INNER JOIN Players w ON WhiteID = w.ID
   ...> INNER JOIN Players b ON BlackID = b.ID
   ...> WHERE g.ID = 340722;
and I get this.

Code: Select all

340722|7|2011-08-18|Ponkratov,P|2593|Zvjaginsev,V|2659|1/2-1/2|1. e4 e6 2. d4 d5 3. e5 c5 4. c3 Nc6 5. Nf3 Qb6 6. Be2 cxd4 7. cxd4 Nh6 8.
Nc3 Nf5 9. Na4 Qa5+ 10. Nc3 Qb6 11. Na4 Qa5+ 12. Nc3 Qb6 13. Na4 1/2-1/2
Something is not right these two games are not the same.
User avatar
phhnguyen
Posts: 1504
Joined: Wed Apr 21, 2010 4:58 am
Location: Australia
Full name: Nguyen Hong Pham

Re: Version Beta released

Post by phhnguyen »

Ferdy wrote: Tue Feb 01, 2022 1:19 pm I tried to check its duplicate detection capability.

I run with -dup and get this one from some.

Code: Select all

Duplicate games detected between IDs 139844 and 340722
Then tried to find these games with query.

Code: Select all

sqlite> SELECT g.ID, g.Round, Date, w.Name White, WhiteElo, b.Name Black, BlackElo, Result, Moves
   ...> FROM Games g
   ...> INNER JOIN Players w ON WhiteID = w.ID
   ...> INNER JOIN Players b ON BlackID = b.ID
   ...> WHERE g.ID = 139844;
I get this.

Code: Select all

139844|7|1990-11-11|Karaklajic, Nikola|2375|Ivkov, Borislav|2480|1/2-1/2|1. e4 c5 2. c3 e6 3. d4 d5 4. e5 Nc6 5. Nf3 Qb6 6. Be2 cxd4 7. cxd4 Nge7 8.
Nc3 Nf5 9. Na4 1/2-1/2
Then the other game.

Code: Select all

sqlite> SELECT g.ID, g.Round, Date, w.Name White, WhiteElo, b.Name Black, BlackElo, Result, Moves
   ...> FROM Games g
   ...> INNER JOIN Players w ON WhiteID = w.ID
   ...> INNER JOIN Players b ON BlackID = b.ID
   ...> WHERE g.ID = 340722;
and I get this.

Code: Select all

340722|7|2011-08-18|Ponkratov,P|2593|Zvjaginsev,V|2659|1/2-1/2|1. e4 e6 2. d4 d5 3. e5 c5 4. c3 Nc6 5. Nf3 Qb6 6. Be2 cxd4 7. cxd4 Nh6 8.
Nc3 Nf5 9. Na4 Qa5+ 10. Nc3 Qb6 11. Na4 Qa5+ 12. Nc3 Qb6 13. Na4 1/2-1/2
Something is not right these two games are not the same.

To get those games in a more convenient way, you may run:

Code: Select all

ocgdb.exe -db clean_master.ocgdb.db3 -g 139844 -g 340722 -o printpgn
The issue is interesting. At the moment, OCGDB detects duplicates by simply comparing the last positions of games. If two games have a similar last position as your above games, OCGDB concludes they are duplicates, regardless of other factors. Thus, OCGDB works correctly as the design.

I have been considering improving that logic/algorithm. Perhaps, by adding the game lengths to the comparison.

Or simply ignore???

Any idea?
https://banksiagui.com
The most features chess GUI, based on opensource Banksia - the chess tournament manager
Ferdy
Posts: 4841
Joined: Sun Aug 10, 2008 3:15 pm
Location: Philippines

Re: Version Beta released

Post by Ferdy »

phhnguyen wrote: Tue Feb 01, 2022 10:50 pm
Ferdy wrote: Tue Feb 01, 2022 1:19 pm I tried to check its duplicate detection capability.

I run with -dup and get this one from some.

Code: Select all

Duplicate games detected between IDs 139844 and 340722
Then tried to find these games with query.

Code: Select all

sqlite> SELECT g.ID, g.Round, Date, w.Name White, WhiteElo, b.Name Black, BlackElo, Result, Moves
   ...> FROM Games g
   ...> INNER JOIN Players w ON WhiteID = w.ID
   ...> INNER JOIN Players b ON BlackID = b.ID
   ...> WHERE g.ID = 139844;
I get this.

Code: Select all

139844|7|1990-11-11|Karaklajic, Nikola|2375|Ivkov, Borislav|2480|1/2-1/2|1. e4 c5 2. c3 e6 3. d4 d5 4. e5 Nc6 5. Nf3 Qb6 6. Be2 cxd4 7. cxd4 Nge7 8.
Nc3 Nf5 9. Na4 1/2-1/2
Then the other game.

Code: Select all

sqlite> SELECT g.ID, g.Round, Date, w.Name White, WhiteElo, b.Name Black, BlackElo, Result, Moves
   ...> FROM Games g
   ...> INNER JOIN Players w ON WhiteID = w.ID
   ...> INNER JOIN Players b ON BlackID = b.ID
   ...> WHERE g.ID = 340722;
and I get this.

Code: Select all

340722|7|2011-08-18|Ponkratov,P|2593|Zvjaginsev,V|2659|1/2-1/2|1. e4 e6 2. d4 d5 3. e5 c5 4. c3 Nc6 5. Nf3 Qb6 6. Be2 cxd4 7. cxd4 Nh6 8.
Nc3 Nf5 9. Na4 Qa5+ 10. Nc3 Qb6 11. Na4 Qa5+ 12. Nc3 Qb6 13. Na4 1/2-1/2
Something is not right these two games are not the same.

To get those games in a more convenient way, you may run:

Code: Select all

ocgdb.exe -db clean_master.ocgdb.db3 -g 139844 -g 340722 -o printpgn
The issue is interesting. At the moment, OCGDB detects duplicates by simply comparing the last positions of games. If two games have a similar last position as your above games, OCGDB concludes they are duplicates, regardless of other factors. Thus, OCGDB works correctly as the design.

I have been considering improving that logic/algorithm. Perhaps, by adding the game lengths to the comparison.

Or simply ignore???

Any idea?
Nice command!

All right so in pgn-extract equivalent it is doing positional match at the end of position.

In pgn-extract there is an option called --duplicates with the following detection algorithm.
Duplicates are identified by comparing a hash value for the board of the end positions of extracted games and an additional cumulative hash value generated from the move sequence. If these both values match then games are considered to be duplicates.
Matching the date, white/black names and result header values can also be a good additional method to detect duplicate games.
User avatar
phhnguyen
Posts: 1504
Joined: Wed Apr 21, 2010 4:58 am
Location: Australia
Full name: Nguyen Hong Pham

Re: Version Beta released

Post by phhnguyen »

Ferdy wrote: Wed Feb 02, 2022 3:15 am All right so in pgn-extract equivalent it is doing positional match at the end of position.

In pgn-extract there is an option called --duplicates with the following detection algorithm.
Duplicates are identified by comparing a hash value for the board of the end positions of extracted games and an additional cumulative hash value generated from the move sequence. If these both values match then games are considered to be duplicates.
Matching the date, white/black names and result header values can also be a good additional method to detect duplicate games.
Thanks for the info!

I have been studying/finding solutions for detecting game duplicates. Look like the method based on hash keys of the last positions is not good enough for me. As usual, I am going to write a report and post it here as a document for referencing later.
https://banksiagui.com
The most features chess GUI, based on opensource Banksia - the chess tournament manager
User avatar
phhnguyen
Posts: 1504
Joined: Wed Apr 21, 2010 4:58 am
Location: Australia
Full name: Nguyen Hong Pham

Re: Open Chess Game Database Standard

Post by phhnguyen »

Duplicates

Typically a game is considered as a duplicate of another game if it has an identical move list with that game. The straightforward way is to compare move lists of games together. However that way is too slow because of loading, retrieving the whole game moves multi times with the complexity of O(n!). We have read that some applications are probably implemented that way and run painfully slow.

Detecting duplicate games in a database is a new feature of OCGDB. It is a “dangerous” one since it comes with the ability to remove duplicate games. Any mistake may lead to losing data.

In this experiment, we study some methods/solutions for detecting duplicates. It should work correctly and in a reasonable time. We focus on matching move lists and ignore other factors such as matching game results, names of events, players... We won’t talk about the correctness of games either.

As usual, we use MillionBase 3.45 database (3.45 million games) as a test set. It has been converted already into an SQLite database with column Moves2 (2 bytes encoding for a move). Because of using encoded moves, it can run tests very fast, saving a lot of time for our experiment.


1. Hash keys of last positions (last hash keys)

For each game, we retrieve and store the hash key of the last position to compare with the ones of other games. If two games have the same last hash keys, they are considerably duplicated.

This way requires a small amount of memory because each game needs only a hash key (and probably some extra info). Computing with hash keys (searching, storing) could be very fast too.

This method is quite popular. People have mentioned it many times. Some programs used it already.

We have implemented and the result brief is as below:

Code: Select all

#games: 3456399, elapsed: 306071ms 05:06, speed: 11292 games/s, #duplicates: 56626
Wow, there are too many games reported as duplicates: over 56 thousand of a 3.45 million games database. It is a big surprise! Frankly speaking, we expected Zero for such a good database. We have read that the author of that database (@Rebel) checked and verified it carefully before releasing it. Perhaps we did something wrong, didn’t we?

We verify each pair of duplicate games. OCGDB reported them with their IDs and it has a useful feature to extract and display game PGNs by their IDs:

Code: Select all

ocgdb -db 345.ocgdb.db3 -o printpgn -g 33 -g 35
Below is a reportedly duplicate pair:

ID: 93794


ID: 3453793


The two above games have identical move lists. They are true duplicates. However, typically we ignore those games as duplicates since they are too short.


2. Last hash keys, ignore short games
We implement the parameter -plycount to ignore any game with the move list shorter than plycount.

The result with PlyCount 20:

Code: Select all

#games: 3456399, elapsed: 356036ms 05:56, speed: 9708 games/s, #duplicates: 46774
The number of duplicates reduced significantly but was still too large IMHO.

We found some reportedly duplicate games as below:

ID: 759


ID: 67163



The above games have similar last hash keys/positions. However, they are completely different. If we delete any one of them, it is lost unreasonably.

Perhaps, if we count game lengths, the results may be better?


3. Last hash keys + game lengths
We encode a game length (using Zobrist random numbers) then combine (using XOR operator) it with the last hash key to create a new key and use it to compare. The plycount is still set as 20.

Code: Select all

#games: 3456399, elapsed: 409704ms 06:49, speed: 8436 games/s, #duplicates: 29524
The program reports the number of duplicates reduced significantly.

However, we found some pairs of games as below:

ID: 2953847


ID: 3423761


Those games have the same lengths, same last positions, same result too. But they are clearly not a double.

We should count all middle moves/positions.

4. Combine all hash keys
For each game, we combine all hash keys of all moves/positions together (using XOR) to be a new key to compare.

Code: Select all

#games: 3456399, elapsed: 376612ms 06:16, speed: 9177 games/s, #duplicates: 33706
The problem is that the reported number of duplicates increased. Below is a pair:

ID: 1780


ID: 3426054



Those games have some similar moves but in a different order. They are not double.

In one of the previous posts, someone quoted that the program PGN-extract has been using the method of the last hash key, plus another one from move lists. Let us re-quote again:

Code: Select all

Duplicates are identified by comparing a hash value for the board of the end positions of extracted games and an additional cumulative hash value generated from the move sequence. If both values match then games are considered to be duplicates.
We believe it is somewhat similar to this step.

Perhaps, using hash keys from all positions may cause a kind of overflow bits. We are not really sure but try to reduce the involving number of hash keys.


5. Combination of hash keys from one of each 5 moves

We create a new key from the last hash key and one of every 5 moves. The result is as the below:

Code: Select all

#games: 3456399, elapsed: 361131ms 06:01, speed: 9571 games/s, #duplicates: 21772
The number of reported duplicates was reduced. However, we still find some of those pairs are not duplicates.


6. Compare move lists
Hash keys are good for matching between positions or their combinations. However, we can’t encode orders of positions into hash keys. In other words, using only the hash key can’t solve the problem of having different move orders. Thus, in the end, we decide to compare whole lists of moves to make sure a pair of games is a double.

The algorithm now works as follows:
- Compare the combinations of hash keys as above part (5) to find all potential duplicates
- Verifying each reported pair of games by comparing their whole move lists

When checking a given game if it is a double, we extract its move list already. If it is reportedly a duplicate with some other games, we load all those games to verify further. It is slowed down a bit but because the number of checking is not large thus the whole process is still fast in general.

Basically, we need only 1 time to be read and retrieve all games' move lists. The complexity is O(n) linear. Only a small number of games become suspected and reloaded to verify but they are just small to affect the complexity in general.

Code: Select all

#games: 3456399, elapsed: 122816ms 02:02, speed: 28142 games/s, #duplicates: 17344
The program now needs only 2 minutes to check the whole database.

After verifying, we think 17344 is the correct number of duplicates with plyCount of 20.

Looking back into part 2 (used only last hash keys) with the result of 46774, there are 29430 games (or 63%) that were reported wrongly as duplicates.
Part 4 has a smaller number of 33706, which means 16362 games (49%) may be deleted for nothing.

Below is the result when we increase the plycount from 20 to 40:

Code: Select all

#games: 3456399, elapsed: 130847ms 02:10, speed: 26415 games/s, #duplicates: 15162
Below is a detectedly duplicate pair:

ID: 19103


ID: 1505936


They are real duplicates and we could guess which one is the wrong clone (but it is another story ;) )!


6. Source code
All source code has been pushed already into the main branch with a new release (Beta 5)


7. Quick conclusion
Using only the last hash keys is not good enough to detect duplicate games since it could conclude wrongly for a large proportion (could be wrong of over 63% in our test). We should compare whole move lists to be correct.

In this experiment we developed a method to reduce the number of loading, retrieving move lists to the complexity of O(n). The implementation runs very fast in practice.
https://banksiagui.com
The most features chess GUI, based on opensource Banksia - the chess tournament manager
User avatar
hgm
Posts: 28270
Joined: Fri Mar 10, 2006 10:06 am
Location: Amsterdam
Full name: H G Muller

Re: Open Chess Game Database Standard

Post by hgm »

When I was writing assemblers, the method I used for hashing the identifiers in the symbol table was to calculate the hash key for a symbol of N characters as A * character_N + B * Hash_of_first_N-1_characters, for suitable constants A and B. I dont see how the current problem is any different from hashing character strings. The internal representation of the game can always be thought of as a string.