Open Chess Game Database Standard

Discussion of chess software programming and technical issues.

Moderators: hgm, Rebel, chrisw

User avatar
phhnguyen
Posts: 1475
Joined: Wed Apr 21, 2010 4:58 am
Location: Australia
Full name: Nguyen Hong Pham

Version Beta released

Post by phhnguyen »

Version Beta released

https://github.com/nguyenpham/ocgdb/rel ... ersionBeta

We have been working to improve the program to be a full feature/useful database program. Below is what users can do with the program:
- convert (multi) PGN files into an (SQLite) database
- search positions with PGN files
- search positions with databases
- get games by their IDs

Users can control many factors/options via arguments. For examples:

- create a new database with two PGN files, create column Moves and Moves2 for moves, discard all comments and sites, discard games with players' ELOs below 2000, game lengths (plies) under 30

Code: Select all

ocgdb -pgn big1.png -pgn big2.png -db big.ocgdb.db3 -cpu 4 -o moves;moves1;discardsites;discardcomments -elo 2000 -plycount 30
- search a database for all games having 3 White Queens

Code: Select all

ocgdb -db big.ocgdb.db3 -cpu 4 -q "Q=3"
- search a PGN file for all games having 3 White Queens

Code: Select all

ocgdb -pgn big1.png -cpu 4 -q "Q=3"
- get game with ID = 4432

Code: Select all

ocgdb -db big.ocgdb.db3 -g 4432
https://banksiagui.com
The most features chess GUI, based on opensource Banksia - the chess tournament manager
Ferdy
Posts: 4840
Joined: Sun Aug 10, 2008 3:15 pm
Location: Philippines

Re: Version Beta released

Post by Ferdy »

phhnguyen wrote: Thu Jan 27, 2022 1:57 am Version Beta released

https://github.com/nguyenpham/ocgdb/rel ... ersionBeta

We have been working to improve the program to be a full feature/useful database program. Below is what users can do with the program:
- convert (multi) PGN files into an (SQLite) database
- search positions with PGN files
- search positions with databases
- get games by their IDs

Users can control many factors/options via arguments. For examples:

- create a new database with two PGN files, create column Moves and Moves2 for moves, discard all comments and sites, discard games with players' ELOs below 2000, game lengths (plies) under 30

Code: Select all

ocgdb -pgn big1.png -pgn big2.png -db big.ocgdb.db3 -cpu 4 -o moves;moves1;discardsites;discardcomments -elo 2000 -plycount 30
- search a database for all games having 3 White Queens

Code: Select all

ocgdb -db big.ocgdb.db3 -cpu 4 -q "Q=3"
- search a PGN file for all games having 3 White Queens

Code: Select all

ocgdb -pgn big1.png -cpu 4 -q "Q=3"
- get game with ID = 4432

Code: Select all

ocgdb -db big.ocgdb.db3 -g 4432
I tried it but there is assertion error.

Code: Select all

PS F:\Chess\Tools\pgn_database\ocgdb-win64-beta3> ls

    Directory: F:\Chess\Tools\pgn_database\ocgdb-win64-beta3

Mode                 LastWriteTime         Length Name
----                 -------------         ------ ----
d----          2022-02-01    13:23                db3
-a---          2022-01-22    07:20      242466224 clean_master.pgn
-----          2022-02-01    12:44        1220096 ocgdb.exe

PS F:\Chess\Tools\pgn_database\ocgdb-win64-beta3> .\ocgdb.exe -pgn clean_master.pgn -db clean_master.ocgdb.db3 -cpu 1 -o moves
Open Chess Game Database Standard (OCGDB), (C) 2022 - version: Beta 3

Convert PGN files into a database...
Assertion failed: !paraRecord.dbPaths.empty(), file C:\Users\nguyenpham\ocgdb\src\builder.cpp, line 423
User avatar
phhnguyen
Posts: 1475
Joined: Wed Apr 21, 2010 4:58 am
Location: Australia
Full name: Nguyen Hong Pham

Re: Version Beta released

Post by phhnguyen »

Ferdy wrote: Tue Feb 01, 2022 6:34 am I tried it but there is assertion error.
Thanks for the report. Could you try with the new binary/release?
https://banksiagui.com
The most features chess GUI, based on opensource Banksia - the chess tournament manager
Ferdy
Posts: 4840
Joined: Sun Aug 10, 2008 3:15 pm
Location: Philippines

Re: Version Beta released

Post by Ferdy »

phhnguyen wrote: Tue Feb 01, 2022 10:19 am
Ferdy wrote: Tue Feb 01, 2022 6:34 am I tried it but there is assertion error.
Thanks for the report. Could you try with the new binary/release?
Beta 4 works, thanks.
Ferdy
Posts: 4840
Joined: Sun Aug 10, 2008 3:15 pm
Location: Philippines

Re: Version Beta released

Post by Ferdy »

I tried to check its duplicate detection capability.

I run with -dup and get this one from some.

Code: Select all

Duplicate games detected between IDs 139844 and 340722
Then tried to find these games with query.

Code: Select all

sqlite> SELECT g.ID, g.Round, Date, w.Name White, WhiteElo, b.Name Black, BlackElo, Result, Moves
   ...> FROM Games g
   ...> INNER JOIN Players w ON WhiteID = w.ID
   ...> INNER JOIN Players b ON BlackID = b.ID
   ...> WHERE g.ID = 139844;
I get this.

Code: Select all

139844|7|1990-11-11|Karaklajic, Nikola|2375|Ivkov, Borislav|2480|1/2-1/2|1. e4 c5 2. c3 e6 3. d4 d5 4. e5 Nc6 5. Nf3 Qb6 6. Be2 cxd4 7. cxd4 Nge7 8.
Nc3 Nf5 9. Na4 1/2-1/2
Then the other game.

Code: Select all

sqlite> SELECT g.ID, g.Round, Date, w.Name White, WhiteElo, b.Name Black, BlackElo, Result, Moves
   ...> FROM Games g
   ...> INNER JOIN Players w ON WhiteID = w.ID
   ...> INNER JOIN Players b ON BlackID = b.ID
   ...> WHERE g.ID = 340722;
and I get this.

Code: Select all

340722|7|2011-08-18|Ponkratov,P|2593|Zvjaginsev,V|2659|1/2-1/2|1. e4 e6 2. d4 d5 3. e5 c5 4. c3 Nc6 5. Nf3 Qb6 6. Be2 cxd4 7. cxd4 Nh6 8.
Nc3 Nf5 9. Na4 Qa5+ 10. Nc3 Qb6 11. Na4 Qa5+ 12. Nc3 Qb6 13. Na4 1/2-1/2
Something is not right these two games are not the same.
User avatar
phhnguyen
Posts: 1475
Joined: Wed Apr 21, 2010 4:58 am
Location: Australia
Full name: Nguyen Hong Pham

Re: Version Beta released

Post by phhnguyen »

Ferdy wrote: Tue Feb 01, 2022 1:19 pm I tried to check its duplicate detection capability.

I run with -dup and get this one from some.

Code: Select all

Duplicate games detected between IDs 139844 and 340722
Then tried to find these games with query.

Code: Select all

sqlite> SELECT g.ID, g.Round, Date, w.Name White, WhiteElo, b.Name Black, BlackElo, Result, Moves
   ...> FROM Games g
   ...> INNER JOIN Players w ON WhiteID = w.ID
   ...> INNER JOIN Players b ON BlackID = b.ID
   ...> WHERE g.ID = 139844;
I get this.

Code: Select all

139844|7|1990-11-11|Karaklajic, Nikola|2375|Ivkov, Borislav|2480|1/2-1/2|1. e4 c5 2. c3 e6 3. d4 d5 4. e5 Nc6 5. Nf3 Qb6 6. Be2 cxd4 7. cxd4 Nge7 8.
Nc3 Nf5 9. Na4 1/2-1/2
Then the other game.

Code: Select all

sqlite> SELECT g.ID, g.Round, Date, w.Name White, WhiteElo, b.Name Black, BlackElo, Result, Moves
   ...> FROM Games g
   ...> INNER JOIN Players w ON WhiteID = w.ID
   ...> INNER JOIN Players b ON BlackID = b.ID
   ...> WHERE g.ID = 340722;
and I get this.

Code: Select all

340722|7|2011-08-18|Ponkratov,P|2593|Zvjaginsev,V|2659|1/2-1/2|1. e4 e6 2. d4 d5 3. e5 c5 4. c3 Nc6 5. Nf3 Qb6 6. Be2 cxd4 7. cxd4 Nh6 8.
Nc3 Nf5 9. Na4 Qa5+ 10. Nc3 Qb6 11. Na4 Qa5+ 12. Nc3 Qb6 13. Na4 1/2-1/2
Something is not right these two games are not the same.

To get those games in a more convenient way, you may run:

Code: Select all

ocgdb.exe -db clean_master.ocgdb.db3 -g 139844 -g 340722 -o printpgn
The issue is interesting. At the moment, OCGDB detects duplicates by simply comparing the last positions of games. If two games have a similar last position as your above games, OCGDB concludes they are duplicates, regardless of other factors. Thus, OCGDB works correctly as the design.

I have been considering improving that logic/algorithm. Perhaps, by adding the game lengths to the comparison.

Or simply ignore???

Any idea?
https://banksiagui.com
The most features chess GUI, based on opensource Banksia - the chess tournament manager
Ferdy
Posts: 4840
Joined: Sun Aug 10, 2008 3:15 pm
Location: Philippines

Re: Version Beta released

Post by Ferdy »

phhnguyen wrote: Tue Feb 01, 2022 10:50 pm
Ferdy wrote: Tue Feb 01, 2022 1:19 pm I tried to check its duplicate detection capability.

I run with -dup and get this one from some.

Code: Select all

Duplicate games detected between IDs 139844 and 340722
Then tried to find these games with query.

Code: Select all

sqlite> SELECT g.ID, g.Round, Date, w.Name White, WhiteElo, b.Name Black, BlackElo, Result, Moves
   ...> FROM Games g
   ...> INNER JOIN Players w ON WhiteID = w.ID
   ...> INNER JOIN Players b ON BlackID = b.ID
   ...> WHERE g.ID = 139844;
I get this.

Code: Select all

139844|7|1990-11-11|Karaklajic, Nikola|2375|Ivkov, Borislav|2480|1/2-1/2|1. e4 c5 2. c3 e6 3. d4 d5 4. e5 Nc6 5. Nf3 Qb6 6. Be2 cxd4 7. cxd4 Nge7 8.
Nc3 Nf5 9. Na4 1/2-1/2
Then the other game.

Code: Select all

sqlite> SELECT g.ID, g.Round, Date, w.Name White, WhiteElo, b.Name Black, BlackElo, Result, Moves
   ...> FROM Games g
   ...> INNER JOIN Players w ON WhiteID = w.ID
   ...> INNER JOIN Players b ON BlackID = b.ID
   ...> WHERE g.ID = 340722;
and I get this.

Code: Select all

340722|7|2011-08-18|Ponkratov,P|2593|Zvjaginsev,V|2659|1/2-1/2|1. e4 e6 2. d4 d5 3. e5 c5 4. c3 Nc6 5. Nf3 Qb6 6. Be2 cxd4 7. cxd4 Nh6 8.
Nc3 Nf5 9. Na4 Qa5+ 10. Nc3 Qb6 11. Na4 Qa5+ 12. Nc3 Qb6 13. Na4 1/2-1/2
Something is not right these two games are not the same.

To get those games in a more convenient way, you may run:

Code: Select all

ocgdb.exe -db clean_master.ocgdb.db3 -g 139844 -g 340722 -o printpgn
The issue is interesting. At the moment, OCGDB detects duplicates by simply comparing the last positions of games. If two games have a similar last position as your above games, OCGDB concludes they are duplicates, regardless of other factors. Thus, OCGDB works correctly as the design.

I have been considering improving that logic/algorithm. Perhaps, by adding the game lengths to the comparison.

Or simply ignore???

Any idea?
Nice command!

All right so in pgn-extract equivalent it is doing positional match at the end of position.

In pgn-extract there is an option called --duplicates with the following detection algorithm.
Duplicates are identified by comparing a hash value for the board of the end positions of extracted games and an additional cumulative hash value generated from the move sequence. If these both values match then games are considered to be duplicates.
Matching the date, white/black names and result header values can also be a good additional method to detect duplicate games.
User avatar
phhnguyen
Posts: 1475
Joined: Wed Apr 21, 2010 4:58 am
Location: Australia
Full name: Nguyen Hong Pham

Re: Version Beta released

Post by phhnguyen »

Ferdy wrote: Wed Feb 02, 2022 3:15 am All right so in pgn-extract equivalent it is doing positional match at the end of position.

In pgn-extract there is an option called --duplicates with the following detection algorithm.
Duplicates are identified by comparing a hash value for the board of the end positions of extracted games and an additional cumulative hash value generated from the move sequence. If these both values match then games are considered to be duplicates.
Matching the date, white/black names and result header values can also be a good additional method to detect duplicate games.
Thanks for the info!

I have been studying/finding solutions for detecting game duplicates. Look like the method based on hash keys of the last positions is not good enough for me. As usual, I am going to write a report and post it here as a document for referencing later.
https://banksiagui.com
The most features chess GUI, based on opensource Banksia - the chess tournament manager
User avatar
phhnguyen
Posts: 1475
Joined: Wed Apr 21, 2010 4:58 am
Location: Australia
Full name: Nguyen Hong Pham

Re: Open Chess Game Database Standard

Post by phhnguyen »

Duplicates

Typically a game is considered as a duplicate of another game if it has an identical move list with that game. The straightforward way is to compare move lists of games together. However that way is too slow because of loading, retrieving the whole game moves multi times with the complexity of O(n!). We have read that some applications are probably implemented that way and run painfully slow.

Detecting duplicate games in a database is a new feature of OCGDB. It is a “dangerous” one since it comes with the ability to remove duplicate games. Any mistake may lead to losing data.

In this experiment, we study some methods/solutions for detecting duplicates. It should work correctly and in a reasonable time. We focus on matching move lists and ignore other factors such as matching game results, names of events, players... We won’t talk about the correctness of games either.

As usual, we use MillionBase 3.45 database (3.45 million games) as a test set. It has been converted already into an SQLite database with column Moves2 (2 bytes encoding for a move). Because of using encoded moves, it can run tests very fast, saving a lot of time for our experiment.


1. Hash keys of last positions (last hash keys)

For each game, we retrieve and store the hash key of the last position to compare with the ones of other games. If two games have the same last hash keys, they are considerably duplicated.

This way requires a small amount of memory because each game needs only a hash key (and probably some extra info). Computing with hash keys (searching, storing) could be very fast too.

This method is quite popular. People have mentioned it many times. Some programs used it already.

We have implemented and the result brief is as below:

Code: Select all

#games: 3456399, elapsed: 306071ms 05:06, speed: 11292 games/s, #duplicates: 56626
Wow, there are too many games reported as duplicates: over 56 thousand of a 3.45 million games database. It is a big surprise! Frankly speaking, we expected Zero for such a good database. We have read that the author of that database (@Rebel) checked and verified it carefully before releasing it. Perhaps we did something wrong, didn’t we?

We verify each pair of duplicate games. OCGDB reported them with their IDs and it has a useful feature to extract and display game PGNs by their IDs:

Code: Select all

ocgdb -db 345.ocgdb.db3 -o printpgn -g 33 -g 35
Below is a reportedly duplicate pair:

ID: 93794
[pgn]
[Event "Elista olm"]
[PlyCount "1"]
[Result "1-0"]
[Round "1"]
[Site "Elista olm"]
[Date "1998-10-02"]
[Black "Acevedo, S."]
[White "Boipuso, S."]

1. e4 1-0
[/pgn]

ID: 3453793
[pgn]
[Event "ch-Paranaense 2019"]
[PlyCount "1"]
[ECO "B00"]
[Result "1-0"]
[Round "7"]
[Site "Maringa BRA"]
[Date "2019-12-15"]
[Black "Coppula, Pablo De Oliveira"]
[White "Zarelli, Joao Pedro"]

1. e4 1-0
[/pgn]

The two above games have identical move lists. They are true duplicates. However, typically we ignore those games as duplicates since they are too short.


2. Last hash keys, ignore short games
We implement the parameter -plycount to ignore any game with the move list shorter than plycount.

The result with PlyCount 20:

Code: Select all

#games: 3456399, elapsed: 356036ms 05:56, speed: 9708 games/s, #duplicates: 46774
The number of duplicates reduced significantly but was still too large IMHO.

We found some reportedly duplicate games as below:

ID: 759
[pgn]
[Event "ch-Wereld Teams"]
[PlyCount "133"]
[Result "1/2-1/2"]
[WhiteElo "2370"]
[BlackElo "2515"]
[Round "1"]
[Site "Luzern"]
[Date "1989-10-01"]
[Black "Van der Sterren, P."]
[White "Abdel Megid, M."]

1. e4 e5 2. Nf3 Nc6 3. Nc3 g6 4. d4 exd4
5. Nxd4 Bg7 6. Be3 Nf6 7. Qd2 O-O 8. O-O-O Re8
9. Nxc6 bxc6 10. Bg5 Qe7 11. Bc4 Qe5 12. Rde1 d6
13. f4 Qa5 14. e5 dxe5 15. Rxe5 Rxe5 16. fxe5 Qxe5
17. Qd8+ Ne8 18. Rf1 Bb7 19. Bxf7+ Kh8 20. Qe7 Nd6
21. Bb3 Re8 22. Qxe5 Bxe5 23. Na4 c5 24. Nxc5 Bxg2
25. Rf2 Bc6 26. Nd3 Bd4 27. Bf6+ Bxf6 28. Rxf6 Kg7
29. Rf4 g5 30. Rf2 h5 31. Nc5 Re1+ 32. Kd2 Re5
33. Ne6+ Rxe6 34. Bxe6 Ne4+ 35. Ke3 Nxf2 36. Kxf2 Kf6
37. Bc8 Ke5 38. Ke3 Bd5 39. a3 a5 40. b4 axb4
41. axb4 g4 42. c3 Bf7 43. Bd7 Bd5 44. Be8 h4
45. Bd7 g3 46. hxg3 hxg3 47. Bh3 Be4 48. b5 g2
49. Kf2 Kd5 50. Bd7 Kc4 51. Be8 Kxc3 52. Bd7 Kd4
53. Be8 Kc5 54. Kg1 Kd6 55. Kf2 Ke7 56. Bh5 Kf6
57. Be8 Ke5 58. Bd7 Kf4 59. Be8 Kg4 60. Bd7+ Kh4
61. Bc6 Bxc6 62. bxc6 Kg4 63. Kxg2 Kf4 64. Kf2 Ke4
65. Ke2 Kd5 66. Kd3 Kxc6 67. Kc4 1/2-1/2
[/pgn]

ID: 67163
[pgn]
[Event "Stockholm Rilton Cup"]
[PlyCount "131"]
[Result "1/2-1/2"]
[Round "1"]
[Site "Stockholm Rilton Cup"]
[Date "1996-01-01"]
[Black "Maidla, V."]
[White "Belokon, A."]

1. e4 e5 2. Nf3 Nc6 3. Bb5 d6 4. d4 exd4
5. Nxd4 Bd7 6. Bxc6 bxc6 7. O-O g6 8. c4 Bg7
9. Nc3 Nf6 10. Bg5 h6 11. Bf4 O-O 12. h3 Re8
13. Re1 Nh5 14. Be3 c5 15. Nde2 Qh4 16. Qd2 Kh7
17. Rac1 Rab8 18. b3 a5 19. f3 a4 20. Bf2 Qd8
21. Rb1 axb3 22. axb3 Bc6 23. Rec1 Qd7 24. Rd1 f5
25. exf5 Qxf5 26. Qd3 Qxd3 27. Rxd3 Rb4 28. g4 Nf6
29. Nc1 Reb8 30. Ra1 Nd7 31. Bg3 g5 32. Ra6 Bd4+
33. Kf1 R4b6 34. Ra7 R8b7 35. Rxb7 Rxb7 36. N3e2 Ne5
37. Bxe5 Bxe5 38. Kf2 Ra7 39. Rd2 Kg6 40. Kg2 Kf6
41. Ra2 Rxa2 42. Nxa2 Bb2 43. Nac3 Bxc3 44. Nxc3 Ke5
45. Ne2 d5 46. Kg3 dxc4 47. bxc4 Be8 48. h4 Bf7
49. hxg5 hxg5 50. f4+ gxf4+ 51. Nxf4 Bxc4 52. Kf3 Kd4
53. g5 Bg8 54. Ke2 Ke4 55. Nh5 Bf7 56. Nf6+ Kd4
57. Kd2 c4 58. Ng4 Ke4 59. Kc3 Kf4 60. Nf2 Kxg5
61. Ne4+ Kf4 62. Nd2 Ke5 63. Nxc4+ Bxc4 64. Kxc4 Kd6
65. Kd4 Kc6 66. Kc4 1/2-1/2
[/pgn]


The above games have similar last hash keys/positions. However, they are completely different. If we delete any one of them, it is lost unreasonably.

Perhaps, if we count game lengths, the results may be better?


3. Last hash keys + game lengths
We encode a game length (using Zobrist random numbers) then combine (using XOR operator) it with the last hash key to create a new key and use it to compare. The plycount is still set as 20.

Code: Select all

#games: 3456399, elapsed: 409704ms 06:49, speed: 8436 games/s, #duplicates: 29524
The program reports the number of duplicates reduced significantly.

However, we found some pairs of games as below:

ID: 2953847
[pgn]
[Event "Titled Tuesday 12th May"]
[PlyCount "135"]
[ECO "B30"]
[Result "1/2-1/2"]
[WhiteElo "2634"]
[BlackElo "2477"]
[Round "3"]
[Site "Chess.com INT"]
[Date "2020-05-12"]
[Black "Arjun, Kalyan"]
[White "Tari, A"]

1. e4 c5 2. Nf3 Nc6 3. Bb5 e6 4. O-O Nge7
5. Re1 Nd4 6. Nxd4 cxd4 7. c3 a6 8. Bf1 Nc6
9. d3 Bc5 10. Qg4 O-O 11. e5 f5 12. Qg3 b5
13. Bf4 Bb7 14. Nd2 Rc8 15. h4 Qe8 16. Nf3 Qg6
17. Qh2 h6 18. h5 Qf7 19. Rac1 Bb6 20. Be2 dxc3
21. bxc3 Ne7 22. d4 Ba5 23. Bd2 Nd5 24. c4 Bxd2
25. Nxd2 Nc7 26. Rc3 bxc4 27. Nxc4 Nb5 28. Rb3 Bd5
29. Rxb5 Bxc4 30. Rc5 Rxc5 31. dxc5 Bxe2 32. Rxe2 Rb8
33. Qh4 Qf8 34. Rc2 Rb1+ 35. Kh2 f4 36. Qg4 Rb4
37. c6 dxc6 38. Qxe6+ Qf7 39. Rxc6 Qxe6 40. Rxe6 Ra4
41. Kh3 Rxa2 42. f3 Kf7 43. Rb6 Ra5 44. e6+ Kf6
45. Kg4 Rg5+ 46. Kxf4 Rxg2 47. Rxa6 Re2 48. e7+ Kxe7
49. Kf5 Kf7 50. Ra7+ Re7 51. Rxe7+ Kxe7 52. Kg6 Kf8
53. Kh7 Kf7 54. f4 Kf8 55. Kg6 Kg8 56. f5 Kf8
57. f6 gxf6 58. Kxf6 Kg8 59. Kg6 Kh8 60. Kf7 Kh7
61. Kf8 Kh8 62. Kf7 Kh7 63. Kf6 Kh8 64. Kg6 Kg8
65. Kxh6 Kh8 66. Kg6 Kg8 67. h6 Kh8 68. h7 1/2-1/2
[/pgn]

ID: 3423761
[pgn]
[Event "Lichtenberger Sommer 2019"]
[PlyCount "135"]
[ECO "E60"]
[Result "1/2-1/2"]
[WhiteElo "1500"]
[BlackElo "1304"]
[Round "4.88"]
[Site "Berlin GER"]
[Date "2019-08-20"]
[Black "Morczynski, Ryszard"]
[White "Nimptsch, Nikolas"]

1. d4 Nf6 2. c4 g6 3. Nf3 Bg7 4. Nc3 O-O
5. Bf4 d6 6. h3 c5 7. e3 cxd4 8. exd4 Nh5
9. Bh2 f5 10. Be2 Nd7 11. Ng5 Ndf6 12. O-O Bh6
13. Nf3 Nf4 14. Re1 Nxe2+ 15. Qxe2 Re8 16. Rad1 Bd7
17. d5 Rc8 18. Nd4 Qc7 19. b3 f4 20. Ne6 Bxe6
21. Qxe6+ Kg7 22. Ne2 Qd7 23. Nxf4 Bxf4 24. Bxf4 Qxe6
25. Rxe6 Rc7 26. Bg5 Kf7 27. Rde1 Ng8 28. h4 b5
29. Bf4 bxc4 30. Bxd6 Rd7 31. Bc5 Rxd5 32. Bxa7 cxb3
33. axb3 Ra8 34. Ra6 Rd7 35. Rea1 Rd6 36. Rxd6 exd6
37. Bd4 Rxa1+ 38. Bxa1 Ke6 39. Kf1 Kd5 40. Ke2 Ne7
41. Kd3 Nc6 42. Bc3 Kc5 43. g4 d5 44. f4 Kd6
45. f5 gxf5 46. gxf5 Ne5+ 47. Bxe5+ Kxe5 48. b4 Kxf5
49. Kd4 Ke6 50. b5 Kd6 51. h5 h6 52. b6 Kc6
53. b7 Kxb7 54. Kxd5 Kc7 55. Ke6 Kd8 56. Kd6 Ke8
57. Ke6 Kf8 58. Kf6 Kg8 59. Kg6 Kh8 60. Kxh6 Kg8
61. Kg5 Kh7 62. h6 Kg8 63. Kg6 Kh8 64. Kh5 Kg8
65. Kg5 Kh7 66. Kh5 Kg8 67. Kg6 Kh8 68. h7 1/2-1/2
[/pgn]

Those games have the same lengths, same last positions, same result too. But they are clearly not a double.

We should count all middle moves/positions.

4. Combine all hash keys
For each game, we combine all hash keys of all moves/positions together (using XOR) to be a new key to compare.

Code: Select all

#games: 3456399, elapsed: 376612ms 06:16, speed: 9177 games/s, #duplicates: 33706
The problem is that the reported number of duplicates increased. Below is a pair:

ID: 1780
[pgn]
[Event "Kladovo zt"]
[PlyCount "22"]
[Result "1/2-1/2"]
[WhiteElo "2475"]
[BlackElo "2440"]
[Round "1"]
[Site "Kladovo zt"]
[Date "1993-01-01"]
[Black "Tosic, M."]
[White "Abramovic, B."]

1. d4 d5 2. c4 c6 3. Nf3 Nf6 4. cxd5 cxd5
5. Bf4 Bf5 6. e3 e6 7. Nc3 Nc6 8. Bd3 Bxd3
9. Qxd3 Bd6 10. Bxd6 Qxd6 11. O-O O-O 1/2-1/2
[/pgn]

ID: 3426054
[pgn]
[Event "ch-LAT 2018"]
[PlyCount "22"]
[ECO "D10"]
[Result "1/2-1/2"]
[WhiteElo "1946"]
[BlackElo "1958"]
[Round "4"]
[Site "Riga LAT"]
[Date "2018-05-02"]
[Black "Brikers, Aleksandrs"]
[White "Isakovs, Janis"]

1. d4 d5 2. c4 c6 3. cxd5 cxd5 4. Nc3 Nf6
5. Bf4 Bf5 6. e3 e6 7. Bd3 Bxd3 8. Qxd3 Nc6
9. Nf3 Bd6 10. Bxd6 Qxd6 11. O-O O-O 1/2-1/2
[/pgn]


Those games have some similar moves but in a different order. They are not double.

In one of the previous posts, someone quoted that the program PGN-extract has been using the method of the last hash key, plus another one from move lists. Let us re-quote again:

Code: Select all

Duplicates are identified by comparing a hash value for the board of the end positions of extracted games and an additional cumulative hash value generated from the move sequence. If both values match then games are considered to be duplicates.
We believe it is somewhat similar to this step.

Perhaps, using hash keys from all positions may cause a kind of overflow bits. We are not really sure but try to reduce the involving number of hash keys.


5. Combination of hash keys from one of each 5 moves

We create a new key from the last hash key and one of every 5 moves. The result is as the below:

Code: Select all

#games: 3456399, elapsed: 361131ms 06:01, speed: 9571 games/s, #duplicates: 21772
The number of reported duplicates was reduced. However, we still find some of those pairs are not duplicates.


6. Compare move lists
Hash keys are good for matching between positions or their combinations. However, we can’t encode orders of positions into hash keys. In other words, using only the hash key can’t solve the problem of having different move orders. Thus, in the end, we decide to compare whole lists of moves to make sure a pair of games is a double.

The algorithm now works as follows:
- Compare the combinations of hash keys as above part (5) to find all potential duplicates
- Verifying each reported pair of games by comparing their whole move lists

When checking a given game if it is a double, we extract its move list already. If it is reportedly a duplicate with some other games, we load all those games to verify further. It is slowed down a bit but because the number of checking is not large thus the whole process is still fast in general.

Basically, we need only 1 time to be read and retrieve all games' move lists. The complexity is O(n) linear. Only a small number of games become suspected and reloaded to verify but they are just small to affect the complexity in general.

Code: Select all

#games: 3456399, elapsed: 122816ms 02:02, speed: 28142 games/s, #duplicates: 17344
The program now needs only 2 minutes to check the whole database.

After verifying, we think 17344 is the correct number of duplicates with plyCount of 20.

Looking back into part 2 (used only last hash keys) with the result of 46774, there are 29430 games (or 63%) that were reported wrongly as duplicates.
Part 4 has a smaller number of 33706, which means 16362 games (49%) may be deleted for nothing.

Below is the result when we increase the plycount from 20 to 40:

Code: Select all

#games: 3456399, elapsed: 130847ms 02:10, speed: 26415 games/s, #duplicates: 15162
Below is a detectedly duplicate pair:

ID: 19103
[pgn]
[Event "St Petersburg"]
[PlyCount "57"]
[Result "1-0"]
[WhiteElo "2285"]
[BlackElo "2105"]
[Round "1"]
[Site "St Petersburg"]
[Date "1998-11-04"]
[Black "Gassij, O."]
[White "Ananchenko, I."]

1. d4 Nf6 2. c4 d5 3. cxd5 Nxd5 4. Nc3 e6
5. Nf3 Bb4 6. Bd2 c6 7. e3 O-O 8. Bd3 Nd7
9. O-O Bd6 10. a3 Re8 11. Rc1 N7b6 12. Ne4 Nf6
13. Nxd6 Qxd6 14. e4 e5 15. Bb4 Qd8 16. dxe5 Ng4
17. Bd6 f6 18. Rc5 fxe5 19. Bc2 Nd5 20. Bxe5 Nxe5
21. Nxe5 Qd6 22. b4 Rxe5 23. exd5 cxd5 24. f4 Re3
25. Rxd5 Qb6 26. Kh1 Be6 27. Rd6 Qc7 28. Rxe6 Rxe6
29. Bb3 1-0
[/pgn]

ID: 1505936
[pgn]
[Event "Chigorin Mem"]
[PlyCount "57"]
[Result "1-0"]
[WhiteElo "2285"]
[BlackElo "2105"]
[Round "1"]
[Site "St Petersburg RUS"]
[Date "1998-11-04"]
[Black "Gassij, O."]
[White "Ananchenko, I."]

1. d4 Nf6 2. c4 d5 3. cxd5 Nxd5 4. Nc3 e6
5. Nf3 Bb4 6. Bd2 c6 7. e3 O-O 8. Bd3 Nd7
9. O-O Bd6 10. a3 Re8 11. Rc1 N7b6 12. Ne4 Nf6
13. Nxd6 Qxd6 14. e4 e5 15. Bb4 Qd8 16. dxe5 Ng4
17. Bd6 f6 18. Rc5 fxe5 19. Bc2 Nd5 20. Bxe5 Nxe5
21. Nxe5 Qd6 22. b4 Rxe5 23. exd5 cxd5 24. f4 Re3
25. Rxd5 Qb6 26. Kh1 Be6 27. Rd6 Qc7 28. Rxe6 Rxe6
29. Bb3 1-0
[/pgn]

They are real duplicates and we could guess which one is the wrong clone (but it is another story ;) )!


6. Source code
All source code has been pushed already into the main branch with a new release (Beta 5)


7. Quick conclusion
Using only the last hash keys is not good enough to detect duplicate games since it could conclude wrongly for a large proportion (could be wrong of over 63% in our test). We should compare whole move lists to be correct.

In this experiment we developed a method to reduce the number of loading, retrieving move lists to the complexity of O(n). The implementation runs very fast in practice.
https://banksiagui.com
The most features chess GUI, based on opensource Banksia - the chess tournament manager
User avatar
hgm
Posts: 27976
Joined: Fri Mar 10, 2006 10:06 am
Location: Amsterdam
Full name: H G Muller

Re: Open Chess Game Database Standard

Post by hgm »

When I was writing assemblers, the method I used for hashing the identifiers in the symbol table was to calculate the hash key for a symbol of N characters as A * character_N + B * Hash_of_first_N-1_characters, for suitable constants A and B. I dont see how the current problem is any different from hashing character strings. The internal representation of the game can always be thought of as a string.