How to find original games in big database

Jonathan003 · Post by **Jonathan003** » Sun Apr 19, 2020 7:40 pm

I have a small database, (let's say 300000 games), with truncated games from length 40ply, with player names and game statistics. The games are from a obk book, made from a big database with about 800000 games. I converted to pgn with obk2bin. So the player names and game statics got lost in the process. I was able to replace the games by the original games, (with player names and game statics), by first truncating the original games to 40ply. Than adding the databases together and search for doubles in Chessbase 15. But the games are still only 40 ply. I want to find the complete games.
Is there some way to achieve this? I have tried the find duplicates options in Chess Assistant, but I find them very confusing. I also tried the find twins with SCID but without any luck.

Deberger · Post by **Deberger** » Sun Apr 19, 2020 9:11 pm

For each game, I would fast forward to the 40th move and search for all games matching the position at that truncation point.

This would require a very short script using something like pgn-extract or scid.

Jonathan003 · Post by **Jonathan003** » Sun Apr 19, 2020 10:08 pm

I don't know how to write scripts. I'm not a programmer. I can only do some very basic stuff with command line apps. There are also some shorter games included I would like to be able to search for. Most games are 40 ply, but some are shorter, I think because the original games where also shorter.

Deberger · Post by **Deberger** » Sun Apr 19, 2020 10:57 pm

> very basic stuff with command line apps

This is all we need.

pgn-extact truncated-games.pgn -C -F > note-final-positions.pgn

Then extract the final positions using your most comfortable method.

Ferdy · Post by **Ferdy** » Mon Apr 20, 2020 2:46 am

Jonathan003 wrote: ↑Sun Apr 19, 2020 7:40 pm I have a small database, (let's say 300000 games), with truncated games from length 40ply, with player names and game statistics. The games are from a obk book, made from a big database with about 800000 games. I converted to pgn with obk2bin. So the player names and game statics got lost in the process. I was able to replace the games by the original games, (with player names and game statics), by first truncating the original games to 40ply. Than adding the databases together and search for doubles in Chessbase 15. But the games are still only 40 ply. I want to find the complete games.
Is there some way to achieve this? I have tried the find duplicates options in Chess Assistant, but I find them very confusing. I also tried the find twins with SCID but without any luck.

Try these command lines using pgn-extract.

1. To extract games from big.pgn those similar games in obk.pgn with less than 40 plies and save those in extract1.pgn

Code: Select all

pgn-extract -U --fuzzydepth 0 -oextract1.pgn obk.pgn big.pgn

2. To extract games from big.pgn those similar games in obk.pgn with 40 plies and save those in extract2.pgn

Code: Select all

pgn-extract -U --fuzzydepth 40 -oextract2.pgn obk.pgn big.pgn

You can then combine the extract1.pgn and extract2.pgn after examining those files.

When I tested this method on the small number of games, it worked.

Jonathan003 · Post by **Jonathan003** » Wed Apr 22, 2020 12:23 am

Thanks for the recommendations. I tried the method Ferdy described.
The result is not perfect. Allot of sidelines are also included in the results. I use obk2bin to convert a obk book to pgn. Than I convert to cbh in Chessbase 15 and search for games with a ? and delete these games, so only the main lines remains in the database. So I don't like it if sidelines are added again.

Jonathan003 · Post by **Jonathan003** » Thu Apr 23, 2020 12:57 pm

Deberger wrote: ↑Sun Apr 19, 2020 10:57 pm > very basic stuff with command line apps

This is all we need.

pgn-extact truncated-games.pgn -C -F > note-final-positions.pgn

Then extract the final positions using your most comfortable method.

What to do next?
Here is an example how 'note-final-positions' looks like:

[Event "?"]
[Site "?"]
[Date "????.??.??"]
[Round "?"]
[White "?"]
[Black "?"]
[Result "1/2-1/2"]
[PlyCount "40"]

1. d4 Nf6 2. c4 e6 3. Nc3 Bb4 4. Qc2 d5 5. a3 Bxc3+ 6. Qxc3 Ne4 7. Qc2 c5
8. dxc5 Nc6 9. cxd5 exd5 10. e3 Qa5+ 11. b4 Nxb4 12. axb4 Qxa1 13. Bb5+ Kf8
14. Ne2 a6 15. Bd3 Bd7 16. f3 Ba4 17. Qb2 Qxb2 18. Bxb2 Ng5 19. Nd4 Bd7 20.
Kf2 f6 { "r4k1r/1p1b2pp/p4p2/2Pp2n1/1P1N4/3BPP2/1B3KPP/7R w - - 0 21" }
1/2-1/2

[Event "?"]
[Site "?"]
[Date "????.??.??"]
[Round "?"]
[White "?"]
[Black "?"]
[Result "1/2-1/2"]
[PlyCount "40"]

1. d4 Nf6 2. c4 e6 3. Nc3 Bb4 4. Qc2 d5 5. a3 Bxc3+ 6. Qxc3 Ne4 7. Qc2 c5
8. dxc5 Nc6 9. cxd5 exd5 10. e3 Qf6 11. f3 Qh4+ 12. g3 Nxg3 13. Qf2 Nf5 14.
Qxh4 Nxh4 15. b4 a6 16. Kf2 Ne5 17. Bb2 f6 18. Rd1 Be6 19. Ne2 Bf7 20. Rg1
Nc4 { "r3k2r/1p3bpp/p4p2/2Pp4/1Pn4n/P3PP2/1B2NK1P/3R1BR1 w kq - 6 21" }
1/2-1/2

[Event "?"]
[Site "?"]
[Date "????.??.??"]
[Round "?"]
[White "?"]
[Black "?"]
[Result "1/2-1/2"]
[PlyCount "40"]

1. d4 Nf6 2. c4 e6 3. Nc3 Bb4 4. Qc2 d5 5. a3 Bxc3+ 6. Qxc3 Ne4 7. Qc2 c5
8. dxc5 Nc6 9. cxd5 exd5 10. Nf3 Qf6 11. e3 Bg4 12. Be2 O-O 13. O-O Rfe8
14. Bd2 d4 15. Rad1 Nxd2 16. Rxd2 dxe3 17. Rd6 Re6 18. fxe3 Rxd6 19. cxd6
Bxf3 20. Bxf3 Qxd6 { "r5k1/pp3ppp/2nq4/8/8/P3PB2/1PQ3PP/5RK1 w - - 0 21" }
1/2-1/2

[Event "?"]
[Site "?"]
[Date "????.??.??"]
[Round "?"]
[White "?"]
[Black "?"]
[Result "1/2-1/2"]
[PlyCount "40"]

I got these messages 'String length 76 is to long for the line length of 75'
What does it mean? I there something wrong with the pgn output from obk2bin?

I know if I convert a bin book to pgn with polyglot-tolerant. The output has many transpositional errors. Threefold repetitions that were not in the original games. Like knights and bishops getting back to there starting positions, rooks moving back and forwards ect...
I didn't find these errors in the output of obk2bin till now.

Henk · Post by **Henk** » Thu Apr 23, 2020 1:03 pm

Ras · Post by **Ras** » Thu Apr 23, 2020 1:14 pm

Jonathan003 wrote: ↑Thu Apr 23, 2020 12:57 pmI got these messages 'String length 76 is to long for the line length of 75'
What does it mean?

Right what it says. The string is too long so that you need to allow more line length.

Check the -w argument for pgn-extract:

Output line length (-w or --linelength)
The -w flag allows an approximate line length to be set for output. Normally games are output with lines up to a maximum of 75 characters. Use the -w flag if you want longer output lines. For instance, you might want all the moves of a game to appear on a single line. You would get this effect by specifying -w1000 (say):

pgn-extract -w1000 file.pgn
If some games are more than 1000 characters long then just increase the value.

Source: https://www.cs.kent.ac.uk/people/staff/ ... lp.html#-w

Deberger · Post by **Deberger** » Thu Apr 23, 2020 4:42 pm

Jonathan003 wrote: ↑Thu Apr 23, 2020 12:57 pm
Deberger wrote: ↑Sun Apr 19, 2020 10:57 pm > very basic stuff with command line apps

This is all we need.

pgn-extact truncated-games.pgn -C -F > note-final-positions.pgn

Then extract the final positions using your most comfortable method. <---
What to do next?
Here is an example how 'note-final-positions' looks like:

I like the Linux command line, so something like this:

pgn-extact truncated-games.pgn -C -F | awk -F \" '/{/ {print $2}' > fenlist.txt

>What to do next?

Loop through the fenlist and extract the complete games from some MegaComplete pgn file:

pgn-extact truncated-games.pgn -C -F | awk -F \" '/{/ {print $2}' | while read FEN; do pgn-extract -Tf"$FEN" MegaComplete.pgn; done

How to find original games in big database

How to find original games in big database

Re: How to find original games in big database

Re: How to find original games in big database

Re: How to find original games in big database

Re: How to find original games in big database

Re: How to find original games in big database

Re: How to find original games in big database

Re: How to find original games in big database

Re: How to find original games in big database

Re: How to find original games in big database