How to find original games in big database

Discussion of chess software programming and technical issues.

Moderators: hgm, Rebel, chrisw

Jonathan003
Posts: 239
Joined: Fri Jul 06, 2018 4:23 pm
Full name: Jonathan Cremers

Re: How to find original games in big database

Post by Jonathan003 »

Deberger wrote: Thu Apr 23, 2020 4:42 pm
Jonathan003 wrote: Thu Apr 23, 2020 12:57 pm
Deberger wrote: Sun Apr 19, 2020 10:57 pm > very basic stuff with command line apps

This is all we need.

pgn-extact truncated-games.pgn -C -F > note-final-positions.pgn

Then extract the final positions using your most comfortable method. <---
What to do next?
Here is an example how 'note-final-positions' looks like:
I like the Linux command line, so something like this:

pgn-extact truncated-games.pgn -C -F | awk -F \" '/{/ {print $2}' > fenlist.txt

>What to do next?

Loop through the fenlist and extract the complete games from some MegaComplete pgn file:

pgn-extact truncated-games.pgn -C -F | awk -F \" '/{/ {print $2}' | while read FEN; do pgn-extract -Tf"$FEN" MegaComplete.pgn; done
I don't have Linux I use Windows 10 64 bit.
I don't know what these '| awk -F \" '/{/ {print $2}' | ' and '-Tf"$FEN' characters are? And I don't find these in the online manual here: https://www.cs.kent.ac.uk/people/staff/ ... lp.html#-w
I think you can do allot with pgn-extract but it's complicated for me to find out how to work with it.
Jonathan003
Posts: 239
Joined: Fri Jul 06, 2018 4:23 pm
Full name: Jonathan Cremers

Re: How to find original games in big database

Post by Jonathan003 »

I find a way to find the original games without pgn-extract. From the 'big' database, I first delete doubles with the standard setting from Chessbase. I remove all games less than 40 ply. Than I remove all tags from the 'big' pgn database and import this database to a ‘temp' scid' database. There I search for twins with the standard settings. I compact the database and export to a pgn database 'big_no_statics'. In Chessbase I create a new database 'temp' and import 'big' database and 'big_no_statics' database. I set eco codes, and create a search booster. Than I search for doubles. I set to ignore all but set 'exact' for moves, keep better game, and choose to clip the doubles. Than I copy the games in clipboard to a new, database 'temp2' and remove the doubles. I repeat this step till I have only the original games without the doubles find in SCID.

I create an obk book from this pgn file, convert to pgn with obk2bin and remove the games with sidelines. Than I repeat the previous steps to get the original games with only the main lines.