PGN Extract

Discussion of anything and everything relating to chess playing software and machines.

Moderators: hgm, Rebel, chrisw

Ferdy
Posts: 4840
Joined: Sun Aug 10, 2008 3:15 pm
Location: Philippines

Re: PGN Extract

Post by Ferdy »

JManion wrote:
Ferdy wrote:
JManion wrote:Ok I have 4 pgn files that are all about 2GB. If I run the program it says 1 has no duplicates, and 2 has no duplicates, 3 has not duplicates, and 4 has none.

However I want to find some way to compare files 1 and 2 to see if it has any duplicates (and 1 and 3 and 1 and 4 etc).
Say you have unique games in each 4 pgn files
1. 1.pgn
2. 2.pgn
3. 3.pgn
4. 4.pgn

First Phase:
A. To check if there are same game between 1.pgn and 2.pgn, use the ff command,

Code: Select all

pgn-extract -U -ddup.pgn 1.pgn 2.pgn 
Check the output dup.pgn, if there is game here then that is the game common to 1.pgn and 2.pgn. Locate the game or games in 2.pgn and delete it. That common game will be retained in 1.pgn. Rename the revised 2.pgn to R1-2.pgn.

B. To check if there are same game between 1.pgn and 3.pgn, use the ff command,

Code: Select all

pgn-extract -U -ddup.pgn 1.pgn 3.pgn 
Check the output dup.pgn, if there is game here then that is the game common to 1.pgn and 3.pgn. Locate the game in 3.pgn and delete it. That common game will be retained in 1.pgn. Rename the revised 3.pgn to R1-3.pgn.

C. Do it for 1.pgn vs 4.pgn

Second Phase:
D. Next do it with R1-2.pgn vs R1-3.pgn
R1-2.pgn refers to 1st revision of 2.pgn, if 2.pgn was revised in 1st phase.
R1-3.pgn refers to 1st revision of 3.pgn, if 3.pgn was revised in 1st phase.

E. Next do it with R1-2.pgn vs R1-4.pgn

Third Phase:
F. Do it with R2-3.pgn vs R2-4.pgn
R2-3.pgn refers to 2nd revision of 3.pgn, if it was revised, otherwise use the latest file always.

So what we have done was
1. compare 1.pgn vs 2.pgn
2. 1.pgn vs 3.pgn
3. 1.pgn vs 4.pgn

4. 2.pgn vs 3.pgn (use revised files)
5. 2.pgn vs 4.pgn (use revised files)

6. 3.pgn vs 4.pgn (use revised files)

*If there are common games, always delete those games from the higher numbered pgn file. Example compare 1.pgn vs 2.pgn, if there are common games, delete those in 2.pgn file.

thank you Ferdinand

after I ran pgn 1 vs 2.

I had a dupe file which has 1344 dupes. Is there any easy command that can delete those 1344 games from pgn2?

thanks again.
I checked the documentaion of pgn-extract and found thru combination of options, there is a way to output games in 2.pgn that are not in 1.pgn. There is no need to delete common games found in 2.pgn. Use this command.

Code: Select all

pgn-extract -c1.pgn -dCommon.pgn -oR1-2.pgn 2.pgn
options:
-c (for check file)
-d (to output common or duplicate games)
-o (to output unique games that are not found in 1.pgn)

1.pgn (master file)
Common.pgn (dupes are found here for inspection)
R1-2.pgn (revised 2.pgn without the common games inside it, this is the file you are interested with)
2.pgn (the file you want to check with the master file 1.pgn)

I tested this command only on small files and it works.

By combining options I guess there are more things this pgn-extract can do that we have not yet discovered :D
Ferdy
Posts: 4840
Joined: Sun Aug 10, 2008 3:15 pm
Location: Philippines

Re: PGN Extract

Post by Ferdy »

Ferdy wrote:
JManion wrote:
Ferdy wrote:
JManion wrote:Ok I have 4 pgn files that are all about 2GB. If I run the program it says 1 has no duplicates, and 2 has no duplicates, 3 has not duplicates, and 4 has none.

However I want to find some way to compare files 1 and 2 to see if it has any duplicates (and 1 and 3 and 1 and 4 etc).
Say you have unique games in each 4 pgn files
1. 1.pgn
2. 2.pgn
3. 3.pgn
4. 4.pgn

First Phase:
A. To check if there are same game between 1.pgn and 2.pgn, use the ff command,

Code: Select all

pgn-extract -U -ddup.pgn 1.pgn 2.pgn 
Check the output dup.pgn, if there is game here then that is the game common to 1.pgn and 2.pgn. Locate the game or games in 2.pgn and delete it. That common game will be retained in 1.pgn. Rename the revised 2.pgn to R1-2.pgn.

B. To check if there are same game between 1.pgn and 3.pgn, use the ff command,

Code: Select all

pgn-extract -U -ddup.pgn 1.pgn 3.pgn 
Check the output dup.pgn, if there is game here then that is the game common to 1.pgn and 3.pgn. Locate the game in 3.pgn and delete it. That common game will be retained in 1.pgn. Rename the revised 3.pgn to R1-3.pgn.

C. Do it for 1.pgn vs 4.pgn

Second Phase:
D. Next do it with R1-2.pgn vs R1-3.pgn
R1-2.pgn refers to 1st revision of 2.pgn, if 2.pgn was revised in 1st phase.
R1-3.pgn refers to 1st revision of 3.pgn, if 3.pgn was revised in 1st phase.

E. Next do it with R1-2.pgn vs R1-4.pgn

Third Phase:
F. Do it with R2-3.pgn vs R2-4.pgn
R2-3.pgn refers to 2nd revision of 3.pgn, if it was revised, otherwise use the latest file always.

So what we have done was
1. compare 1.pgn vs 2.pgn
2. 1.pgn vs 3.pgn
3. 1.pgn vs 4.pgn

4. 2.pgn vs 3.pgn (use revised files)
5. 2.pgn vs 4.pgn (use revised files)

6. 3.pgn vs 4.pgn (use revised files)

*If there are common games, always delete those games from the higher numbered pgn file. Example compare 1.pgn vs 2.pgn, if there are common games, delete those in 2.pgn file.

thank you Ferdinand

after I ran pgn 1 vs 2.

I had a dupe file which has 1344 dupes. Is there any easy command that can delete those 1344 games from pgn2?

thanks again.
I checked the documentaion of pgn-extract and found thru combination of options, there is a way to output games in 2.pgn that are not in 1.pgn. There is no need to delete common games found in 2.pgn. Use this command.

Code: Select all

pgn-extract -c1.pgn -dCommon.pgn -oR1-2.pgn 2.pgn
options:
-c (for check file)
-d (to output common or duplicate games)
-o (to output unique games that are not found in 1.pgn)

1.pgn (master file)
Common.pgn (dupes are found here for inspection)
R1-2.pgn (revised 2.pgn without the common games inside it, this is the file you are interested with)
2.pgn (the file you want to check with the master file 1.pgn)

I tested this command only on small files and it works.

By combining options I guess there are more things this pgn-extract can do that we have not yet discovered :D
Just create a batch file and write this. Save and run the batch file.

Code: Select all

:: 1st phase
pgn-extract -c1.pgn -dCommon1.pgn -oR1-2.pgn 2.pgn
pgn-extract -c1.pgn -dCommon2.pgn -oR1-3.pgn 3.pgn
pgn-extract -c1.pgn -dCommon3.pgn -oR1-4.pgn 4.pgn

:: 2nd phase
pgn-extract -cR1-2.pgn -dCommon4.pgn -oR2-3.pgn R1-3.pgn
pgn-extract -cR1-2.pgn -dCommon5.pgn -oR2-4.pgn R1-4.pgn

:: 3rd phase
pgn-extract -cR2-3.pgn -dCommon6.pgn -oR3-4.pgn R2-4.pgn
Final files without dupes
1.pgn, R1-2.pgn, R2-3.pgn, R3-4.pgn
Guerrero
Posts: 40
Joined: Sun Jul 08, 2007 2:05 am

Re: PGN Extract

Post by Guerrero »

I did a window interface for pgn-extract in 2004.
It was available joined to my old WinboardTournamentManager.

I rebuilt it separate and it is available here:

PGNExtract_GUI

Link direct:
http://www.mediafire.com/?6f08853mf8pdt6q

Site:
http://chessprograms.260mb.com/
JManion
Posts: 205
Joined: Wed Dec 23, 2009 8:53 am

Re: PGN Extract

Post by JManion »

thanks for all the help my database is looking good!
JManion
Posts: 205
Joined: Wed Dec 23, 2009 8:53 am

Re: PGN Extract

Post by JManion »

I am having a strange problem with one of my other Databases. I ran it through the program with the same .bat I had been using.

pgn-extract -dDuplicateGames.pgn -oUniqueGames.pgn orig_file.pgn

The file I am using it on, has 2,100,000 games, and the PGN is roughly 1.91 GB.

I have run Find doubles in CB many times, so I assumed This would not find many doubles.

I run the program, and I keep getting the same result. It stops when the Unique games hits 650k games and the PGN is at 564 MB. The duplicates game file is only 300 games and is a few K in size.

I tried to run the .bat from the start but got the same result.
Ferdy
Posts: 4840
Joined: Sun Aug 10, 2008 3:15 pm
Location: Philippines

Re: PGN Extract

Post by Ferdy »

JManion wrote:I am having a strange problem with one of my other Databases. I ran it through the program with the same .bat I had been using.

pgn-extract -dDuplicateGames.pgn -oUniqueGames.pgn orig_file.pgn

The file I am using it on, has 2,100,000 games, and the PGN is roughly 1.91 GB.

I have run Find doubles in CB many times, so I assumed This would not find many doubles.

I run the program, and I keep getting the same result. It stops when the Unique games hits 650k games and the PGN is at 564 MB. The duplicates game file is only 300 games and is a few K in size.

I tried to run the .bat from the start but got the same result.
You run CB, why you assume it did not find many doubles?

If you want to run anyway with pgn-extract try to use -Z option (for large files), as in

Code: Select all

pgn-extract -Z -dDuplicateGames.pgn -oUniqueGames.pgn orig_file.pgn
Be prepared to have an extra disk space of around 50MB (for your 2.1 Million games) for the storage of temporary file created by pgn-extract.