scid - pgn database size limitation?

Discussion of anything and everything relating to chess playing software and machines.

Moderators: hgm, Harvey Williamson, bob

Forum rules
This textbox is used to restore diagrams posted with the [d] tag before the upgrade.
stevenaaus
Posts: 602
Joined: Wed Oct 13, 2010 7:44 am
Location: Australia
Contact:

Re: scid - pgn database size limitation?

Post by stevenaaus » Fri Feb 23, 2018 8:41 pm

Guenther wrote:I am still extracting one of those gigantic lichess databases out of curiosity.
The one I have downloaded (2018/01) probably will be around 35GB! decompressed.
Yeah.. the problem is modern PGN files can be so big.

I'll just mention my new command line utility sc_filter_pgn.tcl (in 'scripts' in the source tree).
# sc_filter_pgn
# Using several PGN files, copy games matching position <fen> to a database
#
# Usage: sc_filter_pgn <database> <fen> <pgn-files....>
It is helpful for people wanting to filter out a certain position from numerous large pgn files. It does not address the problem with a single pgn file maxing out the game limit, but some pgn sources come in increments.

styx
Posts: 338
Joined: Tue Mar 13, 2012 8:59 pm
Location: Germany

Re: scid - pgn database size limitation?

Post by styx » Fri Feb 23, 2018 8:42 pm

I tested it with tagRemove as Juan suggested (side note: it only works in windows and not in linux via WINE)

I successfully imported a 5,8 GB PGN file (6,22 million games). It only works with the latest sources of SCID. It also works with "SCID vs. PC" (tested 4.16 and 4.18).

8 GB of RAM on this machine.

Dann Corbit
Posts: 9847
Joined: Wed Mar 08, 2006 7:57 pm
Location: Redmond, WA USA
Contact:

Re: scid - pgn database size limitation?

Post by Dann Corbit » Fri Feb 23, 2018 9:20 pm

Ozymandias wrote:
Guenther wrote:I am trying now to clean up the site tags, but even UltraEdit needs quite
some time to open a 5GB pgn file.
You can try Norman Pollock's tagRemove, to completely get rid of the tag. It should be faster.
Why not fix the code?
It's open source.

I imagine that there are probably a couple macros in the header with something like:

Code: Select all

#define MAX_SIZE_FOO 0x1000000
that just need to be made bigger.
Taking ideas is not a vice, it is a virtue. We have another word for this. It is called learning.
But sharing ideas is an even greater virtue. We have another word for this. It is called teaching.

brianr
Posts: 347
Joined: Thu Mar 09, 2006 2:01 pm

Re: scid - pgn database size limitation?

Post by brianr » Fri Feb 23, 2018 9:46 pm

FWIW, a free tool I find useful for very large files under Windows is PilotEdit Lite

Adam Hair
Posts: 3201
Joined: Wed May 06, 2009 8:31 pm
Location: Fuquay-Varina, North Carolina

Re: scid - pgn database size limitation?

Post by Adam Hair » Fri Feb 23, 2018 10:06 pm

styx wrote:I tested it with tagRemove as Juan suggested (side note: it only works in windows and not in linux via WINE)
In Linux I use the Java class files that Norm has available instead of the Windows binaries.

Dann Corbit
Posts: 9847
Joined: Wed Mar 08, 2006 7:57 pm
Location: Redmond, WA USA
Contact:

Re: scid - pgn database size limitation?

Post by Dann Corbit » Fri Feb 23, 2018 10:44 pm

I have looked at the code for SCID and ScidVsPC and have discovered the 16 million limit (16777215) is cast in stone because they encode sizes in 3 bytes so nothing can be bigger than 2^24 - 1.
Taking ideas is not a vice, it is a virtue. We have another word for this. It is called learning.
But sharing ideas is an even greater virtue. We have another word for this. It is called teaching.

Fulvio
Posts: 142
Joined: Fri Aug 12, 2016 6:43 pm

Re: scid - pgn database size limitation?

Post by Fulvio » Sat Feb 24, 2018 12:08 am

Dann Corbit wrote: the 16 million limit (16777215) is cast in stone because they encode sizes in 3 bytes so nothing can be bigger than 2^24 - 1.
The current git version of SCID limits PGN databases to max 64TB file size, 4 billion games (2^32 -2), 256 million unique names for each tag:
https://sourceforge.net/p/scid/code/ci/ ... mory.h#l37
PGN databases are parsed and converted into memory databases: in most cases the available memory (RAM + swap) will be the real limit.

The limits for SCID4 databases are max 4GB file size, 16 million games, 1 million unique player names, 524287 unique "Event" names, 524287 unique "Site" names, 262143 unique "Round" names:
https://sourceforge.net/p/scid/code/ci/ ... cid4.h#l40

Even if technically it's not hard to extend the limits of SCID4 databases, it would not be compatible with older SCID versions (a new SCID5 is necessary).

Fulvio
Posts: 142
Joined: Fri Aug 12, 2016 6:43 pm

Re: scid - pgn database size limitation?

Post by Fulvio » Sat Feb 24, 2018 12:30 am

Guenther wrote:I have no clue though, if it will run/work on a file described above with
my ram limitation of 4GB? (cannot be extended due to bios)
This is under WIN7-64 Ultimate, NTFS of course, thus OS file size limitation
is no problem.
I have tested lichess databases with the current git version of SCID, Windows 10 and 6GB of RAM. The memory allocated would be larger, but it works because Windows uses on the fly memory compression and it is possible to enlarge the size of virtual memory.

However, as others have already correctly pointed out, it is not possible to create a single SCID4 database with all the games due to the limits of max 524287 unique "Site" names and max 16 million games.

Norm Pollock
Posts: 1017
Joined: Thu Mar 09, 2006 3:15 pm
Location: Long Island, NY, USA
Contact:

Re: scid - pgn database size limitation?

Post by Norm Pollock » Sat Feb 24, 2018 1:34 am

A roundabout way to process extremely large pgn files (over 16M games for example) is to split the file into files of an "equal" number of games, and then process each of the smaller new files. "gameSplit" from 40H-PGN tools:

"gameSplit" separates the input "pgn" file into a user-specified
number (2-10000) of files. All games are kept intact within one of
the output files. The number of games in each output file are
"equal", plus or minus 1 game.
It is not the strongest of the species that survive, nor the most intelligent, but the one most responsive to change. -- Charles Darwin

User avatar
Guenther
Posts: 2935
Joined: Wed Oct 01, 2008 4:33 am
Location: Regensburg, Germany
Full name: Guenther Simon
Contact:

Re: scid - pgn database size limitation?

Post by Guenther » Sat Feb 24, 2018 11:46 am

styx wrote:I tested it with tagRemove as Juan suggested (side note: it only works in windows and not in linux via WINE)

I successfully imported a 5,8 GB PGN file (6,22 million games). It only works with the latest sources of SCID. It also works with "SCID vs. PC" (tested 4.16 and 4.18).

8 GB of RAM on this machine.
1. I created smaller chunks out of the 35GB pgn file, now limited to 3GB each with PGNSplit recommended by lichess (quite fast!)

2. With the help of Norms tools tagRemove and tagNull I removed 5 tags I did not need and set Event + Site to Null

3. The first chunk now contains now around 1.5M games and it can be imported both ways a) from cmd pgnscid and via GUI (Scid vs. PC 4.18)

4. Each and every time after importing more games into the database it crashes at exactly 2GB size for the sg4 file. (same prob as Juan already reported)

5. A fresh install of Scid 4.64 even crashes before! the first chunk is completely imported (this was with GUI - no cmd tool available after install)
IIRC the size of the sg4 file was the same as for the first import in Scid vs. PC, thus it crashed probably due to creating the si4 or sn4 file.
It would have been around 3 times faster than Scid vs. PC via GUI though.

This means I am stuck here with max 2GB sg4 files...

Thanks to all, who tried to help anyways
Current foe list count : [92 - still rising]
http://rwbc-chess.de/chronology.htm

Post Reply