Page 2 of 3

Re: scid - pgn database size limitation?

Posted: Fri Feb 23, 2018 8:41 pm
by stevenaaus
Guenther wrote:I am still extracting one of those gigantic lichess databases out of curiosity.
The one I have downloaded (2018/01) probably will be around 35GB! decompressed.
Yeah.. the problem is modern PGN files can be so big.

I'll just mention my new command line utility sc_filter_pgn.tcl (in 'scripts' in the source tree).
# sc_filter_pgn
# Using several PGN files, copy games matching position <fen> to a database
#
# Usage: sc_filter_pgn <database> <fen> <pgn-files....>
It is helpful for people wanting to filter out a certain position from numerous large pgn files. It does not address the problem with a single pgn file maxing out the game limit, but some pgn sources come in increments.

Re: scid - pgn database size limitation?

Posted: Fri Feb 23, 2018 8:42 pm
by styx
I tested it with tagRemove as Juan suggested (side note: it only works in windows and not in linux via WINE)

I successfully imported a 5,8 GB PGN file (6,22 million games). It only works with the latest sources of SCID. It also works with "SCID vs. PC" (tested 4.16 and 4.18).

8 GB of RAM on this machine.

Re: scid - pgn database size limitation?

Posted: Fri Feb 23, 2018 9:20 pm
by Dann Corbit
Ozymandias wrote:
Guenther wrote:I am trying now to clean up the site tags, but even UltraEdit needs quite
some time to open a 5GB pgn file.
You can try Norman Pollock's tagRemove, to completely get rid of the tag. It should be faster.
Why not fix the code?
It's open source.

I imagine that there are probably a couple macros in the header with something like:

Code: Select all

#define MAX_SIZE_FOO 0x1000000
that just need to be made bigger.

Re: scid - pgn database size limitation?

Posted: Fri Feb 23, 2018 9:46 pm
by brianr
FWIW, a free tool I find useful for very large files under Windows is PilotEdit Lite

Re: scid - pgn database size limitation?

Posted: Fri Feb 23, 2018 10:06 pm
by Adam Hair
styx wrote:I tested it with tagRemove as Juan suggested (side note: it only works in windows and not in linux via WINE)
In Linux I use the Java class files that Norm has available instead of the Windows binaries.

Re: scid - pgn database size limitation?

Posted: Fri Feb 23, 2018 10:44 pm
by Dann Corbit
I have looked at the code for SCID and ScidVsPC and have discovered the 16 million limit (16777215) is cast in stone because they encode sizes in 3 bytes so nothing can be bigger than 2^24 - 1.

Re: scid - pgn database size limitation?

Posted: Sat Feb 24, 2018 12:08 am
by Fulvio
Dann Corbit wrote: the 16 million limit (16777215) is cast in stone because they encode sizes in 3 bytes so nothing can be bigger than 2^24 - 1.
The current git version of SCID limits PGN databases to max 64TB file size, 4 billion games (2^32 -2), 256 million unique names for each tag:
https://sourceforge.net/p/scid/code/ci/ ... mory.h#l37
PGN databases are parsed and converted into memory databases: in most cases the available memory (RAM + swap) will be the real limit.

The limits for SCID4 databases are max 4GB file size, 16 million games, 1 million unique player names, 524287 unique "Event" names, 524287 unique "Site" names, 262143 unique "Round" names:
https://sourceforge.net/p/scid/code/ci/ ... cid4.h#l40

Even if technically it's not hard to extend the limits of SCID4 databases, it would not be compatible with older SCID versions (a new SCID5 is necessary).

Re: scid - pgn database size limitation?

Posted: Sat Feb 24, 2018 12:30 am
by Fulvio
Guenther wrote:I have no clue though, if it will run/work on a file described above with
my ram limitation of 4GB? (cannot be extended due to bios)
This is under WIN7-64 Ultimate, NTFS of course, thus OS file size limitation
is no problem.
I have tested lichess databases with the current git version of SCID, Windows 10 and 6GB of RAM. The memory allocated would be larger, but it works because Windows uses on the fly memory compression and it is possible to enlarge the size of virtual memory.

However, as others have already correctly pointed out, it is not possible to create a single SCID4 database with all the games due to the limits of max 524287 unique "Site" names and max 16 million games.

Re: scid - pgn database size limitation?

Posted: Sat Feb 24, 2018 1:34 am
by Norm Pollock
A roundabout way to process extremely large pgn files (over 16M games for example) is to split the file into files of an "equal" number of games, and then process each of the smaller new files. "gameSplit" from 40H-PGN tools:

"gameSplit" separates the input "pgn" file into a user-specified
number (2-10000) of files. All games are kept intact within one of
the output files. The number of games in each output file are
"equal", plus or minus 1 game.

Re: scid - pgn database size limitation?

Posted: Sat Feb 24, 2018 11:46 am
by Guenther
styx wrote:I tested it with tagRemove as Juan suggested (side note: it only works in windows and not in linux via WINE)

I successfully imported a 5,8 GB PGN file (6,22 million games). It only works with the latest sources of SCID. It also works with "SCID vs. PC" (tested 4.16 and 4.18).

8 GB of RAM on this machine.
1. I created smaller chunks out of the 35GB pgn file, now limited to 3GB each with PGNSplit recommended by lichess (quite fast!)

2. With the help of Norms tools tagRemove and tagNull I removed 5 tags I did not need and set Event + Site to Null

3. The first chunk now contains now around 1.5M games and it can be imported both ways a) from cmd pgnscid and via GUI (Scid vs. PC 4.18)

4. Each and every time after importing more games into the database it crashes at exactly 2GB size for the sg4 file. (same prob as Juan already reported)

5. A fresh install of Scid 4.64 even crashes before! the first chunk is completely imported (this was with GUI - no cmd tool available after install)
IIRC the size of the sg4 file was the same as for the first import in Scid vs. PC, thus it crashed probably due to creating the si4 or sn4 file.
It would have been around 3 times faster than Scid vs. PC via GUI though.

This means I am stuck here with max 2GB sg4 files...

Thanks to all, who tried to help anyways