scid - pgn database size limitation?

Discussion of anything and everything relating to chess playing software and machines.

Moderators: hgm, Rebel, chrisw

stevenaaus
Posts: 608
Joined: Wed Oct 13, 2010 9:44 am
Location: Australia

Re: scid - pgn database size limitation?

Post by stevenaaus »

Guenther wrote:I am still extracting one of those gigantic lichess databases out of curiosity.
The one I have downloaded (2018/01) probably will be around 35GB! decompressed.
Yeah.. the problem is modern PGN files can be so big.

I'll just mention my new command line utility sc_filter_pgn.tcl (in 'scripts' in the source tree).
# sc_filter_pgn
# Using several PGN files, copy games matching position <fen> to a database
#
# Usage: sc_filter_pgn <database> <fen> <pgn-files....>
It is helpful for people wanting to filter out a certain position from numerous large pgn files. It does not address the problem with a single pgn file maxing out the game limit, but some pgn sources come in increments.
styx
Posts: 338
Joined: Tue Mar 13, 2012 9:59 pm
Location: Germany

Re: scid - pgn database size limitation?

Post by styx »

I tested it with tagRemove as Juan suggested (side note: it only works in windows and not in linux via WINE)

I successfully imported a 5,8 GB PGN file (6,22 million games). It only works with the latest sources of SCID. It also works with "SCID vs. PC" (tested 4.16 and 4.18).

8 GB of RAM on this machine.
Dann Corbit
Posts: 12537
Joined: Wed Mar 08, 2006 8:57 pm
Location: Redmond, WA USA

Re: scid - pgn database size limitation?

Post by Dann Corbit »

Ozymandias wrote:
Guenther wrote:I am trying now to clean up the site tags, but even UltraEdit needs quite
some time to open a 5GB pgn file.
You can try Norman Pollock's tagRemove, to completely get rid of the tag. It should be faster.
Why not fix the code?
It's open source.

I imagine that there are probably a couple macros in the header with something like:

Code: Select all

#define MAX_SIZE_FOO 0x1000000
that just need to be made bigger.
Taking ideas is not a vice, it is a virtue. We have another word for this. It is called learning.
But sharing ideas is an even greater virtue. We have another word for this. It is called teaching.
brianr
Posts: 536
Joined: Thu Mar 09, 2006 3:01 pm

Re: scid - pgn database size limitation?

Post by brianr »

FWIW, a free tool I find useful for very large files under Windows is PilotEdit Lite
Adam Hair
Posts: 3226
Joined: Wed May 06, 2009 10:31 pm
Location: Fuquay-Varina, North Carolina

Re: scid - pgn database size limitation?

Post by Adam Hair »

styx wrote:I tested it with tagRemove as Juan suggested (side note: it only works in windows and not in linux via WINE)
In Linux I use the Java class files that Norm has available instead of the Windows binaries.
Dann Corbit
Posts: 12537
Joined: Wed Mar 08, 2006 8:57 pm
Location: Redmond, WA USA

Re: scid - pgn database size limitation?

Post by Dann Corbit »

I have looked at the code for SCID and ScidVsPC and have discovered the 16 million limit (16777215) is cast in stone because they encode sizes in 3 bytes so nothing can be bigger than 2^24 - 1.
Taking ideas is not a vice, it is a virtue. We have another word for this. It is called learning.
But sharing ideas is an even greater virtue. We have another word for this. It is called teaching.
Fulvio
Posts: 395
Joined: Fri Aug 12, 2016 8:43 pm

Re: scid - pgn database size limitation?

Post by Fulvio »

Dann Corbit wrote: the 16 million limit (16777215) is cast in stone because they encode sizes in 3 bytes so nothing can be bigger than 2^24 - 1.
The current git version of SCID limits PGN databases to max 64TB file size, 4 billion games (2^32 -2), 256 million unique names for each tag:
https://sourceforge.net/p/scid/code/ci/ ... mory.h#l37
PGN databases are parsed and converted into memory databases: in most cases the available memory (RAM + swap) will be the real limit.

The limits for SCID4 databases are max 4GB file size, 16 million games, 1 million unique player names, 524287 unique "Event" names, 524287 unique "Site" names, 262143 unique "Round" names:
https://sourceforge.net/p/scid/code/ci/ ... cid4.h#l40

Even if technically it's not hard to extend the limits of SCID4 databases, it would not be compatible with older SCID versions (a new SCID5 is necessary).
Fulvio
Posts: 395
Joined: Fri Aug 12, 2016 8:43 pm

Re: scid - pgn database size limitation?

Post by Fulvio »

Guenther wrote:I have no clue though, if it will run/work on a file described above with
my ram limitation of 4GB? (cannot be extended due to bios)
This is under WIN7-64 Ultimate, NTFS of course, thus OS file size limitation
is no problem.
I have tested lichess databases with the current git version of SCID, Windows 10 and 6GB of RAM. The memory allocated would be larger, but it works because Windows uses on the fly memory compression and it is possible to enlarge the size of virtual memory.

However, as others have already correctly pointed out, it is not possible to create a single SCID4 database with all the games due to the limits of max 524287 unique "Site" names and max 16 million games.
Norm Pollock
Posts: 1056
Joined: Thu Mar 09, 2006 4:15 pm
Location: Long Island, NY, USA

Re: scid - pgn database size limitation?

Post by Norm Pollock »

A roundabout way to process extremely large pgn files (over 16M games for example) is to split the file into files of an "equal" number of games, and then process each of the smaller new files. "gameSplit" from 40H-PGN tools:

"gameSplit" separates the input "pgn" file into a user-specified
number (2-10000) of files. All games are kept intact within one of
the output files. The number of games in each output file are
"equal", plus or minus 1 game.
Updated links for 40H Tools and Databases
http://40Hchess.epizy.com
http://nk-qy.info/40h
User avatar
Guenther
Posts: 4605
Joined: Wed Oct 01, 2008 6:33 am
Location: Regensburg, Germany
Full name: Guenther Simon

Re: scid - pgn database size limitation?

Post by Guenther »

styx wrote:I tested it with tagRemove as Juan suggested (side note: it only works in windows and not in linux via WINE)

I successfully imported a 5,8 GB PGN file (6,22 million games). It only works with the latest sources of SCID. It also works with "SCID vs. PC" (tested 4.16 and 4.18).

8 GB of RAM on this machine.
1. I created smaller chunks out of the 35GB pgn file, now limited to 3GB each with PGNSplit recommended by lichess (quite fast!)

2. With the help of Norms tools tagRemove and tagNull I removed 5 tags I did not need and set Event + Site to Null

3. The first chunk now contains now around 1.5M games and it can be imported both ways a) from cmd pgnscid and via GUI (Scid vs. PC 4.18)

4. Each and every time after importing more games into the database it crashes at exactly 2GB size for the sg4 file. (same prob as Juan already reported)

5. A fresh install of Scid 4.64 even crashes before! the first chunk is completely imported (this was with GUI - no cmd tool available after install)
IIRC the size of the sg4 file was the same as for the first import in Scid vs. PC, thus it crashed probably due to creating the si4 or sn4 file.
It would have been around 3 times faster than Scid vs. PC via GUI though.

This means I am stuck here with max 2GB sg4 files...

Thanks to all, who tried to help anyways
https://rwbc-chess.de

trollwatch:
Chessqueen + chessica + AlexChess + Eduard + Sylwy