phhnguyen wrote: ↑Sat Oct 23, 2021 1:12 pm
You are right! But… not totally

...
At the moment for SQL we store a game as two strings, one for the starting FEN, another for the move list (SAN moves).
Don't get me wrong, I fully intend to follow this thread (and possibly provide some input or try some things in Rust myself), because I'd rather have a somewhat larger SQL database than a smaller binary database. I'm not trying to store the 100 million games Lichess puts out every day; if I can store my old MegaBase 2017 with TWIC updates (roughly 8 million games), that'd be fine; even if it's bigger than the Chessbase database.
The database could be compressed with zip or 7-zip; if the uncompressed DB is under 24 GB in size, I'd probably be able to load it right into the computer's main memory if need be. But we'd get to that later.
One way to quickly find every position in each was described by Daniel: you could store a Zobrist hash and a GUID for each position you encounter during the PGN -> DB conversion. The DB could then have a table in between the game and position table (don't know the english name: Dutch is "koppel-tabel", or "connection table"), where each record stores a reference to the game, to a position, and an order. So you could query a game from that table, order it on the "order" field, and you have all the positions occurring in that game. You could also query the table on a certain position and (distinct) game Id's, and you instantly got any game in which this position occurs.
I'm not too concerned with regard to performance (is SQLite multi-threaded? Probably depends on the programming library you use), but it would probably take significantly more storage than a binary format. Even if it takes 1.5 or 2x the storage, I'd probably prefer this over yet another binary format.