BCGN - an alternative to the PGN format for mass chess game storage.

Sopel · Post by **Sopel** » Mon Apr 20, 2020 4:33 pm

I was fed up with the complexity, verbosity, terrible scheme for parsing by computers, and an overcomplicated and underspecified "standard" of PGN. I also didn't like that 6 years of stripped lichess PGN files is taking 800GB space on my hard drive. (Don't let me get started on how ridiculously slow python-chess pgn parser is. That's where it all begun.) I have implemented a relatively conforming (can parse lichess files without problems, any bigger diversion from their format I consider PGN abominations) lazy PGN parser that is at least as fast as pgn-reader for rust (sadly I was unable to test because pgn-reader didn't work on windows when I tried to run it. I only have estimates based on processor speed differences). It was a long endeavour and I was still dissatisfied with the parsing speed. It is still a bottleneck in the process of assembling a database in my application.

That's why I went on to create a format specification that allows both more compact storage and faster parsing speeds than PGN while preserving its tag system. It's also arguably simpler to implement - it doesn't require a move generator.
I abolished the idea of including comments and annotations within the file. While this is a killer feature for many it's also completely unnecessary for the other many in a format with the goals of compactness and speed it would be just an inconvenience to support. Tradeoffs!

A rough sketch of the format:

It is created specifically for chess. It uses assumptions based on chess rules. It may not work if a position that's not a legal Chess960 position arises.

Every file starts with a 32 byte file header with format identifier and flags.
Currently there are 2 compression levels.

0 - 2 bytes per move. packed from square, to square, type, promoted piece. Cannot get simpler than that.
1 - Almost always 1 byte per move, 2 bytes in very unusual positions. It uses an idea similar to the basis for https://lichess.org/blog/Wqa7GiAAAOIpBL ... ompression but doesn't require legal move generation and is much much faster. Cannot get smaller than that with (well, *almost*) fixed width encodings

Every game consists of a header and movetext.
The header contains:

necessary information such as entry_size, header_size, ply_count, result, flags
a custom start position if necessary
common tags like date, white_elo, black_elo, white_player, black_player, round, site, event, eco
may contain additional tags as key value pairs

The movetext consists of encoded moves in (currently) one of two schemes (as described above).
In the future I may specify optional lz4 compression on top of everything else.

There is also a file scope flag "headerless" that allows reducing the header size to minimum, preserving only result and ply_count information, so the file is basilly just movetext.

Details can be found in the specification here: https://github.com/Sopel97/chess_pos_db ... /docs/bcgn along with pseudo-code.
A reference implementation of a file writer and reader can be found here https://github.com/Sopel97/chess_pos_db ... ess/Bcgn.h https://github.com/Sopel97/chess_pos_db ... s/Bcgn.cpp . With other relevant bits in https://github.com/Sopel97/chess_pos_db ... ss/Chess.h (CompressedMove), https://github.com/Sopel97/chess_pos_db ... Position.h (CompressedPosition) and https://github.com/Sopel97/chess_pos_db ... eIndex.cpp (compression level 1 move encoding/decoding)

I also performed some benchmarks on my oldish pc: https://github.com/Sopel97/chess_pos_db ... vs_bcgn.md
tldr; between 1.5x to 3x faster parsing than a corner-cutting pgn parser implementation

I may do some standalone tools if it gains traction.

Would you want this format to become more widely used?
Would you support it in your software?
What would you change (better do this early on!)?