Database storage methods

Discussion of chess software programming and technical issues.

Moderators: hgm, Rebel, chrisw

diep
Posts: 1822
Joined: Thu Mar 09, 2006 11:54 pm
Location: The Netherlands

Re: Database storage methods

Post by diep »

Note that to store FENs, you can store chesspositions pretty compact.
A simple method is to use say for example:

64 bits : bitmap that basically says whether a piece is on a square or not

now we have for 32 pieces at most which can have 12 values, so with arithmetic encoding that's :

log 12 / log 2 = 3.58 bits
So for 32 pieces that's 114.7 = 115 bits.
So we have 147 bits for the position.
now let's store also its properties.
side to move is 1 bit , we are at 148 now.
we know we have at most 5 pawns we can capture en passant.
that's 2.32 bits.

That bad luck so far is that added up with the pieces that's 117.04 bits
so doesn't fit in 117 bits yet.

So we are at 151 now. Then for each side we must store 2 bits for castling rights of the rook at most.

So we end at 155 bits for a position.

Now Bob will probably post something about repetitions and stuff like that, but he isn't hashing that into crafty either and he is hashing en passant and castling rights, so 155 bits it is :)
diep
Posts: 1822
Joined: Thu Mar 09, 2006 11:54 pm
Location: The Netherlands

Re: Database storage methods

Post by diep »

Small calculation mistake in my crabbling:
i added 32 rather than 64, so add 32 to this.
155 + 32 = 187 bits.
Edmund
Posts: 670
Joined: Mon Dec 03, 2007 3:01 pm
Location: Barcelona, Spain

Re: Database storage methods

Post by Edmund »

diep wrote:Note that to store FENs, you can store chesspositions pretty compact.
A simple method is to use say for example:

64 bits : bitmap that basically says whether a piece is on a square or not

now we have for 32 pieces at most which can have 12 values, so with arithmetic encoding that's :

log 12 / log 2 = 3.58 bits
So for 32 pieces that's 114.7 = 115 bits.
So we have 147 bits for the position.
now let's store also its properties.
side to move is 1 bit , we are at 148 now.
we know we have at most 5 pawns we can capture en passant.
that's 2.32 bits.

That bad luck so far is that added up with the pieces that's 117.04 bits
so doesn't fit in 117 bits yet.

So we are at 151 now. Then for each side we must store 2 bits for castling rights of the rook at most.

So we end at 155 bits for a position.

Now Bob will probably post something about repetitions and stuff like that, but he isn't hashing that into crafty either and he is hashing en passant and castling rights, so 155 bits it is :)
Some comments:
you are missing the 7 bits needed for the 50-move counter; this counter can be combined with the ep squares because it has to be 0 in case of ep possibilities.

You are missing out on some gains from restrictions due to chess rules. Pawns may occupy only 3/4 of the board, there is exactly one king per side, the kings may not be on neighboring squares and side to move may not give check.
There are some more, but the gains would probably not justify the efforts.
diep
Posts: 1822
Joined: Thu Mar 09, 2006 11:54 pm
Location: The Netherlands

Re: Database storage methods

Post by diep »

Edmund wrote:
diep wrote:Note that to store FENs, you can store chesspositions pretty compact.
A simple method is to use say for example:

64 bits : bitmap that basically says whether a piece is on a square or not

now we have for 32 pieces at most which can have 12 values, so with arithmetic encoding that's :

log 12 / log 2 = 3.58 bits
So for 32 pieces that's 114.7 = 115 bits.
So we have 147 bits for the position.
now let's store also its properties.
side to move is 1 bit , we are at 148 now.
we know we have at most 5 pawns we can capture en passant.
that's 2.32 bits.

That bad luck so far is that added up with the pieces that's 117.04 bits
so doesn't fit in 117 bits yet.

So we are at 151 now. Then for each side we must store 2 bits for castling rights of the rook at most.

So we end at 155 bits for a position.

Now Bob will probably post something about repetitions and stuff like that, but he isn't hashing that into crafty either and he is hashing en passant and castling rights, so 155 bits it is :)
Some comments:
you are missing the 7 bits needed for the 50-move counter; this counter can be combined with the ep squares because it has to be 0 in case of ep possibilities.
No we aren't missing that at all - is crafty hashing it?
You are missing out on some gains from restrictions due to chess rules. Pawns may occupy only 3/4 of the board, there is exactly one king per side, the kings may not be on neighboring squares and side to move may not give check.
There are some more, but the gains would probably not justify the efforts.
Yes it is possible to get a far more compact way of storing the position.

See some postings i did do on the pawns calculations some months/years ago in this forum.

You really can get the search space calculation down to 10^40 or less quite easily.

Using that scheme then to store will be far more efficient, yet that's not really his question if i understood well.

In fact if i look at what diep sees within its search space, knowing it'll search all nonsense as well, i notice that actually it never manages to promote more than 1 pawn simultaneously causing never to be more than 2 queens of 1 side on the board.

Sure it is possble and sure i bet some player one day tried to get 3 queens just to tease his opponent (something i'm never doing in chess - but i know there is some sadists out there).

Yet even the wildest search of Diep isn't showing this for game play.

So how many positions are there practical possible to achieve for within game play?

I don't know. Maybe 10^30? Maybe 10^35. You tell me.

There is more practical constraints for example. For example all 8 pieces aren't on the other side of the board usually.

So some sophisticated huffman+arithmetic encoding that you tune to the average case, if you go measure in the entire search during entire game, i bet you'll not gonna get over a 116 bits or so.

Vincent
Dave_N
Posts: 153
Joined: Fri Sep 30, 2011 7:48 am

Re: Database storage methods

Post by Dave_N »

diep wrote:
Dave_N wrote:I am thinking about Storing a compressed move list as a Tree in a hash table ...

The hash entry obviously has the following

int64 hash
int addrNext; // if needing a mainline
int addrVariation;


then for each move the game is merged by putting all variations found in a linked list of hash entries, so for a game list stored in sparse format

Code: Select all

1.e4 e5 2.Nf3 Nc6 3.Bc4  // game 1
                          3.Bb5  // game 2
                          3.Nc3  // game 3
the main line is written first, then the variable addrVariation for Bc4 holds the array index of the hash bucket for Bb5, and the Bb5 bucket stores the hash bucket index for Nc3 etc ...

I don't know what to do about terminal nodes or nodes with only 1 game passing through - perhaps the index of the game can be written and merged iff another game scores a hit ?

I also tried to create a DAWG (direct acyclic word graph) for the FEN's to try to create perfect hashing ... the fen can already be compressed to 5 bit chars, then I got a 15 mb to 10 mb compression for the Fen list to DAWG file sizes.

If anyone has tried this disk tree concept I'd be interested to hear about how to compress the tree and whether its worth storing sparse move lists, or whether to just use SQL and a compressed move list, with maybe a gui button to call CQL for position searches.
You soon will find out this eats lots of space. What i did do in the old Diep game format, dating back from 1994 or so, is always generate moves in the same manner,by keeping a sorted list of pieces (so the same pieces need to get sorted by the square they're on).

Then you can easily store each move in 1 byte, as each position P has at most 220 moves or so. I know of a position of 218 moves by head, but for sure not 220.

Now this goes all pretty quick actually.

If you want to put more system time into it then a simple runlength encoding trick can reduce the size of the movelist even further.
I am already indexing a legal move list and decoding back to pgn - my legal moves are always generated in the same way, KRNBQP (or some similar order depending on check too). I can radix sort the moves lists (similar to sorting by ECO, of course 1.a3 might turn up in the legal moves list first anyway), but I am thinking about ways of clustering the transpositions - I generate fen's during compression, so hashing is natural. Building "merge trees" in a file on disk seems less stable however (harder to debug and very time consuming) perhaps hash values could be first clustered then written to make the task easier.
diep
Posts: 1822
Joined: Thu Mar 09, 2006 11:54 pm
Location: The Netherlands

Re: Database storage methods

Post by diep »

Dave_N wrote:
diep wrote:
Dave_N wrote:I am thinking about Storing a compressed move list as a Tree in a hash table ...

The hash entry obviously has the following

int64 hash
int addrNext; // if needing a mainline
int addrVariation;


then for each move the game is merged by putting all variations found in a linked list of hash entries, so for a game list stored in sparse format

Code: Select all

1.e4 e5 2.Nf3 Nc6 3.Bc4  // game 1
                          3.Bb5  // game 2
                          3.Nc3  // game 3
the main line is written first, then the variable addrVariation for Bc4 holds the array index of the hash bucket for Bb5, and the Bb5 bucket stores the hash bucket index for Nc3 etc ...

I don't know what to do about terminal nodes or nodes with only 1 game passing through - perhaps the index of the game can be written and merged iff another game scores a hit ?

I also tried to create a DAWG (direct acyclic word graph) for the FEN's to try to create perfect hashing ... the fen can already be compressed to 5 bit chars, then I got a 15 mb to 10 mb compression for the Fen list to DAWG file sizes.

If anyone has tried this disk tree concept I'd be interested to hear about how to compress the tree and whether its worth storing sparse move lists, or whether to just use SQL and a compressed move list, with maybe a gui button to call CQL for position searches.
You soon will find out this eats lots of space. What i did do in the old Diep game format, dating back from 1994 or so, is always generate moves in the same manner,by keeping a sorted list of pieces (so the same pieces need to get sorted by the square they're on).

Then you can easily store each move in 1 byte, as each position P has at most 220 moves or so. I know of a position of 218 moves by head, but for sure not 220.

Now this goes all pretty quick actually.

If you want to put more system time into it then a simple runlength encoding trick can reduce the size of the movelist even further.
I am already indexing a legal move list and decoding back to pgn - my legal moves are always generated in the same way, KRNBQP (or some similar order depending on check too). I can radix sort the moves lists (similar to sorting by ECO, of course 1.a3 might turn up in the legal moves list first anyway), but I am thinking about ways of clustering the transpositions - I generate fen's during compression, so hashing is natural. Building "merge trees" in a file on disk seems less stable however (harder to debug and very time consuming) perhaps hash values could be first clustered then written to make the task easier.
There is no need to sort the movelist at all. Just generation needs to be deterministic of the movelist, that's all.
Dave_N
Posts: 153
Joined: Fri Sep 30, 2011 7:48 am

Re: Database storage methods

Post by Dave_N »

An idea I got from the other thread about database methods, is that after I make a hash set for the lines that merge the most (e4 c5 etc) most of the games can store fewer moves and the index of their root game, so they can save some bytes. Perhaps then looking at the positions generated and writing in the material imbalances ... how many material imbalances are worth searching for? The alphabet of good material imbances would determine the amount that can be stored.

Atm the basic compression system with bytes showed a best case of compression to 10% of the original size when zipped. I think fixing a hash size as the missing 90% would be insufficient, however I don't care if the DB is bigger than the original pgn if it is much more functional.
Dave_N
Posts: 153
Joined: Fri Sep 30, 2011 7:48 am

Re: Database storage methods

Post by Dave_N »

Well the legal moves seem ok at this stage, I found some bugs while compressing large files.

I think I am failing to appreciate some of the finer qualities of Zobrist hash, perhaps a hash table / tree lookup system saying "white pawn on e4 and black Queen on a8" using a tree of hash elements, (an index table that resides in a separate table), this tree would work with the current memory structure and would not affect the normal processing.

Also a second hash key for material imbalances, I was reading about material imbalances on the chess programming wiki.