Database storage methods

Discussion of chess software programming and technical issues.

Moderators: hgm, Rebel, chrisw

User avatar
JuLieN
Posts: 2949
Joined: Mon May 05, 2008 12:16 pm
Location: Bordeaux (France)
Full name: Julien Marcel

Re: Database storage methods

Post by JuLieN »

bob wrote: Couple of points.

First, you need the full 8 bits for the index. There are positions with > 128 moves, which means 7 bits is not enough. The record is something like 230 legal moves.
I disagree with that: the huge majority of positions will get a movelist inferior to 127 moves. So 7 bits is enough for this vast majority of situations. And handling the rare situations when you need more than 7 bits is easy: just use 255 (and 127) as an escape code: the next byte will contain the (nb_of_moves - 127) remaining moves. And if this number is, again, 255, then repeat. And if you code exactly for the 127th move then it would get coded 255 (or 127) and 0.

Ten years ago, I programmed a compression program à la winzip, and I used escape codes a lot: they are very handy :)

That's what my compressor looked like, btw ^^

Image
bob wrote: Second, for chess engines, I don't think anyone stores a book in the way you describe, nor have they for many years. Problem is that transpositions are difficult to detect. Most just store a huge file of zobrist hash signatures. If you make a move in any position, and can find the zobrist signature in the book file, you know that was a book move.
Bob, did you read this thread, or do you now jump over all of Ed's post? :D
Rebel wrote:
JuLieN wrote:

Code: Select all

(e4(e5(Nf3(Nc6(Bb5 a6)Bc4)Nf6)f4)(c5 Nf3)(d6 d4)Nc6)e6)(d4(Nf6 c4(g6 Nc3)e6(Nc3 Bb4)Nf3)(d5(c4(c6 Nf3)(e6 Nc3)dxc4)Nf3)f5) etc.
Funny, mine is the same, still is, but added transpositions to it later :wink:
Anyway, I was just answering David's request, and this answer obviously helped him, so no need to jump on me. ^^
"The only good bug is a dead bug." (Don Dailey)
[Blog: http://tinyurl.com/predateur ] [Facebook: http://tinyurl.com/fbpredateur ] [MacEngines: http://tinyurl.com/macengines ]
bob
Posts: 20943
Joined: Mon Feb 27, 2006 7:30 pm
Location: Birmingham, AL

Re: Database storage methods

Post by bob »

JuLieN wrote:
bob wrote: Couple of points.

First, you need the full 8 bits for the index. There are positions with > 128 moves, which means 7 bits is not enough. The record is something like 230 legal moves.
I disagree with that: the huge majority of positions will get a movelist inferior to 127 moves. So 7 bits is enough for this vast majority of situations. And handling the rare situations when you need more than 7 bits is easy: just use 255 (and 127) as an escape code: the next byte will contain the (nb_of_moves - 127) remaining moves. And if this number is, again, 255, then repeat. And if you code exactly for the 127th move then it would get coded 255 (or 127) and 0.

Ten years ago, I programmed a compression program à la winzip, and I used escape codes a lot: they are very handy :)

That's what my compressor looked like, btw ^^

Image
bob wrote: Second, for chess engines, I don't think anyone stores a book in the way you describe, nor have they for many years. Problem is that transpositions are difficult to detect. Most just store a huge file of zobrist hash signatures. If you make a move in any position, and can find the zobrist signature in the book file, you know that was a book move.
Bob, did you read this thread, or do you now jump over all of Ed's post? :D

I generally read and respond in the order posts appear. This is a cumbersome facility for reading an entire thread marking particular posts you want to come back to. Quite unlike usenet news with a good news reader that lets you do exactly that...


Rebel wrote:
JuLieN wrote:

Code: Select all

(e4(e5(Nf3(Nc6(Bb5 a6)Bc4)Nf6)f4)(c5 Nf3)(d6 d4)Nc6)e6)(d4(Nf6 c4(g6 Nc3)e6(Nc3 Bb4)Nf3)(d5(c4(c6 Nf3)(e6 Nc3)dxc4)Nf3)f5) etc.
Funny, mine is the same, still is, but added transpositions to it later :wink:
Anyway, I was just answering David's request, and this answer obviously helped him, so no need to jump on me. ^^
Wasn't intending to "jump on" anything at all. Was first just pointing out that there are lots of positions with > 128 moves. And also that transpositions are an integral part of an opening book that can't be ignored without some unwanted consequences. My book in the 70's was stored as a tree, somewhat similar to what you propose but not quite the same. I abandoned that around 1978 or so and went to the zobrist key approach I use today, because I badly wanted the transpositions to be handled correctly so that my opponent could not trick me into an opening I wanted to avoid, by doing some transposition trick.
Edmund
Posts: 670
Joined: Mon Dec 03, 2007 3:01 pm
Location: Barcelona, Spain

Re: Database storage methods

Post by Edmund »

JuLieN wrote:
bob wrote: Couple of points.

First, you need the full 8 bits for the index. There are positions with > 128 moves, which means 7 bits is not enough. The record is something like 230 legal moves.
I disagree with that: the huge majority of positions will get a movelist inferior to 127 moves. So 7 bits is enough for this vast majority of situations. And handling the rare situations when you need more than 7 bits is easy: just use 255 (and 127) as an escape code: the next byte will contain the (nb_of_moves - 127) remaining moves. And if this number is, again, 255, then repeat. And if you code exactly for the 127th move then it would get coded 255 (or 127) and 0.

Ten years ago, I programmed a compression program à la winzip, and I used escape codes a lot: they are very handy :)

That's what my compressor looked like, btw ^^

[...]
bob wrote: Second, for chess engines, I don't think anyone stores a book in the way you describe, nor have they for many years. Problem is that transpositions are difficult to detect. Most just store a huge file of zobrist hash signatures. If you make a move in any position, and can find the zobrist signature in the book file, you know that was a book move.
Bob, did you read this thread, or do you now jump over all of Ed's post? :D
Rebel wrote:
JuLieN wrote:

Code: Select all

(e4(e5(Nf3(Nc6(Bb5 a6)Bc4)Nf6)f4)(c5 Nf3)(d6 d4)Nc6)e6)(d4(Nf6 c4(g6 Nc3)e6(Nc3 Bb4)Nf3)(d5(c4(c6 Nf3)(e6 Nc3)dxc4)Nf3)f5) etc.
Funny, mine is the same, still is, but added transpositions to it later :wink:
Anyway, I was just answering David's request, and this answer obviously helped him, so no need to jump on me. ^^
I agree with Bob that it is imperfect to limit the number of moves to 127. I also think the solution with escape-codes is not optimal here. I would suggest the following solution:
Generate all the moves, apply some static evaluation (SEE and piecesquare-values) and order the moves according to the evaluation. Eventually use huffman coding before you store the moves in the database.

Furthermore I disagree with Bob regarding the "Most just store a huge file of zobrist hash signatures.". When talking about engines here I suppose you are describing opening books and not databases in general. The problem with this method here is that it invites unwanted transpositions.
e.g. after 1. e4 e5 2. Nf3 a6 3. Bb5 the opening book would suggest the transposition into the well known line with Nc6.
The most streight forward countermesure is to not only calculate hash values of positions but also add the move into the equation.
User avatar
hgm
Posts: 27870
Joined: Fri Mar 10, 2006 10:06 am
Location: Amsterdam
Full name: H G Muller

Re: Database storage methods

Post by hgm »

Well, escapes can be viewed as a byte-quantized form of Hufman coding. If compactness is not the only criterion, but you also want to have some efficiency left on decoding, using SEE etc. seems a bad idea.

Decoding can be much faster, with an acceptable loss of compactness, by reserving codes for all possible move steps of the given piece type. You can then decode the move by just looking at the piece list section for that piece type to get the from square, and use a lookup table to get the step-vector. For orthodox Chess you would need 8x3 codes for Pawns, 3x8 for K+N, 4x14 for B+R and 28 for Q, for a total of 132. You would still need 8x3 codes for under-promotions, some of which could be also used for double-pushes, as Pawns that can double-push can never promote. (Castlings could be encoded as off-board King moves, as the King has to be on the edge to castle.) That leaves 100 codes for super-numerary pieces, of which you could set apart 8 for an extra N and 28 for an extra Q,R or B. That still would leave 64 codes with variable decoding (where you first would have to figure out what material is on the board to know what the codes mean) for the never-occurring situations with 3 and 4 Queens, 7 Knights or whatever, and some escape codes for positions so ridiculous that the 1-byte system cannot handle them.
User avatar
Don
Posts: 5106
Joined: Tue Apr 29, 2008 4:27 pm

Re: Database storage methods

Post by Don »

hgm wrote:Well, escapes can be viewed as a byte-quantized form of Hufman coding. If compactness is not the only criterion, but you also want to have some efficiency left on decoding, using SEE etc. seems a bad idea.

Decoding can be much faster, with an acceptable loss of compactness, by reserving codes for all possible move steps of the given piece type. You can then decode the move by just looking at the piece list section for that piece type to get the from square, and use a lookup table to get the step-vector. For orthodox Chess you would need 8x3 codes for Pawns, 3x8 for K+N, 4x14 for B+R and 28 for Q, for a total of 132. You would still need 8x3 codes for under-promotions, some of which could be also used for double-pushes, as Pawns that can double-push can never promote. (Castlings could be encoded as off-board King moves, as the King has to be on the edge to castle.) That leaves 100 codes for super-numerary pieces, of which you could set apart 8 for an extra N and 28 for an extra Q,R or B. That still would leave 64 codes with variable decoding (where you first would have to figure out what material is on the board to know what the codes mean) for the never-occurring situations with 3 and 4 Queens, 7 Knights or whatever, and some escape codes for positions so ridiculous that the 1-byte system cannot handle them.
Has anyone ever given consideration to how a compact opening book might be built using bloom filters or Golomb-coded sets?

It seems like a natural fit, except for the pesky false positive hit rate - a problem that I think makes this unworkable. With Golmob-coded sets we could (in theory) get a book down to something around 8 bits per position and still have transpositions detected.
Capital punishment would be more effective as a preventive measure if it were administered prior to the crime.
User avatar
Don
Posts: 5106
Joined: Tue Apr 29, 2008 4:27 pm

Re: Database storage methods

Post by Don »

Edmund wrote: Furthermore I disagree with Bob regarding the "Most just store a huge file of zobrist hash signatures.". When talking about engines here I suppose you are describing opening books and not databases in general. The problem with this method here is that it invites unwanted transpositions.
e.g. after 1. e4 e5 2. Nf3 a6 3. Bb5 the opening book would suggest the transposition into the well known line with Nc6.
The most streight forward countermesure is to not only calculate hash values of positions but also add the move into the equation.
I think Bob didn't mean that literally. He was probably talking about the polyglot format. That is a pretty sensible way to store a book.
Capital punishment would be more effective as a preventive measure if it were administered prior to the crime.
syzygy
Posts: 5569
Joined: Tue Feb 28, 2012 11:56 pm

Re: Database storage methods

Post by syzygy »

Edmund wrote:Furthermore I disagree with Bob regarding the "Most just store a huge file of zobrist hash signatures.". When talking about engines here I suppose you are describing opening books and not databases in general. The problem with this method here is that it invites unwanted transpositions.
e.g. after 1. e4 e5 2. Nf3 a6 3. Bb5 the opening book would suggest the transposition into the well known line with Nc6.
The most streight forward countermesure is to not only calculate hash values of positions but also add the move into the equation.
From rgcc (10/5/96):
hyatt wrote:To solve your problem above, in tournament mode Crafty uses a short search
of all non-book positions to look for this. It happens lots of times, and
has been the subject of discussion for years. Example:

1. e4 e5 2. Nf3 Nc6 3. Bb5 a6 Ba4

game goes:

1. e4 e5 2. Bb5 a6 4. Nf3 and black promptly plays Nc6 rather than
ripping the bishop, because Nc6 goes back into book. The short search
solves this completely because it would notice that there's a move that
seems to win material at a fairly quick depth, so crafty ignores the
transposition back into book and does a normal search. If the move wins
the piece after a long search, it rips it...
Whether this is still accurate after more than 15 years, I do not know ;-)
Edmund
Posts: 670
Joined: Mon Dec 03, 2007 3:01 pm
Location: Barcelona, Spain

Re: Database storage methods

Post by Edmund »

Don wrote:
hgm wrote:Well, escapes can be viewed as a byte-quantized form of Hufman coding. If compactness is not the only criterion, but you also want to have some efficiency left on decoding, using SEE etc. seems a bad idea.

Decoding can be much faster, with an acceptable loss of compactness, by reserving codes for all possible move steps of the given piece type. You can then decode the move by just looking at the piece list section for that piece type to get the from square, and use a lookup table to get the step-vector. For orthodox Chess you would need 8x3 codes for Pawns, 3x8 for K+N, 4x14 for B+R and 28 for Q, for a total of 132. You would still need 8x3 codes for under-promotions, some of which could be also used for double-pushes, as Pawns that can double-push can never promote. (Castlings could be encoded as off-board King moves, as the King has to be on the edge to castle.) That leaves 100 codes for super-numerary pieces, of which you could set apart 8 for an extra N and 28 for an extra Q,R or B. That still would leave 64 codes with variable decoding (where you first would have to figure out what material is on the board to know what the codes mean) for the never-occurring situations with 3 and 4 Queens, 7 Knights or whatever, and some escape codes for positions so ridiculous that the 1-byte system cannot handle them.
Has anyone ever given consideration to how a compact opening book might be built using bloom filters or Golomb-coded sets?

It seems like a natural fit, except for the pesky false positive hit rate - a problem that I think makes this unworkable. With Golmob-coded sets we could (in theory) get a book down to something around 8 bits per position and still have transpositions detected.
Some Bloomfilter calculations:
n = total number of positions
m = size of the bloomfilter in bits
k = number of bits ORed to the filter for each position

According to http://pages.cs.wisc.edu/~cao/papers/su ... node8.html

for the m/n ratio of 8 (= 8 bits per position) the optimal value for k is 6 and this leaves us with a false positive rate of 2.16%.

Assuming that in a single game 10 moves are played out of book and for each position on average 30 moves are considered, the chance of a false move being played equals:
1 - ( 1 - 0.0216 ) ^ (10 * 30) = 99.857%

For 2 byte per position I am getting 12.867% and for 4 byte it is 0.006%

I couldn't find any similar statistics for Golmob-encoding. Can you provide any figures for false positive rates for an m/n ratio of 8 ?
User avatar
Don
Posts: 5106
Joined: Tue Apr 29, 2008 4:27 pm

Re: Database storage methods

Post by Don »

Edmund wrote:
Don wrote:
hgm wrote:Well, escapes can be viewed as a byte-quantized form of Hufman coding. If compactness is not the only criterion, but you also want to have some efficiency left on decoding, using SEE etc. seems a bad idea.

Decoding can be much faster, with an acceptable loss of compactness, by reserving codes for all possible move steps of the given piece type. You can then decode the move by just looking at the piece list section for that piece type to get the from square, and use a lookup table to get the step-vector. For orthodox Chess you would need 8x3 codes for Pawns, 3x8 for K+N, 4x14 for B+R and 28 for Q, for a total of 132. You would still need 8x3 codes for under-promotions, some of which could be also used for double-pushes, as Pawns that can double-push can never promote. (Castlings could be encoded as off-board King moves, as the King has to be on the edge to castle.) That leaves 100 codes for super-numerary pieces, of which you could set apart 8 for an extra N and 28 for an extra Q,R or B. That still would leave 64 codes with variable decoding (where you first would have to figure out what material is on the board to know what the codes mean) for the never-occurring situations with 3 and 4 Queens, 7 Knights or whatever, and some escape codes for positions so ridiculous that the 1-byte system cannot handle them.
Has anyone ever given consideration to how a compact opening book might be built using bloom filters or Golomb-coded sets?

It seems like a natural fit, except for the pesky false positive hit rate - a problem that I think makes this unworkable. With Golmob-coded sets we could (in theory) get a book down to something around 8 bits per position and still have transpositions detected.
Some Bloomfilter calculations:
n = total number of positions
m = size of the bloomfilter in bits
k = number of bits ORed to the filter for each position

According to http://pages.cs.wisc.edu/~cao/papers/su ... node8.html

for the m/n ratio of 8 (= 8 bits per position) the optimal value for k is 6 and this leaves us with a false positive rate of 2.16%.

Assuming that in a single game 10 moves are played out of book and for each position on average 30 moves are considered, the chance of a false move being played equals:
1 - ( 1 - 0.0216 ) ^ (10 * 30) = 99.857%

For 2 byte per position I am getting 12.867% and for 4 byte it is 0.006%

I couldn't find any similar statistics for Golmob-encoding. Can you provide any figures for false positive rates for an m/n ratio of 8 ?
Golomb encoding basically gets you closer to the theoretical minimum number of bits required for a given probability of false positives, it's probably only 20% or 30% less space but nothing dramatic.

I did a similar calculation and it seem completely unworkable. The "set" of positions to identify is much larger than the number of moves in the book which is s big problem.

There are a couple of ways to significantly constrain this. One of them is to consider only positions that are connected to a previous book move, so once out of book we are done. Another idea is to use a deterministic chess program to generate candidate moves. For example a 5 ply search done by your chess program under the same conditions to produce a move deterministically. The bloom filter identifies those positions where the search does not produce the desired move. It's ok to get a false positive in this case, it would only mean the program rejects the computers choice once in a while when it shouldn't. Of course this has serious limitations as you must now deal with the case where the computer does not pick the book move and it doesn't handle more than 1 possible book move.

Nothing really works well for a proper book system with high compression based on bloom filters that I can think of.
Capital punishment would be more effective as a preventive measure if it were administered prior to the crime.
diep
Posts: 1822
Joined: Thu Mar 09, 2006 11:54 pm
Location: The Netherlands

Re: Database storage methods

Post by diep »

Dave_N wrote:I am thinking about Storing a compressed move list as a Tree in a hash table ...

The hash entry obviously has the following

int64 hash
int addrNext; // if needing a mainline
int addrVariation;


then for each move the game is merged by putting all variations found in a linked list of hash entries, so for a game list stored in sparse format

Code: Select all

1.e4 e5 2.Nf3 Nc6 3.Bc4  // game 1
                          3.Bb5  // game 2
                          3.Nc3  // game 3
the main line is written first, then the variable addrVariation for Bc4 holds the array index of the hash bucket for Bb5, and the Bb5 bucket stores the hash bucket index for Nc3 etc ...

I don't know what to do about terminal nodes or nodes with only 1 game passing through - perhaps the index of the game can be written and merged iff another game scores a hit ?

I also tried to create a DAWG (direct acyclic word graph) for the FEN's to try to create perfect hashing ... the fen can already be compressed to 5 bit chars, then I got a 15 mb to 10 mb compression for the Fen list to DAWG file sizes.

If anyone has tried this disk tree concept I'd be interested to hear about how to compress the tree and whether its worth storing sparse move lists, or whether to just use SQL and a compressed move list, with maybe a gui button to call CQL for position searches.
You soon will find out this eats lots of space. What i did do in the old Diep game format, dating back from 1994 or so, is always generate moves in the same manner,by keeping a sorted list of pieces (so the same pieces need to get sorted by the square they're on).

Then you can easily store each move in 1 byte, as each position P has at most 220 moves or so. I know of a position of 218 moves by head, but for sure not 220.

Now this goes all pretty quick actually.

If you want to put more system time into it then a simple runlength encoding trick can reduce the size of the movelist even further.