Re-Pair compression questions

Rein Halbersma · Post by **Rein Halbersma** » Fri Aug 17, 2018 2:04 pm

A few questions to Ronald about the Re-Pair compression algorithm (as preparation for draughts database compression). From my understanding of other posts here, your compression does a) a re-pair until either no pair has count > 1, or 4096 symbols have been generated, b) you follow this with a Huffman encoding. Is this understanding correct?

1) By how much does the re-pair step (without Huffman post-processing) typically reduce the raw files of WDL and DTZ values?
2) Why do you stop at a fixed number of symbols? (please disregard if I misunderstood)
3) How much more reduction do you get from Huffman?
4) These lecture notes ( http://www2.compute.dtu.dk/courses/0228 ... ess1x1.pdf ) suggest a method for O(log N) (with N the number of positions in the database) for random access to the compressed db (without Huffman). In particular, see the Interval-based search section. Is this similar to how you probe your positions?

Also note that this probing method continues the re-pairing until one "root" symbol remains. This allows them to do a binary search on the full tree of the re-pairing "grammar". I wonder if it could be made competitive.

syzygy · Post by **syzygy** » Fri Aug 17, 2018 8:49 pm

Rein Halbersma wrote: ↑Fri Aug 17, 2018 2:04 pm A few questions to Ronald about the Re-Pair compression algorithm (as preparation for draughts database compression). From my understanding of other posts here, your compression does a) a re-pair until either no pair has count > 1, or 4096 symbols have been generated, b) you follow this with a Huffman encoding. Is this understanding correct?

Yes, except that I stop a bit earlier than count > 1 (minfreq = 8 in the code).

Not unimportant is that I fill in don't cares and partial don't cares in the re-pair phase.
The WDL and DTZ code differ significantly here.

You will probably only have full don't cares (for broken/illegal positions and capture positions).

It is possible to do more by encoding a position where you can force a losing capture by the other side as a full don't care and a drawn position where you can force a drawing capture as a partial don't care (meaning that you can store either a loss or a draw, since the probing code can figure out that the result is at least a draw by discovering the move forcing a drawing capture). But this will slow down probing quite a bit, so if you want to use it at all, then probably only for the tables with the maximum number of pieces.

You will probably only have W/D/L and no cursed wins/losses? That simplifies things too.

1) By how much does the re-pair step (without Huffman post-processing) typically reduce the raw files of WDL and DTZ values?

I don't know.

2) Why do you stop at a fixed number of symbols? (please disregard if I misunderstood)

Because I store the symbol tree with 3 bytes per symbol (12 bits left-hand child, 12 bits right-hand child). The maximum size of this tree is now 12kb and not letting it grow too big should be good for cache behaviour during probing.

Another limit is the maximum expansion for a re-pair symbol, which is 256. This is so I can store the length of each symbol (minus 1) in one byte (adding another 4kb). It also limits the number of branchy loops when extracting the wanted value from a Huffman-decoded Re-Pair symbol. (It seems possible to remove those branches, but I haven't tried yet.)

3) How much more reduction do you get from Huffman?

I don't know this either.

4) These lecture notes ( http://www2.compute.dtu.dk/courses/0228 ... ess1x1.pdf ) suggest a method for O(log N) (with N the number of positions in the database) for random access to the compressed db (without Huffman). In particular, see the Interval-based search section. Is this similar to how you probe your positions?

My probing code first finds the 64-byte block that contains the position being probed. This block is Huffman-decompressed into Re-Pair symbols. Since we know the number of positions (from 1 to 256) represented by each Re-Pair symbol, we know at which symbol to stop. The code then walks down the tree (with the branchy loop mentioned above) to find the terminal value corresponding to the probed position. Here we again use the known lengths of each symbol to decide whether to follow the left-hand or the right-hand symbol on each expansion step. This seems to be the "Top Down Solution" mentioned in the lecture notes.

I haven't yet tried to understand the other proposed solutions. I think what has to be kept in mind is that the engine will be doing normal search work and probing many different tables. The data structure that is optimal for a program that is continuously probing one and the same table need not be optimal for a real engine's probing code.

For the same reason i have some doubt that replacing Huffman with any of the ANS entropy encoders is a good idea. It would make the tables a bit smaller, but that is just one part of the story. (For DTZ tables probed only at the root it likely does make a lot of sense.)

Also note that this probing method continues the re-pairing until one "root" symbol remains. This allows them to do a binary search on the full tree of the re-pairing "grammar". I wonder if it could be made competitive.

So the whole table becomes a single symbol together with a huge grammar? I suspect the number of random accesses needed for each probe would be excessive.

Anyway, there are many choices to be made, and the choices I made are unlikely ot be optimal (not that there exists a set of choices that is optimal for all hardware configurations).

Rein Halbersma · Post by **Rein Halbersma** » Sat Aug 18, 2018 8:31 pm

syzygy wrote: ↑Fri Aug 17, 2018 8:49 pm
Rein Halbersma wrote: ↑Fri Aug 17, 2018 2:04 pm A few questions to Ronald about the Re-Pair compression algorithm (as preparation for draughts database compression). From my understanding of other posts here, your compression does a) a re-pair until either no pair has count > 1, or 4096 symbols have been generated, b) you follow this with a Huffman encoding. Is this understanding correct?
Yes, except that I stop a bit earlier than count > 1 (minfreq = 8 in the code).

Not unimportant is that I fill in don't cares and partial don't cares in the re-pair phase.
The WDL and DTZ code differ significantly here.

You will probably only have full don't cares (for broken/illegal positions and capture positions).

Yes, in checkers/draughts, captures are mandatory and can be eliminated. The Kingsrow databases e.g. set all positions with either pending captures for the side to move or pending capture for the other side to "dont't care". This will require a small search in smaller tables to resolve during lookup.

You will probably only have W/D/L and no cursed wins/losses? That simplifies things too.

Not necessarily, I am thinking of doing either DTM or DTZ_N+ for some value of N (in 10x10 international draughts it's 50 ply, in 8x8 checkers it's 80 ply IIRC).

3) How much more reduction do you get from Huffman?
I don't know this either.

OK, I thought perhaps you know the (weighted) average symbol length of your Huffman coding since you store the Huffman encoding in the db, right?

Anyway, there are many choices to be made, and the choices I made are unlikely ot be optimal (not that there exists a set of choices that is optimal for all hardware configurations).

The design space of endgame database generation is enormous! Not just the compression, but the metric, subdivisions, method of generation, indexing etc.

syzygy · Post by **syzygy** » Sun Aug 19, 2018 12:30 am

3) How much more reduction do you get from Huffman?
I don't know this either.
OK, I thought perhaps you know the (weighted) average symbol length of your Huffman coding since you store the Huffman encoding in the db, right?

For each symbol length I store the number of symbols with that length. This is enough for decoding canonical Huffman codes but not enough to calculate the weighted average symbol length. Of course it is possible to get all such statistics by decompressing the whole table.

Anyway, the compression ratio varies a lot from table to table. Tables that are almost completely won or almost completely drawn compress to 32-byte blocks of 0s, each 0 bit representing 256 draw or win values. This is the best you can get in my scheme since each block can represent at most 65536 positions (so I can store the number of positions in a block minus 1 in 2 bytes). In this case Huffman gains a factor of close to 12 (12 bits -> 1 bit for almost all symbols) and Re-Pair a factor of 256*8/12 = 170.67 if you count an uncompressed WDL value as 1 byte. Since a WDL value can instead be stored in 8/5 bits (if you don't need to accommodate for cursed wins/losses), 256*8/5/12 = 34.13 might be a more honest estimate. With more balanced tables compression will obviously be much worse.

Sergei S. Markoff · Post by **Sergei S. Markoff** » Sat Apr 20, 2019 11:55 pm

syzygy wrote: ↑Fri Aug 17, 2018 8:49 pmFor the same reason i have some doubt that replacing Huffman with any of the ANS entropy encoders is a good idea. It would make the tables a bit smaller, but that is just one part of the story. (For DTZ tables probed only at the root it likely does make a lot of sense.)

I think it's worth to try using ANS with probabilities delivered by convolutional neural network fed with position. I tnink well-trained CNN is able to deliver almost exact evaluation which could dramatically reduce memory usage...

Moreover, during induction phase it's possible to update CNN weights based on newly obtained scores and then perform re-packing "on the fly".

syzygy · Post by **syzygy** » Sun Apr 21, 2019 4:15 pm

Sergei S. Markoff wrote: ↑Sat Apr 20, 2019 11:55 pm
syzygy wrote: ↑Fri Aug 17, 2018 8:49 pmFor the same reason i have some doubt that replacing Huffman with any of the ANS entropy encoders is a good idea. It would make the tables a bit smaller, but that is just one part of the story. (For DTZ tables probed only at the root it likely does make a lot of sense.)
I think it's worth to try using ANS with probabilities delivered by convolutional neural network fed with position. I tnink well-trained CNN is able to deliver almost exact evaluation which could dramatically reduce memory usage...

How are you going to run an NN on all the positions in a 7-piece table during generation?
How much will having to run an NN slow down the probing?

The current compression scheme is essentially based on a predictor that tries out all captures, if any, and otherwise just counts the material that is left on the board. NNs won't do much better than that, except in some tables where the pawn structure predicts the outcome. On such tables the current scheme could be improved by compressing pawn slices separately.

petero2 · Post by **petero2** » Sun Apr 21, 2019 11:03 pm

syzygy wrote: ↑Sun Apr 21, 2019 4:15 pm
Sergei S. Markoff wrote: ↑Sat Apr 20, 2019 11:55 pm I think it's worth to try using ANS with probabilities delivered by convolutional neural network fed with position. I tnink well-trained CNN is able to deliver almost exact evaluation which could dramatically reduce memory usage...
How are you going to run an NN on all the positions in a 7-piece table during generation?
How much will having to run an NN slow down the probing?

The current compression scheme is essentially based on a predictor that tries out all captures, if any, and otherwise just counts the material that is left on the board. NNs won't do much better than that, except in some tables where the pawn structure predicts the outcome. On such tables the current scheme could be improved by compressing pawn slices separately.

A while ago I tried to improve the syzygy WDL+ compression by using decision trees to predict the outcome based on the position. The decision tree computes a permutation of the 5 possible WDL+ values, where the first value corresponds to the most likely result, the second value corresponds to the second most likely result, and so on. The table then stores the index in the permutation instead of storing the WDL+ value directly. This means that if the decision tree always predicts the correct WDL+ value, the table will contain only 0 values and compresses to almost nothing.

This works really well in some simple cases that the decision tree is able to handle efficiently. For example, KBBBvK is predicted almost perfectly, because I have a predicate that tells if there is a bishop on white squares and another predicate that tells if there is a bishop on black squares. Therefore the KBBBvK table can be compressed from 723K to about 1K.

The problem though is that the simple cases are mostly irrelevant to the overall quality of the compression scheme. The simple cases are taking up an insignificant part of the total disk space in the current syzygy implementation, so compressing them even better is mostly useless.

For the complicated cases, such as KRNPvKQ which is the largest 6 man WDL table, the decision tree is a lot less efficient at reducing the compressed size. The KRNPvKQ table is reduced by about 30% by the decision tree prediction. Other "complicated" tables are reduced by around 30% to 50%.

I did not finish my implementation because I currently do not think it is likely that I will be able to come up with significantly better predictors, which would be required to reduce the size further.

Another problem is that building the decision tree is quite slow. I think it took something like 10 hours for the KRNPvKQ table using 1 core. Also parallelizing the algorithm is complicated without using a lot of extra memory.

Also, the probing will be slower since the decision tree has to be evaluated before a table value can be interpreted. How much slower is unknown since I did not finish my implementation.

It is possible that a neural network is better than a decision tree at predicting the WDL+ values, but I currently have no reason to believe this.

syzygy · Post by **syzygy** » Sun Apr 28, 2019 5:58 pm

petero2 wrote: ↑Sun Apr 21, 2019 11:03 pm This works really well in some simple cases that the decision tree is able to handle efficiently. For example, KBBBvK is predicted almost perfectly, because I have a predicate that tells if there is a bishop on white squares and another predicate that tells if there is a bishop on black squares. Therefore the KBBBvK table can be compressed from 723K to about 1K.

The syzygy tables indeed do not deal intelligently with bishops of the same color.

The problem though is that the simple cases are mostly irrelevant to the overall quality of the compression scheme. The simple cases are taking up an insignificant part of the total disk space in the current syzygy implementation, so compressing them even better is mostly useless.

Indeed. It's a lot of programming work for a small gain.

For the complicated cases, such as KRNPvKQ which is the largest 6 man WDL table, the decision tree is a lot less efficient at reducing the compressed size. The KRNPvKQ table is reduced by about 30% by the decision tree prediction. Other "complicated" tables are reduced by around 30% to 50%.

Do you predict the outcome of KRNPvKQ for positions without captures other than by looking at the position of the pawn?

For 6-piece tables with a single pawn it is probably worth it to compress by pawn square rather than by pawn file. That would effectively predict the outcome of a position by looking at the position of the pawn. (Compressing by pawn rank would already be better than by file as is down now.)

petero2 · Post by **petero2** » Mon Apr 29, 2019 12:52 am

syzygy wrote: ↑Sun Apr 28, 2019 5:58 pm
petero2 wrote: ↑Sun Apr 21, 2019 11:03 pm For the complicated cases, such as KRNPvKQ which is the largest 6 man WDL table, the decision tree is a lot less efficient at reducing the compressed size. The KRNPvKQ table is reduced by around 30% by the decision tree prediction. Other "complicated" tables are reduced by around 30% to 50%.
Do you predict the outcome of KRNPvKQ for positions without captures other than by looking at the position of the pawn?

Yes, I use a number of different predicates. I use a binary tree to represent the decision tree. Non-leaf nodes in the tree contain a binary predicate. The predicate takes the position as input and produces a Boolean result. During prediction, you start at the root and evaluate the predicate. If it is false, go to the left child, otherwise go to the right child. Continue until a leaf node is reached. A leaf node does not contain a predicate. Instead it contains a permutation of 01234, which describes how to remap the value stored in the tablebase.

A predicate in the tree is chosen from a predefined set of possible predicates. Some examples of predicates I use:

Code: Select all

* The rank of piece I is <= Y, for I=0..(npieces-1), Y=0..6.     6*7=42 predicates for a 6-men TB
* Similar for file
* The file distance between piece I and J is <= X                6*5/2*7=105 predicates for a 6-men TB
* Similar for rank distance
* It is white's turn to move
* The side to move is in check
* Piece I attacks piece J                                        6*5=30 predicates for a 6-men TB
* White/black has a bishop pair                                  2 predicates
* The L1/Linf distance between piece I and J is <= D

The decision tree is constructed by a greedy algorithm. First all possible predicates are evaluated for all positions in the tablebase, and the predicate that causes the largest reduction in entropy is chosen as the root of the decision tree. The procedure is repeated for a leaf node in the decision tree, by evaluating all possible predicates for all positions in the tablebase that correspond to the leaf node. The procedure stops when the tree has reached a given depth in all leaf nodes, or when no leaf node is worth splitting further because the entropy reduction would be too low.

As an example, consider the KQvKN tablebase. Running the algorithm with a max allowed tree depth of 5 produces the following decision tree:

Code: Select all

fork02            # True if piece 0 (WK) and 2 (WQ) can be forked by an opponent knight
  attack30        # True if piece 3 (BN) attacks piece 0 (WK)
    42013         # Most likely result is 4 (white win), second most likely is 2 (draw), etc
    attack32      # Piece 3 (BN) attacks piece 2 (WQ)
      42013
      20134
  wtm             # White's turn to move
    incheck       # Side to move (black) is in check
      24013
      40123
    40123

Using this decision tree gives a prediction accuracy of around 99.903%, which makes it possible to compress the TB to around 2.8KB.

If allowing the tree depth to be at most 10, the computed tree contains 23 leaf nodes, gives a predication accuracy of around 99.9986%, and the TB can be compressed to around 0.3KB.

For the KRNPvKQ TB, the algorithm produced a tree with 264660 leaf nodes when the maximum allowed tree depth was 20. This would give a compressed size of around 1.43GB.

In principle it is possible to create an arbitrarily good predicate. In the extreme case the predicate could be WDLScore <= V, and the predicate could be computed by doing a full retrograde analysis. This is just another way of stating that a tablebase can always be compressed to the size of the tablebase generator, if you don't care how much time and memory is required to access the tablebase. In practice, such a predicate would obviously be totally useless.

The challenge then becomes how to choose predicates that can be evaluated reasonably fast using a reasonably small amount of RAM, and that can be represented in the stored decision tree using a small amount of memory.

syzygy · Post by **syzygy** » Thu May 02, 2019 10:08 pm

syzygy wrote: ↑Sun Apr 21, 2019 4:15 pm
Sergei S. Markoff wrote: ↑Sat Apr 20, 2019 11:55 pm
syzygy wrote: ↑Fri Aug 17, 2018 8:49 pmFor the same reason i have some doubt that replacing Huffman with any of the ANS entropy encoders is a good idea. It would make the tables a bit smaller, but that is just one part of the story. (For DTZ tables probed only at the root it likely does make a lot of sense.)
I think it's worth to try using ANS with probabilities delivered by convolutional neural network fed with position. I tnink well-trained CNN is able to deliver almost exact evaluation which could dramatically reduce memory usage...
How are you going to run an NN on all the positions in a 7-piece table during generation?
How much will having to run an NN slow down the probing?

An even more serious problem: when decompressing a block, you would need to know the probabilities for all positions in the block preceding the position you are probing. If a block contains up to 65536 positions, that would mean tens of thousands of NN evaluations just to get to the value of a single position.

Another problem is that your approach is incompatible with my Re-Pair-based approach that first strings together sequences of values into single symbols and then uses a static encoder to compress those symbols. You can't really run an NN on a string of positions.

Re-Pair compression questions

Re-Pair compression questions

Re: Re-Pair compression questions

Re: Re-Pair compression questions

Re: Re-Pair compression questions

Re: Re-Pair compression questions

Re: Re-Pair compression questions

Re: Re-Pair compression questions

Re: Re-Pair compression questions

Re: Re-Pair compression questions

Re: Re-Pair compression questions