Compression of chess databases

rvida · Post by **rvida** » Wed Jul 18, 2012 1:35 am

rvida wrote: I think 50% reduction can be easily achieved using a domain specific predictor (i.e. a chess engine). Right now I am running some tests on a sample PGN database and the prediction rates are very compression friednly. Looks like 3-3.5 bits per move is easily achievable while still maintaining the ability to do random accesses with game granularity.

More details later...

Here is my experiment on a PGN of 50861 games (from Olivier Deville's collection).

I used Critter 1.6a as a predictor doing 1 ply full width search + qsearch. Prediction is quite good even though these are human games.

This is the resulting symbol frequency distribution:

Code: Select all

# 1: 35.90540% (1587274)
# 2: 16.77592% (741615)
# 3: 10.12348% (447530)
# 4: 6.53990% (289110)
# 5: 4.82253% (213190)
# 6: 3.69800% (163478)
# 7: 3.16578% (139950)
# 8: 2.64557% (116953)
# 9: 2.23507% (98806)
#10: 1.83142% (80962)
#11: 1.56484% (69177)
#12: 1.64580% (72756)
#13: 1.20171% (53124)
#14: 1.07487% (47517)
#15: 0.92266% (40788)
#16: 0.82613% (36521)
#17: 0.82412% (36432)
#18: 0.75637% (33437)
#19: 0.69821% (30866)
#20: 0.62128% (27465)
#21: 0.40441% (17878)
#22: 0.38331% (16945)
#23: 0.30782% (13608)
#24: 0.23928% (10578)
#25: 0.19614% (8671)
#26: 0.17318% (7656)
#27: 0.10458% (4623)
#28: 0.08406% (3716)
#29: 0.05757% (2545)
#30: 0.04440% (1963)
#31: 0.03561% (1574)
#32: 0.02531% (1119)
rest: 0.06524% (2884)

total games: 50861
total halfmoves: 4420711
halfmoves/game avg = 86.92 min = 39 max = 391

After transformation I stored a raw stream of moves in simple 8 bits per move format, then applied standard zip "deflate" method on it. The size went down from 4420711 raw bytes to 1907969 compressed bytes.

Daniel Shawul · Post by **Daniel Shawul** » Wed Jul 18, 2012 1:59 am

And now you have to distribute Criter 1.6a july 27,2012 version with your decompressor...savy?

Otherwise it won't decompress properly. Try it on a bigger database with 4+ search and it won't ever finish. Atleast I was not able to finish the number of positions in 5men with that predictor. The Op's 10% figure was totally wrong, any compressor will have compressed a PGN much better.

syzygy · Post by **syzygy** » Wed Jul 18, 2012 2:55 am

Daniel Shawul wrote:And now you have to distribute Criter 1.6a july 27,2012 version with your decompressor...savy? Otherwise it won't decompress properly. Try it on a bigger database with 4+ search and it won't ever finish.

So a good trick is not to do 4+ search but keep it cheap. Of course what you can do fully depends on the application. If time is not an issue, you can compress all 7 piece tablebases to a single bit (plus a generator).

For bitbases, doing a full 3 or 4 ply search seems to defeat the purpose of having them. Plus, I doubt that an evaluation function will be able to predict the outcome of a "quiet" tablebase position noticeably better than simply assigning a fixed prediction to all quiet positions depending on the table (e.g. win for KBNvK, draw for KQvKQ). What remains are the positions that you can actually partially or fully resolve in 3 or 4 ply: prove it a win, a loss, at-least-draw or at-most-draw.

To improve compression in my tables, I do one round of captures (if there are any) and look up their outcomes (the tables being two-sided).

Daniel Shawul · Post by **Daniel Shawul** » Wed Jul 18, 2012 3:35 am

syzygy wrote:
Daniel Shawul wrote:And now you have to distribute Criter 1.6a july 27,2012 version with your decompressor...savy? Otherwise it won't decompress properly. Try it on a bigger database with 4+ search and it won't ever finish.
So a good trick is not to do 4+ search but keep it cheap. Of course what you can do fully depends on the application. If time is not an issue, you can compress all 7 piece tablebases to a single bit (plus a generator).

Exactly,it is not about getting the best compression ratio. For instance, I may decide to use many chess engines and select from the PGN headres which two engines to use. I am sure that will improve compression ratio a lot but is just not worth it. PGN are not going to be stored in RAM so there is really no need to go for compression. I don't even use the 1 ply capture prediction mainly because it didn't decrease the bitbase sizes after I dropped one of the sides, but it gave reductions on the 2 sided bitbases so it maybe worthwhile.

For bitbases, doing a full 3 or 4 ply search seems to defeat the purpose of having them. Plus, I doubt that an evaluation function will be able to predict the outcome of a "quiet" tablebase position noticeably better than simply assigning a fixed prediction to all quiet positions depending on the table (e.g. win for KBNvK, draw for KQvKQ). What remains are the positions that you can actually partially or fully resolve in 3 or 4 ply: prove it a win, a loss, at-least-draw or at-most-draw.

To improve compression in my tables, I do one round of captures (if there are any) and look up their outcomes (the tables being two-sided).

Yes I am sure many would not want to use bitbases if I were doing a 3-4 ply search on their behalf. I do a 1 ply search since only one side bitbases are stored but ofcourse that overhead can be removed too by downloading the full two sided bitbases. My experience with prediction is bad. Most of what you can predict with heuristics can be taken care of by general compressors. In bitbases where you really need some domain dependent prediciton as in KRPkr etc, the predictions wont work. Doing shallow searches of 4+ plies over 5 pieces is really fast but still won't help. If it did then it means there is no need for bitbases anyway since they are meant to cover knowledge the engine lacks. I like to think of bitbases as a database of 'exceptions' that can be probed in O(1) time. This important to exclude methods that store exceptions simple list and probe them in O(n).

phhnguyen · Post by **phhnguyen** » Wed Jul 18, 2012 7:23 am

rvida wrote:
rvida wrote: I think 50% reduction can be easily achieved using a domain specific predictor (i.e. a chess engine). Right now I am running some tests on a sample PGN database and the prediction rates are very compression friednly. Looks like 3-3.5 bits per move is easily achievable while still maintaining the ability to do random accesses with game granularity.

More details later...
Here is my experiment on a PGN of 50861 games (from Olivier Deville's collection).

I used Critter 1.6a as a predictor doing 1 ply full width search + qsearch. Prediction is quite good even though these are human games.

This is the resulting symbol frequency distribution:
Code: Select all
# 1: 35.90540% (1587274)
# 2: 16.77592% (741615)
# 3: 10.12348% (447530)
# 4: 6.53990% (289110)
# 5: 4.82253% (213190)
# 6: 3.69800% (163478)
# 7: 3.16578% (139950)
# 8: 2.64557% (116953)
# 9: 2.23507% (98806)
#10: 1.83142% (80962)
#11: 1.56484% (69177)
#12: 1.64580% (72756)
#13: 1.20171% (53124)
#14: 1.07487% (47517)
#15: 0.92266% (40788)
#16: 0.82613% (36521)
#17: 0.82412% (36432)
#18: 0.75637% (33437)
#19: 0.69821% (30866)
#20: 0.62128% (27465)
#21: 0.40441% (17878)
#22: 0.38331% (16945)
#23: 0.30782% (13608)
#24: 0.23928% (10578)
#25: 0.19614% (8671)
#26: 0.17318% (7656)
#27: 0.10458% (4623)
#28: 0.08406% (3716)
#29: 0.05757% (2545)
#30: 0.04440% (1963)
#31: 0.03561% (1574)
#32: 0.02531% (1119)
rest: 0.06524% (2884)

total games: 50861
total halfmoves: 4420711
halfmoves/game avg = 86.92 min = 39 max = 391
After transformation I stored a raw stream of moves in simple 8 bits per move format, then applied standard zip "deflate" method on it. The size went down from 4420711 raw bytes to 1907969 compressed bytes.

Very impressive result!!! Thanks for that!

I divide compression a chess database into two phases:
1) compress by encode moves such as changing from 2 bytes to 1 byte or 6 bits or 4 bits per move
2) compress further by applying normal compressor such as zip, tar, bsz2... library / software on data files

I am usually disappointed about phase 2 because of poor rate.

Back to your work, if you save your moves by 4 bits only, you can save immediately half of your size without using any extra compressor. That job is also faster and using less code (you don't need to integrate some compression libs).

Now, I wonder, if you use 4 bit/move only (and save already 50% of the size), how much can you save more if you continue to compress it by your zip?

Daniel Shawul · Post by **Daniel Shawul** » Wed Jul 18, 2012 11:35 am

I divide compression a chess database into two phases:
1) compress by encode moves such as changing from 2 bytes to 1 byte or 6 bits or 4 bits per move
2) compress further by applying normal compressor such as zip, tar, bsz2... library / software on data files

Don't play dumb here. If you did that you will get far more than 10% but I guess we won't see data from you.

I am usually disappointed about phase 2 because of poor rate.

Back to your work, if you save your moves by 4 bits only, you can save immediately half of your size without using any extra compressor. That job is also faster and using less code (you don't need to integrate some compression libs).

It doesn't matter if you use 8 bits/move or 4 bits/move you will probable get same result. That is as long as you use the same number of bits per move and not mess up the entropy.

phhnguyen · Post by **phhnguyen** » Wed Jul 18, 2012 12:07 pm

Daniel Shawul wrote:
I divide compression a chess database into two phases:
1) compress by encode moves such as changing from 2 bytes to 1 byte or 6 bits or 4 bits per move
2) compress further by applying normal compressor such as zip, tar, bsz2... library / software on data files
Don't play dumb here. If you did that you will get far more than 10% but I guess we won't see data from you.
I am usually disappointed about phase 2 because of poor rate.

Back to your work, if you save your moves by 4 bits only, you can save immediately half of your size without using any extra compressor. That job is also faster and using less code (you don't need to integrate some compression libs).
It doesn't matter if you use 8 bits/move or 4 bits/move you will probable get same result. That is as long as you use the same number of bits per move and not mess up the entropy.

I guess you don't understand my post.

I save 1 move into 4 bits only. It means that I can store 2 moves in 1 byte only. By that way, I need a half of file size without using any compressor.

Uri Blass · Post by **Uri Blass** » Wed Jul 18, 2012 12:31 pm

phhnguyen wrote:
Daniel Shawul wrote:
I divide compression a chess database into two phases:
1) compress by encode moves such as changing from 2 bytes to 1 byte or 6 bits or 4 bits per move
2) compress further by applying normal compressor such as zip, tar, bsz2... library / software on data files
Don't play dumb here. If you did that you will get far more than 10% but I guess we won't see data from you.
I am usually disappointed about phase 2 because of poor rate.

Back to your work, if you save your moves by 4 bits only, you can save immediately half of your size without using any extra compressor. That job is also faster and using less code (you don't need to integrate some compression libs).
It doesn't matter if you use 8 bits/move or 4 bits/move you will probable get same result. That is as long as you use the same number of bits per move and not mess up the entropy.
I guess you don't understand my post.

I save 1 move into 4 bits only. It means that I can store 2 moves in 1 byte only. By that way, I need a half of file size without using any compressor.

If you use constant number of bits for a move then I do not see how you can save every move into 4 bits.

humans already played every legal move in the opening position and you can have only 16 different moves by 4 bits.

phhnguyen · Post by **phhnguyen** » Wed Jul 18, 2012 12:48 pm

Uri Blass wrote: If you use constant number of bits for a move then I do not see how you can save every move into 4 bits.

humans already played every legal move in the opening position and you can have only 16 different moves by 4 bits.

Of course not every moves but almost

If you read the post from Richard Vida in the first page, you will understand the technique: we use the last value, say 0xf (of 4 bits) as an "escape" value and store the bigger value in the next whole byte.

Thus, actually it is not a constant number of bits.

Daniel Shawul · Post by **Daniel Shawul** » Wed Jul 18, 2012 1:00 pm

phhnguyen wrote:
Uri Blass wrote: If you use constant number of bits for a move then I do not see how you can save every move into 4 bits.

humans already played every legal move in the opening position and you can have only 16 different moves by 4 bits.
Of course not every moves but almost

If you read the post from Richard Vida in the first page, you will understand the technique: we use the last value, say 0xf (of 4 bits) as an "escape" value and store the bigger value in the next whole byte.

Thus, actually it is not a constant number of bits.

Like has been said already doing variable length encoding should be the last step of compression. It just messes up the pattern so feeding that data to the compressor is bad which is why why you are reporting 10% compression ratio when in reality it should be more than 300%. Infact I tried it on the pgn Richard used and I got like 700% compression with 7z... Now you can squeeze in a few extra bytes by using domain specific prediction but storing variable length of bits per move will just hurt performance of compressor. Note that Richard used fixed 8bits/move which is good as that matches what compressors do use.

Compression of chess databases

Re: Compression of chess databases

Re: Compression of chess databases

Re: Compression of chess databases

Re: Compression of chess databases

Re: Compression of chess databases

Re: Compression of chess databases

Re: Compression of chess databases

Re: Compression of chess databases

Re: Compression of chess databases

Re: Compression of chess databases