Compression of chess databases

Discussion of chess software programming and technical issues.

Moderators: hgm, Rebel, chrisw

User avatar
rvida
Posts: 481
Joined: Thu Apr 16, 2009 12:00 pm
Location: Slovakia, EU

Re: Compression of chess databases

Post by rvida »

rvida wrote: I think 50% reduction can be easily achieved using a domain specific predictor (i.e. a chess engine). Right now I am running some tests on a sample PGN database and the prediction rates are very compression friednly. Looks like 3-3.5 bits per move is easily achievable while still maintaining the ability to do random accesses with game granularity.

More details later...
Here is my experiment on a PGN of 50861 games (from Olivier Deville's collection).

I used Critter 1.6a as a predictor doing 1 ply full width search + qsearch. Prediction is quite good even though these are human games.

This is the resulting symbol frequency distribution:

Code: Select all

# 1: 35.90540% (1587274)
# 2: 16.77592% (741615)
# 3: 10.12348% (447530)
# 4: 6.53990% (289110)
# 5: 4.82253% (213190)
# 6: 3.69800% (163478)
# 7: 3.16578% (139950)
# 8: 2.64557% (116953)
# 9: 2.23507% (98806)
#10: 1.83142% (80962)
#11: 1.56484% (69177)
#12: 1.64580% (72756)
#13: 1.20171% (53124)
#14: 1.07487% (47517)
#15: 0.92266% (40788)
#16: 0.82613% (36521)
#17: 0.82412% (36432)
#18: 0.75637% (33437)
#19: 0.69821% (30866)
#20: 0.62128% (27465)
#21: 0.40441% (17878)
#22: 0.38331% (16945)
#23: 0.30782% (13608)
#24: 0.23928% (10578)
#25: 0.19614% (8671)
#26: 0.17318% (7656)
#27: 0.10458% (4623)
#28: 0.08406% (3716)
#29: 0.05757% (2545)
#30: 0.04440% (1963)
#31: 0.03561% (1574)
#32: 0.02531% (1119)
rest: 0.06524% (2884)

total games: 50861
total halfmoves: 4420711
halfmoves/game avg = 86.92 min = 39 max = 391
After transformation I stored a raw stream of moves in simple 8 bits per move format, then applied standard zip "deflate" method on it. The size went down from 4420711 raw bytes to 1907969 compressed bytes.
Daniel Shawul
Posts: 4185
Joined: Tue Mar 14, 2006 11:34 am
Location: Ethiopia

Re: Compression of chess databases

Post by Daniel Shawul »

And now you have to distribute Criter 1.6a july 27,2012 version with your decompressor...savy? :) Otherwise it won't decompress properly. Try it on a bigger database with 4+ search and it won't ever finish. Atleast I was not able to finish the number of positions in 5men with that predictor. The Op's 10% figure was totally wrong, any compressor will have compressed a PGN much better.
syzygy
Posts: 5569
Joined: Tue Feb 28, 2012 11:56 pm

Re: Compression of chess databases

Post by syzygy »

Daniel Shawul wrote:And now you have to distribute Criter 1.6a july 27,2012 version with your decompressor...savy? :) Otherwise it won't decompress properly. Try it on a bigger database with 4+ search and it won't ever finish.
So a good trick is not to do 4+ search but keep it cheap. Of course what you can do fully depends on the application. If time is not an issue, you can compress all 7 piece tablebases to a single bit (plus a generator).

For bitbases, doing a full 3 or 4 ply search seems to defeat the purpose of having them. Plus, I doubt that an evaluation function will be able to predict the outcome of a "quiet" tablebase position noticeably better than simply assigning a fixed prediction to all quiet positions depending on the table (e.g. win for KBNvK, draw for KQvKQ). What remains are the positions that you can actually partially or fully resolve in 3 or 4 ply: prove it a win, a loss, at-least-draw or at-most-draw.

To improve compression in my tables, I do one round of captures (if there are any) and look up their outcomes (the tables being two-sided).
Daniel Shawul
Posts: 4185
Joined: Tue Mar 14, 2006 11:34 am
Location: Ethiopia

Re: Compression of chess databases

Post by Daniel Shawul »

syzygy wrote:
Daniel Shawul wrote:And now you have to distribute Criter 1.6a july 27,2012 version with your decompressor...savy? :) Otherwise it won't decompress properly. Try it on a bigger database with 4+ search and it won't ever finish.
So a good trick is not to do 4+ search but keep it cheap. Of course what you can do fully depends on the application. If time is not an issue, you can compress all 7 piece tablebases to a single bit (plus a generator).
Exactly,it is not about getting the best compression ratio. For instance, I may decide to use many chess engines and select from the PGN headres which two engines to use. I am sure that will improve compression ratio a lot but is just not worth it. PGN are not going to be stored in RAM so there is really no need to go for compression. I don't even use the 1 ply capture prediction mainly because it didn't decrease the bitbase sizes after I dropped one of the sides, but it gave reductions on the 2 sided bitbases so it maybe worthwhile.
For bitbases, doing a full 3 or 4 ply search seems to defeat the purpose of having them. Plus, I doubt that an evaluation function will be able to predict the outcome of a "quiet" tablebase position noticeably better than simply assigning a fixed prediction to all quiet positions depending on the table (e.g. win for KBNvK, draw for KQvKQ). What remains are the positions that you can actually partially or fully resolve in 3 or 4 ply: prove it a win, a loss, at-least-draw or at-most-draw.

To improve compression in my tables, I do one round of captures (if there are any) and look up their outcomes (the tables being two-sided).
Yes I am sure many would not want to use bitbases if I were doing a 3-4 ply search on their behalf. I do a 1 ply search since only one side bitbases are stored but ofcourse that overhead can be removed too by downloading the full two sided bitbases. My experience with prediction is bad. Most of what you can predict with heuristics can be taken care of by general compressors. In bitbases where you really need some domain dependent prediciton as in KRPkr etc, the predictions wont work. Doing shallow searches of 4+ plies over 5 pieces is really fast but still won't help. If it did then it means there is no need for bitbases anyway since they are meant to cover knowledge the engine lacks. I like to think of bitbases as a database of 'exceptions' that can be probed in O(1) time. This important to exclude methods that store exceptions simple list and probe them in O(n).
User avatar
phhnguyen
Posts: 1449
Joined: Wed Apr 21, 2010 4:58 am
Location: Australia
Full name: Nguyen Hong Pham

Re: Compression of chess databases

Post by phhnguyen »

rvida wrote:
rvida wrote: I think 50% reduction can be easily achieved using a domain specific predictor (i.e. a chess engine). Right now I am running some tests on a sample PGN database and the prediction rates are very compression friednly. Looks like 3-3.5 bits per move is easily achievable while still maintaining the ability to do random accesses with game granularity.

More details later...
Here is my experiment on a PGN of 50861 games (from Olivier Deville's collection).

I used Critter 1.6a as a predictor doing 1 ply full width search + qsearch. Prediction is quite good even though these are human games.

This is the resulting symbol frequency distribution:

Code: Select all

# 1: 35.90540% (1587274)
# 2: 16.77592% (741615)
# 3: 10.12348% (447530)
# 4: 6.53990% (289110)
# 5: 4.82253% (213190)
# 6: 3.69800% (163478)
# 7: 3.16578% (139950)
# 8: 2.64557% (116953)
# 9: 2.23507% (98806)
#10: 1.83142% (80962)
#11: 1.56484% (69177)
#12: 1.64580% (72756)
#13: 1.20171% (53124)
#14: 1.07487% (47517)
#15: 0.92266% (40788)
#16: 0.82613% (36521)
#17: 0.82412% (36432)
#18: 0.75637% (33437)
#19: 0.69821% (30866)
#20: 0.62128% (27465)
#21: 0.40441% (17878)
#22: 0.38331% (16945)
#23: 0.30782% (13608)
#24: 0.23928% (10578)
#25: 0.19614% (8671)
#26: 0.17318% (7656)
#27: 0.10458% (4623)
#28: 0.08406% (3716)
#29: 0.05757% (2545)
#30: 0.04440% (1963)
#31: 0.03561% (1574)
#32: 0.02531% (1119)
rest: 0.06524% (2884)

total games: 50861
total halfmoves: 4420711
halfmoves/game avg = 86.92 min = 39 max = 391
After transformation I stored a raw stream of moves in simple 8 bits per move format, then applied standard zip "deflate" method on it. The size went down from 4420711 raw bytes to 1907969 compressed bytes.
Very impressive result!!! Thanks for that!

I divide compression a chess database into two phases:
1) compress by encode moves such as changing from 2 bytes to 1 byte or 6 bits or 4 bits per move
2) compress further by applying normal compressor such as zip, tar, bsz2... library / software on data files

I am usually disappointed about phase 2 because of poor rate.

Back to your work, if you save your moves by 4 bits only, you can save immediately half of your size without using any extra compressor. That job is also faster and using less code (you don't need to integrate some compression libs).

Now, I wonder, if you use 4 bit/move only (and save already 50% of the size), how much can you save more if you continue to compress it by your zip?
Daniel Shawul
Posts: 4185
Joined: Tue Mar 14, 2006 11:34 am
Location: Ethiopia

Re: Compression of chess databases

Post by Daniel Shawul »

I divide compression a chess database into two phases:
1) compress by encode moves such as changing from 2 bytes to 1 byte or 6 bits or 4 bits per move
2) compress further by applying normal compressor such as zip, tar, bsz2... library / software on data files
Don't play dumb here. If you did that you will get far more than 10% but I guess we won't see data from you.
I am usually disappointed about phase 2 because of poor rate.

Back to your work, if you save your moves by 4 bits only, you can save immediately half of your size without using any extra compressor. That job is also faster and using less code (you don't need to integrate some compression libs).
It doesn't matter if you use 8 bits/move or 4 bits/move you will probable get same result. That is as long as you use the same number of bits per move and not mess up the entropy.
User avatar
phhnguyen
Posts: 1449
Joined: Wed Apr 21, 2010 4:58 am
Location: Australia
Full name: Nguyen Hong Pham

Re: Compression of chess databases

Post by phhnguyen »

Daniel Shawul wrote:
I divide compression a chess database into two phases:
1) compress by encode moves such as changing from 2 bytes to 1 byte or 6 bits or 4 bits per move
2) compress further by applying normal compressor such as zip, tar, bsz2... library / software on data files
Don't play dumb here. If you did that you will get far more than 10% but I guess we won't see data from you.
I am usually disappointed about phase 2 because of poor rate.

Back to your work, if you save your moves by 4 bits only, you can save immediately half of your size without using any extra compressor. That job is also faster and using less code (you don't need to integrate some compression libs).
It doesn't matter if you use 8 bits/move or 4 bits/move you will probable get same result. That is as long as you use the same number of bits per move and not mess up the entropy.
I guess you don't understand my post.

I save 1 move into 4 bits only. It means that I can store 2 moves in 1 byte only. By that way, I need a half of file size without using any compressor.
Uri Blass
Posts: 10418
Joined: Thu Mar 09, 2006 12:37 am
Location: Tel-Aviv Israel

Re: Compression of chess databases

Post by Uri Blass »

phhnguyen wrote:
Daniel Shawul wrote:
I divide compression a chess database into two phases:
1) compress by encode moves such as changing from 2 bytes to 1 byte or 6 bits or 4 bits per move
2) compress further by applying normal compressor such as zip, tar, bsz2... library / software on data files
Don't play dumb here. If you did that you will get far more than 10% but I guess we won't see data from you.
I am usually disappointed about phase 2 because of poor rate.

Back to your work, if you save your moves by 4 bits only, you can save immediately half of your size without using any extra compressor. That job is also faster and using less code (you don't need to integrate some compression libs).
It doesn't matter if you use 8 bits/move or 4 bits/move you will probable get same result. That is as long as you use the same number of bits per move and not mess up the entropy.
I guess you don't understand my post.

I save 1 move into 4 bits only. It means that I can store 2 moves in 1 byte only. By that way, I need a half of file size without using any compressor.
If you use constant number of bits for a move then I do not see how you can save every move into 4 bits.

humans already played every legal move in the opening position and you can have only 16 different moves by 4 bits.
User avatar
phhnguyen
Posts: 1449
Joined: Wed Apr 21, 2010 4:58 am
Location: Australia
Full name: Nguyen Hong Pham

Re: Compression of chess databases

Post by phhnguyen »

Uri Blass wrote: If you use constant number of bits for a move then I do not see how you can save every move into 4 bits.

humans already played every legal move in the opening position and you can have only 16 different moves by 4 bits.
Of course not every moves but almost :)

If you read the post from Richard Vida in the first page, you will understand the technique: we use the last value, say 0xf (of 4 bits) as an "escape" value and store the bigger value in the next whole byte.

Thus, actually it is not a constant number of bits.
Daniel Shawul
Posts: 4185
Joined: Tue Mar 14, 2006 11:34 am
Location: Ethiopia

Re: Compression of chess databases

Post by Daniel Shawul »

phhnguyen wrote:
Uri Blass wrote: If you use constant number of bits for a move then I do not see how you can save every move into 4 bits.

humans already played every legal move in the opening position and you can have only 16 different moves by 4 bits.
Of course not every moves but almost :)

If you read the post from Richard Vida in the first page, you will understand the technique: we use the last value, say 0xf (of 4 bits) as an "escape" value and store the bigger value in the next whole byte.

Thus, actually it is not a constant number of bits.
Like has been said already doing variable length encoding should be the last step of compression. It just messes up the pattern so feeding that data to the compressor is bad which is why why you are reporting 10% compression ratio when in reality it should be more than 300%. Infact I tried it on the pgn Richard used and I got like 700% compression with 7z... Now you can squeeze in a few extra bytes by using domain specific prediction but storing variable length of bits per move will just hurt performance of compressor. Note that Richard used fixed 8bits/move which is good as that matches what compressors do use.