elcabesa wrote: ↑Sat Jun 09, 2018 1:16 am
thank you for the papers, I'll try to study them.
have you ever thinked about using arithmetic coding instead of Huffman? and in case you have rejected it, why? is it complex/slower to decomress? it does't give enough compression boost?
It is a lot slower than Huffman, and the Re-Pair step ensures that the symbol frequencies are such that Huffman will do OK.
I have recently had another at variations of arithmetic coding like "Finite State Entropy" coding. The problem is that those approaches tend to need rather large tables which might not perform well. The probing code needs to decompress only about 32 bytes at a time, so those tables cannot be assumed to be in the L1 cache (and may not be in the CPU cache at all).
Huffman decoding basically just needs a small table with one integer per codeword length. This is also its weakness: each symbol is encoded with an integral number of bits. But if you in some way want to allow a fractional number of bits per symbol, you cannot avoid needing a table that somehow encodes that fractional number for each symbol. And decoding will somehow have to loop through that table (with about 4k elements) or it will need to do a lookup in a large table for mapping a relatively large integer to the right symbol. (16k elements if you want to be able to assign fractional codeword lengths to the 4k symbols with a resolution of 1/16th bit.)
Having said that, it would be interesting to know how much theoretically could be saved by using an arithmetic coding variant.
Maybe I am too pessimistic about the performance of modern arithmetic coding variants. One would really need to measure it in a realistic setting (so in a chess engine doing a deep search in the late middlegame and getting lots of TB hits in a large variety of TBs).