Daniel Shawul wrote:Martin,
I wrote a custom LZ + Huffman (Deflate) lib for my bitbases with the limited knowledge I have at the time
https://github.com/dshawul/compress. Do you think the LZ part of it could be improved for compression size or decompression speed with your library?
Daniel
Hi Daniel,
Regarding compression size, you already do lazy matching so my implementation can't beat it as I do the same.
Since your compressor uses entropy coding, it should be superior to my plain LZ in terms of ratio.
I only do plain hash-chain (or hash-list) and my compressor is VERY slow (in part because I use 64k "sliding dictionary").
However since it's block-based, I plan to parallelize the compressor.
Next step could be optimal parsing (using suffix trees - slow & probably patented?) or you may want to look at this:
http://fastcompression.blogspot.cz/2011 ... egies.html
My deep lazy matching does about 1% worse than that (compared to my LZ4 implementation vs original LZ4 compressor).
If you want something better than Huffman (still static), you can look at ANS (Jarek Duda) or FSE (two terms for basically the same thing), it's very interesting to say the least (note: I haven't implemented it so can't really compare to Huffman).
Unless you already do, there are two things to improve compression performance (which is not important at all for offline compression anyway):
- when looking for a better (=longer match), first check byte at src[-cur_match_offset+best_len] == src[best_len], this can give a nice gain
- when looking for a lazy match, first check if the tail (=[greedy]best_dist + best_len + 2) is present for MIN_MATCH_LEN before looking for full lazy match
Speaking of decompression performance, your Huffman decoding should be pretty fast already (from what I've seen in your code - you slice it)
In my inflate I use direct lookup table (11 or 10 bits, don't remember) and then I return extra bits. Another possiblity is this (haven't implemented it but Ronald did in Syzygy bases):
http://www.eecs.harvard.edu/~michaelm/E210/huffman.pdf
Your bitstream (or bitset in your code) uses accumulator + length (number of bits)
What I do in my "bit accumulator" is I use a special bit flag while pulling bits.
Since my accumulator goes lsb=>msb, I use a "guard" bit.
It's 24 bits so I read 3 bytes and then or 1 << 24, shifting right as I pull bits.
So I can always simply use a bit mask to check how many bits are available (=can be read without reloading accumulator).
Maybe a similar approach could be used in your decoder? (but reversed).
PS It's a bit late so I hope that what I wrote makes sense - just sharing some thoughts.
Oh and speaking of CRC-32, you may want to try slicing which is significantly faster:
http://create.stephan-brumme.com/crc32/
For example, just computing Adler32 on uncompressed data in my library eats about 17% of total decompression time (and adler32 is much faster but alas also much worse in terms of quality)