New dictionary compression technique

bitwiseshiftleft · Post by **bitwiseshiftleft** » Sun Dec 12, 2021 9:30 pm

Hello TalkChess,

I developed a compression technique for public-key certificate revocation lists (i.e. web security) but I noticed that the interface is similar to something that occurs in chess programming, so I wonder if anyone is interested.

The idea is to compress a dictionary of type K -> V, where K is an arbitrary set of keys. V can be either uniform power-of-2 data (in which case the technique achieves essentially optimal compression) or a small-ish domain (say, < 1000 items) which are not necessarily uniformly distributed (in which case it's at most the entropy of the V distribution + 11%). Compression is somewhat slow, taking maybe 1µs per KVP, but lookups are relatively fast, taking only a few probes into a table and as little as 13 ns (using a weak hash on M1 processor). In general it takes 1-2 probes for the uniform case, and potentially more for the non-uniform case.

The compression technique allows you to query the dictionary on any one of its keys, but not to determine what the keys are (i.e. it doesn't encode the key set, to save space). If you query it on a key that's not in the dictionary, you get an arbitrary answer. So for example, if you have a database of positions, and it excludes illegal positions or those which would be excluded by an earlier search step, then querying it on one of those positions would give a nonsensical answer -- but it doesn't matter because you won't do that.

I did some initial study to compare this to Syzygy endgame tables, which I admittedly don't entirely understand. It seems to me that Syzygy can get better compression on WDL tables, for two reasons:

It breaks the data into shards that are similar, and therefore have reduced entropy; this could easily be replicated (perhaps even better) with my new technique.
It can encode the key in a way that's dense, whereas my compression does not assume that the key is dense; this may avoid paying the <= 11% overhead for the nonuniform case, but otherwise it doesn't help.

It is reasonably likely that my technique has faster lookups, and it might also work better for the depth-to-zero tables because the keys there are not dense.

Anyway, I'm not experienced in chess programming, but if this technique would be useful in a chess engine (perhaps with improvements), please let me know so that I can add code to support this.

The code is online, in prototype form, at https://github.com/bitwiseshiftleft/frayed_cascade.

jorose · Post by **jorose** » Mon Dec 13, 2021 3:08 am

So I will preface this with mentioning, that while I find it an interesting domain, this might be the subproblem I understand least well within computer chess.

I haven't (and won't for at least another 10 days) looked into the actual algorithm, but here are my thoughts.

Based on what your wrote:

The method is too slow for table generation for online use, meaning data must be known a priori.
Either the full key-set must be known, or alternatively, receiving garbage for unknown keys must be an acceptable cost.
For EGTBs, you are hoping that it is faster to access than Syzygy, but expecting that it's compression will likely be less effective.

Based on this I am assuming the primary current usage would be EGTB TBs for 6 pieces or less.

TB size is a major bottleneck in the adoption of EGTB formats, as it is a limiting factor for most users and TB size may have a performance penalty associated with it, as it may need to get read from disc. This does not mean your method is necessarily worse than Syzygy (which is a tough horse to beat), but definitely adds to the challenge.

Lookup speed is the primary objective, under the assumption that size ends up being within certain bounds. So if the size difference is small enough that you don't suffer from a performance penalty (ie if a user can load Syzygy into ram, but could also load your TBs into ram comfortably) and your method is indeed faster, than it could become an interesting competitor.

If you do end up trying to implement an EGTB based on your method, my recommendation would be to try to have an interface which is consistent with https://github.com/AndyGrant/Pyrrhic/. Pyrrhic is the most accessible library for accessing Syzygy tablebases and to my knowledge the most widely used for that purpose. So if you support the interface, developers can try out your tablebases with minimal effort and you are much more likely to get some adoption.

In terms of non-EGTB use cases, I can't come up with many off the top of my head. I could see this being useful for custom format opening books. There are a few ways you could encode a book and I could see this being useful for someone who wants this feature. The bad news is that this is a pretty niche feature, since in most competitions explicit opening books are banned and opening books are viewed as being in the domain of the GUI and not the engine. I also don't know anything about current opening book formats.

In terms of access for smaller tables, how fast is it? IE I am thinking of tables that might already be fast to access, but if the compression is good enough you can access it in a lower level of memory, then it might still be worth it?

Finally, some words as you seem to be new here

.

My impression is you made an honest effort to inform yourself, which is not easy and not something every new user here does before trying to post their ideas, so kudos to you on that. I don't know where you gathered your information, but in case you missed it, the CPW is a treasure trove: https://www.chessprogramming.org/Syzygy_Bases.

For better or worse, this forum is filled with a lot of people with strong opinions. It's filled with a lot of smart people who have developed things like Syzygy and Pyrrhic, but it is also gets bombarded by a stream of people who perhaps overestimate themselves and their ideas a little. I feel that at times this leads to some short fuses and a low tolerance for mistakes. Improving on Syzygy is a challenge, so I would categorize your idea as ambitious, which is great! But it is also the reason I am warning you about this reality.

bitwiseshiftleft · Post by **bitwiseshiftleft** » Mon Dec 13, 2021 11:51 pm

Thanks for the info and for the advice. I'll take a look at Pyrrhic, though I am also busy so I'm unlikely to have a Pyrrhic adapter ready particularly soon.

For your question about speed: roughly tens of nanoseconds up to microseconds, depending on the machine, where the filter is in the cache hierarchy, and how many possible outcomes there are (it's not currently performance-optimized for the case where there are many possible outcomes, but I'm pretty sure that can be significantly improved).

Is there a whitepaper anywhere for how Syzygy's compression works? I haven't seen anything other than sparsely-commented code, and it would be nice to avoid reverse engineering it. I suspect that in some cases frayed_cascade will do better (particularly for DTZ50 tables instead of WDL) but I'm not sure, and I would probably need to either generate or reverse-engineer some tablebases in order to make a proper test.

jorose · Post by **jorose** » Tue Dec 14, 2021 12:31 am

I unfortunately don't think there is a proper whitepaper, though I'd love to read one. My impression is that a lot of people went the route of trying to understand it based on the code, with differing results.

I haven't had time to go through many resources, but you may find the thread in the following link useful. It is the Syzygy release thread from this forum. http://www.talkchess.com/forum3/viewtop ... 1&start=50

You might also want to write Syzygy a message directly. It is likely he knows some more descriptive resources and, for obvious reasons, he can answer any related questions better than I can.

AndrewGrant is the Pyrrhic dev. If you do end up trying to support a similar interface, there is a good chance he would be willing to help if you ask nicely.

Fulvio · Post by **Fulvio** » Fri Dec 17, 2021 12:03 am

In chess, many positions are symmetrical:
https://kirill-kryukov.com/chess/nulp/method.html
so you start by normalizing the searched position and than map that position to its unique ID.

For example there are 50226 unique position in a KRK ending:
https://kirill-kryukov.com/chess/nulp/r ... ndgame.txt
In code it would be something like this:

Code: Select all

Position normalize(Position const& pos);
unsigned map_to_uniqueID(Position const& pos);

enum WDLScore {
    WDLLoss        = -2, // Loss
    WDLBlessedLoss = -1, // Loss, but draw under 50-move rule
    WDLDraw        =  0, // Draw
    WDLCursedWin   =  1, // Win, but draw under 50-move rule
    WDLWin         =  2, // Win
};
WDLScore tablebase_krk[50226];

WDLScore probe_krk_wdl(Position const& pos) {
    Position transformed_pos = normalize(pos);
    unsigned uniqueID = map_to_uniqueID(transformed_pos);
    return tablebase_krk[uniqueID];
}

In syzygy the tablebase_krk would be compressed first using "Recursive Pairing":
https://en.wikipedia.org/wiki/Re-Pair
and then Huffman coding:
https://en.wikipedia.org/wiki/Huffman_coding

It uses a few other tricks ("I don't care" positions) and it achieves an amazing 0.35 bits per position:
https://lichess.org/blog/W3WeMyQAACQAdf ... e-complete

New dictionary compression technique

New dictionary compression technique

Re: New dictionary compression technique

Re: New dictionary compression technique

Re: New dictionary compression technique

Re: New dictionary compression technique