Compression of Underdetermined Data in a 7-Piece Chess Table

Dave Gomboc · Post by **Dave Gomboc** » Sun Apr 06, 2025 10:50 pm

It seems surprising to me that the paper "Compression of Underdetermined Data in a 7-Piece Chess Table" (https://link.springer.com/article/10.31 ... 1916010076), while citing the authors of the RE-PAIR compression method (https://ieeexplore.ieee.org/document/755679), doesn't also cite Syzygy's use of the RE-PAIR compression method. The commit history of https://github.com/syzygy1/tb shows that by April 2013, Syzygy was using the RE-PAIR compression method for compressing its six-piece endgame tables, which precedes the paper by at least a couple of years.

Can anyone clearly explain which portions of this paper represent novel contribution(s) by its authors, and which portions of this paper are reproducing or making a minor update to prior work?

BeyondCritics · Post by **BeyondCritics** » Mon Apr 07, 2025 3:50 am

As far as I know the author never produced something citable. And that in turn means, he also gave never anybody credit. He should explain what he has done there, this will help science and himself.

phhnguyen · Post by **phhnguyen** » Mon Apr 07, 2025 3:52 am

It is not a new problem. Ronald de Man had shown his angry and reasons about that paper in an old post (in this forum) long time ago. I’m in the office thus can’t help to find but you may search that post.

Dave Gomboc · Post by **Dave Gomboc** » Mon Apr 07, 2025 8:37 am

phhnguyen wrote: ↑Mon Apr 07, 2025 3:52 am It is not a new problem. Ronald de Man had shown his angry and reasons about that paper in an old post (in this forum) long time ago. I’m in the office thus can’t help to find but you may search that post.

Thanks. It seems that viewtopic.php?t=60222 may be the post to which you're referring. Reviewing it, I found reference to an older thread that is still online today at the CCRL Discussion Board, which has an area for discussing Endgame Tablebases. Ronald de Man references RE-PAIR in this 2008 posting: http://kirill-kryukov.com/chess/discuss ... 300#p37300.

syzygy · Post by **syzygy** » Fri Apr 18, 2025 3:49 pm

Dave Gomboc wrote: ↑Sun Apr 06, 2025 10:50 pm It seems surprising to me that the paper "Compression of Underdetermined Data in a 7-Piece Chess Table" (https://link.springer.com/article/10.31 ... 1916010076), while citing the authors of the RE-PAIR compression method (https://ieeexplore.ieee.org/document/755679), doesn't also cite Syzygy's use of the RE-PAIR compression method. The commit history of https://github.com/syzygy1/tb shows that by April 2013, Syzygy was using the RE-PAIR compression method for compressing its six-piece endgame tables, which precedes the paper by at least a couple of years.

Can anyone clearly explain which portions of this paper represent novel contribution(s) by its authors, and which portions of this paper are reproducing or making a minor update to prior work?

Indeed this paper is a near-total rip off. I once posted some of my compression ideas to their google plus page, and their response was entirely dismissive. Then they ended up implementing all the ideas (which is fine) and wrote them down in a paper without any attribution (which is not fine).

I considered writing to the editors of the journal, but in the end I could not be bothered enough

.
Also, I can blame myself for never properly writing stuff up.

abulmo2 · Post by **abulmo2** » Sat Apr 19, 2025 3:03 am

Dave Gomboc wrote: ↑Sun Apr 06, 2025 10:50 pm It seems surprising to me that the paper "Compression of Underdetermined Data in a 7-Piece Chess Table" (https://link.springer.com/article/10.31 ... 1916010076), while citing the authors of the RE-PAIR compression method (https://ieeexplore.ieee.org/document/755679), doesn't also cite Syzygy's use of the RE-PAIR compression method. The commit history of https://github.com/syzygy1/tb shows that by April 2013, Syzygy was using the RE-PAIR compression method for compressing its six-piece endgame tables, which precedes the paper by at least a couple of years.

Can anyone clearly explain which portions of this paper represent novel contribution(s) by its authors, and which portions of this paper are reproducing or making a minor update to prior work?

As I understand, Zakharov & al are behind the Lomonosov 7-piece endgame tables, published in 2012. It is possible that their RE-PAIR compression method predates Syzygy's one, although their papers were published later.

syzygy · Post by **syzygy** » Sat Apr 19, 2025 4:21 am

abulmo2 wrote: ↑Sat Apr 19, 2025 3:03 am
Dave Gomboc wrote: ↑Sun Apr 06, 2025 10:50 pm It seems surprising to me that the paper "Compression of Underdetermined Data in a 7-Piece Chess Table" (https://link.springer.com/article/10.31 ... 1916010076), while citing the authors of the RE-PAIR compression method (https://ieeexplore.ieee.org/document/755679), doesn't also cite Syzygy's use of the RE-PAIR compression method. The commit history of https://github.com/syzygy1/tb shows that by April 2013, Syzygy was using the RE-PAIR compression method for compressing its six-piece endgame tables, which precedes the paper by at least a couple of years.

Can anyone clearly explain which portions of this paper represent novel contribution(s) by its authors, and which portions of this paper are reproducing or making a minor update to prior work?
As I understand, Zakharov & al are behind the Lomonosov 7-piece endgame tables, published in 2012. It is possible that their RE-PAIR compression method predates Syzygy's one, although their papers were published later.

Nope, they took it from my generator and/or my posts on various forums. This is of course fine, but they committed severe academic fraud by then publishing it without attribution.
https://web.archive.org/web/20180713195 ... C252LggQQS

7-man TB on April 12, 2013 wrote:Found an interesting research.

http://kirill-kryukov.com/chess/discuss ... 3e52f02879

Although 6-man generator is not something exciting there are two very interesting ideas in the project.

1) Using RE-PAIR algorithm for tablebases compress. Compared to LZMA (which is used for LTB) the advantage of RE-PAIR is having common dictionary for the whole file (LZMA builds separate dictionary for each block of data). This provides better compression for small blocks allowing to make them as small as you need. So two important problems can be solved: (a) better compression, (b) smaller blocks and accordingly faster access time.

2) Probing code keeps compressed blocks in memory. So memory usage can be times better. But you need to pay access time for this. So it is still unclear whether to use or not to use this method.

Possibly some other findings from the project should be checked, but we definetly need to check how RE-PAIR compression will work for 7-man DTM-tables.

It would be good if we will be able to migrate to 4K or 8K blocks instead of current 16K blocks. But many things are unclear for a while. compression ratio, reasonable dictionary size, compression time,

And the similarities between what is in the paper and what is in my generator go way beyond the use of Re-Pair compression.

Dave Gomboc · Post by **Dave Gomboc** » Thu Apr 24, 2025 11:58 pm

abulmo2 wrote: ↑Sat Apr 19, 2025 3:03 am As I understand, Zakharov & al are behind the Lomonosov 7-piece endgame tables, published in 2012. It is possible that their RE-PAIR compression method predates Syzygy's one, although their papers were published later.

They do have several publications. "On Solving the Problem of 7-Piece Chess Endgames" (June 2019, https://link.springer.com/article/10.11 ... 881903006X) indicates that using LZMA compression, they fit all the tables into 140TB. According to the 2019 article, their subsequent work (which included using RE-PAIR) apparently got it down to 96TB -- and that's the work that I understand was being reported on in the 2016 paper that is the subject of this thread. Yet we already saw that a reference being made to RE-PAIR by Ronald de Man as far back as 2008 (http://kirill-kryukov.com/chess/discuss ... 300#p37300).

syzygy wrote: ↑Fri Apr 18, 2025 3:49 pm Indeed this paper is a near-total rip off. I once posted some of my compression ideas to their google plus page, and their response was entirely dismissive. Then they ended up implementing all the ideas (which is fine) and wrote them down in a paper without any attribution (which is not fine).

The text at https://web.archive.org/web/20180713195 ... C252LggQQS doesn't read as "entirely dismissive" to me, taking into account not only the Apr. 16, 2013 remarks but also the original Apr. 12, 2013 posting. The syzygy1/tb repo on github shows that you open-sourced Syzygy's code circa April 1, 2013, just before these postings were made.

syzygy wrote: ↑Fri Apr 18, 2025 3:49 pm I considered writing to the editors of the journal, but in the end I could not be bothered enough .
Also, I can blame myself for never properly writing stuff up.

syzygy wrote: ↑Sat Apr 19, 2025 4:21 am And the similarities between what is in the paper and what is in my generator go way beyond the use of Re-Pair compression.

Using LZMA doesn't strike me as particularly surprising. Such a basic point did not originate with them, but it didn't originate with Syzygy either. However, the sentiment you express of the entire paper in question being a "near-total rip off" is certainly concerning. I was wondering whether or not you had contacted the publishing journal about it, and what any response provided might have been.

You have definitely been helpful in the past when people ask about how Syzygy's code works, but it's certainly also true that a proper freestanding write-up is definitely useful on its own. Syzygy's code being open-sourced is helpful, but it is actually rather difficult to understand as-is. I would even go so far as to say that Syzygy's C code actually looks much more like the C code that would be generated by running a compiler of some other code that is written in some higher-level language that has been compiled down to C, than it looks like C code that would actually be authored directly by a human. Then again, when the human writing the code is not a software developer by profession, just about anything is possible.

Anyway, back to my original question:

Dave Gomboc wrote: ↑Sun Apr 06, 2025 10:50 pm Can anyone clearly explain which portions of this paper represent novel contribution(s) by its authors, and which portions of this paper are reproducing or making a minor update to prior work?

Why do I ask this?

- It would be useful to understand what material in the article applies to the Syzygy table generation code as of April 2013, and also as of today, should there be a difference between then and now. This is particularly important because, so far as I know, the Lomonosov table-generation code is not itself publicly available.

- I intend to be writing up my own doctoral dissertation in the not-so-distant future. As I do that, I want to be able to make appropriate attributions whenever I am discussing the related work that has preceded my own recent research. I anticipate that detailed remarks on what discussed in the academic paper is or isn't actually already present in the Syzygy code as of April 2013 would be useful when I do that writing.

syzygy · Post by **syzygy** » Wed Apr 30, 2025 4:34 am

Dave Gomboc wrote: ↑Thu Apr 24, 2025 11:58 pm The text at https://web.archive.org/web/20180713195 ... C252LggQQS doesn't read as "entirely dismissive" to me, taking into account not only the Apr. 16, 2013 remarks but also the original Apr. 12, 2013 posting. The syzygy1/tb repo on github shows that you open-sourced Syzygy's code circa April 1, 2013, just before these postings were made.

That one is not, but I made a post on their Google+ page earlier, either in 2012 or early 2013. It probably has been archived, but I don't know of a way to find it back. It is not very important. The Google+ post we do have leaves no doubt where they got the idea from..

Using LZMA doesn't strike me as particularly surprising. Such a basic point did not originate with them, but it didn't originate with Syzygy either. However, the sentiment you express of the entire paper in question being a "near-total rip off" is certainly concerning. I was wondering whether or not you had contacted the publishing journal about it, and what any response provided might have been.

LZMA/straightforward compression of large blocks is indeed not used by me but it is what they used before adopting my approach. Nalimov already did the same (not with LZMA, but it can use any general compression algorithm).

What they did take, as far as I can tell:
- changing the order of pieces in the indexing formula to improve compression
- splitting the tables into ternary and "full" tables, with the ternary information "removed" from the full tables,
- the use of "underdetermined" data for broken positions and positions with captures etc. (I call it don't care/partial don't care.)
- the use of Re-Pair, which happens to work well together with "underdetermined" data.
- the use of Huffman coding to encode the final Re-Pair symbols.
- I think the initial replacement rules are nearly identical to what I use.

Of course I am not claiming these were my original ideas.

The idea of re-ordering the pieces was mentioned in this very nice master's thesis:
https://jespertk.dk/master_thesis/index.html
https://jespertk.dk/master_thesis/report/report.pdf
This thesis also contains the idea of using don't care values for broken positions.

What Lomonosov may also have taken from my generator is simplifying the indexing formula to something less efficient but more flexible in terms of the number of permutations of piece orders that can be tried. The paper is not entirely clear on this, but it seems they initially spent a lot of effort on eliminating illegal and duplicate positions. I also did this in earlier generators (with many thousands of lines of code to "optimally" index all 5-piece suicide tables), but it is simply not worth it. Illegal positions are don't cares and can anyway be compressed away. Duplicate positions (i.e. incomplete removal of symmetric positions) are a bit of a waste, but if the right permutation improves compression by 30%, then who cares about 1% redundant positions.

I think some form of compression of (partial?) "don't care" values was used in the Chinook tablebases.
Indeed see section 3.3:
https://staff.ru.is/yngvi/pdf/SchaefferBBLLS03.pdf
I'm not sure if Chinook also used "partial" don't cares. In checkers, the value of a position that allows one or more captures can be determined by evaluating the captures. In chess, having a winning capture solves the position (at least for WDL and DTZ) but if the best capture only draws, then you only know that the position is not a loss. It could be a draw but it could also be a win by a non-capture move. So for WDL a win has to be stored as a win, but a draw can be stored either as a draw or as a loss.
The Chinook tables may or may not use partial don't cares to deal with "threatened captures". If a side has a move that forces the opponent to make a losing capture, then the position is a win (-> don't care). If a side has a move that forces the opponent to make a drawing capture, then the position is either a draw or a win, not a loss (-> partial don't care). But do they use partial don't cares? Probably not.
Even if not, this section might nevertheless have been what gave me the idea.

When using Re-Pair, it turns out that don't cares should not always be set to the most frequent symbol (or "dominant value" as the Chinook paper calls it) biut that you can be smarter (see the "initial replacement rules" of the Lomonosov paper). Finding the best rules took a lot of experimentation, with some rules working well for certain tables but not at all for others. I think the 5th rule mentioned in the Lomonosov paper in the end made things work well for most/all tables.

Re-Pair is obviously not by me (but by Larsson and Moffat: https://ieeexplore.ieee.org/document/755679 ).
My generator does not use Re-Pair as described in the paper but rather an approximation. It stops once there are 4095 or so symbols (and then uses Huffman) instead of re-pairing until the whole table is reduced to 1 top-level symbol, and it does not use the time-efficient (but complicated and memory-intensive) algorithm of Larsson and Moffat. Also it is not guaranteed to find the "best" pairings because it tries to do multiple pairings in each pass, but I found that changing that to 1 pairing per pass surprisingly gave worse (and slower) compression.

Huffman coding was invented by Huffman. Moffat and Turpin described a very nice implementation:
https://citeseerx.ist.psu.edu/document? ... c4f56a9226

You have definitely been helpful in the past when people ask about how Syzygy's code works, but it's certainly also true that a proper freestanding write-up is definitely useful on its own. Syzygy's code being open-sourced is helpful, but it is actually rather difficult to understand as-is. I would even go so far as to say that Syzygy's C code actually looks much more like the C code that would be generated by running a compiler of some other code that is written in some higher-level language that has been compiled down to C, than it looks like C code that would actually be authored directly by a human. Then again, when the human writing the code is not a software developer by profession, just about anything is possible.

Well, it is true that a professional programmer could simply not have written this.
I'm not sure which part you think looks like it was generated from another language.

In my view the worst are parts of tbgen.c, reduce.c and reduce_tmpl.c and iterate() in xtbgen.c which implement the same set of hand-coded table generation rules. It would be nice if these rules could be specified in just one place. Modifying the generator to generate Shatranj tables is almost trivial, except that Shatranj has a 70-move rule, which means that the generator has to do a "reduction" step before dealing with the 70-move rule, which means making many one-off prone changes in 4 places.

Anyway, back to my original question:

Dave Gomboc wrote: ↑Sun Apr 06, 2025 10:50 pm Can anyone clearly explain which portions of this paper represent novel contribution(s) by its authors, and which portions of this paper are reproducing or making a minor update to prior work?

I don't see any novel contributions in the paper (apart from the write up itself). A minor update might be how they applied partial don't cares to DTM, but this is totally straightforward. If a capture leads to a winning mate in 10, then you can store any value >= 10 because the probing code will know (from trying the captures) that the real value must be <=10. Then "Algorithm 1" is "replace with the most frequent value", "Algorithm 2" is "try to form a frequent pair with the symbol on the left", and it turns out that Algorithm 2 works better (as we already knew). [It is not immediately clear to me why this would reduce the speed of access.]

There might be some difference in the replacement rules. I remember that after reading the paper for the first time I tried a small change based on the paper, but it did not seem to improve anything.
I wonder about "we change each block with length n from symbols * to E" in the "3. Final replacement" step. I don't think there can be any blocks of partial don't cares that can be replaced by another symbol, since that would have happened already in step 2? Maybe I am missing something.

Compression of Underdetermined Data in a 7-Piece Chess Table

Compression of Underdetermined Data in a 7-Piece Chess Table

Re: Compression of Underdetermined Data in a 7-Piece Chess Table

Re: Compression of Underdetermined Data in a 7-Piece Chess Table

Re: Compression of Underdetermined Data in a 7-Piece Chess Table

Re: Compression of Underdetermined Data in a 7-Piece Chess Table

Re: Compression of Underdetermined Data in a 7-Piece Chess Table

Re: Compression of Underdetermined Data in a 7-Piece Chess Table

Re: Compression of Underdetermined Data in a 7-Piece Chess Table

Re: Compression of Underdetermined Data in a 7-Piece Chess Table