Tablebase suggestion

syzygy · Post by **syzygy** » Tue Feb 03, 2026 10:26 pm

hgm wrote: ↑Tue Feb 03, 2026 10:03 pmSuppose we have enough memory to 'cache' 3.5 slices (well, 3.5625). Then we can use the order (in parentheses the number of loades slices):

in: a1 (0.5), b1 (1.5), b2 (2), out: a1 (1.5), in: c1, c2 (3.5), out: b1 (2.5), in: c3 (3), out: b2 (2.5), in: d1 (3.5), out: c1 (2.5), in: d2 (3.5), out d1 (2.5), in: d3 (3.5), out: c2 (2.5), in: d4 (3+), out: c3, d4, d3 (1), in: c1 (2), out: c1, d2.

So the number of slices is 10?
I suppose one could do the same with 462 (white king, black king) slices?

hgm · Post by **hgm** » Tue Feb 03, 2026 10:51 pm

Indeed, the symmetric case has 10 possible white-King locations, but for 4 of those the slices are only half size.

If you can involve both Kings depends on what algorithm is used for the generation. Alternating white and black iterations would make King proximity useless during the white retromoves, as the black King would not move anyway. But then, during the black iterations in 6 white-King slices there are 64 black-King slices, as the white King already uniquely determines the symmetry. So you could cache 10 out of 64 to as I described above for the asymmetric case.

If you do both white and black moves in one iteration I suppose you could do something similar with 462 slices, but it would be complex and hard to imagine. Basically you would be dealing with a 4-dimensional set of K-slices, the direct product of 8x8 boards for the two Kings, where proximity is defined as either the first two or last two coordinates changing by +/-1 (or staying the same). You should then find a path through that 4d space visiting all slices, which visits all neighbors of slices visited earlier as quickly as possible, so that the earlier visited slice can be flushed out.

Ender8pieces · Post by **Ender8pieces** » Wed Feb 04, 2026 1:35 am

hgm wrote: ↑Tue Feb 03, 2026 10:03 pm Note that a Pawn on the board breaks all symmetry. So there are 64 white-King slices. And by making a rank-wise raster scan of the board you only have to keep 10 in memory: load a1-a8, b1 and b2, and then you can evict a1, load b3, evict a2 etc.

What I had in mind in the symmetric case is this:

The white King is in the 10-square triangle a1-d1-d4. The black King (or a more conveniently chosen tie-breaker piece) only affects symmetry when white is on the diagonal a1-d4. These 4 squares therefore have white-King slices that are only half size. (Well, slightly larger: 36/64 = 9/16.) Note that when the black King move flips the symmetry, the white King will never move because of it. Only all other pieces do.

Suppose we have enough memory to 'cache' 3.5 slices (well, 3,5625). Then we can use the order (in parentheses the number of loades slices):

in: a1 (0.5), b1 (1.5), b2 (2), out: a1 (1.5), in: c1, c2 (3.5), out: b1 (2.5), in: c3 (3), out: b2 (2.5), in: d1 (3.5), out: c1 (2.5), in: d2 (3.5), out d1 (2.5), in: d3 (3.5), out: c2 (2.5), in: d4 (3+), out: c3, d4, d3 (1), in: c1 (2), out: c1, d2.

The slice c1 had to be ousted before the connecting move with d2 could be made to make room for the latter, so it has to be loaded a second time to complete the iteration. That means 1 extra K-slice load out of 8.25, in order to keep the memory requirement at 3.5625 (out of 8.25). I suppose this could be done even smarter, by streaming in d2 more or less simultanously with streaming out c1, and then treating the connecting moves just before you write out the set of positions of c1 corresponding to those you just loaded from d2.

Also note that we could use some simple compression scheme for data that has to be stored in memory while waiting to be acted on. E.g. run-length encoding for sparse bit sets. Only two slices at the time have to remain unpacked in memory: those between which we are treating the King moves.

All this of course causes slowdowns compared to having unpacked complete EGTs in memory all the time. But I look upon this from the economin POV: what is the sense of spending a fortune on RAM only to be done with it as quickly as you can, and then never need to use it again?

Few comments:
- I completely agree with your economic POV. There is no sense in over-investing in massive RAM for a one-time computation. The significant cost savings of a memory-efficient approach far outweigh the minor slowdowns caused by I/O overhead or compression schemes. Making this feasible on standard hardware is a key goal.

- While pawns break symmetry and allow for simpler partitioning, I believe the true challenge and the most interesting one lies in pawnless endgames. The 8x symmetry factor in these cases is a powerful tool for disk space reduction, though it requires a more sophisticated data-flow model to handle the cross-dependencies in RAM.

- I believe the most efficient use of memory occurs when we model the problem in 5 dimensions: four dimensions for the kings' positions and a fifth for the side-to-move (color turn). This efficiency could potentially enable home-scale computers to participate in calculating 8-piece tables—perhaps through a distributed computing project, similar to those successfully used in other scientific fields.

syzygy · Post by **syzygy** » Wed Feb 04, 2026 2:55 am

I'm thinking out loud what if we split an 8-piece table into 2*462 (wtm/btm x (wk,bk)) K-slices of 62*61*60*59*58*57 positions (in the worst case of no duplicate pieces). A bitmap of one such slice takes up about 5.15 GB.

Before splitting into K-slices, the Wu-Beal algorithm for calculating loss-in-n btm from win-in-n wtm:

Code: Select all

S1 = Load(W, WIN_IN_N(n));
R1 = Predecessor(S1);
S2 = Load(B, UNKNOWN);
S2 = S2 & R1;
----
R2 = Load(W, WIN_IN<=N(n));
S3 = ProveSuccessor(S2, R2);
B = Add(S3, LOSE_IN_N(n));

Here W, B are intermediate stages of the final 8-piece table (wtm, btm), accessed sequentially and stored on disk.
R1/R2 are temporary bitmaps in RAM of size 462*62*61*60*59*58*57/8 bytes, accessed randomly.
S1/S2/S3 are temporary bitmaps that are accessed sequentially (so can be stored on disk, if not generated on the fly).
S1 is a subset of wtm positions, S2,S3 are subsets of btm positions.
S1 and S3 do not need to be stored separately. S1 is read from W, and S3 is immediately written to B.

Let's rewrite the first part a bit:

Code: Select all

S2 = 0;
S1 = Load(W, WIN_IN_N(n));
R1 = Predecessor(S1);
S2 |= Load(B, UNKNOWN) & R1;

It is not necessary to do this in one go. We can take a K-slice S1_k of W, calculate Predecessor(S1_k) which will fall within up to 9 btm K-slices (*), AND the result with LOAD(B,UNKNOWN) and then OR with S2. We only need to hold 9 btm K-slices R1_k in RAM.
Ideally we hold more than 9 btm K-slices in RAM and delay the "AND" and "OR" operation for a btm K-slice R1_k until we need to drop it from RAM to make room for another btm K-slice.
We can choose the order in which we run through K-slices of W, and we can choose which btm K-slices to drop. This gives an interesting optimization problem.

(*) an unmove from a position with one king not on the diagonal to a position with both kings on the diagonal also needs to generate the diagonal mirror-image of the position, but this position is still in the same K-slice.

The same can be done with the second part:

Code: Select all

R2 = LOAD(W, WIN_IN<=N(n));
S3 = ProveSuccessor(S2, R2);
B = Add(S3, LOSE_IN_N(n));

ProveSuccessor(S2,R2) calculates the subset of btm positions p in S2 for which all wtm successor positions (obtained by applying black moves) are in R2.
To calculate ProveSuccessor(S2_k,, R2) for a K-slice S2_k, we need to have up to 9 wtm K-slices of R2 in RAM.
So we just cache as many wtm K-slices in RAM as we can and try to run through K-slices of S2 in a clever order to try to reduce the number of times each wtm K-slice has to be loaded into RAM. This is a second optimization problem.

Then there is also the part that calculates win-in-(n+1) wtm from loss-in-n btm.
This needs S3 (= loss-in-n btm), so either we do store S3 separately or we extract it on the fly from B.

Code: Select all

R3 = Predecessor(S3);
S4 = Load(W, UNKNOWN);
S4 = S4 & R3;
W = Add(S4, WIN_IN_N(n+1));

Let's rewrite a bit:

Code: Select all

S4 = 0;
R3 = Predecessor(S3);
S4 |= Load(W, UNKNOWN) & R3;
W = Add(S4, WIN_IN_N(n+1));

So very similar to the first part above. We go through S3 K-slice by K-slice, Predecessor(S3_k) is a subset of up to 9 wtm K-slices. We keep as many of them as long as we can in RAM, and if we need to drop one we AND with Load(W, UNKNOWN) and OR with W.

The same then needs to be done with wtm and btm interchanged. And repeated many times until no new UNKNOWN positions are resolved. The remaining UNKNOWN positions are draws.

47 GB of RAM (for 9 slices) would be enough, but the more the better (until you can hold 462 slices at once in 2.38 TB).

I guess this should work, but maybe I am overlooking something.

It would be interesting to have an idea of the overhead incurred by being able to hold only N out of 462 slices in RAM at a time.

The same can be done with 7 pieces. A K-slice would be 92.56 MB. So 833 MB would be the minimum, and 42 GB could hold all 462 K-slices.

For pawn tables we can split in pawn slices and those can probably be split in K-slices again. It might be sufficient to split into K-slices with one king fixed.

syzygy · Post by **syzygy** » Wed Feb 04, 2026 3:23 am

I guess W, B and S2 would each be 462 separate compressed files. Updating one means decompressing, applying the relevant operation, deleting the old file and writing back the compressed new file. I don't think SSD is suitable for this.

When the generation is finished, W and B are used to generate WDL and DTZ files.
To generate them in Syzygy format is not entirely trivial but should be doable. The format would probably have to be extended because some tables will need more than 2^32 compressed blocks. Or I guess simply compress them K-slice by K-slice.

Ender8pieces · Post by **Ender8pieces** » Wed Feb 04, 2026 2:23 pm

syzygy wrote: ↑Wed Feb 04, 2026 3:23 am I guess W, B and S2 would each be 462 separate compressed files. Updating one means decompressing, applying the relevant operation, deleting the old file and writing back the compressed new file. I don't think SSD is suitable for this.

I'm not sure why the working files would be stored compressed on the disk. An efficient implementation relies on writing them as uncompressed bitsets during the generation phase. Compression is only relevant as a final post-processing step for the distribution format.

Regarding the number of writes: The described process writes each K-slice at most once per DTZ iteration. These are sequential writes. I estimate that after the first ~200 plies (for tables intended beyond just practical 50-move play), the number of actual disk writes will drop significantly as the data becomes increasingly sparse and many slices "freeze."

I don't see any reason why a modern SSD wouldn't handle a few hundred full-drive sequential writes. While this would require a clear "disclosure" for a distributed computing project (to inform participants about disk wear), for a dedicated generation machine, it's well within the specs of even high-end consumer NVMe drives, let alone Enterprise one

syzygy · Post by **syzygy** » Wed Feb 04, 2026 3:52 pm

syzygy wrote: ↑Wed Feb 04, 2026 2:55 amLet's rewrite the first part a bit:
Code: Select all
S2 = 0;
S1 = Load(W, WIN_IN_N(n));
R1 = Predecessor(S1);
S2 |= Load(B, UNKNOWN) & R1;
It is not necessary to do this in one go. We can take a K-slice S1_k of W, calculate Predecessor(S1_k) which will fall within up to 9 btm K-slices (*), AND the result with LOAD(B,UNKNOWN) and then OR with S2. We only need to hold 9 btm K-slices R1_k in RAM.
Ideally we hold more than 9 btm K-slices in RAM and delay the "AND" and "OR" operation for a btm K-slice R1_k until we need to drop it from RAM to make room for another btm K-slice.
We can choose the order in which we run through K-slices of W, and we can choose which btm K-slices to drop. This gives an interesting optimization problem.

Since only the black king is moving here, we can split this problem into 10 separate problems, each with the white king fixed on one of a1-d1-d4. If the black king crosses the diagonal, nothing happens to the white king.

With wtm and btm reversed, white king moves that leave the a1-d1-d4 triangle will move the black king to one of 3 or 7 other positions, depending on whether the black king is on a diagonal or not. So the black king is in one of 4 (diagonal) plus 6 (off-diagonal) is 10 orbits under the group of symmetries of the chessboard. So again these are 10 separate problems, now with the black king "fixed" in one of those orbits.

Number of (wK,bK) positions with wK fixed:
b1: 58
c1: 58
d1: 58
c2: 55
d2: 55
d3: 55
a1: 33
b2: 30
c3: 30
d4: 30
Total: 462

The numbers with bK "fixed" should be the same.
E.g. bK in {a1,h1,h8,a8}:
a1: 7
h1: 10
h8: 10
a8: 6 (because if wK is on the diagonal, bK can be mirrored to h1).
So indeed 33 (same as wK on a1).

So it seems that if 58 K-slices can fit in RAM, which corresponds to about 300 GB, the K-slice approach should be as efficient as the regular approach that requires 2.38 TB of RAM. With less RAM you have to be a bit more careful to do things in a clever order.

If the table has identical pieces, the size of each slice halves, so the number of slices that fit in RAM doubles.

Again, if I don't overlook anything stupid.

syzygy · Post by **syzygy** » Wed Feb 04, 2026 3:58 pm

Ender8pieces wrote: ↑Wed Feb 04, 2026 2:23 pm
syzygy wrote: ↑Wed Feb 04, 2026 3:23 am I guess W, B and S2 would each be 462 separate compressed files. Updating one means decompressing, applying the relevant operation, deleting the old file and writing back the compressed new file. I don't think SSD is suitable for this.
I'm not sure why the working files would be stored compressed on the disk. An efficient implementation relies on writing them as uncompressed bitsets during the generation phase. Compression is only relevant as a final post-processing step for the distribution format.

Disk I/O will be the limiting factor, so compression of sparse bitmaps will speed things up.

I don't see any reason why a modern SSD wouldn't handle a few hundred full-drive sequential writes. While this would require a clear "disclosure" for a distributed computing project (to inform participants about disk wear), for a dedicated generation machine, it's well within the specs of even high-end consumer NVMe drives, let alone Enterprise one

A few hundred full-drive writes will finish the SSD.

syzygy · Post by **syzygy** » Wed Feb 04, 2026 4:14 pm

jp wrote: ↑Tue Feb 03, 2026 10:29 am
syzygy wrote: ↑Sun Feb 01, 2026 5:13 pm...
A better argument "against" the utility of 8-piece tablebases might be that 8-piece positions occurring in games tend to be either too balanced to not be an easy draw, or too unbalanced to not be an easy win for the stronger side. Positions where the material differs by exactly one pawn will always have an odd number of pieces. But of course there are lots of ways in which a 4v4 can be unbalanced, and 5v3 can still be nearly balanced.
Yes, since I read the Op1-TB news in this thread, a number of times I've seen an interesting position... and counted that it has 9 pieces.

If your 9-piece position has two pairs of opposing pawns, then it would be in "Op2-TB", which I don't think exists, but it could be generated from 8-piece Op1-TB, just as 8-piece Op1-TB is generated from 7-piece tables.

Ender8pieces · Post by **Ender8pieces** » Wed Feb 04, 2026 5:29 pm

syzygy wrote: ↑Wed Feb 04, 2026 3:58 pm
Disk I/O will be the limiting factor, so compression of sparse bitmaps will speed things up.
...
A few hundred full-drive writes will finish the SSD.

It is difficult to be certain without the raw numbers, but I’d like to address the points as candidly as possible:

While Disk I/O is indeed the primary bottleneck, compression has a double-edged effect. It might save bandwidth, but as you noted, it significantly impacts drive endurance and can cause slowdowns due to redundant writes (Write Amplification). A potential middle ground could be running compressed during the initial, data-heavy iterations and switching to raw/uncompressed once the updates become sparse.

To be clear, an efficient implementation must ensure that each K-slice is written to disk at most once per DTZ iteration.

If we assume 1 byte per position in a raw file, writing to a single byte within a 4KB sector necessitates writing the entire sector. In the early iterations, this is frequent, making the difference between compressed and raw formats negligible in terms of disk wear. However, looking at endgames like KQQQkqq, we see a rapid convergence where after ~40 plies, less than 1 in 4,000 positions remain unresolved. While 8-piece endgames will be more complex, 100 plies seems like a cautious estimate for when the data becomes sparse enough that we can optimize how we flush to disk.

While RAM is a reusable asset and SSDs are consumables, modern drives are increasingly durable, with many standard enterprise models handling thousands of full-drive writes. Ultimately, it’s a trade-off between the high upfront cost of massive RAM versus the operational cost of "consuming" high-end storage.

My hope is that by exploring these paths, whether through 5D king-slice transitions, smarter I/O management, or other methods,we can finally break the barrier for 8-piece or even 9-piece tables.

Tablebase suggestion

Re: Tablebase suggestion

Re: Tablebase suggestion

Re: Tablebase suggestion

Re: Tablebase suggestion

Re: Tablebase suggestion

Re: Tablebase suggestion

Re: Tablebase suggestion

Re: Tablebase suggestion

Re: Tablebase suggestion.

Re: Tablebase suggestion