Ok if it is a quad/dual core cpu it could have larger L2 combined (2mb), but you are not using multi cpus for backward step so it is a waste
of resources. Assume 512k per core and that with the 20x latency compared to L1. The program takes away some of that so we are talking not very big
gains. Unless it is an extremly easy process that maps accesses superbly we probably won't get more than 2x from optimizing egtb generation from cache.
Pawn slices can be very small but i doubt the K-slices are troublesome.I understand moves of the last placed piece will benefit from caching whether we place
the most mobile piece there or not. So even if I don't optimize the big cache and the access patter will benefit the generation process anyway
so the improvement is whether a Queen or Bishop is placed there.
L2 can be as large as 2MB, which at one bit per position would be able to contain a 4-men slice. (Not recommended, because it would not leave room for anything else, and code uses L2 too, but you could easily handle 3.5-men slices. E.g. confining Rooks to one file or rank, etc.)
I understand from your leapfrog webpage you try to exploit cache locality by generating moves of one piece only and also streaming through the squares in a cache
friendly way. My generator is a naive implemenation that passes through all moves and positions of both sides and check if it a win or loss, in which case it does the 2-ply search
case with retro moves and verification. I don't see this process benefiting much from cache at least the way I do it. I can see good cache locality for the first pass
where I do forward and almost stream through the positions sequentially but have a hard time realizing other benefits. The tablebase array will have holes
with unknown,illegal,win,loss,draws interparsed throughout.
Note there are several ways to store the tablebase. I like to use 1 byte per position, using 1 bit for indicating won (wtm) positions, and the other 7 either for DTx (when resolved) or 'escape counts' (when not yet resolved) for the corresponding btm position. For 5 men that is fine, but for larger it becomes too bulky. Then I use representations with just 1 bit per position (to squeeze the most out of the memory), indicating whether the position is won (wtm). Do distinguish the other states I then use separate tables, which often can be much smaller or easily compressed (because they are sparse). E.g. many algorithms need to know if a position is 'newly won' (i.e. became won in the last iteration). But that information is gained slice by slice through random accesses (requiring a bitmap only as large as the slice that is the current working set), and can then be compressed (e.g. by run-length encoding) as during most iterations it is very sparse, and stored in that form until it is needed again.
My goal is to generate bitbases so I use 2bits/pos and another 1bit for marking position as visited. Generation algorithm is a bit different there. For instance when some one told me it is possible to generate one side only tables and reduce RAM usage I got confused. What I didn't know is that you would need to store the wins of the other side in a bitmap and also at least a visits indicator. So atleast 2bits/pos for that side too so i decided why not 3bits/pos for the other side too and do passes for both sides at the same time since the saving in RAM is small anyway. If I use dedicate a full byte for storing counters instead of a single bit then i don't need to do verification since all I do is backward passes forever and decreasing counters.
I see that you still do verification anyway in leap frog. What is the reason for that ?
For storing the lost (btm) positions you can similarly compress the bitmap for the newly lost positions of the current slice once you are done with it; These are again rather sparce, and all have the same DTx, so there is no reason to encode the DTX. Once a position is determined to be lost it will never be revisited other than acting as starting point for the next iteration (where it is accessed sequentially). So run-length encoding of the lost-in-(N+1) bitmaps for the slice after the N->(N+1) iteration, and keeping them all stored separately (on disk, after you have used it as base for the next iteration) allows you to construct the full DTx tablebase afterwards.
Compression can help to reduce bandwidth specially if you generate bitbases and the sizes are pretty big to not be in RAM. I have never tried that but I can see how a simple compression like RLE can be used even during generation
since decoding is pretty fast. I gather there are other advanced compressors to that you can decode the value of a single position without actually decoding the whole block.
During most of the building verification is effeicient because, as you say, the first move to a non-lost position produces a beta cutoff. When you are building EGTs that are generally lost, however, the iteration that finally declares a position lost is a fail low. So it will have to verify all escapes, and won't get the beta cutoff on any of them. The previous time that position was visited it had only one escape amongst its moves, and you would on the average have to search half the moves, before that 1/3, etc. Now 1+1/2+1/3+1/4+... is a diverging series, and although it diverges only logarithmically, it can add up to significantly larger than 1. It is then more efficient to do verification for all positions in advance, (even those that might turn out not to need verification in the end), storing the escape counts. But if the EGT turns out to be a general draw, most of the advance verification work will have been completely wasted.
Exactly. Verification is very fast because in my case even an univisited position causes a cut off. The overhead is usually the move generation there but you don't check the values of many resulting postions
from those moves. Worst case scenario is when the position is flagged as potential loss and it turns out to be a real loss anyway. All the moves have to had to be marked as lost for it to be finally declared as really lost.
Now with the byte counter I don't do verification at all since when the counter for the potential lost becomes zero it is flagged as really lost anyway .
There is an alternative method that avoids that, which is not storing counts, but storing a hint in unresolved positions upto what point verification was performed on the previous visit. So you verify until you find an escape. But then you store the move number of that escape, or if that takes to many bits, the number of the piece that caused the beta cutoff, or the number of the direction. So that next time you need to verify that node, you can skip all the moves that did not produce the cutoff (i.e. do lead to 'won' positions), and immediately start with the move that caused it last time. (Or with the piece that did it, or the piece in the right direction.) Most of the time that will not be the escape whose plugging led you to revisite the node, so you are done immediately. If it happened to be the plugged escape, you then simply continue the verification where you left off. So there is no investment in verification until you really need it, and very little duplicate verification. (In fact when you can specify the exact move that produced the cutoff last time, you can check if that move is the reverse of the unmove through which you now revisit the position, because the successor turned into 'won', and if it was not, the position to which it led must still be undecided (or you would have revisited the position when that successor turned 'won'), and there is no need to access the EGT to actually verify.)
That makes a lot of sense because I have two methods to be used for systems with small and large ram. If you are willing to dedicate a byte or less as you suggested by storing the piece index, then you can reduce the impact of verfication overhead or even completely avoid it if you store move counts. At 1 bit/pos for verification I am at a 4.4G RAM requirement for the largest 6 men bitbases. So I will not be using any more bits for counters there but 5 men avoid verification all in all.