pawn enumeration

Discussion of chess software programming and technical issues.

Moderator: Ras

syzygy
Posts: 5911
Joined: Tue Feb 28, 2012 11:56 pm

Re: price of complex indexing

Post by syzygy »

Daniel Shawul wrote:How can I retrograde from captures when I have those stored in separate tables and I access them through an egbbdll ?
Say we are generating KRBvKQ and doing the black (un)captures. First you remove the R, which leaves KBvKQ. You loop through all possible KBvKQ positions with white to move. Now for each position, you look up the score and you generate all black K moves and insert the white R at the square where the black K was. Those you mark based on the score. You do the same with the black Q. Once you have finished looping through KBvKQ positions, you do (un)captures of the white B by looping through KRvKQ, looking up the score, undo captures with black K, undo captures with black Q.

Before doing these real captures, you can do captures of the white K and black K to find illegal positions. This is slightly more tricky if you use the white K for mirroring, but still not very complicated.
This does many more verifications than my scheme needs. I basically delay the verification so that a position will often have been found to be potentially losing multiple times but only needs one verification.
Well then I don't understand your method. To delay for verfication of potential losses you need a counter or something you trigger verfication search up on reaching a threshold value.
No, I only delay until the next pass. During a white pass, I mark black positions as potloss or newwin. After the white pass has finished, I treat the black positions that have been marked as potloss or newwin. My point is that during this white pass, a black position can be marked as potloss more than once. In that case, only one verification is done in my scheme. This is just an explanation for why my scheme seems to be relatively fast, e.g. faster than what I could get using out-counting.
My guess is doing verifciation will help in convergence since it finds wins if there are any. You wait (through some means) for a retrograde to change your potential loss to win, I do it right away. So clearly mine wins if the potential loss is actually a win. If it is not then any univisited position will break the loop. The only problem is when most are visited and are lost leading to a real loss.
My verification will never find a win, since if it is a win (that could be found), it will already have been found before the verification is performed. Of course my verification als returns once it has found a move not known to lose.
Also you don't seem to be distinguishing between newly resolved positions and old positions? That would mean you do a lot of unnecessary double work. But I suppose you use that 1 bit/pos here?
Ofcourse I forgot about that here but I clearly mentioned it in my previous post. Any win/loss that generated is marked as visited and never generates retro moves again. So each child position visits its parents only once and at the same time asks its parent for verification as well.
Ok, so visited means finished. My newwin is like your win/unvisited. You don't seem to have an equivalent for potloss (and you don't need it since you verify immediately). You don't seem to have an equivalent for draw, which gives me the impression that you need to verify capture moves, which I do not have to.
User avatar
hgm
Posts: 28464
Joined: Fri Mar 10, 2006 10:06 am
Location: Amsterdam
Full name: H G Muller

Re: price of complex indexing

Post by hgm »

Daniel Shawul wrote:I think this can benefit when you are trying to access disk in a systematic way. The L1 cache can hold so few positions ( one cache line holds 64bytes=256 postions).
There is more in a CPU than just L1. In fact L2 access in nearly as good: even though it has a longer latency (e.g. 20 cycles in stead of 3) the latency can usually be completely hidden by running other instriuctions in parallel through the out-of-order mechanism. (Even with 3 instructions per clock, that would only be 60 instructions, which fits in the reorder buffer.) With the much longer RAM access the reorder buffer will completely fill up, and the inability to retire the load instruction will then stall the CPU.

L2 can be as large as 2MB, which at one bit per position would be able to contain a 4-men slice. (Not recommended, because it would not leave room for anything else, and code uses L2 too, but you could easily handle 3.5-men slices. E.g. confining Rooks to one file or rank, etc.)

Note there are several ways to store the tablebase. I like to use 1 byte per position, using 1 bit for indicating won (wtm) positions, and the other 7 either for DTx (when resolved) or 'escape counts' (when not yet resolved) for the corresponding btm position. For 5 men that is fine, but for larger it becomes too bulky. Then I use representations with just 1 bit per position (to squeeze the most out of the memory), indicating whether the position is won (wtm). Do distinguish the other states I then use separate tables, which often can be much smaller or easily compressed (because they are sparse). E.g. many algorithms need to know if a position is 'newly won' (i.e. became won in the last iteration). But that information is gained slice by slice through random accesses (requiring a bitmap only as large as the slice that is the current working set), and can then be compressed (e.g. by run-length encoding) as during most iterations it is very sparse, and stored in that form until it is needed again.

For storing the lost (btm) positions you can similarly compress the bitmap for the newly lost positions of the current slice once you are done with it; These are again rather sparce, and all have the same DTx, so there is no reason to encode the DTX. Once a position is determined to be lost it will never be revisited other than acting as starting point for the next iteration (where it is accessed sequentially). So run-length encoding of the lost-in-(N+1) bitmaps for the slice after the N->(N+1) iteration, and keeping them all stored separately (on disk, after you have used it as base for the next iteration) allows you to construct the full DTx tablebase afterwards.

A note on storing escape counts:

During most of the building verification is effeicient because, as you say, the first move to a non-lost position produces a beta cutoff. When you are building EGTs that are generally lost, however, the iteration that finally declares a position lost is a fail low. So it will have to verify all escapes, and won't get the beta cutoff on any of them. The previous time that position was visited it had only one escape amongst its moves, and you would on the average have to search half the moves, before that 1/3, etc. Now 1+1/2+1/3+1/4+... is a diverging series, and although it diverges only logarithmically, it can add up to significantly larger than 1. It is then more efficient to do verification for all positions in advance, (even those that might turn out not to need verification in the end), storing the escape counts. But if the EGT turns out to be a general draw, most of the advance verification work will have been completely wasted.

There is an alternative method that avoids that, which is not storing counts, but storing a hint in unresolved positions upto what point verification was performed on the previous visit. So you verify until you find an escape. But then you store the move number of that escape, or if that takes to many bits, the number of the piece that caused the beta cutoff, or the number of the direction. So that next time you need to verify that node, you can skip all the moves that did not produce the cutoff (i.e. do lead to 'won' positions), and immediately start with the move that caused it last time. (Or with the piece that did it, or the piece in the right direction.) Most of the time that will not be the escape whose plugging led you to revisite the node, so you are done immediately. If it happened to be the plugged escape, you then simply continue the verification where you left off. So there is no investment in verification until you really need it, and very little duplicate verification. (In fact when you can specify the exact move that produced the cutoff last time, you can check if that move is the reverse of the unmove through which you now revisit the position, because the successor turned into 'won', and if it was not, the position to which it led must still be undecided (or you would have revisited the position when that successor turned 'won'), and there is no need to access the EGT to actually verify.)
syzygy
Posts: 5911
Joined: Tue Feb 28, 2012 11:56 pm

Re: price of complex indexing

Post by syzygy »

syzygy wrote:
hgm wrote:
syzygy wrote:(There might be a problem though with such cache lines mapping to the same cache entry, since they are at distances (a mutiple of) a power of two.)
This can be avoided by using 64-byte spacer regions in between the higher slices. If you don't want to do that in the index, (e.g. to keep it easy to mirror the board by bit-manipulation on the index), you can use an index->address translation just before every access, like

address = index*01000101 >> 18;

This is still much faster than the access itself, so it will not cause a noticeable slowdown. (As it will be computed in parallel.)
Yes, I will certainly be trying something like this if generation speed really drops for 6-piece tables.
I just experimented with the following mapping (for 5-piece tables):

address = index ^ ( (index >> 12) & (4095 << 6) )

Unfortunately this slows down my generator by about 25%.
User avatar
hgm
Posts: 28464
Joined: Fri Mar 10, 2006 10:06 am
Location: Amsterdam
Full name: H G Muller

Re: price of complex indexing

Post by hgm »

That is a lot, and seems much more than could be explained from the execution time of the few extra instructions. This suggests it actually decreases cache efficiency.

Without knowing the details of the access pattern of your generator I can only speculate what causes that. But it is not inconceivable that under circumstances where the working set is ~twice larger than the cache, focussing half the accesses on the same cache entry would actually be beneficial. Because then at least the other half of the working set could remain in cache and cause hits. While with spreading out the accesses over the cache entries would cause severe thrashing, flushing every item out before it could be used a second time.
Daniel Shawul
Posts: 4186
Joined: Tue Mar 14, 2006 11:34 am
Location: Ethiopia

Re: price of complex indexing

Post by Daniel Shawul »

Ok if it is a quad/dual core cpu it could have larger L2 combined (2mb), but you are not using multi cpus for backward step so it is a waste
of resources. Assume 512k per core and that with the 20x latency compared to L1. The program takes away some of that so we are talking not very big
gains. Unless it is an extremly easy process that maps accesses superbly we probably won't get more than 2x from optimizing egtb generation from cache.
Pawn slices can be very small but i doubt the K-slices are troublesome.I understand moves of the last placed piece will benefit from caching whether we place
the most mobile piece there or not. So even if I don't optimize the big cache and the access patter will benefit the generation process anyway
so the improvement is whether a Queen or Bishop is placed there.
L2 can be as large as 2MB, which at one bit per position would be able to contain a 4-men slice. (Not recommended, because it would not leave room for anything else, and code uses L2 too, but you could easily handle 3.5-men slices. E.g. confining Rooks to one file or rank, etc.)
I understand from your leapfrog webpage you try to exploit cache locality by generating moves of one piece only and also streaming through the squares in a cache
friendly way. My generator is a naive implemenation that passes through all moves and positions of both sides and check if it a win or loss, in which case it does the 2-ply search
case with retro moves and verification. I don't see this process benefiting much from cache at least the way I do it. I can see good cache locality for the first pass
where I do forward and almost stream through the positions sequentially but have a hard time realizing other benefits. The tablebase array will have holes
with unknown,illegal,win,loss,draws interparsed throughout.
Note there are several ways to store the tablebase. I like to use 1 byte per position, using 1 bit for indicating won (wtm) positions, and the other 7 either for DTx (when resolved) or 'escape counts' (when not yet resolved) for the corresponding btm position. For 5 men that is fine, but for larger it becomes too bulky. Then I use representations with just 1 bit per position (to squeeze the most out of the memory), indicating whether the position is won (wtm). Do distinguish the other states I then use separate tables, which often can be much smaller or easily compressed (because they are sparse). E.g. many algorithms need to know if a position is 'newly won' (i.e. became won in the last iteration). But that information is gained slice by slice through random accesses (requiring a bitmap only as large as the slice that is the current working set), and can then be compressed (e.g. by run-length encoding) as during most iterations it is very sparse, and stored in that form until it is needed again.
My goal is to generate bitbases so I use 2bits/pos and another 1bit for marking position as visited. Generation algorithm is a bit different there. For instance when some one told me it is possible to generate one side only tables and reduce RAM usage I got confused. What I didn't know is that you would need to store the wins of the other side in a bitmap and also at least a visits indicator. So atleast 2bits/pos for that side too so i decided why not 3bits/pos for the other side too and do passes for both sides at the same time since the saving in RAM is small anyway. If I use dedicate a full byte for storing counters instead of a single bit then i don't need to do verification since all I do is backward passes forever and decreasing counters.
I see that you still do verification anyway in leap frog. What is the reason for that ?
For storing the lost (btm) positions you can similarly compress the bitmap for the newly lost positions of the current slice once you are done with it; These are again rather sparce, and all have the same DTx, so there is no reason to encode the DTX. Once a position is determined to be lost it will never be revisited other than acting as starting point for the next iteration (where it is accessed sequentially). So run-length encoding of the lost-in-(N+1) bitmaps for the slice after the N->(N+1) iteration, and keeping them all stored separately (on disk, after you have used it as base for the next iteration) allows you to construct the full DTx tablebase afterwards.
Compression can help to reduce bandwidth specially if you generate bitbases and the sizes are pretty big to not be in RAM. I have never tried that but I can see how a simple compression like RLE can be used even during generation
since decoding is pretty fast. I gather there are other advanced compressors to that you can decode the value of a single position without actually decoding the whole block.
During most of the building verification is effeicient because, as you say, the first move to a non-lost position produces a beta cutoff. When you are building EGTs that are generally lost, however, the iteration that finally declares a position lost is a fail low. So it will have to verify all escapes, and won't get the beta cutoff on any of them. The previous time that position was visited it had only one escape amongst its moves, and you would on the average have to search half the moves, before that 1/3, etc. Now 1+1/2+1/3+1/4+... is a diverging series, and although it diverges only logarithmically, it can add up to significantly larger than 1. It is then more efficient to do verification for all positions in advance, (even those that might turn out not to need verification in the end), storing the escape counts. But if the EGT turns out to be a general draw, most of the advance verification work will have been completely wasted.
Exactly. Verification is very fast because in my case even an univisited position causes a cut off. The overhead is usually the move generation there but you don't check the values of many resulting postions
from those moves. Worst case scenario is when the position is flagged as potential loss and it turns out to be a real loss anyway. All the moves have to had to be marked as lost for it to be finally declared as really lost.
Now with the byte counter I don't do verification at all since when the counter for the potential lost becomes zero it is flagged as really lost anyway .
There is an alternative method that avoids that, which is not storing counts, but storing a hint in unresolved positions upto what point verification was performed on the previous visit. So you verify until you find an escape. But then you store the move number of that escape, or if that takes to many bits, the number of the piece that caused the beta cutoff, or the number of the direction. So that next time you need to verify that node, you can skip all the moves that did not produce the cutoff (i.e. do lead to 'won' positions), and immediately start with the move that caused it last time. (Or with the piece that did it, or the piece in the right direction.) Most of the time that will not be the escape whose plugging led you to revisite the node, so you are done immediately. If it happened to be the plugged escape, you then simply continue the verification where you left off. So there is no investment in verification until you really need it, and very little duplicate verification. (In fact when you can specify the exact move that produced the cutoff last time, you can check if that move is the reverse of the unmove through which you now revisit the position, because the successor turned into 'won', and if it was not, the position to which it led must still be undecided (or you would have revisited the position when that successor turned 'won'), and there is no need to access the EGT to actually verify.)
That makes a lot of sense because I have two methods to be used for systems with small and large ram. If you are willing to dedicate a byte or less as you suggested by storing the piece index, then you can reduce the impact of verfication overhead or even completely avoid it if you store move counts. At 1 bit/pos for verification I am at a 4.4G RAM requirement for the largest 6 men bitbases. So I will not be using any more bits for counters there but 5 men avoid verification all in all.
syzygy
Posts: 5911
Joined: Tue Feb 28, 2012 11:56 pm

Re: price of complex indexing

Post by syzygy »

hgm wrote:That is a lot, and seems much more than could be explained from the execution time of the few extra instructions. This suggests it actually decreases cache efficiency.
Well, one problem is that I did the transformation also in the main loop when looping over the table looking for work to do. When I loop through the table linearly, and do the reverse transformation before doing the work, things get a bit faster. Still, it is 10% slower.

A second problem is that my transformation might not sufficiently solve the problem, so I replaced (idx ^ (idx >> 12)) by (idx + (idx >> 12)) (with appropriate masking to make things work correctly). However, this only slows down things further.

I've measured cache-misses and they do indeed increase.
Without knowing the details of the access pattern of your generator I can only speculate what causes that. But it is not inconceivable that under circumstances where the working set is ~twice larger than the cache, focussing half the accesses on the same cache entry would actually be beneficial. Because then at least the other half of the working set could remain in cache and cause hits. While with spreading out the accesses over the cache entries would cause severe thrashing, flushing every item out before it could be used a second time.
It's a possible explanation. Hard to say.
User avatar
hgm
Posts: 28464
Joined: Fri Mar 10, 2006 10:06 am
Location: Amsterdam
Full name: H G Muller

Re: price of complex indexing

Post by hgm »

Daniel Shawul wrote: I see that you still do verification anyway in leap frog. What is the reason for that ?
Whether I store counts or do verification on the spot is really independent from whether I do leapfrog. The problem is that storing counts on a really slow medium such as hard disk is not competative. It is faster then to just store 1 bit/position, compressed by run-length, and redo the verification on the next pass. That reduces the amount of data to be written to and read back from the disk so much that you save more on disk I/O than redoing the verification costs you in RAM access.
Daniel Shawul
Posts: 4186
Joined: Tue Mar 14, 2006 11:34 am
Location: Ethiopia

Re: price of complex indexing

Post by Daniel Shawul »

syzygy wrote:
hgm wrote:That is a lot, and seems much more than could be explained from the execution time of the few extra instructions. This suggests it actually decreases cache efficiency.
Well, one problem is that I did the transformation also in the main loop when looping over the table looking for work to do. When I loop through the table linearly, and do the reverse transformation before doing the work, things get a bit faster. Still, it is 10% slower.

A second problem is that my transformation might not sufficiently solve the problem, so I replaced (idx ^ (idx >> 12)) by (idx + (idx >> 12)) (with appropriate masking to make things work correctly). However, this only slows down things further.

I've measured cache-misses and they do indeed increase.
Without knowing the details of the access pattern of your generator I can only speculate what causes that. But it is not inconceivable that under circumstances where the working set is ~twice larger than the cache, focussing half the accesses on the same cache entry would actually be beneficial. Because then at least the other half of the working set could remain in cache and cause hits. While with spreading out the accesses over the cache entries would cause severe thrashing, flushing every item out before it could be used a second time.
It's a possible explanation. Hard to say.
In my case the indiex is constructed from numbers that are not multiples of 64 so i don't need to cache align it nor pad it since there is not a definite pattern. I am curious what your miss rate for L1 and L2 is for 3 men and 4 men generation to see if there is hope for cache optimizations anyway. I don't think we can compare numbers of different programs but just the effect with exponential increase of tablebase sizes would be interesting to know.
Daniel Shawul
Posts: 4186
Joined: Tue Mar 14, 2006 11:34 am
Location: Ethiopia

Re: price of complex indexing

Post by Daniel Shawul »

hgm wrote:
Daniel Shawul wrote: I see that you still do verification anyway in leap frog. What is the reason for that ?
Whether I store counts or do verification on the spot is really independent from whether I do leapfrog. The problem is that storing counts on a really slow medium such as hard disk is not competative. It is faster then to just store 1 bit/position, compressed by run-length, and redo the verification on the next pass. That reduces the amount of data to be written to and read back from the disk so much that you save more on disk I/O than redoing the verification costs you in RAM access.
Oh so we do the same thing. I wrongly thought the byte you used had also counters in it. For 6 men where I don't use counters I use a 1 bit visited flag + verfication to avoid going to disk.
Daniel Shawul
Posts: 4186
Joined: Tue Mar 14, 2006 11:34 am
Location: Ethiopia

Re: price of complex indexing

Post by Daniel Shawul »

Say we are generating KRBvKQ and doing the black (un)captures. First you remove the R, which leaves KBvKQ. You loop through all possible KBvKQ positions with white to move. Now for each position, you look up the score and you generate all black K moves and insert the white R at the square where the black K was. Those you mark based on the score. You do the same with the black Q. Once you have finished looping through KBvKQ positions, you do (un)captures of the white B by looping through KRvKQ, looking up the score, undo captures with black K, undo captures with black Q.
This sounds complicated to me. The saving you can get from retro moving even in the first iteration are minimal. First I need to count moves for up to 5 men geneation since I use the byte method anyway. Second the first pass is done in parallel fast. Still can't figure out how to do parallel backpasses so time taken by iteration 0 is small, the iteration 1 & 2 biggest then decreases from then onwards.
Before doing these real captures, you can do captures of the white K and black K to find illegal positions. This is slightly more tricky if you use the white K for mirroring, but still not very complicated.
Yes unmoves for captures are complicated. Illegal positons are filtered last in my case so they are remain untouched during generation.
No, I only delay until the next pass. During a white pass, I mark black positions as potloss or newwin. After the white pass has finished, I treat the black positions that have been marked as potloss or newwin. My point is that during this white pass, a black position can be marked as potloss more than once. In that case, only one verification is done in my scheme. This is just an explanation for why my scheme seems to be relatively fast, e.g. faster than what I could get using out-counting.
I see a difference here. You do passes for a side at once. But I do passes for a WIN or LOSS at once not at the same time since as I mentioned it didn't terminate properly if I mixed both with the 1bit method for flagging.
1st pass = LOSSes for both -> Wins for both
2nd pass = WINs for both -> PotLOSS for both
Now if I break this last step in to:
2.1 WIN for white ->PotLoss for black
2.2 WIN for black -> PotLoss for white
This gains nothing so I have to do the verification right away. Maybe there is an optimial combination that does generation for both sides at the same time with white/black/loss/win back passes interparsed. I just used the one that converged at that time.
My verification will never find a win, since if it is a win (that could be found), it will already have been found before the verification is performed. Of course my verification als returns once it has found a move not known to lose.
I guess this difference comes from the way we do passes as I mentioned above.
Ok, so visited means finished. My newwin is like your win/unvisited. You don't seem to have an equivalent for potloss (and you don't need it since you verify immediately). You don't seem to have an equivalent for draw, which gives me the impression that you need to verify capture moves, which I do not have to.
Draws can be gotten from iteration 0 from caps and promotions but are never backpassed after that. When I do verfication ofcourse it updates with draws to. But I don't include caputes in the verfication (it does non-caps only so does the retro move). The captures make their contribution in the first pass and they are never done again. Maybe this wrong somehow but all I know is it gets the same numbers at least for 4 men.