pawn enumeration

wgarvin · Post by **wgarvin** » Wed Jul 25, 2012 8:40 pm

6-man with 1 pawn : 24 slices, 2*64*64*64*3612 = 1806 MB. You'd need at most two other slices loaded to initialize any one slice (and since they are indexed in the same order, you don't really need them in RAM at the same time). So 4GB of RAM would be enough, and 6GB would definitely be enough (even at one byte per position).

6-man with 2 opposite pawns: 24*47 slices, 2*64*64*3612 = less than 30 MB per slice, so that's easy enough. With 2 same pawns it gets a little harder to give a sensible numbering to the slices. I'd be inclined to just use 48*47/2 and 1806 king constellations and apply the mirroring to the white king instead of the pawns (since you need that capability for non-pawn endings anyway).

6-man with no pawns can get pretty big. At one byte per position, with no men of same color and type, you get 2*64*64*64*64*462 = ~14.4 GB or 2*62*61*60*59*462 = ~11.6 GB. I guess you can do one side at a time and store only 1 bit/position for the other side, or something, which would let even these be generated entirely in RAM on an 8GB machine.

This seems a lot simpler to me than any of the techniques that involve disk access on every pass, but its true that it won't scale up to 7-man dbs anytime soon!

syzygy · Post by **syzygy** » Wed Jul 25, 2012 8:58 pm

diep wrote:
syzygy wrote:
Daniel Shawul wrote:This is much more difficult than you think. Even enumerating n alike pieces requires some weird math. Look in my other post for a code snippet. Now merging this for two groups of white and black pawns, you see it is a nightmare. Even though I think it may be possible to find an enumeration function it will be difficult.
What makes it difficult are the symmetries if none of the sides have exactly 1 pawn.
If one side has exactly 1 pawn, all symmetry is eliminated by restricting this pawn to files a-d. The rest is a matter of placing k alike pieces in n squares, which you are already doing for pawnless tables anyway. First the pawns of the other side, then groups of pieces. Of course this doesn't yet solve the problem of preserving the partial ordering on pawn structures imposed by pawn moves, which you need for easy slicing.

To have a perfect index (i.e. no symmetry remaining and no broken positions with two pieces on the same square) with (say) white having two pawns and black either 0 or at least 2 pawns, you first have to distinguish between the case that the white pawns are in symmetric positions and the case that the white pawns are not in symmetric positions. The latter case is relatively easy, since indexing the white pawns breaks all symmetry. In the symmetric case, there is symmetry left that you can exploit when placing the remaining pawns or pieces. If black has no pawns you can use the white king for this. If black has pawns, you have to analyze a bit more. With 3 white pawns things actually get simpler, since the 3 pawns always break symmetry. So only KPPvKPP needs a bit more analysis. (Looking at my old code for KPPvPP in suicide chess, I distinguish 5 cases. The code for KPPvKPP would be similar. Not too bad.)
You know - this is highschool type math. Not sure what courses you had there, but in my highschool we had some math regarding : "how many ways can i put n blue balls in there". Never had that?

I did, and I learned to read already in elementary school.

diep · Post by **diep** » Wed Jul 25, 2012 8:59 pm

wgarvin wrote:6-man with 1 pawn : 24 slices, 2*64*64*64*3612 = 1806 MB. You'd need at most two other slices loaded to initialize any one slice (and since they are indexed in the same order, you don't really need them in RAM at the same time). So 4GB of RAM would be enough, and 6GB would definitely be enough (even at one byte per position).

6-man with 2 opposite pawns: 24*47 slices, 2*64*64*3612 = less than 30 MB per slice, so that's easy enough. With 2 same pawns it gets a little harder to give a sensible numbering to the slices. I'd be inclined to just use 48*47/2 and 1806 king constellations and apply the mirroring to the white king instead of the pawns (since you need that capability for non-pawn endings anyway).

6-man with no pawns can get pretty big. At one byte per position, with no men of same color and type, you get 2*64*64*64*64*462 = ~14.4 GB or 2*62*61*60*59*462 = ~11.6 GB. I guess you can do one side at a time and store only 1 bit/position for the other side, or something, which would let even these be generated entirely in RAM on an 8GB machine.

This seems a lot simpler to me than any of the techniques that involve disk access on every pass, but its true that it won't scale up to 7-man dbs anytime soon!

the diskaccess method always wins. in *all* cases, because in total it needs less bandwidth. If you have enough RAM, you can simply make a RAM disk and put some of the files that you stream there.

The whole generation of EGTBs is about to use effectively the bandwidth you have available and minimizing the CPU resources.

Koistinen comes in handy there.

What i didn't investigate is whether the actual GENERATION can be done fast by a GPU. The motherboards that can have the Tesla's do not have pci-e 2.0 let alone 3.0, the cluster does (but that's pci-e x8 not x16).

So bandwidth i have is about 2.2 GB/s (bidirectional), basically about the bandwidth of the DDR2 ecc reg ram if you add it up and add up the overhead for it which makes it probably if you'd test better a 8GB/s aggregated.

Note that x8 pci-e 2.0 also has, not surprisingly the same 8GB/s aggregated bandwidth there.

Now you can do whatever you want, that's the bandwidth you got and that's it.

So if you store stuff in 2 bits and lookup 2 bits rather than one, then you're going to lookup to factor 2 more bandwidth.

That's the theory i have for you.

There is however another big problem with big RAM buffers.
Around 2003 i wrote a test for that.

TLB trashing Latency to big buffers of RAM is a hell of a lot slower than TLB trashing latency of small buffer of RAM. So that's the random lookups i refer to.

If you do that inside a caching system then you'll need of course a limited number of cache buffers from disk (or ramdisk).

We can prove that if we keep using during our loop the same indexing scheme (so not a randomized indexing scheme that alters every lookup we do) that we just need a subset of the lookup buffer for the coming n positions we lookup to. therefore the logical deduction is that we're looking up within a small RAM buffer each time which is faster latency.

So it's not the perfect situation of course, but it's really efficient in bandwidth usage.

Anything i saw written so far in suggestions is wasting bandwidth.

syzygy · Post by **syzygy** » Wed Jul 25, 2012 9:21 pm

wgarvin wrote:6-man with 1 pawn : 24 slices, 2*64*64*64*3612 = 1806 MB. You'd need at most two other slices loaded to initialize any one slice (and since they are indexed in the same order, you don't really need them in RAM at the same time). So 4GB of RAM would be enough, and 6GB would definitely be enough (even at one byte per position).

Actually, you only need half of the other two slices. For example, if the pawn is on a2 and white, you only need the btm info for the pawn on a3 and a4. All black moves stay within the slice anyway. It would also be relatively easy to first use the a3-slice, then the a4-slice (or the other way around). In fact, you don't even need to store the a3- and a4- slices in RAM since you need to access the information only once and sequentially.

RAM is not a problem for pawnful tables. The pawnless tables are harder in this respect.

syzygy · Post by **syzygy** » Wed Jul 25, 2012 9:22 pm

diep wrote:the diskaccess method always wins. in *all* cases, because in total it needs less bandwidth.

Ok, so how many seconds for KQRvKQ?

diep · Post by **diep** » Wed Jul 25, 2012 9:27 pm

syzygy wrote:
diep wrote:the diskaccess method always wins. in *all* cases, because in total it needs less bandwidth.
Ok, so how many seconds for KQRvKQ?

Why don't you give away some pieces, makes it faster

wgarvin · Post by **wgarvin** » Thu Jul 26, 2012 1:41 am

diep wrote:
wgarvin wrote:6-man with 1 pawn : 24 slices, 2*64*64*64*3612 = 1806 MB. You'd need at most two other slices loaded to initialize any one slice (and since they are indexed in the same order, you don't really need them in RAM at the same time). So 4GB of RAM would be enough, and 6GB would definitely be enough (even at one byte per position).

6-man with 2 opposite pawns: 24*47 slices, 2*64*64*3612 = less than 30 MB per slice, so that's easy enough. With 2 same pawns it gets a little harder to give a sensible numbering to the slices. I'd be inclined to just use 48*47/2 and 1806 king constellations and apply the mirroring to the white king instead of the pawns (since you need that capability for non-pawn endings anyway).

6-man with no pawns can get pretty big. At one byte per position, with no men of same color and type, you get 2*64*64*64*64*462 = ~14.4 GB or 2*62*61*60*59*462 = ~11.6 GB. I guess you can do one side at a time and store only 1 bit/position for the other side, or something, which would let even these be generated entirely in RAM on an 8GB machine.

This seems a lot simpler to me than any of the techniques that involve disk access on every pass, but its true that it won't scale up to 7-man dbs anytime soon!
the diskaccess method always wins. in *all* cases, because in total it needs less bandwidth. If you have enough RAM, you can simply make a RAM disk and put some of the files that you stream there.

The whole generation of EGTBs is about to use effectively the bandwidth you have available and minimizing the CPU resources.

Koistinen comes in handy there.

What i didn't investigate is whether the actual GENERATION can be done fast by a GPU. The motherboards that can have the Tesla's do not have pci-e 2.0 let alone 3.0, the cluster does (but that's pci-e x8 not x16).

So bandwidth i have is about 2.2 GB/s (bidirectional), basically about the bandwidth of the DDR2 ecc reg ram if you add it up and add up the overhead for it which makes it probably if you'd test better a 8GB/s aggregated.

Note that x8 pci-e 2.0 also has, not surprisingly the same 8GB/s aggregated bandwidth there.

Now you can do whatever you want, that's the bandwidth you got and that's it.

So if you store stuff in 2 bits and lookup 2 bits rather than one, then you're going to lookup to factor 2 more bandwidth.

That's the theory i have for you.

There is however another big problem with big RAM buffers.
Around 2003 i wrote a test for that.

TLB trashing Latency to big buffers of RAM is a hell of a lot slower than TLB trashing latency of small buffer of RAM. So that's the random lookups i refer to.

If you do that inside a caching system then you'll need of course a limited number of cache buffers from disk (or ramdisk).

We can prove that if we keep using during our loop the same indexing scheme (so not a randomized indexing scheme that alters every lookup we do) that we just need a subset of the lookup buffer for the coming n positions we lookup to. therefore the logical deduction is that we're looking up within a small RAM buffer each time which is faster latency.

So it's not the perfect situation of course, but it's really efficient in bandwidth usage.

Anything i saw written so far in suggestions is wasting bandwidth.

I still don't see how generating entirely in RAM is going to be slower than anything involving disk access. Even with streaming, you're going to be I/O bound because its hundreds of times slower than RAM. Any kind of non-sequential I/O will take dozens of milliseconds.

I think why people recommend the "out-counting" method (which can only be efficiently done with a byte per position) is that you only have to generate moves for each position about two times during the entire generation process: one forward generation to count up how many successors the position has within the slice, and one reverse generation to decrement the counters of unresolved predecessors when the position is resolved. Yes, there's still going to be a lot of random accesses throughout the slice (which you can mitigate a little bit by putting the piece with the highest mobility in the lowest bits of the index, so that all 64 placements of that piece lie together in the same cache line).

Anyway, my wild guess is that out-counting will be faster than any technique that involves generating moves for a position more than a handful of times. Because you don't store a full byte per position, you don't know when all of the successors have been resolved, so any time one of the successors of an unresolved position was recently resolved, you have to generate moves for it to see if it actually was enough. And that means more memory accesses overall. Out-counting processes all pieces of a side at the same time, so you can keep WLD and a distance metric in the same byte and compute them at the same time.

The only way a disk-based generation technique could possibly be faster, is if it actually incurs fewer L2/L3 cache misses overall, and doesn't ever have to stop and wait for disk I/O to catch up. Since each move generated (except for the LSB piece) is approximately one cache miss, generating as few moves as possible helps put an upper limit on the total number of cache misses.

But hey, I'd say do whatever makes sense to you. I'm mostly interested in 5-man and less, which can all be done comfortably in RAM. I'm not ambitious enough to even think about trying to generate 7-man databases, and for those it would certainly be important to stream through your working sets in a very efficient way since they'd be too big to fit in RAM.

syzygy · Post by **syzygy** » Thu Jul 26, 2012 2:19 am

wgarvin wrote:I think why people recommend the "out-counting" method (which can only be efficiently done with a byte per position) is that you only have to generate moves for each position about two times during the entire generation process: one forward generation to count up how many successors the position has within the slice, and one reverse generation to decrement the counters of unresolved predecessors when the position is resolved.

I've tried out-counting, but it was slower than my "grandfather" implementation (and is more complicated). I think the reason is that you have to do the forward generation for all positions, including those that are not lost at all. Maybe it is possible to do the out-counting only for a side that is generally losing, but that's going to add a lot of complexity.

Yes, there's still going to be a lot of random accesses throughout the slice (which you can mitigate a little bit by putting the piece with the highest mobility in the lowest bits of the index, so that all 64 placements of that piece lie together in the same cache line).

Indeed that helps.
My current generator does 5-piece pawnless tables in about 10-12 seconds (i7-2670QM, using 8 hyperthreads). It will be interesting to see if 6-piece tables are going to take 10-12 minutes (in which case my compression will be the bottleneck anyway), or if they will completely trash the cpu cache.

diep · Post by **diep** » Fri Jul 27, 2012 2:48 pm

wgarvin wrote:
diep wrote:
wgarvin wrote:6-man with 1 pawn : 24 slices, 2*64*64*64*3612 = 1806 MB. You'd need at most two other slices loaded to initialize any one slice (and since they are indexed in the same order, you don't really need them in RAM at the same time). So 4GB of RAM would be enough, and 6GB would definitely be enough (even at one byte per position).

6-man with 2 opposite pawns: 24*47 slices, 2*64*64*3612 = less than 30 MB per slice, so that's easy enough. With 2 same pawns it gets a little harder to give a sensible numbering to the slices. I'd be inclined to just use 48*47/2 and 1806 king constellations and apply the mirroring to the white king instead of the pawns (since you need that capability for non-pawn endings anyway).

6-man with no pawns can get pretty big. At one byte per position, with no men of same color and type, you get 2*64*64*64*64*462 = ~14.4 GB or 2*62*61*60*59*462 = ~11.6 GB. I guess you can do one side at a time and store only 1 bit/position for the other side, or something, which would let even these be generated entirely in RAM on an 8GB machine.

This seems a lot simpler to me than any of the techniques that involve disk access on every pass, but its true that it won't scale up to 7-man dbs anytime soon!
the diskaccess method always wins. in *all* cases, because in total it needs less bandwidth. If you have enough RAM, you can simply make a RAM disk and put some of the files that you stream there.

The whole generation of EGTBs is about to use effectively the bandwidth you have available and minimizing the CPU resources.

Koistinen comes in handy there.

What i didn't investigate is whether the actual GENERATION can be done fast by a GPU. The motherboards that can have the Tesla's do not have pci-e 2.0 let alone 3.0, the cluster does (but that's pci-e x8 not x16).

So bandwidth i have is about 2.2 GB/s (bidirectional), basically about the bandwidth of the DDR2 ecc reg ram if you add it up and add up the overhead for it which makes it probably if you'd test better a 8GB/s aggregated.

Note that x8 pci-e 2.0 also has, not surprisingly the same 8GB/s aggregated bandwidth there.

Now you can do whatever you want, that's the bandwidth you got and that's it.

So if you store stuff in 2 bits and lookup 2 bits rather than one, then you're going to lookup to factor 2 more bandwidth.

That's the theory i have for you.

There is however another big problem with big RAM buffers.
Around 2003 i wrote a test for that.

TLB trashing Latency to big buffers of RAM is a hell of a lot slower than TLB trashing latency of small buffer of RAM. So that's the random lookups i refer to.

If you do that inside a caching system then you'll need of course a limited number of cache buffers from disk (or ramdisk).

We can prove that if we keep using during our loop the same indexing scheme (so not a randomized indexing scheme that alters every lookup we do) that we just need a subset of the lookup buffer for the coming n positions we lookup to. therefore the logical deduction is that we're looking up within a small RAM buffer each time which is faster latency.

So it's not the perfect situation of course, but it's really efficient in bandwidth usage.

Anything i saw written so far in suggestions is wasting bandwidth.
I still don't see how generating entirely in RAM is going to be slower than anything involving disk access. Even with streaming, you're going to be I/O bound because its hundreds of times slower than RAM. Any kind of non-sequential I/O will take dozens of milliseconds.

I think why people recommend the "out-counting" method (which can only be efficiently done with a byte per position) is that you only have to generate moves for each position about two times during the entire generation process: one forward generation to count up how many successors the position has within the slice, and one reverse generation to decrement the counters of unresolved predecessors when the position is resolved. Yes, there's still going to be a lot of random accesses throughout the slice (which you can mitigate a little bit by putting the piece with the highest mobility in the lowest bits of the index, so that all 64 placements of that piece lie together in the same cache line).

Anyway, my wild guess is that out-counting will be faster than any technique that involves generating moves for a position more than a handful of times. Because you don't store a full byte per position, you don't know when all of the successors have been resolved, so any time one of the successors of an unresolved position was recently resolved, you have to generate moves for it to see if it actually was enough. And that means more memory accesses overall. Out-counting processes all pieces of a side at the same time, so you can keep WLD and a distance metric in the same byte and compute them at the same time.

The only way a disk-based generation technique could possibly be faster, is if it actually incurs fewer L2/L3 cache misses overall, and doesn't ever have to stop and wait for disk I/O to catch up. Since each move generated (except for the LSB piece) is approximately one cache miss, generating as few moves as possible helps put an upper limit on the total number of cache misses.

But hey, I'd say do whatever makes sense to you. I'm mostly interested in 5-man and less, which can all be done comfortably in RAM. I'm not ambitious enough to even think about trying to generate 7-man databases, and for those it would certainly be important to stream through your working sets in a very efficient way since they'd be too big to fit in RAM.

First of all it's unlikely he'll write a 2nd generator in his life if any generator at all.

Secondly, you should scroll back. I said you don't have the bandwidth for it.
Third my method isn't doing random lookups much. If you lookup 64KB+ at a time then you get the full bandwidth out of it.

The most important thing is what i wrote in the first messages.

The problem is not to keep the entire slice in the RAM. The problem is limiting the CPU time.

I/O in the default way to generate is the smallest problem you have. Your bottleneck is that out of every bit in a register you got, for each instruction you're basically using 1 bit effectively.

So you're wasting 63 bits, or if you want to see it as SSE2 then you're wasting even more.

Do you get that problem?

The CPU can simply execute THIS amount of instructions per cycle.

Now improved Koistinen's method which is a trivial thing to find if you realize Koistinen's idea very well, that is use 64 bits effectively. So each instruction is dealing with 64 positions rather than 1.

*that's* why it's faster a lot in the first place.

Note there is overhead for that method so it's not exactly 64 times faster. Let's say effectively factor 60.

The second thing is that the method i proposed basically works upon L1 and L2. Cache misses are going to be minimal. So the bitmap method of streaming bitmaps is going to work within L1 and L2 most of the time whereas in the naive method you're TLB trashing of course.

Realize that for the bitmap method it doesn't matter whether the stuff is in RAM or on disk. You prefetch that anyway.

If you do naive lookupes from 2 EGTBs using 2 bits a position, then doing random lookups to huge RAM sizes is always slower than doing random lookups to within L1/L2. You get why it's faster to do things within L1/L2?

As it has a bigger bandwidth than your RAM.

One experiment you can do is using a benchmark program. I wrote one, it's doing random 8 byte lookups in memory buffers. A bigger buffer is considerable slower than a small one and the reason for that has nothing to do with the L1/L2/L3, but because simply getting bits from RAM from a position further away than this one, that eats more time simply.

Daniel Shawul · Post by **Daniel Shawul** » Fri Jul 27, 2012 3:24 pm

Code: Select all

Each sample counts as 0.01 seconds.
  %   cumulative   self              self     total           
 time   seconds   seconds    calls   s/call   s/call  name    
 22.96      5.08     5.08 80073496     0.00     0.00  ENUMERATOR::get_index(unsigned long&, bool)
 21.67      9.87     4.79        8     0.60     1.94  ENUMERATOR::backward_pass(unsigned char*, unsigned char*, unsigned char*, unsigned char*, unsigned long const&, unsigned long const&, bool)
 10.04     12.09     2.22 12024640     0.00     0.00  ENUMERATOR::get_retro_score(unsigned char*, unsigned char*, unsigned char*, unsigned char*, int, bool)
  9.64     14.22     2.13 195361391     0.00     0.00  SEARCHER::do_move(int const&)
  7.80     15.94     1.73 19882129     0.00     0.00  ENUMERATOR::get_pos(unsigned long)
  7.01     17.49     1.55 195361391     0.00     0.00  SEARCHER::undo_move(int const&)
  5.56     18.72     1.23 14874174     0.00     0.00  ENUMERATOR::get_init_score(unsigned char&, int, int, bool)
  5.29     19.89     1.17 210660367     0.00     0.00  SEARCHER::attacks(int, int) const
  3.44     20.65     0.76        8     0.10     0.77  ENUMERATOR::initial_pass(unsigned char*, unsigned char*, unsigned char*, unsigned char*, unsigned long const&, unsigned long const&, bool)
  2.96     21.31     0.66 14874174     0.00     0.00  SEARCHER::gen_all()
  1.36     21.61     0.30 12024640     0.00     0.00  SEARCHER::gen_retro()
  1.18     21.87     0.26        2     0.13    10.96  generate_slice(ENUMERATOR*, _IO_FILE*, _IO_FILE*)
  0.41     21.96     0.09                             print_sq(int const&)
  0.36     22.04     0.08                             SEARCHER::SEARCHER()
  0.27     22.10     0.06  4439741     0.00     0.00  SEARCHER::probe_bitbases(int&)
  0.05     22.11     0.01        7     0.00     0.00  ENUMERATOR::init()
  0.02     22.11     0.01                             SEARCHER::gen_noncaps()

I did a profile on generation of 4 men bitbases and guess what comes out at the top. Yes it is the complex indexing which is the price I pay for ellegance. After making non-capture moves (forward or retro), I calculate the index of the position from _scratch_. This is because I used all kinds of tricks for indexing. If it was a simple 64x64x.. etc it would have been possible to update incrementally by just taking out the contribution from the from square and then adding the to square contribution. I guess I will have to try naive indexing methods for speedy generation.. I think it may even be possible to incrementlly update with the indexing I have that could give me some speedup. OTOH move generation is so fast depsite what someone here might claim

If I did generation of 5 or 6 men involving disk access then move generation speed will be pretty irrelevant..

pawn enumeration

Re: pawn enumeration

Re: pawn enumeration

Re: pawn enumeration

Re: pawn enumeration

Re: pawn enumeration

Re: pawn enumeration

Re: pawn enumeration

Re: pawn enumeration

Re: pawn enumeration

price of complex indexing