pawn enumeration

hgm · Post by **hgm** » Tue Jul 24, 2012 8:32 am

I don't think this is theoreticlly possible: the pawn advance is a function in a multi-dimensional spae, and it cannot be mapped by a continuous function on the one-dimensional space of numbers.

But I don't understand why you would need this. Why not simply make a pawn hash table which lists all possible positions, and the name of the file on the HD where you stored the slice, if it is already computed? You can use a hashed perft on positions with Pawns only (and counts of the other pieces, so you know when lane-changing is still possible), starting with a placement step on 2nd rank. Then you can just run through the list to search for not-yet-computed slices that have all successors already computed, and start working on those. If it is not a pure Pawn ending, I don't think you would ever get in a situation where a linear search through the list of Pawn structures would take significant time compared to calculating a single slice.

This also makes it much easier to fully exploit the exchange symmetry of a large number of Pawns.

With pure Pawn endings slices would only be 64 positions, so for the sake of efficiency you would probably combine them to super-slices, e.g. only slice based on white Pawn locations.

Daniel Shawul · Post by **Daniel Shawul** » Tue Jul 24, 2012 2:34 pm

hgm wrote:I don't think this is theoreticlly possible: the pawn advance is a function in a multi-dimensional spae, and it cannot be mapped by a continuous function on the one-dimensional space of numbers.

But I don't understand why you would need this. Why not simply make a pawn hash table which lists all possible positions, and the name of the file on the HD where you stored the slice, if it is already computed? You can use a hashed perft on positions with Pawns only (and counts of the other pieces, so you know when lane-changing is still possible), starting with a placement step on 2nd rank. Then you can just run through the list to search for not-yet-computed slices that have all successors already computed, and start working on those. If it is not a pure Pawn ending, I don't think you would ever get in a situation where a linear search through the list of Pawn structures would take significant time compared to calculating a single slice.

This also makes it much easier to fully exploit the exchange symmetry of a large number of Pawns.

With pure Pawn endings slices would only be 64 positions, so for the sake of efficiency you would probably combine them to super-slices, e.g. only slice based on white Pawn locations.

Well it seems slicing them too much is of no use. The largest pawn-less bitbases usuallly take 4x less than those with single pawns, so a reduction of 8x using file slicing is enough. It is also easier to implement because now I don't need to load/unload slices since a slice can be generated by itself (except for capture/promotion transitions). And also the number of slices is always 8 if I slice based on the first pawn position. I would require to slice further only if the largest pawnless bitbases some how get reduced.

jdart · Post by **jdart** » Tue Jul 24, 2012 4:50 pm

Shared memory basically does this. See mmap (Linux) or CreateFileMapping/MapViewOfFile (Windows).

--Jon

diep · Post by **diep** » Tue Jul 24, 2012 6:31 pm

hgm wrote:I don't think this is theoreticlly possible: the pawn advance is a function in a multi-dimensional spae, and it cannot be mapped by a continuous function on the one-dimensional space of numbers.

BS statement.

I solved that in Diep already in 1994 or so. I was 20 years old and just had started Diep.

Note that the best way to order moves is just using a picksort - i'm using that of course.

But solving just the pawns is so simple....

Note with nowadays bitboards you can solve it in a different manner.
Just enumerate for white the fields a8 as least significant, followed by b8 and h1 as most significant.

Now with a simple bitboard trick you can get out the first pawn.
then you fiddle with that and you got the pawn move.

You know this is highschool math...

It is 1 dimensional math...

But I don't understand why you would need this. Why not simply make a pawn hash table which lists all possible positions, and the name of the file on the HD where you stored the slice, if it is already computed? You can use a hashed perft on positions with Pawns only (and counts of the other pieces, so you know when lane-changing is still possible), starting with a placement step on 2nd rank. Then you can just run through the list to search for not-yet-computed slices that have all successors already computed, and start working on those. If it is not a pure Pawn ending, I don't think you would ever get in a situation where a linear search through the list of Pawn structures would take significant time compared to calculating a single slice.

This also makes it much easier to fully exploit the exchange symmetry of a large number of Pawns.

With pure Pawn endings slices would only be 64 positions, so for the sake of efficiency you would probably combine them to super-slices, e.g. only slice based on white Pawn locations.

On a GPU it makes sense, but he's making a mistake.

He should talk to Joost Buijs how he did do his ordering for nightmare in 90s...

diep · Post by **diep** » Tue Jul 24, 2012 6:57 pm

I am trying to generate 6 men bitbases with pawns. The pawnless need about 4G of RAM but those with single pawn need as much as 13G without pawn slicing.

Why not just build the generator and have a look.

Even with a few megabytes RAM in your machine you have zero problems you know, provided your terabyte harddrive is fast enough.

The bottleneck is speed of your move generator and nothing else, provided you don't have gibberish C++ code that slows you down nor a caching system that's O ( n ) ( nalimov looks up O ( n ) ).

Everyone already knows this since mid 90s.

If you use 1 drive at end of platters it's fastest, so the speed at which you can fill the EGTB cache is a 133MB/s or in short say a billion positions per second.

Nearly all motherboards have SATA raid now.

If you have a raid10 array of say 8 drives, you can deliver 500MB/s easily
(better controllers over a gigabyte per second i/o).

That's 8 billion positions per second from the i/o, to which in your L1/L2 you're gonna do a 100+ billion lookups or so, depending upon which EGTB you generate.

No move generator can keep up with that, not even if you run it in parallel at 8 cores.

Your lookup speed to your L2 cache is gonna be slower than your harddrive speed you know.

How many cycles for all those address calculations?

the last thing you worry about is the i/o speed

If you just had experimented - you would've known this as well.

A few days to generate all 6 men.

I don't know your format. In Diep format it's a 5T positions. So to generate that's all together 625+ gigabyte.

Then you can multiply that times the numbers of passes (writes and reads) that you need in total for the total i/o you need. The i/o is not the problem. Writing a bugfree generator is gonna take more time for sure than that...

wgarvin · Post by **wgarvin** » Tue Jul 24, 2012 7:21 pm

I can see two good reasons to slice fully by all the pawns that are on the board. One, it reduces working set size to the size of one slice (or two if you confine WQ to one side of the board, in which case some WQ moves can cross into the "mirror slice". A few slices mirror onto themselves, and working set size is the same as if you confined a pawn, but WQ is unique which simplifies things). Benefit Two is that you can compute DTZ or DTZ50 metrics very easily because the pawn moves are all inter-slice moves and only reversible moves occur within a slice.

As far as numbering the slices, you have two choices: Either number them in an ordering so that if slice X depends on other slice(s), they have a lower slice number than X, and then compute them in that order (the ranks of the pawns are the key for that). Or, number them in whatever order you want and just recursively explore the dependencies of each slice and make sure you generate them first.

Another thing: since pawns can't change files, there are effectively several "groups" of slices with absolutely no dependencies between them. These could be computed separately on different cores or machines. At the very least, there's no need to keep any slices from one group around in RAM when you move on to the next group.

diep · Post by **diep** » Tue Jul 24, 2012 7:26 pm

wgarvin wrote:I can see two good reasons to slice fully by all the pawns that are on the board. One, it reduces working set size to the size of one slice (or two if you confine WQ to one side of the board, in which case some WQ moves can cross into the "mirror slice". A few slices mirror onto themselves, and working set size is the same as if you confined a pawn, but WQ is unique which simplifies things). Benefit Two is that you can compute DTZ or DTZ50 metrics very easily because the pawn moves are all inter-slice moves and only reversible moves occur within a slice.

As far as numbering the slices, you have two choices: Either number them in an ordering so that if slice X depends on other slice(s), they have a lower slice number than X, and then compute them in that order (the ranks of the pawns are the key for that). Or, number them in whatever order you want and just recursively explore the dependencies of each slice and make sure you generate them first.

Another thing: since pawns can't change files, there are effectively several "groups" of slices with absolutely no dependencies between them. These could be computed separately on different cores or machines. At the very least, there's no need to keep any slices from one group around in RAM when you move on to the next group.

For gpu it's a relevant question. For EGTBs, he'll soon withdraw this question, as the most simple format to generate is best - as the only thing that matters is generator speed.

A clever format = stupid for generator.

In Diep i've had a 5 formats or so (most generators i threw away).
Then those convert then to the diep EGTB format where i filter out more illegals and which stores at 5 positions a byte.

As only the first pass is the exchange pass, i do that the slow way, just using the default EGTB lookup.

For the 7 men the problem is enough i/o storage. To generate i'd like a raid array of say a disk or 16. Means depending upon SATA, i need another few disks in reserve to replace.

Right now 16 drives raid10 and in my closet a few spare drives if one breaks so i can directly replace - say you need 20 drives.

price is 138 euro for a 3TB drive here, but i'm not sure that's good drives for such array, as maxtor got bought by seagate.

Those maxtors when i ran EGTBs on them, 50% died after streaming some terabytes.

I simply do not have sufficient cash for the harddrives for 7 men. That is only problem.

Storage of finished 7 men is not a big problem. I store those in diep format.
Those compress well. For generation you need only a few EGTBs at a time uncompressed on that array anyway.

Note with compression here i mean a SIMPLE generic compression with for example 7-zip of the entire EGTB.

A smart compression is not usable of course for generation, looking it up would take too much system time then for generation. So during generation you need the EGTBs in uncompressed format on the array.

That's why you need a big array.

Daniel Shawul · Post by **Daniel Shawul** » Tue Jul 24, 2012 8:39 pm

jdart wrote:Shared memory basically does this. See mmap (Linux) or CreateFileMapping/MapViewOfFile (Windows).

--Jon

Indeed, that is what I needed. Now I can think of available physical RAM as a cache of a much bigger address space. Here I could not allocate 13G with malloc/new even though it is 64bits machine.I read somewhere there is a special version of malloc can do that, but anyway mmap() is a direct solution. With this method, I can generate any size bitbases (5,6,7...etc) without the fear of exceeding physical RAM limits. The performance of generation depends on the efficiency of paging which is better done by the OS.
Thanks

Daniel Shawul · Post by **Daniel Shawul** » Tue Jul 24, 2012 8:55 pm

Why not just build the generator and have a look.

Even with a few megabytes RAM in your machine you have zero problems you know, provided your terabyte harddrive is fast enough.

I have a 6 men bitbases generator but not a machine big enough to generate those with a single pawn. I doubt you can generate say KBNKBP fast on disk. Last time I checked disk access (SSD or HD) is still orders of magnitude slower so I don't understand why you say EGTB generation is computation bound. Please explain.

The bottleneck is speed of your move generator and nothing else, provided you don't have gibberish C++ code that slows you down nor a caching system that's O ( n ) ( nalimov looks up O ( n ) ).

Everyone already knows this since mid 90s.

How can caching be factor when you are generating it on RAM? Now I can do 6 men in 4G doing everything in RAM. Every iteration except the initialization does not touch the disk. If the EGTB that is being generated is is in RAM cache has no effect at all. Also nalimov's LRU cache is meant to be used for probes done by the engine in addition to what the OS provides you. So use of an O(n) algorithm to find the block means nothing compared to the proceding decompression phase that takes much more time. 16/32MB is usually what is allocated as egtb cache.

If you use 1 drive at end of platters it's fastest, so the speed at which you can fill the EGTB cache is a 133MB/s or in short say a billion positions per second.

Nearly all motherboards have SATA raid now.

If you have a raid10 array of say 8 drives, you can deliver 500MB/s easily
(better controllers over a gigabyte per second i/o).

That's 8 billion positions per second from the i/o, to which in your L1/L2 you're gonna do a 100+ billion lookups or so, depending upon which EGTB you generate.

No move generator can keep up with that, not even if you run it in parallel at 8 cores.

Your lookup speed to your L2 cache is gonna be slower than your harddrive speed you know.

How many cycles for all those address calculations?

the last thing you worry about is the i/o speed

Think again my friend it is bitbases

If you just had experimented - you would've known this as well.

A few days to generate all 6 men.

I don't know your format. In Diep format it's a 5T positions. So to generate that's all together 625+ gigabyte.

IIRC 5bit/pos didn't help me at all. Compression worked better with 4bit/pos probably because the latter is a power of 2. But it is impressive that you can generate 6 men in a few days. How much RAM do you need for KBNKNP for example ? And how many hours for that ?

Then you can multiply that times the numbers of passes (writes and reads) that you need in total for the total i/o you need. The i/o is not the problem. Writing a bugfree generator is gonna take more time for sure than that...

If the problem was not i/o bound , I doubt it would be difficult to generate 7 men.

syzygy · Post by **syzygy** » Tue Jul 24, 2012 9:01 pm

Daniel Shawul wrote:I am trying to generate 6 men bitbases with pawns. The pawnless need about 4G of RAM but those with single pawn need as much as 13G without pawn slicing.

In that case dividing the pawn table into 4 files (based on the leading pawn) would reduce 13G to 3.25G < 4G, which should be sufficient.

My simple indexing scheme when using 3 bits per position would require 4.5G per file (2 x 6 x 64^5 x 3 / 8 bytes), assuming you need 3 bits for white to move and 3 bits for black to move. When eliminating positions with two pieces on the same square, you still need 2 x 6 x 63 x 62 x 61 x 60 x 59 x 3/8 = 3.54 G in the general case.

You can go even lower by eliminating positions with neighbouring kings, but that will give complicated indexing functions. Another option is keeping only part of the file in memory. For generating the slice with the leading (white) pawn on a2, you only need to access the slices with the leading pawn on a3 and a4. This cuts RAM requirements by a factor of 2, giving 2.25G with the simple indexing scheme.

I am not sure if you remember but you explained to me some years ago how to do enumeration of like pieces. Now I want the same thing for enumerating white and black pawns together given an index.

Yes I remember

.
My old generator had "perfect" indexing functions for almost all combinations of 5 pawns (since in suicide chess it is possible to have PPPPvP and PPPvPP), but there was no requirement for letting promotions come before the positions that could lead to those promotions (since I did not use it during generation). I'm now using less-than-perfect but somewhat simpler indexing that gives more possibilities to permute the pieces.

Yes naive indexing schemes may go faster but they could require larger RAM. You are probaly generating DTM tables so that already brings the requirement higher anyway. You would need atleast 2 bytes for the counter and storing the DTM score. I am using 2bits for the score and 1bit for visits.

I'm generating what I call DTZ50+ tables. For positions that can be won under the 50-move rule, I calculate the number of plies to a zeroing move. The value for positions that cannot be won under the 50-move rule but can be won without it, is either the number of plies to a zeroing move into a position won under the 50-move rule, or 100 plus the number of plies to a zeroing move into a "won" position that cannot be won under the 50-move rule, whichever is smaller. For use during search I generate WDL50+ tables, which have 5 values: -2, -1, 0, 1, 2, where -1 and 1 indicate that the position is lost or won, but drawn under the 50-move rule.

During generation I only need 1 byte per position. When the values get too big, I save to disk and reduce all values in RAM.

For generation and compression of pawnless 6-piece tables I need a machine with 32GB. It's a lot, but I don't want to sacrifice speed since my plan is to eventually generate the 6-piece suicide tables and there are a LOT of them.

pawn enumeration

Re: pawn enumeration

Re: pawn enumeration

Re: Related: malloc virtual

Re: pawn enumeration

Re: pawn enumeration

Re: pawn enumeration

Re: pawn enumeration

Re: Related: malloc virtual

Re: pawn enumeration

Re: pawn enumeration