max number of pseudo legal move

MikeB · Post by **MikeB** » Thu Jun 13, 2019 6:50 pm

Uri Blass wrote: ↑Wed Jun 12, 2019 1:47 am
JohnWoe wrote: ↑Tue Jun 11, 2019 6:50 pm Logged Sapeli's maximum moves from standard games against Fairy:
https://pastebin.com/HU2CQvZm

79 moves is the maximum currently.
I think 128 is plenty for gameplay. For analyzing 256 is bullet proof.
I will be surprised if 79 moves is the maximum for human games.

There are many games between children when the weaker side does not resign and the stronger side prefer to promote many pawns to queens instead of making mate fast and I am sure that some of them have more than 79 moves.

Hahah Obviously you are correct - but the reference is made at a higher level of play - not child play.

Ajedrecista · Post by **Ajedrecista** » Thu Jun 13, 2019 7:37 pm

Hello Roland:

Roland Chastain wrote:[...]

By the way, the official JetChess web page no longer exists. Would you know another place where one can download it?

The web is still reachable with the help of Wayback Machine:

https://web.archive.org/web/20171119065 ... e/jetchess

The direct link of the download works in the same way (it is a ZIP of 480.2 KB):

https://web.archive.org/web/20150402020 ... .0.0.0.zip

Regards from Spain.

Ajedrecista.

Roland Chastain · Post by **Roland Chastain** » Thu Jun 13, 2019 8:48 pm

Ajedrecista wrote: ↑Thu Jun 13, 2019 7:37 pm The web is still reachable with the help of Wayback Machine:

https://web.archive.org/web/20171119065 ... e/jetchess

The direct link of the download works in the same way (it is a ZIP of 480.2 KB):

https://web.archive.org/web/20150402020 ... .0.0.0.zip

Hello Ajedrecista. Thank you very much.

Best regards.

dragontamer5788 · Post by **dragontamer5788** » Thu Jul 11, 2019 7:32 pm

It seems like memory is a very finite resource in the GPU-world, especially when you have 16,000+ threads of execution. And since my model is going to have hundreds-of-thousands, maybe millions, of tasks... saving 64-bytes or 128-bytes here and there can add up to hundreds-of-MB of memory being saved.

I don't know if anyone else has come across this problem yet... but here's my planned solution for saving a bit of memory between the "average case" 35x children and "worst case" ~200+ children per node.

Code: Select all

struct MoveListChunk{
	uint16_t moves[32];
};

struct MoveList{
	uint8_t capacity;
	uint8_t curMove;
	MoveListChunk* chunks[8];
	MoveListChunk firstChunk; // First chunk is stored in the MoveList, because all moveList have at least 1 move
};

// Example initialization

MoveList ml = malloc(sizeof(MoveList));
ml.capacity = CountMoves(position);
ml.curMove = 0;
ml.chunks[0] = &ml.firstChunk;
uint32_t allocedCapacity = 32;
while(allocedCapacity < ml.capacity){
	ml.chunks[allocedCapacity / 32] = malloc(sizeof(MoveListChunk));
	allocedCapacity += 32;
};
// No need to initialize the later pointers.

for(int i=0; i<numMovesGenerated; i++){
	ml.chunks[i/32].moves[i%32] = GenerateMove(position, i);
}

// To consume
for(int i=0; i<ml.capacity; i++){
	Position nextPos = MakeMove(ml.chunk[i/32].moves[i%32]);
	foobar(nextPos);
	UnmakeMove();
}

// To free
for(int i=1; i*32<ml.capacity; i++){
	free(ml.chunks[i]);
}

The minimum size movelist (holding 32-or-less moves) uses 144 bytes. Each additional chunk uses +64 bytes. Most nodes in practice will therefore be 130 bytes or 194 bytes used. But the chunk-pointers can grow all the way to 256-moves being used (where the structure will use 578 bytes).

In contrast, a typical uint16_t moves[256] array (+uint8_t capacity + uint8_t curIndex) will always use 514 bytes. We basically add 64-bytes to the structure for the (limited) ability to dynamically resize for the workload. division by 32 and % by 32 are both efficient (will be compiled into bitshifts), so I don't expect much overhead either.

This code has not been debugged or tested. But I think you'd get the gist from reading it. Unlike a linked list, all positions are available from just a single pointer-dereference. All you guys in the CPU world probably have more than enough RAM to do your job. This is more of an issue when you're storing hundreds-of-thousands of movelists for parallel exectuion... saving those 64-bytes here and there really does add up to a lot of RAM usage.

1-Million tasks x 194 bytes (64 moves) == 194MB for movelists alone. If I used uint16_t moves[256] arrays instead, it would be 514 MB on movelists... the ~300MB difference could go to the TT instead. I don't think that other people expect to be holding ~1-million movelists in memory like my code, but that's how I plan to solve this particular problem.

Dann Corbit · Post by **Dann Corbit** » Thu Jul 11, 2019 11:25 pm

These two positions are interesting because they both have 205 checkmates in 2 (every single legal move is a mate in 2):
QKn1kb2/Pnr4q/4q3/1q4q1/3q4/q4q2/7r/2q1q3 b - -
2b2nKQ/q4rnP/1k1q4/1q4q1/4q3/2q4q/r4q2/3q4 b - -

And this one has 218 legal moves:
3Q4/1Q4Q1/4Q3/2Q4R/Q4Q2/3Q4/1Q4Rp/1K1BBNNk w - -

Nobody has ever invented a position that has more than 218 legal moves {so far as I know}. That does not mean it is impossible, but it seems that 256 is probably overkill.

I guess the question is, would you want your program to handle these three positions?

hgm · Post by **hgm** » Thu Jul 11, 2019 11:44 pm

I don't know in which order the threads will visit the tree nodes, but if it is anything like a recursive walk of a sub-tree, it would be faster and more compact to put the moves on a stack. Then the list for each node can directly follow that of its parent without wasting any space. And it will be discarded by the time you get back to the level of the parent, so that you are free to expand the list there as much as you like.

If you divide up memory in chuncks of, say, 512 moves, you could grow a stack in each of those, and only allocate a new chunck when there is 'serious danger' that the remaining empty space in the chunck is not enough to hold the list. The number of moves will never incrase from 32 to 200 in a single move; it must be easy to guess an upper limit for the number of moves in a (grand-)daughter with high confidence, e.g. by adding 16. So if you currently have 48 moves, you allocate a new 512-move chunck if the remaining space was less than 64, and otherwise just grow the stack in the current chunck. If in 0.1% of the cases you would overrun the remaining free space you could abort, allocate a new chunck, and redo the generation there.

dragontamer5788 · Post by **dragontamer5788** » Fri Jul 12, 2019 12:04 am

hgm wrote: ↑Thu Jul 11, 2019 11:44 pm I don't know in which order the threads will visit the tree nodes, but if it is anything like a recursive walk of a sub-tree, it would be faster and more compact to put the moves on a stack.

Bingo. There will absolutely be a local stack for this reason. For each physical thread, there will be a thread-private stack. So from a single-threaded perspective, everything is going to be pushing / popping from this local stack.

----------

However: private stacks per-thread means that Thread#1 might get all the work, while Thread#16384 will get no work at all. Ex: Thread#1 gets the work AlphaBeta(root-node). Without sharing data, threads #2 through #16384 won't ever be able to "help out". These stacks must communicate through some kind of global data-structure.

My current plan is to have a linked list as the global data-structure. There will be 8-linked lists (with spinlocks for synchronization), for DepthRemaining0, DepthRemaining1, DepthRemaining2... DepthRemaining7+ nodes. I was considering a concurrent GPU queue but alas, that looks like way more effort and 8+ linked lists would effectively perform the same task.

So when Thread#1 "runs out of space" (which should be quickly: the local stack will be purposefully small), Thread#1 will offload some of its tasks to the global array[8] of linked list. Other tasks will prefer to "help out" by choosing the lowest depth (ex: Depth 0) tasks first, but they'll grab Depth7+ if work runs out.

By keeping the local-stack small, I will ensure good balancing of work across all 16384 physical SIMD-threads of my GPU. However, making the local-stack bigger will reduce the amount of global-synchronization (spinlocks are very, very slow on a GPU). I'll likely have to tune the size for performance.

If you divide up memory in chuncks of, say, 512 moves, you could grow a stack in each of those, and only allocate a new chunck when there is 'serious danger' that the remaining empty space in the chunck is not enough to hold the list. The number of moves will never incrase from 32 to 200 in a single move; it must be easy to guess an upper limit for the number of moves in a (grand-)daughter with high confidence, e.g. by adding 16. So if you currently have 48 moves, you allocate a new 512-move chunck if the remaining space was less than 64, and otherwise just grow the stack in the current chunck. If in 0.1% of the cases you would overrun the remaining free space you could abort, allocate a new chunck, and redo the generation there.

I've got a fixed-size SIMD-malloc routine figured out for GPUs based off of bitmasks. It turns out that when you have 16384 threads, you can perform a parallel malloc very, very quickly by scanning the bitmask array.

Well, I'm kidding a little bit. But not really. The wavefront size on AMD GPUs is 64 threads (at a minimum, all my code has 64x SIMD threads at any given time). These 64-threads can read/write 32-bits at a time. We're talking 2048-bits scanned per clock-tick per wavefront (with 256 wavefronts working in parallel on my particular GPU). Synchronization is slow, but Atomic operations (such as AtomicAnd and AtomicOr) are very fast on a GPU. Protect the malloc/free routines with a memory-barrier (for proper synchronization) and bam, parallel GPU-efficient global fixed size mallocs.

Mallocs are made slightly more efficient by giving the wavefronts a random starting location every time they call malloc(), so there's less synchronization issues (You don't want all 256x parallel wavefronts to be hitting the starting memory location at the same time, every time).

1-million chunks x 64-bytes per chunk (32-moves per chunk) == 64MB heap. The bitmask array is only 128kB (aka: 1-million bits). The wavefront steps through the entire bitmask in just 512 clock ticks (worst-case). Because its a "wavefront malloc", I'm allocating 64x chunks per scan. So the code benefits from the efficiency of scale.

dragontamer5788 · Post by **dragontamer5788** » Fri Jul 12, 2019 5:47 pm

The wavefront steps through the entire bitmask in just 512 clock ticks (worst-case).

Uggghhhh... I meant 512 loop-iterations. The loop is 512-iterations worst case. The real cost is the memory barriers at the beginning and end of the routine, because memory-barriers on AMD Vega are implemented by flushing your entire L1 cache. So any memory-synchronization just eradicates your performance.

Obviously, each loop iteration takes more than 1-clock tick to execute. But scanning a bitmask 2048 bits at a time is roughly popcount(current_element), and prefix_sum(current_element).

The scanning of the bits is very simple:

Code: Select all

val = __popc(bitmask[iter + threadIdx.x]);

// Prefix sum across the wavefront is super-efficient
val += __shfl_up (val, 1)
val += __shfl_up (val, 2)
val += __shfl_up (val, 4)
val += __shfl_up (val, 8)
val += __shfl_up (val, 16) // last step on NVidia: 32th threads per wavefront
val += __shfl_up (val, 32) // last step on AMD: 64th thread has sum of all val

// "val" now contains the number of bits set in the bitmask your thread has + all of the bits the threads to your left have.

So it is quite efficient to see how many bits in your 2048-bit window (AMD) has. The __shfl_up instruction compiles down into a single GPU-assembly instruction and is quite efficient in practice. Adds and Popcount both execute in one clock.

For more details: https://devblogs.nvidia.com/faster-para ... ns-kepler/

-------

To make it parallel / atomic, local_bitmask = AtomicExch(bitmask[iter + threadIdx.x], -1) allows for multiple-wavefronts to interact with the global bitmask. Memory-barriers are needed to ensure proper ordering. The last portion of the algorithm is transforming bits into pointers, but LS1B + Count_leading_Zeros gets you there.

alessandro · Post by **alessandro** » Thu Aug 08, 2019 10:26 pm

Dann Corbit wrote: ↑Tue Jun 11, 2019 11:10 pm Here is a fun one for you, a position with 205 best moves:
[d]3Q4/R6Q/2Q4Q/4Q3/1Q4Q1/3Q4/Q4RNp/2BK1Nkq w - -
The best moves are:
Ba3 Bb2 Bd2 Be3 Bf4 Bg5 Kc2 Kd2 Ke1 Ke2 Ne1+ Nf4+ Nfe3 Ng3 Nge3+ Nh4+ Nxh2 Q3d4 Q3d5 Q3d6 Q3d7 Q6g7 Q7g6 Q7g7 Q8d4 Q8d5 Q8d6 Q8d7 Qaa1 Qaa3 Qaa4 Qaa5 Qaa6 Qab1 Qab2 Qab3 Qac2 Qac4 Qad2 Qad5 Qae2 Qae6 Qaf7 Qag8 Qba3 Qba4 Qba5 Qbb1 Qbb2 Qbb3 Qbb5 Qbb6 Qbb7 Qbb8 Qbc3 Qbc4 Qbc5 Qbd2 Qbd4 Qbd6 Qbe1 Qbe4 Qbe7 Qbf4 Qbf8 Qca4 Qca6 Qca8 Qcb5 Qcb6 Qcb7 Qcc2 Qcc3 Qcc4 Qcc5 Qcc7 Qcc8 Qcd5 Qcd6 Qcd7 Qce4 Qce6 Qce8 Qcf3 Qcf6 Qcg6 Qda3 Qda5 Qda6 Qda8 Qdb1 Qdb3 Qdb5 Qdb6 Qdb8 Qdc2 Qdc3 Qdc4 Qdc7 Qdc8 Qdd2 Qde2 Qde3 Qde4 Qde7 Qde8 Qdf3 Qdf5 Qdf6 Qdf8 Qdg3 Qdg5 Qdg6 Qdg8 Qdh3 Qdh4 Qdh8 Qea1 Qea5 Qeb2 Qeb5 Qeb8 Qec3 Qec5 Qec7 Qed4 Qed5 Qed6 Qee1 Qee2 Qee3 Qee4 Qee6 Qee7 Qee8 Qef4 Qef5 Qef6 Qeg3 Qeg5 Qeg7 Qeh5 Qeh8 Qexh2+ Qgc4 Qgc8 Qgd4 Qgd7 Qge2 Qge4 Qge6 Qgf3 Qgf4 Qgf5 Qgg3 Qgg5 Qgg6 Qgg7 Qgg8 Qgh3 Qgh4 Qgh5 Qh6g6 Qhb7 Qhc7 Qhd2 Qhd6 Qhd7 Qhe3 Qhe4 Qhe6 Qhe7 Qhf4 Qhf5 Qhf6 Qhf7 Qhf8 Qhg5 Qhg8 Qhh3 Qhh4 Qhh5 Qhh8 Qhxh2+ Ra3 Ra4 Ra5 Ra6 Ra8 Raf7 Rb2 Rb7 Rc2 Rc7 Rd2 Rd7 Re2 Re7 Rf3 Rf4 Rf5 Rf6 Rf8 Rff7 Rg7
and all of them result in mate in 2.
One example:
Nge3+ Qg2 Qhxh2#

Hi guys sorry if I am out of topic but this FEN has 206 legal moves and not 205 as reported.
The missing one is Nd2

Please verify

Sven · Post by **Sven** » Fri Aug 09, 2019 12:31 am

alessandro wrote: ↑Thu Aug 08, 2019 10:26 pm
Dann Corbit wrote: ↑Tue Jun 11, 2019 11:10 pm Here is a fun one for you, a position with 205 best moves:
[d]3Q4/R6Q/2Q4Q/4Q3/1Q4Q1/3Q4/Q4RNp/2BK1Nkq w - -
The best moves are:
Ba3 Bb2 Bd2 Be3 Bf4 Bg5 Kc2 Kd2 Ke1 Ke2 Ne1+ Nf4+ Nfe3 Ng3 Nge3+ Nh4+ Nxh2 Q3d4 Q3d5 Q3d6 Q3d7 Q6g7 Q7g6 Q7g7 Q8d4 Q8d5 Q8d6 Q8d7 Qaa1 Qaa3 Qaa4 Qaa5 Qaa6 Qab1 Qab2 Qab3 Qac2 Qac4 Qad2 Qad5 Qae2 Qae6 Qaf7 Qag8 Qba3 Qba4 Qba5 Qbb1 Qbb2 Qbb3 Qbb5 Qbb6 Qbb7 Qbb8 Qbc3 Qbc4 Qbc5 Qbd2 Qbd4 Qbd6 Qbe1 Qbe4 Qbe7 Qbf4 Qbf8 Qca4 Qca6 Qca8 Qcb5 Qcb6 Qcb7 Qcc2 Qcc3 Qcc4 Qcc5 Qcc7 Qcc8 Qcd5 Qcd6 Qcd7 Qce4 Qce6 Qce8 Qcf3 Qcf6 Qcg6 Qda3 Qda5 Qda6 Qda8 Qdb1 Qdb3 Qdb5 Qdb6 Qdb8 Qdc2 Qdc3 Qdc4 Qdc7 Qdc8 Qdd2 Qde2 Qde3 Qde4 Qde7 Qde8 Qdf3 Qdf5 Qdf6 Qdf8 Qdg3 Qdg5 Qdg6 Qdg8 Qdh3 Qdh4 Qdh8 Qea1 Qea5 Qeb2 Qeb5 Qeb8 Qec3 Qec5 Qec7 Qed4 Qed5 Qed6 Qee1 Qee2 Qee3 Qee4 Qee6 Qee7 Qee8 Qef4 Qef5 Qef6 Qeg3 Qeg5 Qeg7 Qeh5 Qeh8 Qexh2+ Qgc4 Qgc8 Qgd4 Qgd7 Qge2 Qge4 Qge6 Qgf3 Qgf4 Qgf5 Qgg3 Qgg5 Qgg6 Qgg7 Qgg8 Qgh3 Qgh4 Qgh5 Qh6g6 Qhb7 Qhc7 Qhd2 Qhd6 Qhd7 Qhe3 Qhe4 Qhe6 Qhe7 Qhf4 Qhf5 Qhf6 Qhf7 Qhf8 Qhg5 Qhg8 Qhh3 Qhh4 Qhh5 Qhh8 Qhxh2+ Ra3 Ra4 Ra5 Ra6 Ra8 Raf7 Rb2 Rb7 Rc2 Rc7 Rd2 Rd7 Re2 Re7 Rf3 Rf4 Rf5 Rf6 Rf8 Rff7 Rg7
and all of them result in mate in 2.
One example:
Nge3+ Qg2 Qhxh2#
Hi guys sorry if I am out of topic but this FEN has 206 legal moves and not 205 as reported.
The missing one is Nd2

Please verify

The number of 206 legal moves is correct. However, the position was not announced as having 205 legal moves but as having 205 best moves, where each single one results in a mate in 2. Nd2 is an exception since it "only" results in a mate in 3.

max number of pseudo legal move

Re: max number of pseudo legal move

Re: Maximum number of pseudo legal moves.

Re: Maximum number of pseudo legal moves.

Re: max number of pseudo legal move

Re: max number of pseudo legal move

Re: max number of pseudo legal move

Re: max number of pseudo legal move

Re: max number of pseudo legal move

Re: max number of pseudo legal move

Re: max number of pseudo legal move