max number of pseudo legal move

Discussion of chess software programming and technical issues.

Moderators: hgm, Harvey Williamson, bob

Forum rules
This textbox is used to restore diagrams posted with the [d] tag before the upgrade.
MikeB
Posts: 3158
Joined: Thu Mar 09, 2006 5:34 am
Location: Pen Argyl, Pennsylvania

Re: max number of pseudo legal move

Post by MikeB » Thu Jun 13, 2019 4:50 pm

Uri Blass wrote:
Tue Jun 11, 2019 11:47 pm
JohnWoe wrote:
Tue Jun 11, 2019 4:50 pm
Logged Sapeli's maximum moves from standard games against Fairy:
https://pastebin.com/HU2CQvZm

79 moves is the maximum currently.
I think 128 is plenty for gameplay. For analyzing 256 is bullet proof.
I will be surprised if 79 moves is the maximum for human games.

There are many games between children when the weaker side does not resign and the stronger side prefer to promote many pawns to queens instead of making mate fast and I am sure that some of them have more than 79 moves.
Hahah Obviously you are correct - but the reference is made at a higher level of play - not child play.

User avatar
Ajedrecista
Posts: 1385
Joined: Wed Jul 13, 2011 7:04 pm
Location: Madrid, Spain.
Contact:

Re: Maximum number of pseudo legal moves.

Post by Ajedrecista » Thu Jun 13, 2019 5:37 pm

Hello Roland:
Roland Chastain wrote:[...]

By the way, the official JetChess web page no longer exists. Would you know another place where one can download it?
The web is still reachable with the help of Wayback Machine:

https://web.archive.org/web/20171119065 ... e/jetchess

The direct link of the download works in the same way (it is a ZIP of 480.2 KB):

https://web.archive.org/web/20150402020 ... .0.0.0.zip

Regards from Spain.

Ajedrecista.

User avatar
Roland Chastain
Posts: 177
Joined: Sat Jun 08, 2013 8:07 am
Location: Dakar (Senegal)
Full name: Roland Chastain
Contact:

Re: Maximum number of pseudo legal moves.

Post by Roland Chastain » Thu Jun 13, 2019 6:48 pm

Ajedrecista wrote:
Thu Jun 13, 2019 5:37 pm
The web is still reachable with the help of Wayback Machine:

https://web.archive.org/web/20171119065 ... e/jetchess

The direct link of the download works in the same way (it is a ZIP of 480.2 KB):

https://web.archive.org/web/20150402020 ... .0.0.0.zip
Hello Ajedrecista. Thank you very much.

Best regards.

dragontamer5788
Posts: 43
Joined: Thu Jun 06, 2019 6:05 pm
Full name: Percival Tiglao

Re: max number of pseudo legal move

Post by dragontamer5788 » Thu Jul 11, 2019 5:32 pm

It seems like memory is a very finite resource in the GPU-world, especially when you have 16,000+ threads of execution. And since my model is going to have hundreds-of-thousands, maybe millions, of tasks... saving 64-bytes or 128-bytes here and there can add up to hundreds-of-MB of memory being saved.

I don't know if anyone else has come across this problem yet... but here's my planned solution for saving a bit of memory between the "average case" 35x children and "worst case" ~200+ children per node.

Code: Select all

struct MoveListChunk{
	uint16_t moves[32];
};

struct MoveList{
	uint8_t capacity;
	uint8_t curMove;
	MoveListChunk* chunks[8];
	MoveListChunk firstChunk; // First chunk is stored in the MoveList, because all moveList have at least 1 move
};

// Example initialization

MoveList ml = malloc(sizeof(MoveList));
ml.capacity = CountMoves(position);
ml.curMove = 0;
ml.chunks[0] = &ml.firstChunk;
uint32_t allocedCapacity = 32;
while(allocedCapacity < ml.capacity){
	ml.chunks[allocedCapacity / 32] = malloc(sizeof(MoveListChunk));
	allocedCapacity += 32;
};
// No need to initialize the later pointers.

for(int i=0; i<numMovesGenerated; i++){
	ml.chunks[i/32].moves[i%32] = GenerateMove(position, i);
}

// To consume
for(int i=0; i<ml.capacity; i++){
	Position nextPos = MakeMove(ml.chunk[i/32].moves[i%32]);
	foobar(nextPos);
	UnmakeMove();
}

// To free
for(int i=1; i*32<ml.capacity; i++){
	free(ml.chunks[i]);
}
The minimum size movelist (holding 32-or-less moves) uses 144 bytes. Each additional chunk uses +64 bytes. Most nodes in practice will therefore be 130 bytes or 194 bytes used. But the chunk-pointers can grow all the way to 256-moves being used (where the structure will use 578 bytes).

In contrast, a typical uint16_t moves[256] array (+uint8_t capacity + uint8_t curIndex) will always use 514 bytes. We basically add 64-bytes to the structure for the (limited) ability to dynamically resize for the workload. division by 32 and % by 32 are both efficient (will be compiled into bitshifts), so I don't expect much overhead either.

This code has not been debugged or tested. But I think you'd get the gist from reading it. Unlike a linked list, all positions are available from just a single pointer-dereference. All you guys in the CPU world probably have more than enough RAM to do your job. This is more of an issue when you're storing hundreds-of-thousands of movelists for parallel exectuion... saving those 64-bytes here and there really does add up to a lot of RAM usage.

1-Million tasks x 194 bytes (64 moves) == 194MB for movelists alone. If I used uint16_t moves[256] arrays instead, it would be 514 MB on movelists... the ~300MB difference could go to the TT instead. I don't think that other people expect to be holding ~1-million movelists in memory like my code, but that's how I plan to solve this particular problem.

Dann Corbit
Posts: 9771
Joined: Wed Mar 08, 2006 7:57 pm
Location: Redmond, WA USA
Contact:

Re: max number of pseudo legal move

Post by Dann Corbit » Thu Jul 11, 2019 9:25 pm

These two positions are interesting because they both have 205 checkmates in 2 (every single legal move is a mate in 2):
QKn1kb2/Pnr4q/4q3/1q4q1/3q4/q4q2/7r/2q1q3 b - -
2b2nKQ/q4rnP/1k1q4/1q4q1/4q3/2q4q/r4q2/3q4 b - -

And this one has 218 legal moves:
3Q4/1Q4Q1/4Q3/2Q4R/Q4Q2/3Q4/1Q4Rp/1K1BBNNk w - -

Nobody has ever invented a position that has more than 218 legal moves {so far as I know}. That does not mean it is impossible, but it seems that 256 is probably overkill.

I guess the question is, would you want your program to handle these three positions?
Taking ideas is not a vice, it is a virtue. We have another word for this. It is called learning.
But sharing ideas is an even greater virtue. We have another word for this. It is called teaching.

User avatar
hgm
Posts: 23296
Joined: Fri Mar 10, 2006 9:06 am
Location: Amsterdam
Full name: H G Muller
Contact:

Re: max number of pseudo legal move

Post by hgm » Thu Jul 11, 2019 9:44 pm

I don't know in which order the threads will visit the tree nodes, but if it is anything like a recursive walk of a sub-tree, it would be faster and more compact to put the moves on a stack. Then the list for each node can directly follow that of its parent without wasting any space. And it will be discarded by the time you get back to the level of the parent, so that you are free to expand the list there as much as you like.

If you divide up memory in chuncks of, say, 512 moves, you could grow a stack in each of those, and only allocate a new chunck when there is 'serious danger' that the remaining empty space in the chunck is not enough to hold the list. The number of moves will never incrase from 32 to 200 in a single move; it must be easy to guess an upper limit for the number of moves in a (grand-)daughter with high confidence, e.g. by adding 16. So if you currently have 48 moves, you allocate a new 512-move chunck if the remaining space was less than 64, and otherwise just grow the stack in the current chunck. If in 0.1% of the cases you would overrun the remaining free space you could abort, allocate a new chunck, and redo the generation there.

dragontamer5788
Posts: 43
Joined: Thu Jun 06, 2019 6:05 pm
Full name: Percival Tiglao

Re: max number of pseudo legal move

Post by dragontamer5788 » Thu Jul 11, 2019 10:04 pm

hgm wrote:
Thu Jul 11, 2019 9:44 pm
I don't know in which order the threads will visit the tree nodes, but if it is anything like a recursive walk of a sub-tree, it would be faster and more compact to put the moves on a stack.
Bingo. There will absolutely be a local stack for this reason. For each physical thread, there will be a thread-private stack. So from a single-threaded perspective, everything is going to be pushing / popping from this local stack.

----------

However: private stacks per-thread means that Thread#1 might get all the work, while Thread#16384 will get no work at all. Ex: Thread#1 gets the work AlphaBeta(root-node). Without sharing data, threads #2 through #16384 won't ever be able to "help out". These stacks must communicate through some kind of global data-structure.

My current plan is to have a linked list as the global data-structure. There will be 8-linked lists (with spinlocks for synchronization), for DepthRemaining0, DepthRemaining1, DepthRemaining2... DepthRemaining7+ nodes. I was considering a concurrent GPU queue but alas, that looks like way more effort and 8+ linked lists would effectively perform the same task.

So when Thread#1 "runs out of space" (which should be quickly: the local stack will be purposefully small), Thread#1 will offload some of its tasks to the global array[8] of linked list. Other tasks will prefer to "help out" by choosing the lowest depth (ex: Depth 0) tasks first, but they'll grab Depth7+ if work runs out.

By keeping the local-stack small, I will ensure good balancing of work across all 16384 physical SIMD-threads of my GPU. However, making the local-stack bigger will reduce the amount of global-synchronization (spinlocks are very, very slow on a GPU). I'll likely have to tune the size for performance.
If you divide up memory in chuncks of, say, 512 moves, you could grow a stack in each of those, and only allocate a new chunck when there is 'serious danger' that the remaining empty space in the chunck is not enough to hold the list. The number of moves will never incrase from 32 to 200 in a single move; it must be easy to guess an upper limit for the number of moves in a (grand-)daughter with high confidence, e.g. by adding 16. So if you currently have 48 moves, you allocate a new 512-move chunck if the remaining space was less than 64, and otherwise just grow the stack in the current chunck. If in 0.1% of the cases you would overrun the remaining free space you could abort, allocate a new chunck, and redo the generation there.
I've got a fixed-size SIMD-malloc routine figured out for GPUs based off of bitmasks. It turns out that when you have 16384 threads, you can perform a parallel malloc very, very quickly by scanning the bitmask array. :D :D

Well, I'm kidding a little bit. But not really. The wavefront size on AMD GPUs is 64 threads (at a minimum, all my code has 64x SIMD threads at any given time). These 64-threads can read/write 32-bits at a time. We're talking 2048-bits scanned per clock-tick per wavefront (with 256 wavefronts working in parallel on my particular GPU). Synchronization is slow, but Atomic operations (such as AtomicAnd and AtomicOr) are very fast on a GPU. Protect the malloc/free routines with a memory-barrier (for proper synchronization) and bam, parallel GPU-efficient global fixed size mallocs.

Mallocs are made slightly more efficient by giving the wavefronts a random starting location every time they call malloc(), so there's less synchronization issues (You don't want all 256x parallel wavefronts to be hitting the starting memory location at the same time, every time).

1-million chunks x 64-bytes per chunk (32-moves per chunk) == 64MB heap. The bitmask array is only 128kB (aka: 1-million bits). The wavefront steps through the entire bitmask in just 512 clock ticks (worst-case). Because its a "wavefront malloc", I'm allocating 64x chunks per scan. So the code benefits from the efficiency of scale.

dragontamer5788
Posts: 43
Joined: Thu Jun 06, 2019 6:05 pm
Full name: Percival Tiglao

Re: max number of pseudo legal move

Post by dragontamer5788 » Fri Jul 12, 2019 3:47 pm

The wavefront steps through the entire bitmask in just 512 clock ticks (worst-case).
Uggghhhh... I meant 512 loop-iterations. The loop is 512-iterations worst case. The real cost is the memory barriers at the beginning and end of the routine, because memory-barriers on AMD Vega are implemented by flushing your entire L1 cache. So any memory-synchronization just eradicates your performance.

Obviously, each loop iteration takes more than 1-clock tick to execute. But scanning a bitmask 2048 bits at a time is roughly popcount(current_element), and prefix_sum(current_element).

The scanning of the bits is very simple:

Code: Select all

val = __popc(bitmask[iter + threadIdx.x]);

// Prefix sum across the wavefront is super-efficient
val += __shfl_up (val, 1)
val += __shfl_up (val, 2)
val += __shfl_up (val, 4)
val += __shfl_up (val, 8)
val += __shfl_up (val, 16) // last step on NVidia: 32th threads per wavefront
val += __shfl_up (val, 32) // last step on AMD: 64th thread has sum of all val

// "val" now contains the number of bits set in the bitmask your thread has + all of the bits the threads to your left have.
So it is quite efficient to see how many bits in your 2048-bit window (AMD) has. The __shfl_up instruction compiles down into a single GPU-assembly instruction and is quite efficient in practice. Adds and Popcount both execute in one clock.

For more details: https://devblogs.nvidia.com/faster-para ... ns-kepler/

-------

To make it parallel / atomic, local_bitmask = AtomicExch(bitmask[iter + threadIdx.x], -1) allows for multiple-wavefronts to interact with the global bitmask. Memory-barriers are needed to ensure proper ordering. The last portion of the algorithm is transforming bits into pointers, but LS1B + Count_leading_Zeros gets you there.

Post Reply