Possible Board Presentation and Move Generation for GPUs?

smatovic · Post by **smatovic** » Sat Mar 19, 2011 7:16 pm

Dear Chess Programmers,

i still try to get a prototype of a chess engine running on a GPU and i am stuck with the first things like Board Presentation and Move Generation. Maybe you got an idea to help me out?

Technic background:
GPUs act as a SIMD device with many threads/processors and compute data in the most efficient matter if it is organized in 16 * 4 * 32 bit. Read/Writes are optimized for 32 bit. 64 bit operations take up to 4x-8x more cycles than 32bit. So a 4*32 bit board-structure would perform best.

Project Background:
I already tried to port a 0x88 movegenerator (micromax) to OpenCL, it worked, but the nested loops are definetly not SIMD friendly. Every thread/process has to wait until the others finish. So i want to use a Magic Bitbord approach, which seems (made some test with dummy data) to be SIMD friendlier because the move generation for the pieces are more similiar.

What i am now looking for is a Board Presentation which is optimized for a 4*32 or 16*4*32 bit structure.

OpenCL also supports VectorDatatypes like int4....the best idea i can come up with is a QuadBitBoard design organized as long4 Vectordatatype...but 64 bit means that the computation will be up to 8x slower than a clear 32 bit design

Any suggestions for a 16*4*32bit board-design with magic bitboards as prefered move generator?

--
Srdja

Karlo Bala · Post by **Karlo Bala** » Sat Mar 19, 2011 8:56 pm

smatovic wrote:Dear Chess Programmers,

i still try to get a prototype of a chess engine running on a GPU and i am stuck with the first things like Board Presentation and Move Generation. Maybe you got an idea to help me out?

Technic background:
GPUs act as a SIMD device with many threads/processors and compute data in the most efficient matter if it is organized in 16 * 4 * 32 bit. Read/Writes are optimized for 32 bit. 64 bit operations take up to 4x-8x more cycles than 32bit. So a 4*32 bit board-structure would perform best.

Project Background:
I already tried to port a 0x88 movegenerator (micromax) to OpenCL, it worked, but the nested loops are definetly not SIMD friendly. Every thread/process has to wait until the others finish. So i want to use a Magic Bitbord approach, which seems (made some test with dummy data) to be SIMD friendlier because the move generation for the pieces are more similiar.

What i am now looking for is a Board Presentation which is optimized for a 4*32 or 16*4*32 bit structure.

OpenCL also supports VectorDatatypes like int4....the best idea i can come up with is a QuadBitBoard design organized as long4 Vectordatatype...but 64 bit means that the computation will be up to 8x slower than a clear 32 bit design

Any suggestions for a 16*4*32bit board-design with magic bitboards as prefered move generator?

--
Srdja

I do not know much about GPU programming, but my intuition says that kogge-stone could be right approach for move generation

NaltaP312 · Post by **NaltaP312** » Sat Mar 19, 2011 9:47 pm

I think you can use Kindergarten bitboards too.
you have a lot of information here
http://chessprogramming.wikispaces.com/Bitboards

Evert · Post by **Evert** » Sat Mar 19, 2011 10:33 pm

smatovic wrote: I already tried to port a 0x88 movegenerator (micromax) to OpenCL, it worked, but the nested loops are definetly not SIMD friendly. Every thread/process has to wait until the others finish. So i want to use a Magic Bitbord approach, which seems (made some test with dummy data) to be SIMD friendlier because the move generation for the pieces are more similiar.

On the other hand, you then have a somewhat large amount of data that needs to be accessed efficiently by the different threads.
I suspect that is not ideal either. Perhaps the most efficient algorithm is simply to ask 64x64 times whether a piece from square x can move to square y and forget about loops and large table lookups.

Having said that, the sort of parallelism that OpenCL (or GPUs in general) is good for is probably not the sort of parallelism you need for chess... so while it's probably an interesting concept and proof-of-concept, I suspect a GPU is not the idea hardware platform for chess.

smatovic · Post by **smatovic** » Sun Mar 20, 2011 2:51 am

On the other hand, you then have a somewhat large amount of data that needs to be accessed efficiently by the different threads.
I suspect that is not ideal either.

Right, i have to use slow global memory with the magic bitboard approach.

Perhaps the most efficient algorithm is simply to ask 64x64 times whether a piece from square x can move to square y and forget about loops and large table lookups.

This could be true, unfortunately the max amount of threads working on the same, fast memory space is limited to 256 or 1024 (depends on the OpenCL device).

I suspect a GPU is not the idea hardware platform for chess.

At least there are a lot of restrictions which make the implementation difficult.

--
Srdja

Evert · Post by **Evert** » Sun Mar 20, 2011 3:12 am

smatovic wrote:
Perhaps the most efficient algorithm is simply to ask 64x64 times whether a piece from square x can move to square y and forget about loops and large table lookups.
This could be true, unfortunately the max amount of threads working on the same, fast memory space is limited to 256 or 1024 (depends on the OpenCL device).

In that case, partitioning the question may be the best way to go: ask for half the board first, then for the second half. Or, alternatively, it may be possible to be a bit more clever and only ask for squares where you know there's a piece to squares that are a possible target (the squares that can be reached by a "superpiece" that moves as a queen or a knight). That should probably be about 16x32 at most (less in practice), which might still require partitioning the search space in two queries.
Of course, whether that pays off or not depends on how efficient you can do this pre-processing.

jwes · Post by **jwes** » Sun Mar 20, 2011 7:46 am

smatovic wrote:
Perhaps the most efficient algorithm is simply to ask 64x64 times whether a piece from square x can move to square y and forget about loops and large table lookups.
This could be true, unfortunately the max amount of threads working on the same, fast memory space is limited to 256 or 1024 (depends on the OpenCL device).

Perhaps some more explicitly parallel algorithms could work, say allocating 8 threads for each piece, one for each possible direction, and generating all moves in parallel with a mailbox approach, or using rotated bitmaps and updating them in parallel.

smatovic · Post by **smatovic** » Sun Mar 20, 2011 11:43 am

Perhaps some more explicitly parallel algorithms could work, say allocating 8 threads for each piece, one for each possible direction, and generating all moves in parallel with a mailbox approach, or using rotated bitmaps and updating them in parallel.

Yes, this is some kind of parallel move generation i think about -> direction-wise threading.

--
Srdja

steffan · Post by **steffan** » Sun Mar 20, 2011 11:59 am

smatovic wrote:Dear Chess Programmers,

i still try to get a prototype of a chess engine running on a GPU and i am stuck with the first things like Board Presentation and Move Generation. Maybe you got an idea to help me out?

Technic background:
GPUs act as a SIMD device with many threads/processors and compute data in the most efficient matter if it is organized in 16 * 4 * 32 bit. Read/Writes are optimized for 32 bit. 64 bit operations take up to 4x-8x more cycles than 32bit. So a 4*32 bit board-structure would perform best.

Project Background:
I already tried to port a 0x88 movegenerator (micromax) to OpenCL, it worked, but the nested loops are definetly not SIMD friendly. Every thread/process has to wait until the others finish. So i want to use a Magic Bitbord approach, which seems (made some test with dummy data) to be SIMD friendlier because the move generation for the pieces are more similiar.

What i am now looking for is a Board Presentation which is optimized for a 4*32 or 16*4*32 bit structure.

OpenCL also supports VectorDatatypes like int4....the best idea i can come up with is a QuadBitBoard design organized as long4 Vectordatatype...but 64 bit means that the computation will be up to 8x slower than a clear 32 bit design

Any suggestions for a 16*4*32bit board-design with magic bitboards as prefered move generator?

--
Srdja

I'd strongly recommend you take a look a look at my method inspired by Kogge-Stone parallel prefix networks, as I explicitly had hardware implementation in mind when I devised it. The software implementation I wrote was actually a by-product of this work.

http://chessprogramming.wikispaces.com/ ... +Algorithm

With appropriate instantiation of networks for all pieces and all directions, you should have a 1-1 mapping from specific output bits to moves for rooks, bishops and queens. The aim of the Kogge-Stone network is to reduce logic depth and gate count. The logic functions for the other piece types are very much simpler and have low logic depth. All of these functions operate in parallel, giving very good move generation speed when measured in clock cycles.

A special mention should go to Gerd Isenberg for kindly writing up this page on the Chess Prgramming Wiki - Thank you Gerd

Cheers,
Steffan

Gerd Isenberg · Post by **Gerd Isenberg** » Sun Mar 20, 2011 12:44 pm

steffan wrote:
smatovic wrote:Dear Chess Programmers,

i still try to get a prototype of a chess engine running on a GPU and i am stuck with the first things like Board Presentation and Move Generation. Maybe you got an idea to help me out?

Technic background:
GPUs act as a SIMD device with many threads/processors and compute data in the most efficient matter if it is organized in 16 * 4 * 32 bit. Read/Writes are optimized for 32 bit. 64 bit operations take up to 4x-8x more cycles than 32bit. So a 4*32 bit board-structure would perform best.

Project Background:
I already tried to port a 0x88 movegenerator (micromax) to OpenCL, it worked, but the nested loops are definetly not SIMD friendly. Every thread/process has to wait until the others finish. So i want to use a Magic Bitbord approach, which seems (made some test with dummy data) to be SIMD friendlier because the move generation for the pieces are more similiar.

What i am now looking for is a Board Presentation which is optimized for a 4*32 or 16*4*32 bit structure.

OpenCL also supports VectorDatatypes like int4....the best idea i can come up with is a QuadBitBoard design organized as long4 Vectordatatype...but 64 bit means that the computation will be up to 8x slower than a clear 32 bit design

Any suggestions for a 16*4*32bit board-design with magic bitboards as prefered move generator?

--
Srdja
I'd strongly recommend you take a look a look at my method inspired by Kogge-Stone parallel prefix networks, as I explicitly had hardware implementation in mind when I devised it. The software implementation I wrote was actually a by-product of this work.

http://chessprogramming.wikispaces.com/ ... +Algorithm

With appropriate instantiation of networks for all pieces and all directions, you should have a 1-1 mapping from specific output bits to moves for rooks, bishops and queens. The aim of the Kogge-Stone network is to reduce logic depth and gate count. The logic functions for the other piece types are very much simpler and have low logic depth. All of these functions operate in parallel, giving very good move generation speed when measured in clock cycles.

A special mention should go to Gerd Isenberg for kindly writing up this page on the Chess Prgramming Wiki - Thank you Gerd

Cheers,
Steffan

Thank you for your kind words, Steffan. Almost your stuff. Some more background on Kogge-Stone is mentioned on Parallel Prefix Algorithms, also the comparison with byte-wise SWAR-add and sub. I actually perform Kogge-Stone with quad-bitboards, hopefully soon using 256-bit ymm-registers. Which is like performing up to 15 disjoint piece bitboards in one run. No idea with GPUs. Do they have 64-bit shifts with 128-bit SIMD vectors? If yes, I guess Kogge-Stone is the way to go on these architectures. No lookups and branches, each direction might be processed independently.

Cheers,
Gerd

Possible Board Presentation and Move Generation for GPUs?

Possible Board Presentation and Move Generation for GPUs?

Re: Possible Board Presentation and Move Generation for GPUs

Re: Possible Board Presentation and Move Generation for GPUs

Re: Possible Board Presentation and Move Generation for GPUs

Re: Possible Board Presentation and Move Generation for GPUs

Re: Possible Board Presentation and Move Generation for GPUs

Re: Possible Board Presentation and Move Generation for GPUs

Re: Possible Board Presentation and Move Generation for GPUs

Re: Possible Board Presentation and Move Generation for GPUs

Re: Possible Board Presentation and Move Generation for GPUs