Possible Board Presentation and Move Generation for GPUs?

Discussion of chess software programming and technical issues.

Moderators: hgm, Rebel, chrisw

smatovic
Posts: 2644
Joined: Wed Mar 10, 2010 10:18 pm
Location: Hamburg, Germany
Full name: Srdja Matovic

Possible Board Presentation and Move Generation for GPUs?

Post by smatovic »

Dear Chess Programmers,

i still try to get a prototype of a chess engine running on a GPU and i am stuck with the first things like Board Presentation and Move Generation. Maybe you got an idea to help me out?

Technic background:
GPUs act as a SIMD device with many threads/processors and compute data in the most efficient matter if it is organized in 16 * 4 * 32 bit. Read/Writes are optimized for 32 bit. 64 bit operations take up to 4x-8x more cycles than 32bit. So a 4*32 bit board-structure would perform best.

Project Background:
I already tried to port a 0x88 movegenerator (micromax) to OpenCL, it worked, but the nested loops are definetly not SIMD friendly. Every thread/process has to wait until the others finish. So i want to use a Magic Bitbord approach, which seems (made some test with dummy data) to be SIMD friendlier because the move generation for the pieces are more similiar.

What i am now looking for is a Board Presentation which is optimized for a 4*32 or 16*4*32 bit structure.

OpenCL also supports VectorDatatypes like int4....the best idea i can come up with is a QuadBitBoard design organized as long4 Vectordatatype...but 64 bit means that the computation will be up to 8x slower than a clear 32 bit design :-(

Any suggestions for a 16*4*32bit board-design with magic bitboards as prefered move generator?

--
Srdja
Karlo Bala
Posts: 373
Joined: Wed Mar 22, 2006 10:17 am
Location: Novi Sad, Serbia
Full name: Karlo Balla

Re: Possible Board Presentation and Move Generation for GPUs

Post by Karlo Bala »

smatovic wrote:Dear Chess Programmers,

i still try to get a prototype of a chess engine running on a GPU and i am stuck with the first things like Board Presentation and Move Generation. Maybe you got an idea to help me out?

Technic background:
GPUs act as a SIMD device with many threads/processors and compute data in the most efficient matter if it is organized in 16 * 4 * 32 bit. Read/Writes are optimized for 32 bit. 64 bit operations take up to 4x-8x more cycles than 32bit. So a 4*32 bit board-structure would perform best.

Project Background:
I already tried to port a 0x88 movegenerator (micromax) to OpenCL, it worked, but the nested loops are definetly not SIMD friendly. Every thread/process has to wait until the others finish. So i want to use a Magic Bitbord approach, which seems (made some test with dummy data) to be SIMD friendlier because the move generation for the pieces are more similiar.

What i am now looking for is a Board Presentation which is optimized for a 4*32 or 16*4*32 bit structure.

OpenCL also supports VectorDatatypes like int4....the best idea i can come up with is a QuadBitBoard design organized as long4 Vectordatatype...but 64 bit means that the computation will be up to 8x slower than a clear 32 bit design :-(

Any suggestions for a 16*4*32bit board-design with magic bitboards as prefered move generator?

--
Srdja
I do not know much about GPU programming, but my intuition says that kogge-stone could be right approach for move generation
Best Regards,
Karlo Balla Jr.
NaltaP312
Posts: 56
Joined: Wed Oct 29, 2008 1:06 pm
Full name: Marc Paule

Re: Possible Board Presentation and Move Generation for GPUs

Post by NaltaP312 »

I think you can use Kindergarten bitboards too.
you have a lot of information here
http://chessprogramming.wikispaces.com/Bitboards
User avatar
Evert
Posts: 2929
Joined: Sat Jan 22, 2011 12:42 am
Location: NL

Re: Possible Board Presentation and Move Generation for GPUs

Post by Evert »

smatovic wrote: I already tried to port a 0x88 movegenerator (micromax) to OpenCL, it worked, but the nested loops are definetly not SIMD friendly. Every thread/process has to wait until the others finish. So i want to use a Magic Bitbord approach, which seems (made some test with dummy data) to be SIMD friendlier because the move generation for the pieces are more similiar.
On the other hand, you then have a somewhat large amount of data that needs to be accessed efficiently by the different threads.
I suspect that is not ideal either. Perhaps the most efficient algorithm is simply to ask 64x64 times whether a piece from square x can move to square y and forget about loops and large table lookups.

Having said that, the sort of parallelism that OpenCL (or GPUs in general) is good for is probably not the sort of parallelism you need for chess... so while it's probably an interesting concept and proof-of-concept, I suspect a GPU is not the idea hardware platform for chess.
smatovic
Posts: 2644
Joined: Wed Mar 10, 2010 10:18 pm
Location: Hamburg, Germany
Full name: Srdja Matovic

Re: Possible Board Presentation and Move Generation for GPUs

Post by smatovic »

On the other hand, you then have a somewhat large amount of data that needs to be accessed efficiently by the different threads.
I suspect that is not ideal either.
Right, i have to use slow global memory with the magic bitboard approach.
Perhaps the most efficient algorithm is simply to ask 64x64 times whether a piece from square x can move to square y and forget about loops and large table lookups.
This could be true, unfortunately the max amount of threads working on the same, fast memory space is limited to 256 or 1024 (depends on the OpenCL device).
I suspect a GPU is not the idea hardware platform for chess.
At least there are a lot of restrictions which make the implementation difficult.

--
Srdja
User avatar
Evert
Posts: 2929
Joined: Sat Jan 22, 2011 12:42 am
Location: NL

Re: Possible Board Presentation and Move Generation for GPUs

Post by Evert »

smatovic wrote:
Perhaps the most efficient algorithm is simply to ask 64x64 times whether a piece from square x can move to square y and forget about loops and large table lookups.
This could be true, unfortunately the max amount of threads working on the same, fast memory space is limited to 256 or 1024 (depends on the OpenCL device).
In that case, partitioning the question may be the best way to go: ask for half the board first, then for the second half. Or, alternatively, it may be possible to be a bit more clever and only ask for squares where you know there's a piece to squares that are a possible target (the squares that can be reached by a "superpiece" that moves as a queen or a knight). That should probably be about 16x32 at most (less in practice), which might still require partitioning the search space in two queries.
Of course, whether that pays off or not depends on how efficient you can do this pre-processing.
jwes
Posts: 778
Joined: Sat Jul 01, 2006 7:11 am

Re: Possible Board Presentation and Move Generation for GPUs

Post by jwes »

smatovic wrote:
Perhaps the most efficient algorithm is simply to ask 64x64 times whether a piece from square x can move to square y and forget about loops and large table lookups.
This could be true, unfortunately the max amount of threads working on the same, fast memory space is limited to 256 or 1024 (depends on the OpenCL device).
Perhaps some more explicitly parallel algorithms could work, say allocating 8 threads for each piece, one for each possible direction, and generating all moves in parallel with a mailbox approach, or using rotated bitmaps and updating them in parallel.
smatovic
Posts: 2644
Joined: Wed Mar 10, 2010 10:18 pm
Location: Hamburg, Germany
Full name: Srdja Matovic

Re: Possible Board Presentation and Move Generation for GPUs

Post by smatovic »

Perhaps some more explicitly parallel algorithms could work, say allocating 8 threads for each piece, one for each possible direction, and generating all moves in parallel with a mailbox approach, or using rotated bitmaps and updating them in parallel.
Yes, this is some kind of parallel move generation i think about -> direction-wise threading.

--
Srdja
steffan
Posts: 28
Joined: Sun Mar 12, 2006 12:08 am
Location: Midlands, England

Re: Possible Board Presentation and Move Generation for GPUs

Post by steffan »

smatovic wrote:Dear Chess Programmers,

i still try to get a prototype of a chess engine running on a GPU and i am stuck with the first things like Board Presentation and Move Generation. Maybe you got an idea to help me out?

Technic background:
GPUs act as a SIMD device with many threads/processors and compute data in the most efficient matter if it is organized in 16 * 4 * 32 bit. Read/Writes are optimized for 32 bit. 64 bit operations take up to 4x-8x more cycles than 32bit. So a 4*32 bit board-structure would perform best.

Project Background:
I already tried to port a 0x88 movegenerator (micromax) to OpenCL, it worked, but the nested loops are definetly not SIMD friendly. Every thread/process has to wait until the others finish. So i want to use a Magic Bitbord approach, which seems (made some test with dummy data) to be SIMD friendlier because the move generation for the pieces are more similiar.

What i am now looking for is a Board Presentation which is optimized for a 4*32 or 16*4*32 bit structure.

OpenCL also supports VectorDatatypes like int4....the best idea i can come up with is a QuadBitBoard design organized as long4 Vectordatatype...but 64 bit means that the computation will be up to 8x slower than a clear 32 bit design :-(

Any suggestions for a 16*4*32bit board-design with magic bitboards as prefered move generator?

--
Srdja
I'd strongly recommend you take a look a look at my method inspired by Kogge-Stone parallel prefix networks, as I explicitly had hardware implementation in mind when I devised it. The software implementation I wrote was actually a by-product of this work.

http://chessprogramming.wikispaces.com/ ... +Algorithm

With appropriate instantiation of networks for all pieces and all directions, you should have a 1-1 mapping from specific output bits to moves for rooks, bishops and queens. The aim of the Kogge-Stone network is to reduce logic depth and gate count. The logic functions for the other piece types are very much simpler and have low logic depth. All of these functions operate in parallel, giving very good move generation speed when measured in clock cycles.

A special mention should go to Gerd Isenberg for kindly writing up this page on the Chess Prgramming Wiki - Thank you Gerd :)

Cheers,
Steffan
Gerd Isenberg
Posts: 2250
Joined: Wed Mar 08, 2006 8:47 pm
Location: Hattingen, Germany

Re: Possible Board Presentation and Move Generation for GPUs

Post by Gerd Isenberg »

steffan wrote:
smatovic wrote:Dear Chess Programmers,

i still try to get a prototype of a chess engine running on a GPU and i am stuck with the first things like Board Presentation and Move Generation. Maybe you got an idea to help me out?

Technic background:
GPUs act as a SIMD device with many threads/processors and compute data in the most efficient matter if it is organized in 16 * 4 * 32 bit. Read/Writes are optimized for 32 bit. 64 bit operations take up to 4x-8x more cycles than 32bit. So a 4*32 bit board-structure would perform best.

Project Background:
I already tried to port a 0x88 movegenerator (micromax) to OpenCL, it worked, but the nested loops are definetly not SIMD friendly. Every thread/process has to wait until the others finish. So i want to use a Magic Bitbord approach, which seems (made some test with dummy data) to be SIMD friendlier because the move generation for the pieces are more similiar.

What i am now looking for is a Board Presentation which is optimized for a 4*32 or 16*4*32 bit structure.

OpenCL also supports VectorDatatypes like int4....the best idea i can come up with is a QuadBitBoard design organized as long4 Vectordatatype...but 64 bit means that the computation will be up to 8x slower than a clear 32 bit design :-(

Any suggestions for a 16*4*32bit board-design with magic bitboards as prefered move generator?

--
Srdja
I'd strongly recommend you take a look a look at my method inspired by Kogge-Stone parallel prefix networks, as I explicitly had hardware implementation in mind when I devised it. The software implementation I wrote was actually a by-product of this work.

http://chessprogramming.wikispaces.com/ ... +Algorithm

With appropriate instantiation of networks for all pieces and all directions, you should have a 1-1 mapping from specific output bits to moves for rooks, bishops and queens. The aim of the Kogge-Stone network is to reduce logic depth and gate count. The logic functions for the other piece types are very much simpler and have low logic depth. All of these functions operate in parallel, giving very good move generation speed when measured in clock cycles.

A special mention should go to Gerd Isenberg for kindly writing up this page on the Chess Prgramming Wiki - Thank you Gerd :)

Cheers,
Steffan
Thank you for your kind words, Steffan. Almost your stuff. Some more background on Kogge-Stone is mentioned on Parallel Prefix Algorithms, also the comparison with byte-wise SWAR-add and sub. I actually perform Kogge-Stone with quad-bitboards, hopefully soon using 256-bit ymm-registers. Which is like performing up to 15 disjoint piece bitboards in one run. No idea with GPUs. Do they have 64-bit shifts with 128-bit SIMD vectors? If yes, I guess Kogge-Stone is the way to go on these architectures. No lookups and branches, each direction might be processed independently.

Cheers,
Gerd