Dear Chess Programmers,
i still try to get a prototype of a chess engine running on a GPU and i am stuck with the first things like Board Presentation and Move Generation. Maybe you got an idea to help me out?
Technic background:
GPUs act as a SIMD device with many threads/processors and compute data in the most efficient matter if it is organized in 16 * 4 * 32 bit. Read/Writes are optimized for 32 bit. 64 bit operations take up to 4x-8x more cycles than 32bit. So a 4*32 bit board-structure would perform best.
Project Background:
I already tried to port a 0x88 movegenerator (micromax) to OpenCL, it worked, but the nested loops are definetly not SIMD friendly. Every thread/process has to wait until the others finish. So i want to use a Magic Bitbord approach, which seems (made some test with dummy data) to be SIMD friendlier because the move generation for the pieces are more similiar.
What i am now looking for is a Board Presentation which is optimized for a 4*32 or 16*4*32 bit structure.
OpenCL also supports VectorDatatypes like int4....the best idea i can come up with is a QuadBitBoard design organized as long4 Vectordatatype...but 64 bit means that the computation will be up to 8x slower than a clear 32 bit design
Any suggestions for a 16*4*32bit board-design with magic bitboards as prefered move generator?
--
Srdja
Possible Board Presentation and Move Generation for GPUs?
Moderators: hgm, Rebel, chrisw
-
- Posts: 2644
- Joined: Wed Mar 10, 2010 10:18 pm
- Location: Hamburg, Germany
- Full name: Srdja Matovic
-
- Posts: 373
- Joined: Wed Mar 22, 2006 10:17 am
- Location: Novi Sad, Serbia
- Full name: Karlo Balla
Re: Possible Board Presentation and Move Generation for GPUs
I do not know much about GPU programming, but my intuition says that kogge-stone could be right approach for move generationsmatovic wrote:Dear Chess Programmers,
i still try to get a prototype of a chess engine running on a GPU and i am stuck with the first things like Board Presentation and Move Generation. Maybe you got an idea to help me out?
Technic background:
GPUs act as a SIMD device with many threads/processors and compute data in the most efficient matter if it is organized in 16 * 4 * 32 bit. Read/Writes are optimized for 32 bit. 64 bit operations take up to 4x-8x more cycles than 32bit. So a 4*32 bit board-structure would perform best.
Project Background:
I already tried to port a 0x88 movegenerator (micromax) to OpenCL, it worked, but the nested loops are definetly not SIMD friendly. Every thread/process has to wait until the others finish. So i want to use a Magic Bitbord approach, which seems (made some test with dummy data) to be SIMD friendlier because the move generation for the pieces are more similiar.
What i am now looking for is a Board Presentation which is optimized for a 4*32 or 16*4*32 bit structure.
OpenCL also supports VectorDatatypes like int4....the best idea i can come up with is a QuadBitBoard design organized as long4 Vectordatatype...but 64 bit means that the computation will be up to 8x slower than a clear 32 bit design
Any suggestions for a 16*4*32bit board-design with magic bitboards as prefered move generator?
--
Srdja
Best Regards,
Karlo Balla Jr.
Karlo Balla Jr.
-
- Posts: 56
- Joined: Wed Oct 29, 2008 1:06 pm
- Full name: Marc Paule
Re: Possible Board Presentation and Move Generation for GPUs
I think you can use Kindergarten bitboards too.
you have a lot of information here
http://chessprogramming.wikispaces.com/Bitboards
you have a lot of information here
http://chessprogramming.wikispaces.com/Bitboards
-
- Posts: 2929
- Joined: Sat Jan 22, 2011 12:42 am
- Location: NL
Re: Possible Board Presentation and Move Generation for GPUs
On the other hand, you then have a somewhat large amount of data that needs to be accessed efficiently by the different threads.smatovic wrote: I already tried to port a 0x88 movegenerator (micromax) to OpenCL, it worked, but the nested loops are definetly not SIMD friendly. Every thread/process has to wait until the others finish. So i want to use a Magic Bitbord approach, which seems (made some test with dummy data) to be SIMD friendlier because the move generation for the pieces are more similiar.
I suspect that is not ideal either. Perhaps the most efficient algorithm is simply to ask 64x64 times whether a piece from square x can move to square y and forget about loops and large table lookups.
Having said that, the sort of parallelism that OpenCL (or GPUs in general) is good for is probably not the sort of parallelism you need for chess... so while it's probably an interesting concept and proof-of-concept, I suspect a GPU is not the idea hardware platform for chess.
-
- Posts: 2644
- Joined: Wed Mar 10, 2010 10:18 pm
- Location: Hamburg, Germany
- Full name: Srdja Matovic
Re: Possible Board Presentation and Move Generation for GPUs
Right, i have to use slow global memory with the magic bitboard approach.On the other hand, you then have a somewhat large amount of data that needs to be accessed efficiently by the different threads.
I suspect that is not ideal either.
This could be true, unfortunately the max amount of threads working on the same, fast memory space is limited to 256 or 1024 (depends on the OpenCL device).Perhaps the most efficient algorithm is simply to ask 64x64 times whether a piece from square x can move to square y and forget about loops and large table lookups.
At least there are a lot of restrictions which make the implementation difficult.I suspect a GPU is not the idea hardware platform for chess.
--
Srdja
-
- Posts: 2929
- Joined: Sat Jan 22, 2011 12:42 am
- Location: NL
Re: Possible Board Presentation and Move Generation for GPUs
In that case, partitioning the question may be the best way to go: ask for half the board first, then for the second half. Or, alternatively, it may be possible to be a bit more clever and only ask for squares where you know there's a piece to squares that are a possible target (the squares that can be reached by a "superpiece" that moves as a queen or a knight). That should probably be about 16x32 at most (less in practice), which might still require partitioning the search space in two queries.smatovic wrote:This could be true, unfortunately the max amount of threads working on the same, fast memory space is limited to 256 or 1024 (depends on the OpenCL device).Perhaps the most efficient algorithm is simply to ask 64x64 times whether a piece from square x can move to square y and forget about loops and large table lookups.
Of course, whether that pays off or not depends on how efficient you can do this pre-processing.
-
- Posts: 778
- Joined: Sat Jul 01, 2006 7:11 am
Re: Possible Board Presentation and Move Generation for GPUs
Perhaps some more explicitly parallel algorithms could work, say allocating 8 threads for each piece, one for each possible direction, and generating all moves in parallel with a mailbox approach, or using rotated bitmaps and updating them in parallel.smatovic wrote:This could be true, unfortunately the max amount of threads working on the same, fast memory space is limited to 256 or 1024 (depends on the OpenCL device).Perhaps the most efficient algorithm is simply to ask 64x64 times whether a piece from square x can move to square y and forget about loops and large table lookups.
-
- Posts: 2644
- Joined: Wed Mar 10, 2010 10:18 pm
- Location: Hamburg, Germany
- Full name: Srdja Matovic
Re: Possible Board Presentation and Move Generation for GPUs
Yes, this is some kind of parallel move generation i think about -> direction-wise threading.Perhaps some more explicitly parallel algorithms could work, say allocating 8 threads for each piece, one for each possible direction, and generating all moves in parallel with a mailbox approach, or using rotated bitmaps and updating them in parallel.
--
Srdja
-
- Posts: 28
- Joined: Sun Mar 12, 2006 12:08 am
- Location: Midlands, England
Re: Possible Board Presentation and Move Generation for GPUs
I'd strongly recommend you take a look a look at my method inspired by Kogge-Stone parallel prefix networks, as I explicitly had hardware implementation in mind when I devised it. The software implementation I wrote was actually a by-product of this work.smatovic wrote:Dear Chess Programmers,
i still try to get a prototype of a chess engine running on a GPU and i am stuck with the first things like Board Presentation and Move Generation. Maybe you got an idea to help me out?
Technic background:
GPUs act as a SIMD device with many threads/processors and compute data in the most efficient matter if it is organized in 16 * 4 * 32 bit. Read/Writes are optimized for 32 bit. 64 bit operations take up to 4x-8x more cycles than 32bit. So a 4*32 bit board-structure would perform best.
Project Background:
I already tried to port a 0x88 movegenerator (micromax) to OpenCL, it worked, but the nested loops are definetly not SIMD friendly. Every thread/process has to wait until the others finish. So i want to use a Magic Bitbord approach, which seems (made some test with dummy data) to be SIMD friendlier because the move generation for the pieces are more similiar.
What i am now looking for is a Board Presentation which is optimized for a 4*32 or 16*4*32 bit structure.
OpenCL also supports VectorDatatypes like int4....the best idea i can come up with is a QuadBitBoard design organized as long4 Vectordatatype...but 64 bit means that the computation will be up to 8x slower than a clear 32 bit design
Any suggestions for a 16*4*32bit board-design with magic bitboards as prefered move generator?
--
Srdja
http://chessprogramming.wikispaces.com/ ... +Algorithm
With appropriate instantiation of networks for all pieces and all directions, you should have a 1-1 mapping from specific output bits to moves for rooks, bishops and queens. The aim of the Kogge-Stone network is to reduce logic depth and gate count. The logic functions for the other piece types are very much simpler and have low logic depth. All of these functions operate in parallel, giving very good move generation speed when measured in clock cycles.
A special mention should go to Gerd Isenberg for kindly writing up this page on the Chess Prgramming Wiki - Thank you Gerd
Cheers,
Steffan
-
- Posts: 2250
- Joined: Wed Mar 08, 2006 8:47 pm
- Location: Hattingen, Germany
Re: Possible Board Presentation and Move Generation for GPUs
Thank you for your kind words, Steffan. Almost your stuff. Some more background on Kogge-Stone is mentioned on Parallel Prefix Algorithms, also the comparison with byte-wise SWAR-add and sub. I actually perform Kogge-Stone with quad-bitboards, hopefully soon using 256-bit ymm-registers. Which is like performing up to 15 disjoint piece bitboards in one run. No idea with GPUs. Do they have 64-bit shifts with 128-bit SIMD vectors? If yes, I guess Kogge-Stone is the way to go on these architectures. No lookups and branches, each direction might be processed independently.steffan wrote:I'd strongly recommend you take a look a look at my method inspired by Kogge-Stone parallel prefix networks, as I explicitly had hardware implementation in mind when I devised it. The software implementation I wrote was actually a by-product of this work.smatovic wrote:Dear Chess Programmers,
i still try to get a prototype of a chess engine running on a GPU and i am stuck with the first things like Board Presentation and Move Generation. Maybe you got an idea to help me out?
Technic background:
GPUs act as a SIMD device with many threads/processors and compute data in the most efficient matter if it is organized in 16 * 4 * 32 bit. Read/Writes are optimized for 32 bit. 64 bit operations take up to 4x-8x more cycles than 32bit. So a 4*32 bit board-structure would perform best.
Project Background:
I already tried to port a 0x88 movegenerator (micromax) to OpenCL, it worked, but the nested loops are definetly not SIMD friendly. Every thread/process has to wait until the others finish. So i want to use a Magic Bitbord approach, which seems (made some test with dummy data) to be SIMD friendlier because the move generation for the pieces are more similiar.
What i am now looking for is a Board Presentation which is optimized for a 4*32 or 16*4*32 bit structure.
OpenCL also supports VectorDatatypes like int4....the best idea i can come up with is a QuadBitBoard design organized as long4 Vectordatatype...but 64 bit means that the computation will be up to 8x slower than a clear 32 bit design
Any suggestions for a 16*4*32bit board-design with magic bitboards as prefered move generator?
--
Srdja
http://chessprogramming.wikispaces.com/ ... +Algorithm
With appropriate instantiation of networks for all pieces and all directions, you should have a 1-1 mapping from specific output bits to moves for rooks, bishops and queens. The aim of the Kogge-Stone network is to reduce logic depth and gate count. The logic functions for the other piece types are very much simpler and have low logic depth. All of these functions operate in parallel, giving very good move generation speed when measured in clock cycles.
A special mention should go to Gerd Isenberg for kindly writing up this page on the Chess Prgramming Wiki - Thank you Gerd
Cheers,
Steffan
Cheers,
Gerd