i still try to get a prototype of a chess engine running on a GPU and i am stuck with the first things like Board Presentation and Move Generation. Maybe you got an idea to help me out?
Technic background:
GPUs act as a SIMD device with many threads/processors and compute data in the most efficient matter if it is organized in 16 * 4 * 32 bit. Read/Writes are optimized for 32 bit. 64 bit operations take up to 4x-8x more cycles than 32bit. So a 4*32 bit board-structure would perform best.
Project Background:
I already tried to port a 0x88 movegenerator (micromax) to OpenCL, it worked, but the nested loops are definetly not SIMD friendly. Every thread/process has to wait until the others finish. So i want to use a Magic Bitbord approach, which seems (made some test with dummy data) to be SIMD friendlier because the move generation for the pieces are more similiar.
What i am now looking for is a Board Presentation which is optimized for a 4*32 or 16*4*32 bit structure.
OpenCL also supports VectorDatatypes like int4....the best idea i can come up with is a QuadBitBoard design organized as long4 Vectordatatype...but 64 bit means that the computation will be up to 8x slower than a clear 32 bit design

Any suggestions for a 16*4*32bit board-design with magic bitboards as prefered move generator?
--
Srdja