Back to the basics, generating moves on gpu in parallel...

smatovic · Post by **smatovic** » Sun Mar 05, 2017 1:28 am

already back in 2008 was 1Mnps reported for the Nvidia 8800 gpu,
with 64 threads running in parallel for move generation,
sorting and evaluation.

http://gpuchess.blogspot.de/

With my current 64 bit KoggeStone vector based design,
i achieve max 150 Knps on an Nv 8800 GT.

https://github.com/smatovic/Zeta/

It is clear that these older gpus prefer fp32 or int24 type computation
and Bitboards have some major penalties,
but now, with mixed precision support on modern Nvidia and AMD gpus,
these also would profit from computation with half or quarter precisions.

So here is the question,
how to implement an efficient float or int8 based move generator
with 64 threads in parallel, running on an SIMD unit.

I got some vector based 0x88, for 8 directions, in my mind,
but my dummy tests can not outperform the KoggeStone design.

I have read that Belle used 64 "threads" in parallel,
could such a design be usefull realized in software?

Any ideas and suggestions are welcome....

--
Srdja

Ed Trice · Post by **Ed Trice** » Sun Mar 05, 2017 8:31 pm

Did you really call this post "Back to the basics..." ??

I sat next to Dr. Hans Berliner at the 1990 Pennsylvania State Chess Championship, the last tournament for his Hi Tech program. I was operating my program, The Sniper, which finished without a loss but tied for 10th. Hi Tech won 5-0 and took clear first place since some of the higher rated players "signed the list" not to be paired against a program.

He told me Hi Tech used 64 processors, and the attacks from each square were all generated in parallel. I'm not sure about Ken Thompson's program, Belle. I remember Belle cost $160,000 to build back in the 1980s, and Belle Labs footed the bill.

smatovic · Post by **smatovic** » Sun Mar 05, 2017 9:21 pm

He told me Hi Tech used 64 processors, and the attacks from each square were all generated in parallel. I'm not sure about Ken Thompson's program, Belle. I remember Belle cost $160,000 to build back in the 1980s, and Belle Labs footed the bill.

funny, than Belle made about 1 nps per Dollar....
thanks for the info,

--
Srdja

bhamadicharef · Post by **bhamadicharef** » Tue Mar 07, 2017 8:11 pm

GPU programming, to acheive high number of moves per second, needs a different approach to conventional CPU. GPU programming must have efficient feeding of data to kernels running on lots and lots of threads. One must think carefull of memory access pattern to get high performance. It is sometimes easier to have clear view of what the CPU version does, make a naive GPU version, then re organize the data in order to feed the beast the most efficiently way it can process them. I will try to find time to study the code from the two links provided and comment later on, to try to help if
I can.

As for Belle and 64 moves in parallel, there are few papers in the literature one can read, I for one like the FPGA approach better than GPU, because can design the right function in hardware, rather than wrap my brain around the gymnastic of GPU constrained of its programming model. Both have their strenght and issues !

DustyMonkey · Post by **DustyMonkey** » Wed Mar 08, 2017 4:12 pm

Unfortunately, neighborhood sampling operations greater than ~5x5 still perform poorly on the latest gpu's, bottlenecking on cache/memory speeds.

They are much better at this now than they were in the 8800GT era (even AMD APU's), but the bottleneck remains the same.

smatovic · Post by **smatovic** » Wed Mar 08, 2017 4:30 pm

So switching from 64 bit Bitboards to 32 bit based board presentation and move gen,
could lower memory pressure in general...maybe i will it a try. thx.

--
Srdja

smatovic · Post by **smatovic** » Wed Mar 08, 2017 4:58 pm

Don't get me wrong,
i run 14*2*64 threads in total on the 8800 GT,
one work-group resp. block consists of 64 threads,
and these are used to do move gen, movepicking and eval
for one chess position in parallel.

My Benchmarks with the current kernel showed, that running 2 to 4
waves of work-groups per SIMD unit can be enough to utilize the unit,
presumed that enough fast private and local memory is present.

I achieve 50% valu utilization for an single work-group on AMD GCN 1.0,
so there is room for branch and memory optimization.

But what bothers me, is that the 50% valu can be used more efficient
when switching from 64 bit Bitboards to an fp32 or int24 based design
for example.

Considering the doubled memory transfer (e.g. 64 bit AttackTables)
and the penalty for 64 bit computation,
there are now two reasons to skip the Bitboards.

--
Srdja

ps: i dunno if gpus will be ever competitive to classic cpu engines,
but this project became some kind of sporting challenge to me....

jhaglund2 · Post by **jhaglund2** » Wed Mar 08, 2017 5:32 pm

I like this kind of research. What are your thoughts about this: http://www.talkchess.com/forum/viewtopi ... ht=#413344

With today's GPU's GPX 1080 TI, I don't see why you can't use a pixel for each node.

Maximum resolution: 7680x4320x60Hz

= 33,177,600 Nps @ 1fps
= 331,776,000 Nps @ 10fps
= 995,328,000 Nps @ 30fps
= 1,990,656,000 Nps @ 60fps

smatovic wrote:already back in 2008 was 1Mnps reported for the Nvidia 8800 gpu,
with 64 threads running in parallel for move generation,
sorting and evaluation.

http://gpuchess.blogspot.de/

With my current 64 bit KoggeStone vector based design,
i achieve max 150 Knps on an Nv 8800 GT.

https://github.com/smatovic/Zeta/

It is clear that these older gpus prefer fp32 or int24 type computation
and Bitboards have some major penalties,
but now, with mixed precision support on modern Nvidia and AMD gpus,
these also would profit from computation with half or quarter precisions.

So here is the question,
how to implement an efficient float or int8 based move generator
with 64 threads in parallel, running on an SIMD unit.

I got some vector based 0x88, for 8 directions, in my mind,
but my dummy tests can not outperform the KoggeStone design.

I have read that Belle used 64 "threads" in parallel,
could such a design be usefull realized in software?

Any ideas and suggestions are welcome....

--
Srdja

smatovic · Post by **smatovic** » Wed Mar 08, 2017 6:04 pm

With today's GPU's GPX 1080 TI, I don't see why you can't use a pixel for each node.

i guess the magic lies in how to compute the frames....

i have only little clue on OpenGL and graphics programming,
maybe Cuda and OpenCL offer more familiar features and structures
for an C programmer than OpenGL and DirectX,
so a port of an classic cpu engine to gpu is more likely to happen in
these languages....

--
Srdja

Dann Corbit · Post by **Dann Corbit** » Thu Mar 09, 2017 1:31 am

I think at some point the Jetson approach will win out (fluid integration of memory):

https://developer.nvidia.com/embedded/b ... oifQ%3D%3D

Back to the basics, generating moves on gpu in parallel...

Back to the basics, generating moves on gpu in parallel...

Re: Back to the basics, generating moves on gpu in parallel.

Re: Back to the basics, generating moves on gpu in parallel.

Re: Back to the basics, generating moves on gpu in parallel.

Re: Back to the basics, generating moves on gpu in parallel.

Re: Back to the basics, generating moves on gpu in parallel.

Re: Back to the basics, generating moves on gpu in parallel.

Re: Back to the basics, generating moves on gpu in parallel.

Re: Back to the basics, generating moves on gpu in parallel.

Re: Back to the basics, generating moves on gpu in parallel.