uct on gpu

Daniel Shawul · Post by **Daniel Shawul** » Mon Mar 19, 2012 11:01 pm

Thank you for your thoughts. I can't really comment on it now but I will give it a thought later.
The fermi is interesting because of the 64kb L1 cache. That and other improvements may change my view of gpu search. But I still think it is not enough to have full occupancy with that since you have to divide it to 48kb/16kb as shared/l1 cache. The 17% occupancy I have now on my current device is not enough even to hide the 24 cycles pipeline latency. I am going to order a device (not a fermi though too expensie 2500$!) that would bring that up to 30% atleast.

Every thread needs its own stack (say 64 ply) of move lists int[256] and other variables to do an iterative alpha-beta search. That is what puts me off from alpha-beta even for doing shallow q-search at the leafs. If the L1 cache can hold that stack for each thread then each thread can do its own search. Another idea mentioned here some time ago is to use the threads for parallel move generation (kogge-stone), parallel evaluation etc.. I don't like that idea because there isn't much too do with that and threads will likely spend most of their time idle. But I understand that is how parallel search on cpus started and it may well be good alternative on gpus.

What I do is really simple. Each warp (32 thread) works independetly from others ( which means it can grab its own node from the tree). Each thread of the warp does it own simulation. There is no barrier except the inherent sync of threads in a warp.

p.s. now i got sponsored by Nvidia of course, but even then it's easy to prove only Nvidia is the cards to get a chessprogram going; simply put you can run DIFFERENT instruction streams at each SIMD. At AMD that is not possible. So all cards (as far as you can steer enough at AMD) and all SIMD's basically execute the same instruction at the same time at AMD, despite 100 promises to change that - i don't see it happen any soon at AMD, even simple requests they don't manage there so far.

I didn't know that. Now I am glad to have NVIDIA gpu.

diep · Post by **diep** » Tue Mar 20, 2012 1:54 am

smatovic wrote:
p.s. now i got sponsored by Nvidia of course,
cool.

Are you going to use OpenCL or Cuda?

You're buliding a chessprogram on a gpu as well Srdja?
Working on it since 2 years in spare time, but compared to you i am a newbie in chessprogramming, my blog:

http://zeta-chess.blogspot.de/

Can you remind us which gpu you're using Srdja and how high it is clocked?
http://zeta-chess.blogspot.de/2012/03/n ... -7750.html

GTS250:
Cores: 128
SIMD Units: 16
GPU Clock: 738 MHz
Shader Clock: 1836 MHz
GFLOP/s SP: 705 (max. single point precision)
GFLOP/s DP: 88 (max. double point precision)
Memory Interface: 256 bit
Memory Clock: 2200 MHz (GDDR3)
Memory Bandwidth: 70 GB/s
private memory/SIMD Unit: 32 KB
local memory/SIMD Unit: 16 KB
max warps/SIMD Unit: 24

My Move Generator is still under construction (no en passant or castle moves) and i am not sure which Search Algorithm to use.....got currently a best first search running.

What do you plan for generating moves?

AFAIK 24 bit mul or mad operations have the best performance on GPUs...maybe someone should design a 24 bit based move generation.

--
Srdja

You are right for the old Nvidia's. The Fermi and newer have fast 32 bits multiplications as well. AMD runs behind there and won't improve. 32 bits multiplications at AMD require all 4 PE's to "cooperate" so it's virtual 8 cycles or 2 cycles throughput @ 4 pe's.

At Nvidia it's just 2 cycles of 1 streamcore, as it's 2 instructions as well.

So at multiplication of integers, Nvidia total owns AMD and probably also upcoming intel's thing is no big deal there.

Intel always been not so strong in multiplications, so we would need to see solid evidence there.

I wouldn't worry about en passant and castling for now. Important is a better search. Best First search doesn't seem very impressive in that sense.

Probably the only real big problem at gpu programming is the fact that SMP search already is so complicated at CPU's and at a gpu it's 10 times harder as you REALLY need a good efficiency from the SMP searches there.

That's however a big job to design on paper first, prove it on paper, and then implement it. Very interesting job just to write it out on paper i'd say, but unpaid i wouldn't do that.

diep · Post by **diep** » Tue Mar 20, 2012 1:59 am

Daniel Shawul wrote:Thank you for your thoughts. I can't really comment on it now but I will give it a thought later.
The fermi is interesting because of the 64kb L1 cache. That and other improvements may change my view of gpu search. But I still think it is not enough to have full occupancy with that since you have to divide it to 48kb/16kb as shared/l1 cache. The 17% occupancy I have now on my current device is not enough even to hide the 24 cycles pipeline latency. I am going to order a device (not a fermi though too expensie 2500$!) that would bring that up to 30% atleast.

Every thread needs its own stack (say 64 ply) of move lists int[256] and other variables to do an iterative alpha-beta search. That is what puts me off from alpha-beta even for doing shallow q-search at the leafs. If the L1 cache can hold that stack for each thread then each thread can do its own search. Another idea mentioned here some time ago is to use the threads for parallel move generation (kogge-stone), parallel evaluation etc.. I don't like that idea because there isn't much too do with that and threads will likely spend most of their time idle. But I understand that is how parallel search on cpus started and it may well be good alternative on gpus.

What I do is really simple. Each warp (32 thread) works independetly from others ( which means it can grab its own node from the tree). Each thread of the warp does it own simulation. There is no barrier except the inherent sync of threads in a warp.

p.s. now i got sponsored by Nvidia of course, but even then it's easy to prove only Nvidia is the cards to get a chessprogram going; simply put you can run DIFFERENT instruction streams at each SIMD. At AMD that is not possible. So all cards (as far as you can steer enough at AMD) and all SIMD's basically execute the same instruction at the same time at AMD, despite 100 promises to change that - i don't see it happen any soon at AMD, even simple requests they don't manage there so far.
I didn't know that. Now I am glad to have NVIDIA gpu.

Daniel i disagree on the occupancy with you. You just need to be a good programmer to get to a high IPC and also you're using a dead old GPU. Less talented programmers already get 70% IPC at Fermi and newer, same thing for AMD by the way.

It has to do with the fact that the newer gpu's are alternating 2 threads quickly (wavefronts), so the 8 cycles you have to wait until a registers result is available then drops to 4 cycles.

That way of programming however majority of programmers will never understand; it's obvious from your comments you have some clues there.

It is however 8 cycles at the GPU you got, except if you use expensive instructions; you should avoid those at all cost i'd argue.

Just use simple instructions.

Don't use branches.

diep · Post by **diep** » Tue Mar 20, 2012 2:04 am

So for the non informed what i'm trying to say is this:

Suppose we do:

A = x + y; // it takes now 8 cycles for A to be available for further processing
b = A + c;

The throughput is ok, but you have to wait that 8 cycles first for A to be available.

At Fermi it launches simply 2x more threads, causing it to be 4 cycles. Usually compiler already is capable of pretty much hiding the latency then.

So that means one should write something like:

A = x + y;
f = k + i;
g = i + 3;
h = j + 2;

// A is available now after 4 cycles
b = A + c;

It's very simple cores of course.

smatovic · Post by **smatovic** » Tue Mar 20, 2012 3:38 am

my measurement was wrong on this,

currently i get only 166 Knps per SIMD Unit using a loop over the starting position.

--
Srdja

Daniel Shawul · Post by **Daniel Shawul** » Tue Mar 20, 2012 3:40 am

Daniel i disagree on the occupancy with you. You just need to be a good programmer to get to a high IPC and also you're using a dead old GPU. Less talented programmers already get 70% IPC at Fermi and newer, same thing for AMD by the way.

It is not about occupancy really rather performance. I can get to 70% occupancy by limiting register usage --maxrregcount during nvcc compilation. It does its best to reduce register usage at the cost of putting some variables , that it judges as least often used, in local memory. There are other tricks like using 'volatile' which boils down to the same thing: sacrificing speed for occupancy. I very much doubt it is my coding that is the problem because nvcc compiler does really optimize register usage even if you write a bad code. I don't write code that has lots of branches or with very big dependency chains, unless I have to. Maybe I will save some some if I work hard on it but I doubt it would be a significant improvement.

You should understand that what makes my occupancy so low is register pressure caused by design choices I made: that each thread does its own search,
and that it stores everything it needs on registers and shared mem. You talk about alpha-beta here but it is virtually impossible to have the move stacks
for each thread be on chip. I can implement a YBW or something similar if that is what I wanted but what is the point if it will be slow while everything is stored
on global memory. To put it in perspective, I can't even have one array of int[246] for each thread to generate its moves and remove the illegal ones (even on fermi).
So it really baffles me to think about doing any form of alpha-beta fast without incuring the 500 cycles latency everytime you pick up a move from uncached (barely cached) global memory..

A = x + y; // it takes now 8 cycles for A to be available for further processing
b = A + c;

We have a different concept of latency here. On gpus simple instruction like add, mad and othesr typically take 24 cycles. This pipeline latency is hidden by at least having 192 threads (6 warps on tesla probably twice of that on fermi), so once you have that you need not worry about it this dependency issue. It requires atleast 25% occupancy. To hide global memory latency , you would need far more than that...

diep · Post by **diep** » Tue Mar 20, 2012 1:45 pm

SNIP

Are you going to use OpenCL or Cuda?

I'm going OpenCL of course for my own codes, depending upon how well that works for the chess; as for the prime number codes those are already in CUDA, so if i modify those slightly that's a logical thing to do; the CUDA prime number stuff just runs in sparetime. The chess always has priority.

I don't feel that for chess codes the choice OpenCL vs CUDA is a big choice, as for the chess we don't work with huge integers; for Diep i need 20 bits max and then some lineair extrapolations in evaluation i do since 2000 already, they require a tad more bits (40+ of course for the multiplication of it), but it's not like the prime numbers that really need big bit accuracy, so majority of code in diep won't profit much from faster carry, to give example. Also it's easy to crisscross port things of course there, and i feel OpenCL is a good incentive there.

Portability and future prospects which is a big issue for some larger organisations, meaning automatically choosing for Nvidia, because of their long term reliability (AMD might go bankrupt of course with that huge disaster called bulldozer); all those issues are not so relevant for the chess.

So whether i use opencl there or cuda is no big deal of course, I'm not so sure how stable OpenCL is - we already see how bad AMD supports it - but i'm always a big proponent of new technology that in longterm might be interesting.

The prime number codes are impossible to optimize yourself better, without being fulltime busy for 10+ years yourself there you won't be able to optimize that much better - these guys have tested every cycle they can save out pretty well - logically that runs in CUDA.

Of course Nvidia promotes CUDA bigtime, as basically those who want the utmost performance can find that in CUDA.

Chess is simple 32 bits integer work however, something that runs fastest on Nvidia, factors faster on each core than at AMD - AMD needs 4 cores to do 32 bits multiplications and Nvidia just 1, so you can divide AMD's multiplication performance in 32 bits integers by factor 4.

Both for the primenumber factorisation code i run as well as for the chess Nvidia is a better choice.

SNIP

diep · Post by **diep** » Tue Mar 20, 2012 2:07 pm

Daniel Shawul wrote:
Daniel i disagree on the occupancy with you. You just need to be a good programmer to get to a high IPC and also you're using a dead old GPU. Less talented programmers already get 70% IPC at Fermi and newer, same thing for AMD by the way.
It is not about occupancy really rather performance. I can get to 70% occupancy by limiting register usage --maxrregcount during nvcc compilation. It does its best to reduce register usage at the cost of putting some variables , that it judges as least often used, in local memory. There are other tricks like using 'volatile' which boils down to the same thing: sacrificing speed for occupancy.

I very much doubt it is my coding that is the problem because nvcc compiler does really optimize register usage even if you write a bad code. I don't write code that has lots of branches or with very big dependency chains, unless I have to.

Well you do something wrong obviously; what you should do is have a generic code that executes which is basically doing nearly nothing towards the local shared memory. You evaluate entire position, execute code, execute code, you generic backtrack (and in cores that don't need to backtrack you don't backtrack), then you have prefetched by then the local hashtable and backtrack again (so those cycles are wasted for cores that do not backtrack).

Everything except hashtable is then local, the hashtable being local shared memory. So we speak about thousands of cycles that you execute that are all local code and have *nothing* to do with any reference to the local shared memory, let alone the global shared memory, let alone the device RAM.

This is what happens in 99.9% of the cases, so you can really get a very close to that IPC=1.0 rate there. Note other 'official measures' are not so interesting there as we're not busy with multiply-add, which explains 50% of the paper performance, cpu's or gpu's does not matter there, but it changes the percentages bigtime.

Maybe I will save some some if I work hard on it but I doubt it would be a significant improvement.

You should understand that what makes my occupancy so low is register pressure caused by design choices I made: that each thread does its own search,

Well it should of course and only once per node you do a local shared check there. Programming that in an efficient manner isn't easy of course, it's big work.

and that it stores everything it needs on registers and shared mem. You talk about alpha-beta here but it is virtually impossible to have the move stacks
for each thread be on chip. I can implement a YBW or something similar if that is what I wanted but what is the point if it will be slow while everything is stored
on global memory.

YBW is the only algorithm that you should consider of course, the other algorithms are all junk. Took me months of puzzling on paper to solve that problem on Nvidia GPU's. I'm sure there is more solutions, but yeah it's hard work. Yet saying it's virtual impossible is not correct - i solved it on paper already around 2007-2008.

Saying it's a HUGE work is correct however. Chessprogramming is like that

To put it in perspective, I can't even have one array of int[246] for each thread to generate its moves and remove the illegal ones (even on fermi).
So it really baffles me to think about doing any form of alpha-beta fast without incuring the 500 cycles latency everytime you pick up a move from uncached (barely cached) global memory..

I understand your problem but oh boy - are you far off how to solve things.

Realize you want to preserve the branching factor that YBW gives - anywhere else you can sacrafice. And once again - i did do my calculations for the generation GPU's before Fermi, as you can see from the date - i DID count at losing in total up to factor 5 in overhead when implementing a huge evaluation function. What you're busy with here is however the years 70s basic of how to write an efficient move generator.

The move generator isn't the biggest problem to solve though. You are ALLOWED to lose something there.

A = x + y; // it takes now 8 cycles for A to be available for further processing
b = A + c;
We have a different concept of latency here. On gpus simple instruction like add, mad and othesr typically take 24 cycles.
This pipeline latency is hidden by at least having 192 threads (6 warps on tesla probably twice of that on fermi), so once you have that you need not worry about it this dependency issue. It requires atleast 25% occupancy. To hide global memory latency , you would need far more than that...

http://www.cs.berkeley.edu/~volkov/volkov10-GTC.pdf

Look at the example, he's building an instruction level parallellism of 4 instructions prior to referencing, that gets the optimal throughput.

He had a tad older writing showing this more clearly, check around at his homepage there.

diep · Post by **diep** » Tue Mar 20, 2012 2:13 pm

smatovic wrote:my measurement was wrong on this,

currently i get only 166 Knps per SIMD Unit using a loop over the starting position.

--
Srdja

Is that entire movelists or total nodes generated? Of course as there is no branch prediction in the GPU you can easily benchmark the same position over and over again, unlike CPU's, yet all effort you do for the pieces at a1..h1, they generate in total only 4 moves for the knights in openingsposition.

smatovic · Post by **smatovic** » Tue Mar 20, 2012 2:43 pm

Daniel was right concerning global memory usage,
i just have to delete this line and double the throughput......

Code: Select all

                        
// copy move to global
global_moves&#91;pid*256+n&#93; = move;

running 16*2*32 threads:

with saving moves in global memory, wo legality check:

nodes: 64000000 ,tnode: 0 ,ab-nodes: 0 ,movecount: 1280000000, bestmove: 0 ,sec: 7.000000

without saving moves in global memory, wo legality check:

nodes: 64000000 ,tnode: 0 ,ab-nodes: 0 ,movecount: 1280000000, bestmove: 0 ,sec: 3.000000

Of course this is a synthetic "benchmark".

--
Srdja

uct on gpu

Re: uct for chess - move gen speedup by vector datatypes

Re: uct for chess - move gen speedup by vector datatypes

Re: uct for chess - move gen speedup by vector datatypes

Re: uct for chess - move gen speedup by vector datatypes

Re: uct for chess - move gen speedup by vector datatypes

Re: uct for chess - move gen speedup by vector datatypes

Re: uct for chess - move gen speedup by vector datatypes

Re: uct for chess - move gen speedup by vector datatypes

Re: uct for chess - move gen speedup by vector datatypes

Re: uct for chess - move gen speedup by vector datatypes