Daniel Shawul wrote:
Daniel i disagree on the occupancy with you. You just need to be a good programmer to get to a high IPC and also you're using a dead old GPU. Less talented programmers already get 70% IPC at Fermi and newer, same thing for AMD by the way.
It is not about occupancy really rather performance. I can get to 70% occupancy by limiting register usage --maxrregcount during nvcc compilation. It does its best to reduce register usage at the cost of putting some variables , that it judges as least often used, in local memory. There are other tricks like using 'volatile' which boils down to the same thing: sacrificing speed for occupancy.
I very much doubt it is my coding that is the problem because nvcc compiler does really optimize register usage even if you write a bad code. I don't write code that has lots of branches or with very big dependency chains, unless I have to.
Well you do something wrong obviously; what you should do is have a generic code that executes which is basically doing nearly nothing towards the local shared memory. You evaluate entire position, execute code, execute code, you generic backtrack (and in cores that don't need to backtrack you don't backtrack), then you have prefetched by then the local hashtable and backtrack again (so those cycles are wasted for cores that do not backtrack).
Everything except hashtable is then local, the hashtable being local shared memory. So we speak about thousands of cycles that you execute that are all local code and have *nothing* to do with any reference to the local shared memory, let alone the global shared memory, let alone the device RAM.
This is what happens in 99.9% of the cases, so you can really get a very close to that IPC=1.0 rate there. Note other 'official measures' are not so interesting there as we're not busy with multiply-add, which explains 50% of the paper performance, cpu's or gpu's does not matter there, but it changes the percentages bigtime.
Maybe I will save some some if I work hard on it but I doubt it would be a significant improvement.
You should understand that what makes my occupancy so low is register pressure caused by design choices I made: that each thread does its own search,
Well it should of course and only once per node you do a local shared check there. Programming that in an efficient manner isn't easy of course, it's big work.
and that it stores everything it needs on registers and shared mem. You talk about alpha-beta here but it is virtually impossible to have the move stacks
for each thread be on chip. I can implement a YBW or something similar if that is what I wanted but what is the point if it will be slow while everything is stored
on global memory.
YBW is the only algorithm that you should consider of course, the other algorithms are all junk. Took me months of puzzling on paper to solve that problem on Nvidia GPU's. I'm sure there is more solutions, but yeah it's hard work. Yet saying it's virtual impossible is not correct - i solved it on paper already around 2007-2008.
Saying it's a HUGE work is correct however. Chessprogramming is like that
To put it in perspective, I can't even have one array of int[246] for each thread to generate its moves and remove the illegal ones (even on fermi).
So it really baffles me to think about doing any form of alpha-beta fast without incuring the 500 cycles latency everytime you pick up a move from uncached (barely cached) global memory..
I understand your problem but oh boy - are you far off how to solve things.
Realize you want to preserve the branching factor that YBW gives - anywhere else you can sacrafice. And once again - i did do my calculations for the generation GPU's before Fermi, as you can see from the date - i DID count at losing in total up to factor 5 in overhead when implementing a huge evaluation function. What you're busy with here is however the years 70s basic of how to write an efficient move generator.
The move generator isn't the biggest problem to solve though. You are ALLOWED to lose something there.
A = x + y; // it takes now 8 cycles for A to be available for further processing
b = A + c;
We have a different concept of latency here. On gpus simple instruction like add, mad and othesr typically take 24 cycles.
This pipeline latency is hidden by at least having 192 threads (6 warps on tesla probably twice of that on fermi), so once you have that you need not worry about it this dependency issue. It requires atleast 25% occupancy. To hide global memory latency , you would need far more than that...
http://www.cs.berkeley.edu/~volkov/volkov10-GTC.pdf
Look at the example, he's building an instruction level parallellism of 4 instructions prior to referencing, that gets the optimal throughput.
He had a tad older writing showing this more clearly, check around at his homepage there.