| View previous topic :: View next topic |
| Author |
Message |
Vincent Diepeveen
Joined: 09 Mar 2006 Posts: 1738 Location: The Netherlands
|
Post subject: Re: uct for chess - move gen speedup by vector datatypes Posted: Tue Mar 20, 2012 1:07 pm |
|
|
| Daniel Shawul wrote: |
| Quote: |
Daniel i disagree on the occupancy with you. You just need to be a good programmer to get to a high IPC and also you're using a dead old GPU. Less talented programmers already get 70% IPC at Fermi and newer, same thing for AMD by the way.
|
It is not about occupancy really rather performance. I can get to 70% occupancy by limiting register usage --maxrregcount during nvcc compilation. It does its best to reduce register usage at the cost of putting some variables , that it judges as least often used, in local memory. There are other tricks like using 'volatile' which boils down to the same thing: sacrificing speed for occupancy.
I very much doubt it is my coding that is the problem because nvcc compiler does really optimize register usage even if you write a bad code. I don't write code that has lots of branches or with very big dependency chains, unless I have to.
|
Well you do something wrong obviously; what you should do is have a generic code that executes which is basically doing nearly nothing towards the local shared memory. You evaluate entire position, execute code, execute code, you generic backtrack (and in cores that don't need to backtrack you don't backtrack), then you have prefetched by then the local hashtable and backtrack again (so those cycles are wasted for cores that do not backtrack).
Everything except hashtable is then local, the hashtable being local shared memory. So we speak about thousands of cycles that you execute that are all local code and have *nothing* to do with any reference to the local shared memory, let alone the global shared memory, let alone the device RAM.
This is what happens in 99.9% of the cases, so you can really get a very close to that IPC=1.0 rate there. Note other 'official measures' are not so interesting there as we're not busy with multiply-add, which explains 50% of the paper performance, cpu's or gpu's does not matter there, but it changes the percentages bigtime.
| Quote: |
Maybe I will save some some if I work hard on it but I doubt it would be a significant improvement.
You should understand that what makes my occupancy so low is register pressure caused by design choices I made: that each thread does its own search,
|
Well it should of course and only once per node you do a local shared check there. Programming that in an efficient manner isn't easy of course, it's big work.
| Quote: |
and that it stores everything it needs on registers and shared mem. You talk about alpha-beta here but it is virtually impossible to have the move stacks
for each thread be on chip. I can implement a YBW or something similar if that is what I wanted but what is the point if it will be slow while everything is stored
on global memory.
|
YBW is the only algorithm that you should consider of course, the other algorithms are all junk. Took me months of puzzling on paper to solve that problem on Nvidia GPU's. I'm sure there is more solutions, but yeah it's hard work. Yet saying it's virtual impossible is not correct - i solved it on paper already around 2007-2008.
Saying it's a HUGE work is correct however. Chessprogramming is like that
| Quote: |
To put it in perspective, I can't even have one array of int[246] for each thread to generate its moves and remove the illegal ones (even on fermi).
So it really baffles me to think about doing any form of alpha-beta fast without incuring the 500 cycles latency everytime you pick up a move from uncached (barely cached) global memory..
|
I understand your problem but oh boy - are you far off how to solve things.
Realize you want to preserve the branching factor that YBW gives - anywhere else you can sacrafice. And once again - i did do my calculations for the generation GPU's before Fermi, as you can see from the date - i DID count at losing in total up to factor 5 in overhead when implementing a huge evaluation function. What you're busy with here is however the years 70s basic of how to write an efficient move generator.
The move generator isn't the biggest problem to solve though. You are ALLOWED to lose something there.
| Quote: |
| Quote: |
A = x + y; // it takes now 8 cycles for A to be available for further processing
b = A + c;
|
We have a different concept of latency here. On gpus simple instruction like add, mad and othesr typically take 24 cycles.
This pipeline latency is hidden by at least having 192 threads (6 warps on tesla probably twice of that on fermi), so once you have that you need not worry about it this dependency issue. It requires atleast 25% occupancy. To hide global memory latency , you would need far more than that... |
http://www.cs.berkeley.edu/~volkov/volkov10-GTC.pdf
Look at the example, he's building an instruction level parallellism of 4 instructions prior to referencing, that gets the optimal throughput.
He had a tad older writing showing this more clearly, check around at his homepage there. |
|
| Back to top |
|
 |
|
| Subject |
Author |
Date/Time |
uct on gpu |
Daniel Shawul |
Fri Feb 24, 2012 5:52 am |
Re: uct on gpu |
Srdja Matovic |
Fri Feb 24, 2012 8:17 am |
Re: uct on gpu |
Srdja Matovic |
Fri Feb 24, 2012 8:45 am |
Re: uct on gpu |
Daniel Shawul |
Fri Feb 24, 2012 1:00 pm |
Re: uct on gpu |
Srdja Matovic |
Fri Feb 24, 2012 1:44 pm |
Re: uct on gpu |
Daniel Shawul |
Fri Feb 24, 2012 2:28 pm |
Re: uct on gpu |
Srdja Matovic |
Fri Feb 24, 2012 3:04 pm |
Re: uct on gpu |
Daniel Shawul |
Fri Feb 24, 2012 3:53 pm |
Re: uct on gpu |
david nash |
Sun Feb 26, 2012 12:42 am |
Re: uct on gpu |
Daniel Shawul |
Thu Mar 08, 2012 1:26 pm |
Re: uct on gpu |
Daniel Shawul |
Sat Feb 25, 2012 8:30 pm |
100x speed up |
Daniel Shawul |
Mon Feb 27, 2012 8:02 pm |
Re: 100x speed up |
Robert Hyatt |
Thu Mar 15, 2012 2:13 pm |
Re: 100x speed up |
Daniel Shawul |
Thu Mar 15, 2012 3:24 pm |
Re: 100x speed up |
Robert Hyatt |
Thu Mar 15, 2012 4:35 pm |
Re: 100x speed up |
Daniel Shawul |
Thu Mar 15, 2012 5:11 pm |
Table |
Daniel Shawul |
Thu Mar 15, 2012 5:51 pm |
Re: 100x speed up |
Robert Hyatt |
Thu Mar 15, 2012 7:36 pm |
Re: 100x speed up |
Daniel Shawul |
Thu Mar 15, 2012 8:21 pm |
Re: uct on gpu |
Daniel Shawul |
Thu Mar 08, 2012 1:08 pm |
uct for chess |
Daniel Shawul |
Mon Mar 12, 2012 10:30 pm |
Re: uct for chess |
Karlo Bala Jr. |
Mon Mar 12, 2012 11:14 pm |
Re: uct for chess |
Daniel Shawul |
Tue Mar 13, 2012 12:13 am |
Re: uct for chess |
Karlo Bala Jr. |
Tue Mar 13, 2012 12:52 pm |
Re: uct for chess |
Srdja Matovic |
Tue Mar 13, 2012 8:08 pm |
Re: uct for chess |
Daniel Shawul |
Tue Mar 13, 2012 9:43 pm |
Re: uct for chess |
Daniel Shawul |
Wed Mar 14, 2012 2:21 am |
Re: uct for chess |
Srdja Matovic |
Wed Mar 14, 2012 11:56 am |
Re: uct for chess |
Daniel Shawul |
Wed Mar 14, 2012 12:46 pm |
Re: uct for chess |
Srdja Matovic |
Wed Mar 14, 2012 1:00 pm |
Re: uct for chess - move gen speedup by vector datatypes |
Srdja Matovic |
Mon Mar 19, 2012 3:04 pm |
Re: uct for chess - move gen speedup by vector datatypes |
Daniel Shawul |
Mon Mar 19, 2012 8:01 pm |
Re: uct for chess - move gen speedup by vector datatypes |
Vincent Diepeveen |
Mon Mar 19, 2012 8:43 pm |
Re: uct for chess - move gen speedup by vector datatypes |
Vincent Diepeveen |
Mon Mar 19, 2012 9:01 pm |
Re: uct for chess - move gen speedup by vector datatypes |
Daniel Shawul |
Mon Mar 19, 2012 10:01 pm |
Re: uct for chess - move gen speedup by vector datatypes |
Vincent Diepeveen |
Tue Mar 20, 2012 12:59 am |
Re: uct for chess - move gen speedup by vector datatypes |
Vincent Diepeveen |
Tue Mar 20, 2012 1:04 am |
Re: uct for chess - move gen speedup by vector datatypes |
Daniel Shawul |
Tue Mar 20, 2012 2:40 am |
Re: uct for chess - move gen speedup by vector datatypes |
Vincent Diepeveen |
Tue Mar 20, 2012 1:07 pm |
Re: uct for chess - MCS, YBW and 32 bit move gen |
Srdja Matovic |
Tue Mar 20, 2012 2:37 pm |
Re: uct for chess - MCS, YBW and 32 bit move gen |
Vincent Diepeveen |
Wed Mar 21, 2012 4:39 pm |
Re: uct for chess - MCS, YBW and 32 bit move gen |
Srdja Matovic |
Wed Mar 21, 2012 5:53 pm |
Re: uct for chess - move gen speedup by vector datatypes |
Daniel Shawul |
Tue Mar 20, 2012 3:18 pm |
Re: uct for chess - move gen speedup by vector datatypes |
Vincent Diepeveen |
Wed Mar 21, 2012 2:13 pm |
Re: uct for chess - move gen speedup by vector datatypes |
Daniel Shawul |
Wed Mar 21, 2012 4:00 pm |
Re: uct for chess - move gen speedup by vector datatypes |
Vincent Diepeveen |
Mon Mar 19, 2012 8:33 pm |
Re: uct for chess - move gen speedup by vector datatypes |
Srdja Matovic |
Mon Mar 19, 2012 9:30 pm |
Re: uct for chess - move gen speedup by vector datatypes |
Vincent Diepeveen |
Tue Mar 20, 2012 12:54 am |
Re: uct for chess - move gen speedup by vector datatypes |
Vincent Diepeveen |
Tue Mar 20, 2012 12:45 pm |
Re: uct for chess - move gen speedup by vector datatypes |
Srdja Matovic |
Tue Mar 20, 2012 2:38 am |
Re: uct for chess - move gen speedup by vector datatypes |
Vincent Diepeveen |
Tue Mar 20, 2012 1:13 pm |
Re: uct for chess - move gen speedup by vector datatypes |
Srdja Matovic |
Tue Mar 20, 2012 1:43 pm |
Re: uct for chess - move gen performance killers |
Srdja Matovic |
Tue Mar 20, 2012 4:45 pm |
intrinsic popcnt |
Daniel Shawul |
Wed Mar 14, 2012 5:21 am |
Re: intrinsic popcnt |
Daniel Shawul |
Wed Mar 14, 2012 5:50 am |
Re: intrinsic popcnt |
Robert Hyatt |
Thu Mar 15, 2012 5:12 pm |
Re: uct on gpu |
Vincent Diepeveen |
Thu Mar 15, 2012 8:14 pm |
Re: uct on gpu |
Daniel Shawul |
Thu Mar 15, 2012 8:27 pm |
Re: uct on gpu |
Vincent Diepeveen |
Sat Mar 17, 2012 1:17 pm |
|
You cannot post new topics in this forum You cannot reply to topics in this forum You cannot edit your posts in this forum You cannot delete your posts in this forum You cannot vote in polls in this forum
|
|