uct on gpu

diep · Post by **diep** » Tue Mar 20, 2012 2:13 pm

smatovic wrote:my measurement was wrong on this,

currently i get only 166 Knps per SIMD Unit using a loop over the starting position.

--
Srdja

Is that entire movelists or total nodes generated? Of course as there is no branch prediction in the GPU you can easily benchmark the same position over and over again, unlike CPU's, yet all effort you do for the pieces at a1..h1, they generate in total only 4 moves for the knights in openingsposition.

smatovic · Post by **smatovic** » Tue Mar 20, 2012 2:43 pm

Daniel was right concerning global memory usage,
i just have to delete this line and double the throughput......

Code: Select all

                        
// copy move to global
global_moves&#91;pid*256+n&#93; = move;

running 16*2*32 threads:

with saving moves in global memory, wo legality check:

nodes: 64000000 ,tnode: 0 ,ab-nodes: 0 ,movecount: 1280000000, bestmove: 0 ,sec: 7.000000

without saving moves in global memory, wo legality check:

nodes: 64000000 ,tnode: 0 ,ab-nodes: 0 ,movecount: 1280000000, bestmove: 0 ,sec: 3.000000

Of course this is a synthetic "benchmark".

--
Srdja

smatovic · Post by **smatovic** » Tue Mar 20, 2012 3:37 pm

@Vincent:

Daniel is right with his MCS approach, it performs best on GPUs. If a MCS solution can play good chess is another question.

YBW
You really dont want to implement Master/Slave relations on a GPU, synching threads is a performance killer.

32 Bit Move Generator
Con: board presentation needs more memory,
remind how many registers / thread you have
Con: the solutions i know need nested loops, not good for GPUs,
Pro: GPUs are 32 bit devices

Looking forward to see your 32 bit GPU move generator solution.

--
Srdja

Daniel Shawul · Post by **Daniel Shawul** » Tue Mar 20, 2012 4:18 pm

I am sorry Vincent, but you talk too much for someone who got everything worked out ... sigh ... on paper! I doubt you understand gpu architecture as much as you want us to believe you. GPUs don't work like CPU but you keep on making beginner mistakes repeatedly, like the one you pulled about the piepeline latency. It is clear you mix up cpu with gpu again and again.

If you want to implement YBW, you would require say atleast a lot of memory to store move list and other stuff. Like the 256 move list I mentioned already. That requires 4 x 256 x 64 plies = 64kb for each thread! If you put 1024 threads in an SM , what does that leave you 1 MB. Now do you think you can effectively cache that by whatever thingy you claim you got going in there??

YBW is the only algorithm that you should consider of course, the other algorithms are all junk. Took me months of puzzling on paper to solve that problem on Nvidia GPU's. I'm sure there is more solutions, but yeah it's hard work. Yet saying it's virtual impossible is not correct - i solved it on paper already around 2007-2008.

Saying it's a HUGE work is correct however. Chessprogramming is like that

Lets just say I believe Srdja more than the crap you post here , sorry

At least he tried many algorithms that are more suited to gpu computation.
I now understand why it is really difficult to have a civil conversation without you extending some sort of "superiroty" over other people.

http://www.cs.berkeley.edu/~volkov/volkov10-GTC.pdf

Look at the example, he's building an instruction level parallellism of 4 instructions prior to referencing, that gets the optimal throughput.

He had a tad older writing showing this more clearly, check around at his homepage there.

Yes yes, if you back up a few posts I even gave a link to his presentation. So I don't understand the point you are trying to make. They basically suggested keep your occupancy lower while doing everything on registers which reinforces my stand on the issue so far.

smatovic · Post by **smatovic** » Tue Mar 20, 2012 5:45 pm

Simple Move Generation is fast, but legality check, storing in global and sorting is expensive:

Simple loop over starting position, wo legality check, wo storing, wo sorting:

1.3 Mnps per SIMD Unit

Simple loop over starting position, w legality check, wo storing, wo sorting:

500 Knps per SIMD Unit

Simple loop over starting position, w legality check, w storing, wo sorting:

166 Knps per SIMD Unit

Simple loop over starting position, w legality check, w storing, w sorting:

50 Knps per SIMD Unit

--
Srdja

diep · Post by **diep** » Wed Mar 21, 2012 3:13 pm

[snip]

Shawul,

Your messages are very impolite for someone who just takes a look in computerchess.

To me it seems that you and Srdja have not much understanding on what a vector processor is and that you have to do everything with just a few simple operations as anything else is slow on it.

Now Srdja, if i google for what he's doing in reality, he's excused. Seems browsing a tad on the internet is his core job.

Your ideas about storage are far away from reality.
YBW is an algorithm, it doesn't dictate your datastructure.

At vector processing you need creativity, away from the CPU datastructures, away from the cut'n pasting, to achieve what you want to achieve.

FYI the fast assembler engines from the 90s and before didn't store much either, nor did deep blue, nor did hydra.

It seems you don't have the IQ nor practice to achieve simplistic vectorprocessing in fact.

Sure it's a lot of work, but not even the move generator you manage to get going.

Vincent

p.s. Maybe google for Diepeveen + Wagstaff to enlarge your understanding that there are things beyond computerchess where GPU's are everything

p.s.2 for the non-beginners - the only problem to solve on GPU's is not the move generator, it's how to get an efficient SMP going. Efficient in SPEEDUP of course.

p.s.3 but most importantly you seem to fail to understand that at vectorprocessors you need to STREAM to get things done quickly

Daniel Shawul · Post by **Daniel Shawul** » Wed Mar 21, 2012 5:00 pm

Shawul,

Your messages are very impolite for someone who just takes a look in computerchess.

To me it seems that you and Srdja have not much understanding on what a vector processor is and that you have to do everything with just a few simple operations as anything else is slow on it.

I don't like discussing with people with a condescending tone even if they are right (I am not saying you are). I should have stoped discussion with you earlier
if I haven't decided to show courtesy. So if I come off as too impolite I apologize but you should look in the mirror too.

Now Srdja, if i google for what he's doing in reality, he's excused. Seems browsing a tad on the internet is his core job.

See above...

Your ideas about storage are far away from reality.
YBW is an algorithm, it doesn't dictate your datastructure.

At vector processing you need creativity, away from the CPU datastructures, away from the cut'n pasting, to achieve what you want to achieve.

FYI the fast assembler engines from the 90s and before didn't store much either, nor did deep blue, nor did hydra.

It seems you don't have the IQ nor practice to achieve simplistic vectorprocessing in fact.

Shhh...I belive I have done enough vector processing aside from chess to quench my curisoity. But I don't
feel the need to explain myself to you..

Sure it's a lot of work, but not even the move generator you manage to get going.

Haha.good one.. my approach works as intended. And I know why I want to avoid alpha-beta stuff when I started..
If you have a solution for it, explain so we can benefit. But so far nothing has come out.

Vincent

p.s. Maybe google for Diepeveen + Wagstaff to enlarge your understanding that there are things beyond computerchess where GPU's are everything

Vincent, intimidation is not gonna work. I for one hope these weird math guys stop kepping the cluster busy computing prime numbers.
I know there are uses for it for security and whatever but still..

p.s.2 for the non-beginners - the only problem to solve on GPU's is not the move generator, it's how to get an efficient SMP going. Efficient in SPEEDUP of course.

p.s.3 but most importantly you seem to fail to understand that at vectorprocessors you need to STREAM to get things done quickly

Ok vincent please explain your methodology. I have always enjoyed your posts in the past so I would like to keep it that way.

diep · Post by **diep** » Wed Mar 21, 2012 5:39 pm

Srdja,

Diep already since 2002 implements YBW in a total SMP way, there is no master and there are no slaves. Everywhere it can split, in fact it can split simultaneously at the ATOMIC same time at several spots in the tree.

You can prove of course that otherwise it wouldn't run on a 512 processor supercomputer very well if you do master-slave relationships inside the search.

In a well implemented YBW search the search is only for a very short time frame at just 1 core until you divide it over the rest, after that the same thing happens in parallel everywhere. If you do *that* also as a master slave, like Crafty/SF and most YBW implementations do, that won't work well at a supercomputer of course.

Crafty avoids some problems for more than 4 cores there by splitting very little. Bad for speedup however. In diep i quickly split and happily keep doing that.

You can also mathematically prove that the concept of YBW, not necessarily to the full implemented, but at least the starting concept, is a manner to preserve the branching factor *a little* (not fully) and so far no other algorithm it's easy to prove that for and no other algorithm so far exists that's doing exactly that.

What we do know however is that, take my cluster as an example, aborting cpu's there takes microseconds as a minimum. The switching inside a GPU goes way faster than that. You can abort it in nanoseconds in theory.

In practice of course you need to do that very little at cpu's, so effectively it's milliseconds at each cpu, yet it can be microseconds at GPU's.

At gpu you just accept the overhead of cores that searched without having a job and each X nodes redivide search tree within 1 SIMD.

Another major advantage at gpu's, is that it's nonstop alternating each few cycles different threads, so you can do at gpu's a few things you can't do at CPU's which avoid some of the damage done by running that many threads

CPU's really suffer everywhere from the runqueue latency which fires each 10 milliseconds. In reality it's slower than that of course.

GPU's do not have this problem at all. Within 1 SIMD in fact you have complete deterministic search in fact. That's a HUGE advantage when debugging for performance and bugs.

You shouldn't ask yourself at a GPU: "can Monte Carlo with UCT play chess a tad?" as it will be 800 elo worse of course (i picked an arbitrary elo difference that it's worse. Could be 1000, could be 500, it's too much in short). The question is simply: "How do i get YBW to work in a fast manner without too much of an overhead?"

Same thing for Go of course - UCT there is a joke as well of course. Just because they don't know how to search there and some of the 'top engines' at the time were forward pruning even in the root, the UCT type engines won suddenly.

UCT is overrated in that sense. UCT is the most trivial form of selective search that isn't brute force, yet requires huge overhead.

At GPU's already in advance going for something total inferior is not a rather good idea.

My advice at the gpu's would be: whatever the hell you do on it, even if it is a tiny thing, try to get good performance out of that tiny thing.

The fact that you can very quickly run different kernels there is a huge advantage, which is total impossible at CPU's. Use that.

By accepting overhead you reduce the inefficiency.

But now i really have posted enough on this subject.

GPU programming is not for beginners, that's the whole problem.

smatovic wrote:@Vincent:

Daniel is right with his MCS approach, it performs best on GPUs. If a MCS solution can play good chess is another question.

YBW
You really dont want to implement Master/Slave relations on a GPU, synching threads is a performance killer.

32 Bit Move Generator
Con: board presentation needs more memory,
remind how many registers / thread you have
Con: the solutions i know need nested loops, not good for GPUs,
Pro: GPUs are 32 bit devices

Looking forward to see your 32 bit GPU move generator solution.

--
Srdja

smatovic · Post by **smatovic** » Wed Mar 21, 2012 6:53 pm

Looking forward to see a Diep GPU Version....i would even buy one.

Now it is up to you to show your skills Vincent. Good luck.

--
Srdja

uct on gpu

Re: uct for chess - move gen speedup by vector datatypes

Re: uct for chess - move gen speedup by vector datatypes

Re: uct for chess - MCS, YBW and 32 bit move gen

Re: uct for chess - move gen speedup by vector datatypes

Re: uct for chess - move gen performance killers

Re: uct for chess - move gen speedup by vector datatypes

Re: uct for chess - move gen speedup by vector datatypes

Re: uct for chess - MCS, YBW and 32 bit move gen

Re: uct for chess - MCS, YBW and 32 bit move gen