uct on gpu

Daniel Shawul · Post by **Daniel Shawul** » Thu Mar 08, 2012 2:08 pm

A ton of very serious bugs were fixed. For starters the uct select was completely broken producing rather strange shallow and uniform tree. But after the many bug fixes, it starts producing good result as shown below. Also the occupancy of my last test was at 50% so I have a chance to increase that to 100% by simply changing configurations. Infact the occupancy has been slightly increased on my current gpu to 67% after the register usage reduced when the bugs are fixed. So that directly translated to performance increase. On capability >1.3 block size of 256 gives 100% occupancy so that will be doubled performance if everything goes well.
One other thing is that now there is absolutely no synchronization (__syncthreads) except for the inherent warp sync. A warp is the smallest working unit to fetch a node from DRAM; previously it was a block. The threads in a warp can diverge since each does its own monte carlo game on the same initial board. That has speeded up the code and simplified it as well.
On the initial position e4! (this is HEX btw!) is chosen as the best move and it was expanded to depth 7 after 45 mil simulation. Time is decreased to 5.375sec compared to previous ~5.5. Tree is also very selective.

Code: Select all

0.h8     22323755     45989376     0.485411
        1.a1       148364       291840     0.508374
        1.b1        88821       177152     0.501383
        1.c1        10350        20992     0.493045
        1.d1         7756        15872     0.488659
        1.e1         8021        16384     0.489563
        1.f1        36382        75264     0.483392
        1.g1        54957       116224     0.472854
        1.h1          947         2048     0.462402
        1.a2        58208       116224     0.500826
        1.b2       254948       501248     0.508626
        1.c2       174206       344064     0.506319
        1.d2        59118       116736     0.506425
        1.e2        58413       116224     0.502590
        1.f2        57629       116736     0.493669
        1.g2         1467         3072     0.477539
        1.h2          680         1536     0.442708
        1.a3        58347       117760     0.495474
        1.b3       239661       470528     0.509345
        1.c3       185939       364032     0.510777
        1.d3       238775       470528     0.507462
        1.e3       174534       344576     0.506518
        1.f3        53076       105472     0.503224
        1.g3        57767       117248     0.492691
        1.h3         1972         4096     0.481445
        1.a4        54223       111616     0.485799
        1.b4       156572       313344     0.499681
        1.c4       259251       508928     0.509406
        1.d4       402122       789504     0.509335
        1.e4      1890305      3686400     0.512778
        1.f4       120209       241152     0.498478
        1.g4        54437       108544     0.501520
        1.h4        57571       119296     0.482590
        1.a5        37182        76800     0.484141
        1.b5        58508       116736     0.501199
        1.c5       172019       337408     0.509825
        1.d5       149083       292352     0.509943
        1.e5     15977301     30803968     0.518677
        1.f5       256793       503296     0.510223
        1.g5        59556       117248     0.507949
        1.h5        57704       118784     0.485789
        1.a6         1699         3584     0.474051
        1.b6        58560       118784     0.492996
        1.c6        59423       117760     0.504611
        1.d6       225991       443392     0.509687
        1.e6       177571       348160     0.510027
        1.f6       211449       414720     0.509860
        1.g6       115356       231424     0.498462
        1.h6        57993       117760     0.492468
        1.a7         2211         4608     0.479818
        1.b7         1935         4096     0.472412
        1.c7        57972       118272     0.490158
        1.d7        58431       117760     0.496187
        1.e7        60376       119296     0.506102
        1.f7       225979       445952     0.506734
        1.g7       130554       256000     0.509977
        1.h7        59264       118784     0.498922
        1.a8         1896         4096     0.462891
        1.b8         1966         4096     0.479980
        1.c8         2192         4608     0.475694
        1.d8        57759       119296     0.484165
        1.e8         1983         4096     0.484131
        1.f8        58658       119296     0.491701
        1.g8        58889       117760     0.500076
        1.h8       127233       249856     0.509225
Total nodes   &#58; 31499
Leaf  nodes   &#58; 30985
Maximum depth &#58; 7
Average depth &#58; 3.72
Errors&#58; no error
time 5375

Daniel Shawul · Post by **Daniel Shawul** » Thu Mar 08, 2012 2:26 pm

Nope texture is constant memory so i can't use it for computation. However I have used the constant memory to store bitboard tables for move generation in chess. I use minimal kindergarten and all other bitboard tables summed up to ~3kb. Since constant/texture memory is cached and infact each mp has 8kb (64kb total) , that made first_one and all attacks (rooks & bishops) calculation as fast what can be achieved if they were all loaded in shared memory or registers. I measured it infact. When those tables are put in global memory I get 10x slow down.
Now the obstacle is where to store the generated moves. I have to use global memory if I am to do any kind of alpha beta. But for monte carlo I need only to generate a random legal move which simplifies that requirement. The problem is I can't even store one move generation step (256 moves max) on shared memory to check which moves are legal. I am left with something that can stor 16 moves in shared mem. May be I will try a sliding test with that available memory. Or finish up the random legal move generation without using any. If that is done I will probably have world's fastest chess perft estimator with a uct_perft. Stay tuned..

Daniel Shawul · Post by **Daniel Shawul** » Mon Mar 12, 2012 11:30 pm

I have finally finished a uct for chess on gpu. It is not that it is difficult but that move generation and other stuff consume space. Go would be ideal for UCT but requirement of linked list of stones for capture detection is a performance killer. Checkers would be another game ideally suited for gpu if i manage to generate those captures ...
It turned out to be a lot more slower than Hex since move generation and legality testing is a pain in the neck. The most annoying thing is that
there is not enough space to declare an int[256] for each thread to hold on the moves generated.
Bear in mind it is not a stack but just one entry. Here are the callenges:

a) The move generation step I mentioned above. I have gone through great length to generate one legal random move:
1 - Count pseudo legal moves
2 - generate a random number and pick which move to test
3 - generate only that move
4 - make the move and test for legality
5 - keep a bitset of 256 bits to mark moves that are already tested. This added 7 more registers (total = 59 now)
6 - If all bits are set as is the case when checkmated, break out.
I use kindergarten bitboards (8 bitboards total) which are light weight (512bytes for bishop/rook attacks).
But legal move generation is a pain due to either big table requirement in_between, direction[64][64] etc..
or something else. If I can generate a random legal move, it would save me from going through all those trouble.

b) Due to high register pressure (almost 60 right now), the occupancy is very low. Only 17% at my GPU right now.
But it is still 2-3x faster. I can offload some of those to shared memory but it is debatable whether trying to increase
occupancy that way is better. Some guys who did FFT using cuda claim that lower occupancy with high register usage actually
runs faster. Also the nvcc optimizer is very good so trying to squeeze out a register or two takes a lot of effort!

b) I use a copy / make instead of do / undo basically because there is n't space for a stack. But to test for legality
of pseudo legal moves I needed to save enpassant,castle,fifty,player flags just for one step. So undo is used there.
For the rest of MC playout a copy/make is used. BTW the 8 bitboards for positions representation simplifies do/undo and
my undo_move is basically do_move itself. A little problem I encounted with that board representation is that it is difficult
/slower to know what kind of piece is being captured. I delay that detection until I actually make the move ..

c) 32 bit firstone and bitcounts are ok but still very slow. I need to count psedudo legal moves before generating them.
The bitboards are very sparse and maybe a simpler version of population count would help.

e) Can't use arrays if you want your variable to be stored on register. Could be a pain for coding with many switch staments needed.

Probably some more. Ideas are welcome especially on how to generate legal random moves for chess preferable without using
large tables.

I guess that if I modify the uct code a bit it would be a fast UCT_perft approximator.

---
cheers

Karlo Bala · Post by **Karlo Bala** » Tue Mar 13, 2012 12:14 am

Daniel Shawul wrote:I have finally finished a uct for chess on gpu. It is not that it is difficult but that move generation and other stuff consume space. Go would be ideal for UCT but requirement of linked list of stones for capture detection is a performance killer. Checkers would be another game ideally suited for gpu if i manage to generate those captures ...
It turned out to be a lot more slower than Hex since move generation and legality testing is a pain in the neck. The most annoying thing is that
there is not enough space to declare an int[256] for each thread to hold on the moves generated.
Bear in mind it is not a stack but just one entry. Here are the callenges:

a) The move generation step I mentioned above. I have gone through great length to generate one legal random move:
1 - Count pseudo legal moves
2 - generate a random number and pick which move to test
3 - generate only that move
4 - make the move and test for legality
5 - keep a bitset of 256 bits to mark moves that are already tested. This added 7 more registers (total = 59 now)
6 - If all bits are set as is the case when checkmated, break out.
I use kindergarten bitboards (8 bitboards total) which are light weight (512bytes for bishop/rook attacks).
But legal move generation is a pain due to either big table requirement in_between, direction[64][64] etc..
or something else. If I can generate a random legal move, it would save me from going through all those trouble.

b) Due to high register pressure (almost 60 right now), the occupancy is very low. Only 17% at my GPU right now.
But it is still 2-3x faster. I can offload some of those to shared memory but it is debatable whether trying to increase
occupancy that way is better. Some guys who did FFT using cuda claim that lower occupancy with high register usage actually
runs faster. Also the nvcc optimizer is very good so trying to squeeze out a register or two takes a lot of effort!

b) I use a copy / make instead of do / undo basically because there is n't space for a stack. But to test for legality
of pseudo legal moves I needed to save enpassant,castle,fifty,player flags just for one step. So undo is used there.
For the rest of MC playout a copy/make is used. BTW the 8 bitboards for positions representation simplifies do/undo and
my undo_move is basically do_move itself. A little problem I encounted with that board representation is that it is difficult
/slower to know what kind of piece is being captured. I delay that detection until I actually make the move ..

c) 32 bit firstone and bitcounts are ok but still very slow. I need to count psedudo legal moves before generating them.
The bitboards are very sparse and maybe a simpler version of population count would help.

e) Can't use arrays if you want your variable to be stored on register. Could be a pain for coding with many switch staments needed.

Probably some more. Ideas are welcome especially on how to generate legal random moves for chess preferable without using
large tables.

I guess that if I modify the uct code a bit it would be a fast UCT_perft approximator.

---
cheers

Perhaps you should simply allow illegal moves. End of a game will be when the king is captured. I suppose that you'll get tactically stronger engine because UCT will explore more across checks due to more king captures.

Daniel Shawul · Post by **Daniel Shawul** » Tue Mar 13, 2012 1:13 am

Perhaps you should simply allow illegal moves. End of a game will be when the king is captured. I suppose that you'll get tactically stronger engine because UCT will explore more across checks due to more king captures.

That is infact what I do for CPU nebiyu in general game playing mode and when using alpha-beta. I have an option "allow_king_capture" that is set to on by default. The reason is that I get more than 100% slow down if I do legality testing (1Mnps vs 0.5Mnps). I was surprized about that at first. A very slow attacksTo() is cause of that problem. But soon I discovered it is not so straight forward when using UCT (maybe I have not thought it through). The problem is if I allow any move when being in check, then on the next ply the opponent _must_ capture my king and not make a random move. I got all sorts of crashes before I realized that... So if that is required, testing for legality (attacks(opp,king)) right after making the move becomes equivalent to letting it go one more ply.
For alpha-beta I test if the side to play has a king otherwise it is considered as checkmated. This doesn't need attacks() however in monte-carlo play a checkmate could be simply skipped since both basically make a random move without checking if their king is attacked.

Karlo Bala · Post by **Karlo Bala** » Tue Mar 13, 2012 1:52 pm

Daniel Shawul wrote: For alpha-beta I test if the side to play has a king otherwise it is considered as checkmated. This doesn't need attacks() however in monte-carlo play a checkmate could be simply skipped since both basically make a random move without checking if their king is attacked.

I had similar problem when I played with the BPIP-DFISA. If one side is checkmated and the other side fails to capture the king it will do it in the next few moves, because most of the time checkmated side will not find a way out (since both side play random moves). If the rules of chess have been somewhat different, and if the goal was to capture the king, the algorithm that treat check (and out of check) like every other move would be perfectly sound.

However, it is just a theory, and theories are cheap

smatovic · Post by **smatovic** » Tue Mar 13, 2012 9:08 pm

Yep, to achieve a full occupancy of the device is nearly impossible with such an complex chess move generation and such little registers.

With Zeta i use something above 60 registers and can run on my Device 16*2*32 threads, full occupancy would mean running 16*24*32 therads.

I ordered an AMD HD 7750, the new GCN architecture has 32 KB registers for each SIMD Unit, each SIMD can run 10*16 threads. So you got 204,8 Bytes per Thread, great! But unfortunately there are still no drivers for Linux.

Maybe you could switch to an QuadBitboard Board presentation, uses only 32 Bytes.

In Cuda there should be somekind of shared memory, in OpenCL called local memory. It is fast and big enough to hold for each thread the Board. So you could save some registers for computation.

--
Srdja

Daniel Shawul · Post by **Daniel Shawul** » Tue Mar 13, 2012 10:43 pm

Yep, to achieve a full occupancy of the device is nearly impossible with such an complex chess move generation and such little registers.

With Zeta i use something above 60 registers and can run on my Device 16*2*32 threads, full occupancy would mean running 16*24*32 therads.

My GPU is an old one so I can't get full occupancy even for HEX (50% only). Even though I use 60 registers for chess, I do not read from global memory any variables so I am not all that unhappy about it. I can control latency by simply doing more monte-carlo simulations which would be impossible if that too uses global memory.

I ordered an AMD HD 7750, the new GCN architecture has 32 KB registers for each SIMD Unit, each SIMD can run 10*16 threads. So you got 204,8 Bytes per Thread, great! But unfortunately there are still no drivers for Linux.

That is great. Hopefully someday I will test it on a fermi which has 3x more registers.

Maybe you could switch to an QuadBitboard Board presentation, uses only 32 Bytes.

I have in total 9 bitboards (18 registers) so it is not a lot but the calculations to generate one random legal move are intensive and before you know it you hit 60 registers. Also quad bitboards require SIMD operations for decoding if I am not mistaken. GPUs are SIMT and as such do not have SSE instructions.

In Cuda there should be somekind of shared memory, in OpenCL called local memory. It is fast and big enough to hold for each thread the Board. So you could save some registers for computation.

I have infact ample space from shared memory that I can spend. But so far no improvement. I put the Board struct on shared memory , neither the register usage decreased nor it run equally faster. But theoretically that is the only option left. Btw what cuda calls "local memory" is a disguised slow as a tortoise global memory (much like thread local storage in cpu). I can spill some registers to local memory during compilation but it runs slower. Also as I mentioned before manually moving some variables to shared mem didn't improve performance for some people working on fast libraries. But I will keep on digging.
-----
cheers

Daniel Shawul · Post by **Daniel Shawul** » Wed Mar 14, 2012 3:21 am

Here is the interesting article on occupancy which has put me in dilemma. They argue the programming guides of nvidia are misleading by showing an example that achieves 84% of the peak performance at a mere 4% occupancy. They also say shared memory is not as fast as advertised and the gap between that and arithmetic throughput is ever increasing with the newer gpus. And much more ...

Daniel Shawul · Post by **Daniel Shawul** » Wed Mar 14, 2012 6:21 am

c) 32 bit firstone and bitcounts are ok but still very slow. I need to count psedudo legal moves before generating them.
The bitboards are very sparse and maybe a simpler version of population count would help.

Good lord. I was dead wrong on this. There are all sorts of integer instructions even for 64 bitboards __popcll() , __clzll() and what have you. If your integers are 24 bit then there are faster multiplications __mul24() as well. I got some speed up with the popcnt already. I assumed that they wouldn't bother implementing integer intrinsics since floats are faster and that is what gpu are intended for. Anyway I hope Gerd sees this and do his magic

Edit: Wait. That was a fluke, intrinsics are slower especially the __ffsll(). But I shaved of 5 registers and removed some tables by using intrinsic. I will do unit-testing to find if they are actually slower.

The intrinsics are hardware implemented according to this page so they should be faster.

Btw the bitmagic library has some SSE popcnt routines that can be used when there is no hardware support for it. Here is a straight forward translation of the well known 32 bits pop count.

uct on gpu

Re: uct on gpu

Re: uct on gpu

uct for chess

Re: uct for chess

Re: uct for chess

Re: uct for chess

Re: uct for chess

Re: uct for chess

Re: uct for chess

intrinsic popcnt