Go questions

Daniel Shawul · Post by **Daniel Shawul** » Tue Feb 14, 2012 3:45 pm

I am implementing an SMP UCT now. It seems to me that locks are barely needed. Right now I use them only when creating child nodes at create_children() and when incrementing number of wins and visits atomically. I don't think UCT_Select or RAVE updates need a lock since errors from there can be tolerated I think. So it is somewhat surprising that the performance at 8xprocessors degraded so severely when using locks. Maybe I missed a place that needs synchronization. Note that I don't use hash tables but a vector of nodes, so it is not exactly the same as your implementation. More later.

Daniel Shawul · Post by **Daniel Shawul** » Wed Feb 15, 2012 4:37 am

I have implemented it as I described above and it seems that it scales very well up to 4 processors. You had also good results there even with locking. I don't have access to 8 processor node now so I will have to wait. Hopefully tomorrow I will do a test maybe up to 16 processors.
Can you describe your implementation briefly ? I wonder if using hash tables rather than list of nodes makes a difference. For example, you mentioned that you use memory fence operations that i do not use.

Edit:
I think your numbers with the locked hash table are bad compared to mine. Here are my results up to 4 processors for the first run I made.
My results are even better than the lockless scheme but I may have missed something...

Code: Select all

Your results&#58; 
   PPS                      Scaling    
   Lock           No_Lock   Lock       No_lock 
1   22,477.90   22,447.90   1.0000   1.0000 
2   37,769.80   43,076.90   1.6803   1.9190 
3   55,888.20   60,825.50   2.4864   2.7096 
4   61,448.40   79,470.20   2.7337   3.5402 
5   64,665.10   95,346.20   2.8768   4.2474 
6   62,407.10   110,092.00   2.7764   4.9043 
7   66,508.30   130,638.00   2.9588   5.8196 
8   59,196.60   146,341.00   2.6335   6.5191 
          
My results&#58;    
     PPS           Scaling 
     Lock          Lock    
1    5,148.00      1.0000    
2   10,010.00      1.9444    
3   14,664.00      2.8485    
4   18,979.00      3.6867

Edit2
And here is the result with the old distributed algorithm that don't share anything. It scales perfectly as expected.

Code: Select all

1    4,989.00      1.0000 
2   10,090.00      2.0224 
3   15,090.00      3.0247 
4   19,982.00      4.0052

Also I think that the behaviour of the parallel search could be significantly different if no tree is shared as above. I would guess there would be more exploration of alternative moves since each processor has to establish its own best moves and tree. That is another reason that kept me from trying other alternatives.

Rémi Coulom · Post by **Rémi Coulom** » Wed Feb 15, 2012 9:26 am

Your playouts are 4 times slower, so that explains why your results are better.

1 spinlock/node is simple and should perform close to optimality. Maybe I avoid a spinlock here and there by using a compare-and-swap to assign a pointer to a new node, but that should perform almost the same as a lock.

What's really important is to avoid a global lock over the whole stree, and have a local lock for each node. My lock data was with one lock for the whole tree.

Rémi

Laskos · Post by **Laskos** » Wed Feb 15, 2012 12:28 pm

Rémi Coulom wrote:
340 is obviously very wrong.

In handicap games vs pros, the current handicap is about 5 stones. That is more than 1000 Elo. Top programs have really no chance at all without handicap.

Also, there are diminishing returns. Here is an old 9x9 scalability study:
http://cgos.boardspace.net/study/index.html

But that study was computer vs computer. I guess that computer vs human has even bigger diminishing returns, although I have very little data. My experience is that I sometimes measure a 100 Elo improvement in Crazy Stone in comp vs comp, but get almost no improvement when connecting the new version to KGS.

An interesting aspect of computer go is that we have programs based on completely different algorithms. I found that a N-point Elo improvement against another Monte-Carlo program is usually worth N/2 points against a classical program such as GNU Go, and maybe even less against humans. So Elo ratings are a bit meaningless. It is important to understand how wrong the Elo model is.

I expect that if I could make Crazy Stone 1 million times faster, it would still not be enough to beat pros. We need better algorithms.

Rémi

Thanks for the plots. Basically they say that the rating in Elo in bot-bot games (related unclearly to stones or amateur dan in human games) is close to logarithmic with the number of doublings. The factor in front of the logarithm varies, depending on the Monte Carlo program, but not by much. Against humans it could be worse than logarithmic.

I see now the total mess with Elo terminology, I should have been more careful asking questions. Now I hope to make more sense:

What number of doublings you think are necessary for Crazy Stone to go up 1 dan on KGS (against humans) starting with these conditions: 24 cores, 10 sec/move (or close to that)?

The same: 24 cores, 300 sec/move (or close to that)?

How Crazy Stone fares with the doublings as compared to other Monte Carlo bots? Is it pretty representative? Are there several scaling patterns in non-Monte Carlo bots?

Thanks,

Kai

Daniel Shawul · Post by **Daniel Shawul** » Wed Feb 15, 2012 1:14 pm

What's really important is to avoid a global lock over the whole stree, and have a local lock for each node. My lock data was with one lock for the whole tree.

That is such a lazy scheme to compare to IMO . But I should have seen that is what you meant; I just didn't expect it. A better comparison to evaluate the effect of the lockless strategy would be to use a lock for each TT entry (node in my case). That is what is done in chess. Locking the _whole_ TT every time a thread writes/reads an entry is a severe synchronization requirement which introduces so much idle time. The nps scaling would be perfect for any number of processors, obviously not with a global lock. So there is a difference there too.

Your playouts are 4 times slower, so that explains why your results are better.

Yes I understood that and I was not really concerned about comparison with the lockless method. For that I would have to make many runs anyway.

1 spinlock/node is simple and should perform close to optimality. Maybe I avoid a spinlock here and there by using a compare-and-swap to assign a pointer to a new node, but that should perform almost the same as a lock.

Yep. The locks are barely needed so the gain from going lock less is very small.

Daniel

Laskos · Post by **Laskos** » Wed Feb 15, 2012 7:08 pm

Laskos wrote:
Rémi Coulom wrote:
340 is obviously very wrong.

In handicap games vs pros, the current handicap is about 5 stones. That is more than 1000 Elo. Top programs have really no chance at all without handicap.

Also, there are diminishing returns. Here is an old 9x9 scalability study:
http://cgos.boardspace.net/study/index.html

But that study was computer vs computer. I guess that computer vs human has even bigger diminishing returns, although I have very little data. My experience is that I sometimes measure a 100 Elo improvement in Crazy Stone in comp vs comp, but get almost no improvement when connecting the new version to KGS.

An interesting aspect of computer go is that we have programs based on completely different algorithms. I found that a N-point Elo improvement against another Monte-Carlo program is usually worth N/2 points against a classical program such as GNU Go, and maybe even less against humans. So Elo ratings are a bit meaningless. It is important to understand how wrong the Elo model is.

I expect that if I could make Crazy Stone 1 million times faster, it would still not be enough to beat pros. We need better algorithms.

Rémi

Thanks for the plots. Basically they say that the rating in Elo in bot-bot games (related unclearly to stones or amateur dan in human games) is close to logarithmic with the number of doublings. The factor in front of the logarithm varies, depending on the Monte Carlo program, but not by much. Against humans it could be worse than logarithmic.

I see now the total mess with Elo terminology, I should have been more careful asking questions. Now I hope to make more sense:

What number of doublings you think are necessary for Crazy Stone to go up 1 dan on KGS (against humans) starting with these conditions: 24 cores, 10 sec/move (or close to that)?

The same: 24 cores, 300 sec/move (or close to that)?

How Crazy Stone fares with the doublings as compared to other Monte Carlo bots? Is it pretty representative? Are there several scaling patterns in non-Monte Carlo bots?

Thanks,

Kai

I thought to re-formulate one question in a probably easier to answer question:

How many dan on KGS (against humans) Crazy Stone gains going from 1 core to 24 cores in the above mentioned time/move conditions?

Thanks,
Kai

Rémi Coulom · Post by **Rémi Coulom** » Wed Feb 15, 2012 7:48 pm

I don't know what would be the KGS rating on one core.

Rémi

Laskos · Post by **Laskos** » Wed Feb 15, 2012 11:21 pm

Rémi Coulom wrote:I don't know what would be the KGS rating on one core.

Rémi

Sorry for pretty silly questions, I just want to have a picture of Crazy Stone (one the best Go programs), some other Monte Carlo programs and non-Monte Carlo programs scaling, given your new to me 5 stone difference between 24 core Crazy Stone and a pro Go human player. You stated that not even 1,000,000 cores would be enough to close the 5 stone gap. In this case a factor of 24 could be estimated at a little less than 1 stone on KGS, but could be more for the first 24 cores and less for the following factors if remembering the highly visible "diminishing return", as pointed out by you and Uri.

Anyway, thanks and good luck with your algorithms,

Kai

Daniel Shawul · Post by **Daniel Shawul** » Sun Feb 19, 2012 3:41 pm

I still haven't found free time on a big node but I have good news for the lockless scheme on a GPU. I implemented a fixed size array to build a shallow tree in parallell using atomic operations: atomicCAS, atomicAdd atomicExch() and __threadfence() . The lockless scheme seems to be much better. Even locking & then immediately unlocking a global spinlock without doing any other work seems to affect performance a lot.

Code: Select all

atomicAdd&#40;&size,-1&#41;; 
__threadfence&#40;); 
atomicAdd&#40;&#40;int*&#41;head,1&#41;;

And with lock

Code: Select all

while&#40;atomicCAS&#40;&lock,0,1&#41; != 0&#41;; 
size++; 
head--; 
atomicExch&#40;&lock,0&#41;;

I used 256 threads per multiprocessor. All active blocks will call onto the locks to get free nodes, so that will be possibly thousands of threads calling the same lock. Perfromance of hashtable on gpu is generally very bad but that is besides the point. The lock-less scheme maybe useful when thousands of threads are involved.

Edit: There is an issue with the locked test. Since warps have to execute the same instruction at any time ,the spinlock mechanism using atomicCAS doesn't work. I will fix that and see if I get the same slow down as before.

Daniel

Go questions

Re: Go questions

Re: Go questions

Re: Go questions

Re: Go questions

Re: Go questions

Re: Go questions

Re: Go questions

Re: Go questions

Re: Go questions