Intell 48-core CPU scales to 1k cores

rbarreira · Post by **rbarreira** » Tue Nov 23, 2010 4:47 pm

Gian-Carlo Pascutto wrote:
rbarreira wrote: Obviously it couldn't be dramatic, unless the elo-per-doubling was much higher for Go than for Chess (in which case computers would be improving really fast at Go).
IIRC about 120 ELO per doubling, about twice as much as for chess. (And yes, computers are improving rapidly, but the difference between human ranks is also big)

Show me a chess searcher which gets near 100 elo per doubling (or even 50?) with 32 threads.
Impossible regardless of scaling because of the argument you already identified above.

It would help if you were a bit more explicit.

You keep saying stuff without directly refuting anything or even saying what you think is the scaling of Chess and Go for a high number of threads. This is confusion, not debate.

bob · Post by **bob** » Tue Nov 23, 2010 5:32 pm

rbarreira wrote:
Gian-Carlo Pascutto wrote:
rbarreira wrote: But the average branching factor is higher in Go, correct? Around 200 instead of 35 or so I've heard. So it should be easier to parallelize a Go tree.
This is only true if you would not prune anything (minimax). See previous post.
Why is it only true without pruning? Even with pruning, having more moves to distribute among CPUs can't hurt, and probably helps. In chess, if you're at an ALL node and you have a 100-core CPU, you don't have enough moves for all the CPUs. In Go you would.

In any case, the Monte Carlo part can be parallelized very easily, which makes me wonder how you were so quick to dismiss the value of more cores for Go.

Because we have those pesky fail-highs. A much larger effective branching factor simply reduces the total number of split operations required, but overhead does not go down at all.

bob · Post by **bob** » Tue Nov 23, 2010 5:35 pm

rbarreira wrote:
Gian-Carlo Pascutto wrote:
rbarreira wrote: Obviously it couldn't be dramatic, unless the elo-per-doubling was much higher for Go than for Chess (in which case computers would be improving really fast at Go).
IIRC about 120 ELO per doubling, about twice as much as for chess. (And yes, computers are improving rapidly, but the difference between human ranks is also big)

Show me a chess searcher which gets near 100 elo per doubling (or even 50?) with 32 threads.
Impossible regardless of scaling because of the argument you already identified above.
It would help if you were a bit more explicit.

You keep saying stuff without directly refuting anything or even saying what you think is the scaling of Chess and Go for a high number of threads. This is confusion, not debate.

I don't follow the semantics. If you mean 32 cores at 2.0ghz, and then going to 32 cores at 4.0 ghz, that is a real doubling, and would produce something like +70 Elo for chess.

If, on the other hand, you mean 32 cores at 2.0ghz, and doubling to get 64 cores at 2.0 ghz, then that is not a "doubling" with respect to alpha/beta because of search overhead, so you would not get +70 Elo.

I find it impossible to figure out which was meant, which makes it tough to choose a side.

bob · Post by **bob** » Tue Nov 23, 2010 5:37 pm

rbarreira wrote:
Gian-Carlo Pascutto wrote: I'm not saying many cores are not valuable for Go. I'm saying they're exactly as valuable as for a chess searcher, and that it's a completely wrong illusion to think that Go is somehow fundamentally easier than chess for that.
Show me a chess searcher which gets near 100 elo per doubling (or even 50?) with 32 threads.

Again, define "doubling". If you mean 32 threads at 2.0 ghz and doubling that to be 32 threads at 4.0ghz, I can show you at least _one_ program that will gain +70 Elo...

rbarreira · Post by **rbarreira** » Tue Nov 23, 2010 5:39 pm

bob wrote:
rbarreira wrote:
Gian-Carlo Pascutto wrote: I'm not saying many cores are not valuable for Go. I'm saying they're exactly as valuable as for a chess searcher, and that it's a completely wrong illusion to think that Go is somehow fundamentally easier than chess for that.
Show me a chess searcher which gets near 100 elo per doubling (or even 50?) with 32 threads.
Again, define "doubling". If you mean 32 threads at 2.0 ghz and doubling that to be 32 threads at 4.0ghz, I can show you at least _one_ program that will gain +70 Elo...

This discussion is only about multi-threading, not pure speed improvements. I was referring to a specific 32-threads vs 1-thread benchmark that I posted earlier in the thread.

The only time I mentioned a pure speed improvement was in the context that a pure speed improvement is an upper bound on the improvement from increasing the number of threads.

Gian-Carlo Pascutto · Post by **Gian-Carlo Pascutto** » Tue Nov 23, 2010 5:52 pm

rbarreira wrote:
Gian-Carlo Pascutto wrote: Impossible regardless of scaling because of the argument you already identified above.
It would help if you were a bit more explicit.

You asked for an example of a chessprogram that gets 100 ELO per doubling of cores. Given that a doubling is only worth 50-70 ELO (source: SSDF), this doesn't exist, because it would require superlinear speedups.

Go doubling is worth about 120-150 ELO (source: scalability study from Don Dailey), so even less than linear speeups can reach it.

This means that comparing parallel scalabilty between chess and go by comparing the relative ELO gains does not work, as worse parallelization of Go can still produce superior ELO gains.

You keep saying stuff without directly refuting anything or even saying what you think is the scaling of Chess and Go for a high number of threads. This is confusion, not debate.

I can say the same thing about the people claiming Go is easier to parallelize than chess!

There is little reliable data once you go higher than 8 threads, and I don't claim to have it. So I would strongly agree: there is confusion, which is why I think bold staments such as that Go scales better than chess are misguided!

What I do know is:

a) Most good results on many cores (including the one you quoted) aren't comparing the same algorithm in the serial vs the parallel case. This is flawed. For example, the paper from Chaslot I quoted earlier claims a speedup 14 out of 16 threads (excellent result) for root parallelization. But if you look at what that parallelization is actually doing and what it is outperforming, you should grow suspicious: it's outperforming the serial case on 4 threads!

b) It's easy to be misguided about parallel speedup if the serial algorithm sucks. Scaling is easier with minimax than alphabeta. It's easier with alphabeta than with PVS. It's easier without nullmove then with. It's easier if you're slow than if you're fast. There are many examples of good speedups reported for chess that can now no longer be reproduced or are understood to be bogus because the serial implementation was crippled. Zugzwang was the most known example, but most papers about parallelizing chessprograms suffer from it in some way or another. I claim that relatively speaking, current Go algorithms still suck. Which is no wonder, given how recent the current paradigm is. Point (a) is an illustration of that.

c) Scaling results for Go on 9x9 are not impressive (and no good results are reported by anyone), even though the branching factor is higher than chess and it's also using a "Monte Carlo" algorithm.

d) The effective tree searched by an alpha-beta searcher using heavy LMR and a Go program with a small exploration constant are very similar: they heavily deepen the mainline and severely reduce the depth of any subtrees with lower ordering, until the score of the mainline drops or a subtree jumps up heavily. I don't know of any public results here, sorry. At some point it was proven that some classes of best-first searches are equivalent to alpha-beta with transposition tables, so it's not such a far out idea.

e) Simply running Monte Carlo simulations in parallel doesn't scale at all. The reasons are obvious if you study how an UCT engine works and I reported a paper with results above. If you cannot parallelize the Monte Carlo itself, you must parallelize the tree search, at which point (c) limits the amount of difference you can get between chess and go.

f) Theorethical branching factor is completely irrelevant to the parallel scaling, because the program does not have to investigate all moves. And in Go, you can in effect prune much more safely than in chess. I know that in Leela the average number of moves per position is 11 for a 19x19 board (instead of 350+). It would be interesting to see how many of those are really searched, but I think I'll wait with instrumenting that until someone produces a decent argument as to why I would be wrong before I do that.

So, in short, I see many similarities between chess and go search trees and top programs that beg the question why there would be a significant difference.

People claiming Go is easy to parallelize seem to talk from the beginners misconception that you can "just" run simulations in parallel or that the number of moves is of any relevance, but unfortunately neither is true. So if they claim Go is magically more scalable, they should find better arguments.

Gian-Carlo Pascutto · Post by **Gian-Carlo Pascutto** » Tue Nov 23, 2010 5:58 pm

bob wrote: Again, define "doubling". If you mean 32 threads at 2.0 ghz and doubling that to be 32 threads at 4.0ghz, I can show you at least _one_ program that will gain +70 Elo...

It depends on the game under consideration, which is why directly comparing ELO or %score differences between games and making conclusions about parallelization is erronous.

A tic-tac-toe program wouldn't "scale" in terms of ELO strength when given more cores, even if there is a real and effective parallelization of its tree search.

bob · Post by **bob** » Tue Nov 23, 2010 7:07 pm

rbarreira wrote:
bob wrote:
rbarreira wrote:
Gian-Carlo Pascutto wrote: I'm not saying many cores are not valuable for Go. I'm saying they're exactly as valuable as for a chess searcher, and that it's a completely wrong illusion to think that Go is somehow fundamentally easier than chess for that.
Show me a chess searcher which gets near 100 elo per doubling (or even 50?) with 32 threads.
Again, define "doubling". If you mean 32 threads at 2.0 ghz and doubling that to be 32 threads at 4.0ghz, I can show you at least _one_ program that will gain +70 Elo...
This discussion is only about multi-threading, not pure speed improvements. I was referring to a specific 32-threads vs 1-thread benchmark that I posted earlier in the thread.

The only time I mentioned a pure speed improvement was in the context that a pure speed improvement is an upper bound on the improvement from increasing the number of threads.

Then what is the "doubling" measure? 16 to 32 threads? Of course that is not going to produce 70 Elo where it would if the 16 threads simply ran twice as fast per thread.

Don · Post by **Don** » Wed Nov 24, 2010 10:18 pm

Dann Corbit wrote:
rbarreira wrote:
Gian-Carlo Pascutto wrote:
rbarreira wrote: But the average branching factor is higher in Go, correct? Around 200 instead of 35 or so I've heard. So it should be easier to parallelize a Go tree.
This is only true if you would not prune anything (minimax). See previous post.
Why is it only true without pruning? Even with pruning, having more moves to distribute among CPUs can't hurt, and probably helps. In chess, if you're at an ALL node and you have a 100-core CPU, you don't have enough moves for all the CPUs. In Go you would.

In any case, the Monte Carlo part can be parallelized very easily, which makes me wonder how you were so quick to dismiss the value of more cores for Go.
I suspect that the Monte Carlo approach for Go is a desperation move because the branching factor is so bad. Since the tree is too big to search effectively, lots of samples are taken to make an educated guess {I am extrapolating here, because I have not spent much time studying Go programs}.

The Monte Carlo approach in Go is an amazing breakthrough, nothing short of fantastic as it has quickly added the Go equivalent of hundreds of ELO to Go engines.

It's probably also a desperation move because the tree is pretty big for searching with alpha/beta.

This is only my opinion, but I believe the primary discovery with MCTS is not the tree part, but the fact that there is no explicit evaluation function - it is by-product of the statistics that come out of Monte Carlo Tree Search. This seems to be a far better way to evaluate Go positions since it's incredible difficult to create a reasonable "chess style" evaluation function for Go.

On the other hand, if the Monte Carlo approach scales better with massive thread/process count, where are the ultra-strong (compared to single threaded) Go engines?

I guess that if it worked, we would see Beowulf cluster Go engines pounding the stuffings out of their few-threaded brethren.

There have been a number of exhibition matches against strong Go players and they are almost always played with big clusters or at least mult-core machine and it does help substantially. It's just that Go is not there yet compared to chess. For example if a 1980's program was run on a big cluster it would still be very weak compared to a Rybka running on a single core today.

Nobody cares when a super weak program improves to fairly weak. I'm roughly estimating that the best Go programs are comparable to the first weak chess expert machines and programs. However the exhibition matches are always played with against player who are hundreds of ELO stronger but with big handicaps, so they are just not very interesting. It's more of a curiosity than a real test.

diep · Post by **diep** » Sat Nov 27, 2010 6:23 pm

Don wrote: The Monte Carlo approach in Go is an amazing breakthrough, nothing short of fantastic as it has quickly added the Go equivalent of hundreds of ELO to Go engines.

With all respect, but they were complete ignorant how to produce a search in go. All go-engines used to hard forward prune in the root for example.

The discussion of elopoints and doubling in go is a very difficult discussion, because go is total other game than chess from search viewpoint.

In chess very unexpected moves can be the best move in a position, in go this is nearly impossible. So you can *nearly* hard forward prune at every position say about 90% of all moves quite instantly.

So actually a few years ago, giving more system time to go programs made them play WORSE, not better.

I'm not joking.

They simply didn't search at all.

Then evaluation speed. Where i had designed an incremental evaluation function for go, which did do board control and kept track of groups incremental for the life&death (basically the most important consideration in go), that went at a speed of 3000+ nps. This for 1 week of programming.

Similar evaluations from go-programs searched at a speed of 100 evaluations a second at the same hardware.

The trick the 'world top', just short before monte-carlo arrived, was using, was a fast 'local tactical search'. So again very hard pruning. Your program can't learn from the search in such case.

So there was 0 effects in 0 of the top programs that would let them play better with more system time, not to mention more cores.

If you move from that to a brute force approach that, be it semi-random, considers all moves, needless to say you win 1000 elo from search viewpoint.

Also please realize that before they figured out how to search, basically this means the evaluation fucntions some had written, do a lot better job than chess evaluations do.

Know any program that at standard time controls, searching 100 nps with hard forward pruning in root, gets a 'first-dan' claimed elostrength.

Ok that would of course also mean i'm first dan, as i can easily beat the go software, whereas experienced go players easily beat me.

I only play for life&death against the computer, against human you make no chance with that strategy as they then go try to win tactical against you instead of strategical, so they win all fights everywhere then.

So i would argue it was really weak that software and suffered major league from not having a decent (parallel) search and had an even bigger problem programming for speed.

You get what you pay for of course; there is no fulltime salary to make with a go-engine other than a few selling a few GUI's, and that basically sells to western europe and USA and not to Asia at all. That market is basically a complete closed market.

Obviously if a few chessprogrammers do some effort, they would show up with very sophisticated selective forms of search, at which point you can completely throw away anything that has to do with monte-carlo.

If i wanted i would be able to write down a method here that gives you a complete superior form of search in go, which is going to kick monte carlo major league.

Yet an important condition for such a search is that you search 20-30 plies deep at least, so that your evaluation function realizes what happens.

In itself your search is a lot more selective then of course (way way more) than in chess, but realize this is POSSIBLE in go, and IMPOSSIBLE to do in the same manner in chess. There is no way you can prune in the root in chess like you can in go.

Nullmove also works far better in go than in chess of course. So that is a huge branching factor improvement it'll give there which it won't give in chess in the same manner.

Already in 1998 my go program was reaching 10 plies deep, just using nullmove and no other pruning at all.

Yes, in the openings position.

10 plies in chess took longer simply.

So with some selectivity the branching factor in go isn't worse than in chess at all, as you can really reduce most moves, except for a handful, in go, which is total impossible in chess.

In chess reductions of more than 1 ply get done by some programs and i've extensively experimented with it, but it is really difficult to get to work.

AFAIK no top engine of today in chess is doing this, except for maybe junior.

So a move is either 1 ply or 2 plies in nearly all programs (and when extended 0 plies of course).

In Go you can easily say that 90% of the moves has a cost of 8 plies.
So a total reduction of 7 plies, is peanuts there, as some moves are such crap that they never gonna make it to best move EVER.

Then on top of that nullmove in Go and other forms of selectivity and you
can easily search 30 ply in go, right from start, and beat the very best go players at a 19x19 board.

I see no problems in doing that, however you would invest a part of your life to produce such software and get no salary at all. Not from sales either.

You get what you pay for.

Go is not more complicated than chess. In Go the best go players try to beat the best go players. That's just as hard as the best chess players trying to beat the best chess players.

Computer-go is simpler than computer-chess however; beating the best chess programs is nearly impossible for someone who doesn't come from within that field, whereas in computer-go this is a lot easier.

Intell 48-core CPU scales to 1k cores

Re: Intell 48-core CPU scales to 1k cores

Re: Intell 48-core CPU scales to 1k cores

Re: Intell 48-core CPU scales to 1k cores

Re: Intell 48-core CPU scales to 1k cores

Re: Intell 48-core CPU scales to 1k cores

Re: Intell 48-core CPU scales to 1k cores

Re: Intell 48-core CPU scales to 1k cores

Re: Intell 48-core CPU scales to 1k cores

Re: Intell 48-core CPU scales to 1k cores

Re: Intell 48-core CPU scales to 1k cores