LC0 on 43 cores had a ~2700 CCRL ELO performance.

Ras · Post by **Ras** » Wed Apr 18, 2018 9:30 pm

Daniel Shawul wrote:Ok lets assume A0 has to run on 4-TPUs for some unknown reason to me, then to be fair (based on flops)

Flops are irrelevant, and not only because Stockfish runs integer math. During the match, it was 4 TPUs and not 40. Just like Stockfish was developed with MUCH more computing power than it ran in the match.

Both GPU and TPU numbers I used for my calculations are theoretical flops so don't see why that matters.

Because GPU flops are not the same as CPU flops. Actually, that is why GPUs exist at all. To make use of GPU flops, you need to have an algorithm that performs the same operation on a lot of data. GPU flops cannot be used as randomly as CPU flops.

The technology and the algorithms are so different that the only thing you can actually measure and compare is how much power the system draws from the wall, and in that regard, they were in the same ballpark.

mirek · Post by **mirek** » Wed Apr 18, 2018 9:30 pm

Daniel Shawul wrote: c) Hardware differences. It has become very clear to me this result is achieved via massive hardware acceleration of a very slow eval. Theoretically Deep Blue could also have acieved this result with their FGPA. Admittedly, their approach is cost-effective given the future is cheap manycore architectures like GPUs.

Have you not read my post on 1st page? If better eval could be performed in equal time that would obviously help, but even more helpful to overall strength of A0 is reducing the branching factor during search as this provides exponential "speed-up" when done well. That's mainly reason why alpha-beta search engines typically have relatively simple eval functions - because searching deeper (thanks to fast eval) is much more useful than having super-duper precise but slow eval function. Depth of the search is where "true" strength comes from. And by reducing branching factor you get to much deeper depths. So even if you could let's say do super-duper eval for Deep Blue in FGPA without slowing its nodes per seconds metric - I almost sure that with it's brute force approach to pruning it would be actually still much weaker than e.g. stockfish8.

Daniel Shawul · Post by **Daniel Shawul** » Wed Apr 18, 2018 9:42 pm

mirek wrote:
Daniel Shawul wrote: c) Hardware differences. It has become very clear to me this result is achieved via massive hardware acceleration of a very slow eval. Theoretically Deep Blue could also have acieved this result with their FGPA. Admittedly, their approach is cost-effective given the future is cheap manycore architectures like GPUs.
Have you not read my post on 1st page? If better eval could be performed in equal time that would obviously help, but even more helpful to overall strength of A0 is reducing the branching factor during search as this provides exponential "speed-up" when done well. That's mainly reason why alpha-beta search engines typically have relatively simple eval functions - because searching deeper (thanks to fast eval) is much more useful than having super-duper precise but slow eval function. Depth of the search is where "true" strength comes from. And by reducing branching factor you get to much deeper depths. So even if you could let's say do super-duper eval for Deep Blue in FGPA without slowing its nodes per seconds metric - I almost sure that with it's brute force approach to pruning it would be actually still much weaker than e.g. stockfish8.

They had no option but to use MCTS not because it is better.
That is because it was getting 80,000 nodes/s even on 4-TPUs. With that nps a full width alpha-beta search you are stuck with search depth engines used to get in the 90's. That brings up tactical problems which they minimized with massive hardware -- it annoys me that there is no mention of this in the arxiv paper. They could have said yes their is a problem that could be exploited by a tactical engine, but we solved it with massive hardware woud be enough. The kind of tactical mistakes Leala zero made on a 48 core tcec machine speaks loud about this problem.

I have a combined alpha-beta + MCTS search in scorpio in which the latter is used for strategy (looking at specific lines very deep), while the former is used to increase tactical awareness at shallow depth. If you use MCTS alone, you will suffer from tactical problems even a 100x more time won't solve a 7-ply trap. Once you ensure there are no tactical problems, ofcourse MCTS might be better than LMR + nullmove. Infact scorpio-mcts-min that uses MCTS actually beat an alpha-beta rollouts search algorithm with null move and LMR so I know how effective it can be.

mirek · Post by **mirek** » Wed Apr 18, 2018 11:21 pm

Daniel Shawul wrote:
They had no option but to use MCTS not because it is better.
That is because it was getting 80,000 nodes/s even on 4-TPUs. With that nps a full width alpha-beta search you are stuck with search depth engines used to get in the 90's. That brings up tactical problems which they minimized with massive hardware -- it annoys me that there is no mention of this in the arxiv paper. They could have said yes their is a problem that could be exploited by a tactical engine, but we solved it with massive hardware woud be enough. The kind of tactical mistakes Leala zero made on a 48 core tcec machine speaks loud about this problem.

The tactical problem of A0 is there only if we are speaking of very short time controls. And also anyone can see the strength scaling graphs in the paper. From there it's mostly clear how strength of SF8 and A0 scales based on thinking time and also based on total nodes searched. So it's not like they made it a secret that strength of A0 goes down rapidly as total nodes searched approaches zero. Or that A0 on 1sec / move + 1080Ti would be much weaker compared to SF8 on 64 cores, while at 1 min / move they would be similar in strength. The only thing that was not explicitly mentioned is that the reason of such scaling at low nodes count is due to "tactical vulnerability" - but that should be more or less given.

Also speaking of details that are "not being explicitly mentioned" it seems to me you are overly concerned with tactical vulnerabilities present only at short time controls and ignoring the fact that A0 + 1080Ti at 1 min / move or above doesn't suffer with tactical vulnerabilities (unless you want also to call 64 core SF8, 1min / move tactically vulnerable, or assume that A0 can match SF8 strength and still be somehow tactically "inferior")

And you can't compare level at which LC0 is at the moment with A0. 1 month ago LC0 was doing even much more horrible tactical blunders, so now you extrapolate to the future and I think it should be clear what the correct conclusions should be.

So if 1s / move or engine bullet games is your thing than sure, A0 will suck there on consumer HW for quiet some time. If on the other hand you are more inclined towards LTC, then clearly A0 approach is the way to go. I mean if they made it a regular 120min / 40 moves + 30 sec increment match + proper time management SF8 would probably lose even much worse than just by 100 elo.

mirek · Post by **mirek** » Wed Apr 18, 2018 11:40 pm

Daniel Shawul wrote:If you use MCTS alone, you will suffer from tactical problems even a 100x more time won't solve a 7-ply trap.

If we are speaking A0 that is only true if the search guiding NN won't recognize the patterns and realize that such trap maybe there. And obviously NN can fail at recognizing such pattern similarly as e.g. null-move heuristics can fail for zugzwang detection, but the idea is that most of the times when the 7 ply trap is there it will be recognized by the properly trained NN. And this must be the case with A0 otherwise it couldn't be nearly as strong with nps so low.

duncan · Post by **duncan** » Wed Apr 18, 2018 11:52 pm

Daniel Shawul wrote:
Ok lets assume A0 has to run on 4-TPUs for some unknown reason to me, then to be fair (based on flops) they have to give stockfish 180x64= 11520 cores not just 64 ...

what about A0 running on 4-TPUs and adding on to stockfish the extra elo it would have got if it had 11,520 cores.

would that be fair. ?and how many extra elos would it have got?

Milos · Post by **Milos** » Thu Apr 19, 2018 12:15 am

Pio wrote:It would have been interesting if They had swapped hardware. I wonder how good stockfish would have been on TPU:s

Exactly as it was on its "own" hardware, because on top of those 4 TPUs A0 also used the same 64 core machine to run those 70k MCTS searches per second (which they somehow "forgot" to even mention in the paper) because TPUs can only do dot product and absolutely nothing else.

David Xu · Post by **David Xu** » Thu Apr 19, 2018 7:23 am

Milos wrote:
Pio wrote:It would have been interesting if They had swapped hardware. I wonder how good stockfish would have been on TPU:s
Exactly as it was on its "own" hardware, because on top of those 4 TPUs A0 also used the same 64 core machine to run those 70k MCTS searches per second (which they somehow "forgot" to even mention in the paper) because TPUs can only do dot product and absolutely nothing else.

Source for this claim?

noobpwnftw · Post by **noobpwnftw** » Thu Apr 19, 2018 7:34 am

David Xu wrote:
Milos wrote:
Pio wrote:It would have been interesting if They had swapped hardware. I wonder how good stockfish would have been on TPU:s
Exactly as it was on its "own" hardware, because on top of those 4 TPUs A0 also used the same 64 core machine to run those 70k MCTS searches per second (which they somehow "forgot" to even mention in the paper) because TPUs can only do dot product and absolutely nothing else.
Source for this claim?

I think it used somewhere around 19 CPU cores to drive those 4TPUs in AG0, no? At least it writes so in Wikipedia that might be inaccurate, but still better than those papers that appears to be hardly any reference on how much processing power is required for input pre-processing.

Make a brief guess of Leela training on a 1080TI card, it takes about 2 CPU cores to feed it into max GPU usage, rendering some 5k rollouts per second, to get 80k you probably need some 32 cores, not including some tree traversal overhead and IO throughtput on this scale.

You may just think it can hyperscale to infinity with only more GPUs, like those supercomputers can do anything useful with its coprocessors peaking a couple hundered TFLOPS coordinated over a telephone line.

corres · Post by **corres** » Thu Apr 19, 2018 11:05 am

[quote="George Tsavdaris"]

Only the small hash was a stupid decision to use but how much it could affect its strength? 5 ELO?

[/quote]

During the discussion you forget an important thing:
NN is not only an instrument to replace the static evaluation of standard chess engines. NN behaves like a dynamic opening AND middle game book. The quality and extent of this dynamic book depends on the learning of NN. The chess power of these MACHINE (and not engine!) is greatly depend on the power and time used for teaching NN - that is making these dynamic book.
So making a really well established comparison between A0 and Stockfish is a hopeless thing. Only the practical viewpoints (for e.g price/Elo rating - in far (or near?) future- may be a good standpoint.

LC0 on 43 cores had a ~2700 CCRL ELO performance.

Re: LC0 on 43 cores had a ~2700 CCRL ELO performance.

Re: LC0 on 43 cores had a ~2700 CCRL ELO performance.

Re: LC0 on 43 cores had a ~2700 CCRL ELO performance.

Re: LC0 on 43 cores had a ~2700 CCRL ELO performance.

Re: LC0 on 43 cores had a ~2700 CCRL ELO performance.

Re: LC0 on 43 cores had a ~2700 CCRL ELO performance.

Re: LC0 on 43 cores had a ~2700 CCRL ELO performance.

Re: LC0 on 43 cores had a ~2700 CCRL ELO performance.

Re: LC0 on 43 cores had a ~2700 CCRL ELO performance.

Re: LC0 on 43 cores had a ~2700 CCRL ELO performance.