Bad scaling of AlphaZero to long time control (LTC)?

Discussion of anything and everything relating to chess playing software and machines.

Moderators: hgm, Rebel, chrisw

User avatar
Laskos
Posts: 10948
Joined: Wed Jul 26, 2006 10:21 pm
Full name: Kai Laskos

Bad scaling of AlphaZero to long time control (LTC)?

Post by Laskos »

Just browsing again for 10 minutes the paper and the additional material, I stumbled upon this picture:

Image

Initially, it didn't attract much attention from me as the description begins with:

(B) Scalability of AlphaZero with thinking time compared with Stockfish.

So, at first quick glance, I inferred that A0 scales better than SF8 at all time controls.

But, now, heck no, this is not what this reads like. They are varying only the TC used by A0 to 1/3, 1/10, 1/30, 1/100, keeping SF8 at fixed TCEC time control. As I read fast these diagrams and tables (I am very bright person, you know :lol: ), it instantly struck me that the scaling of A0 is bad, at least starting from 1/10 TCEC time, if not earlier. I don't have the results of these matches (maybe I missed them somewhere in materials posted), only for 1:1 1000 games match from single Initial Board position (pretty crazy and bad practice, I will show that later), but from picture given:

1/10 TCEC time: A0 is ~14 Elo points stronger than SF8 TCEC time with ~86% draw rate
1/3 TCEC time: A0 is ~35 Elo points stronger than SF8 TCEC time with ~85% draw rate
1/1 TCEC time (here I saw raw numbers): A0 is 52 Elo points stronger than SF8 TCEC time with 84% draw rate.

from 1/10 TCEC TC to 1 TCEC TC A0 improved only some 40 Elo points, or some 12 Elo points per doubling.
This is bad. In TCEC conditios SF and Komodo show at least 25, more likely 30 Elo points per doubling.

Also, to observe that A0 beats conclusively SF8 at 1/10 TC or "Leela Ratio" of 0.1. Lc0 needs at least a "Leela Ratio" of about 0.3 or higher to beat SF8 in the same conditions. So A0 is significantly stronger than Lc0 with good nets, but they similarly scale badly to LTC (see http://talkchess.com/forum3/viewtopic.php?f=2&t=69068 about Lc0).

One thing to consider: all the games in this "scaling graph" are from 1 Initial Board Positions. I suppose each match was of 1000 games at those long time controls from the same initial position. Pretty crazy. It's easy to verify that Initial Board position is one of the most favorable to Lc0 (and probably A0) starting positions. This set-up can distort pretty heavily the outcome in many ways, and had to be avoided. For example, using this one position one can have the impression that A0 is almost undefeatable when it is stronger. The draw rate seems also inflated. Example for Lc0:

Leela Ratio is about 2, short time control:

Initial Board position:
Score of lc0_v19_11261 vs SF8: 18 - 0 - 22 [0.725] 40
Elo difference: 168.40 +/- 67.96
Finished match

Lc0 seems undefeatable here. But from

Adam Hair's opening 4-mover PGN
Score of lc0_v19_11261 vs SF8: 16 - 7 - 17 [0.613] 40
Elo difference: 79.53 +/- 84.63
Finished match

Lc0 performs almost 100 Elo points weaker (I showed that long ago in another thread), and has a significant number of losses, even if it is significantly stronger overall.

Their scaling results were obtained in these skewed conditions, so it's possible that the picture of bad scaling of A0 to LTC is inaccurate.
duncan
Posts: 12038
Joined: Mon Jul 07, 2008 10:50 pm

Re: Bad scaling of AlphaZero to long time control (LTC)?

Post by duncan »

Laskos wrote: Sun Dec 09, 2018 1:42 pm Just browsing again for 10 minutes the paper and the additional material, I stumbled upon this picture:

Image

Initially, it didn't attract much attention from me as the description begins with:

(B) Scalability of AlphaZero with thinking time compared with Stockfish.

So, at first quick glance, I inferred that A0 scales better than SF8 at all time controls.

But, now, heck no, this is not what this reads like. They are varying only the TC used by A0 to 1/3, 1/10, 1/30, 1/100, keeping SF8 at fixed TCEC time control. As I read fast these diagrams and tables (I am very bright person, you know :lol: ), it instantly struck me that the scaling of A0 is bad, at least starting from 1/10 TCEC time, if not earlier. I don't have the results of these matches (maybe I missed them somewhere in materials posted), only for 1:1 1000 games match from single Initial Board position (pretty crazy and bad practice, I will show that later), but from picture given:

1/10 TCEC time: A0 is ~14 Elo points stronger than SF8 TCEC time with ~86% draw rate
1/3 TCEC time: A0 is ~35 Elo points stronger than SF8 TCEC time with ~85% draw rate
1/1 TCEC time (here I saw raw numbers): A0 is 52 Elo points stronger than SF8 TCEC time with 84% draw rate.

from 1/10 TCEC TC to 1 TCEC TC A0 improved only some 40 Elo points, or some 12 Elo points per doubling.
This is bad. In TCEC conditios SF and Komodo show at least 25, more likely 30 Elo points per doubling.

Also, to observe that A0 beats conclusively SF8 at 1/10 TC or "Leela Ratio" of 0.1. Lc0 needs at least a "Leela Ratio" of about 0.3 or higher to beat SF8 in the same conditions. So A0 is significantly stronger than Lc0 with good nets, but they similarly scale badly to LTC (see http://talkchess.com/forum3/viewtopic.php?f=2&t=69068 about Lc0).

One thing to consider: all the games in this "scaling graph" are from 1 Initial Board Positions. I suppose each match was of 1000 games at those long time controls from the same initial position. Pretty crazy. It's easy to verify that Initial Board position is one of the most favorable to Lc0 (and probably A0) starting positions. This set-up can distort pretty heavily the outcome in many ways, and had to be avoided. For example, using this one position one can have the impression that A0 is almost undefeatable when it is stronger. The draw rate seems also inflated. Example for Lc0:

Leela Ratio is about 2, short time control:

Initial Board position:
Score of lc0_v19_11261 vs SF8: 18 - 0 - 22 [0.725] 40
Elo difference: 168.40 +/- 67.96
Finished match

Lc0 seems undefeatable here. But from

Adam Hair's opening 4-mover PGN
Score of lc0_v19_11261 vs SF8: 16 - 7 - 17 [0.613] 40
Elo difference: 79.53 +/- 84.63
Finished match

Lc0 performs almost 100 Elo points weaker (I showed that long ago in another thread), and has a significant number of losses, even if it is significantly stronger overall.

Their scaling results were obtained in these skewed conditions, so it's possible that the picture of bad scaling of A0 to LTC is inaccurate.

Has lc0 scaling been tested in Adam Hair's opening 4-mover PGN?
User avatar
hgm
Posts: 27788
Joined: Fri Mar 10, 2006 10:06 am
Location: Amsterdam
Full name: H G Muller

Re: Bad scaling of AlphaZero to long time control (LTC)?

Post by hgm »

This is the reverse conclusion from what was reported initially, namely that the AlphaZero advantage over Stockfish would shrink to nothing at faster TC. So it is a bit suspect. Could it be that Stockfish scales even less than 12 Elo per doubling, at this level of play? Unless this new AlphaZero has a so much improved NN that it is now much better at tactics, and isn't dependent anymore on very long TC to avoid most possible blunders.

I am not sure if Elo still is a meaningful measure between engines that think so unlike each other. It would have been interesting to see a similar graph for AlphaZero self play (with time odds). Perhaps then the result would drop much more spectacularly. AlphaZero, with its small search tree, probably can never compete tactically with Stockfish at any of these TCs. So it doesn't matter very much how large its tree is; if it stumbles on tactics it is screwed. It gets its wins by exploiting Stockfish' strategic mistakes, and it probably recognizes these as easily at fast TC.
jp
Posts: 1470
Joined: Mon Apr 23, 2018 7:54 am

Re: Bad scaling of AlphaZero to long time control (LTC)?

Post by jp »

Laskos wrote: Sun Dec 09, 2018 1:42 pm But, now, heck no, this is not what this reads like. They are varying only the TC used by A0 to 1/3, 1/10, 1/30, 1/100, keeping SF8 at fixed TCEC time control. As I read fast these diagrams and tables (I am very bright person, you know :lol: ), it instantly struck me that the scaling of A0 is bad, at least starting from 1/10 TCEC time, if not earlier. I don't have the results of these matches (maybe I missed them somewhere in materials posted), only for 1:1 1000 games match from single Initial Board position (pretty crazy and bad practice, I will show that later), but from picture given:

1/10 TCEC time: A0 is ~14 Elo points stronger than SF8 TCEC time with ~86% draw rate
1/3 TCEC time: A0 is ~35 Elo points stronger than SF8 TCEC time with ~85% draw rate
1/1 TCEC time (here I saw raw numbers): A0 is 52 Elo points stronger than SF8 TCEC time with 84% draw rate.

from 1/10 TCEC TC to 1 TCEC TC A0 improved only some 40 Elo points, or some 12 Elo points per doubling.
This is bad. In TCEC conditios SF and Komodo show at least 25, more likely 30 Elo points per doubling.
Well spotted, Kai.

No, I don't think you missed something. They don't give numbers for anything plotted there. We can only try to read off numbers from the plots.

The preprint figure (from a year ago) claiming A0 scales better than SF has disappeared from the final paper.
Milos
Posts: 4190
Joined: Wed Nov 25, 2009 1:47 am

Re: Bad scaling of AlphaZero to long time control (LTC)?

Post by Milos »

hgm wrote: Sun Dec 09, 2018 2:15 pm This is the reverse conclusion from what was reported initially, namely that the AlphaZero advantage over Stockfish would shrink to nothing at faster TC. So it is a bit suspect. Could it be that Stockfish scales even less than 12 Elo per doubling, at this level of play? Unless this new AlphaZero has a so much improved NN that it is now much better at tactics, and isn't dependent anymore on very long TC to avoid most possible blunders.
No ofc not. It only means that scaling of SF they "showed" in the preprint was totally bogus as demonstrated multiple times.
SF even gains 30 Elo per core doubling form 16 cores onwards in STC. And certainly more than 30Elo per time doubling in LTC.
Milos
Posts: 4190
Joined: Wed Nov 25, 2009 1:47 am

Re: Bad scaling of AlphaZero to long time control (LTC)?

Post by Milos »

Laskos wrote: Sun Dec 09, 2018 1:42 pm But, now, heck no, this is not what this reads like. They are varying only the TC used by A0 to 1/3, 1/10, 1/30, 1/100, keeping SF8 at fixed TCEC time control. As I read fast these diagrams and tables (I am very bright person, you know :lol: ), it instantly struck me that the scaling of A0 is bad, at least starting from 1/10 TCEC time, if not earlier. I don't have the results of these matches (maybe I missed them somewhere in materials posted), only for 1:1 1000 games match from single Initial Board position (pretty crazy and bad practice, I will show that later), but from picture given:

1/10 TCEC time: A0 is ~14 Elo points stronger than SF8 TCEC time with ~86% draw rate
1/3 TCEC time: A0 is ~35 Elo points stronger than SF8 TCEC time with ~85% draw rate
1/1 TCEC time (here I saw raw numbers): A0 is 52 Elo points stronger than SF8 TCEC time with 84% draw rate.

from 1/10 TCEC TC to 1 TCEC TC A0 improved only some 40 Elo points, or some 12 Elo points per doubling.
This is bad. In TCEC conditios SF and Komodo show at least 25, more likely 30 Elo points per doubling.

Also, to observe that A0 beats conclusively SF8 at 1/10 TC or "Leela Ratio" of 0.1. Lc0 needs at least a "Leela Ratio" of about 0.3 or higher to beat SF8 in the same conditions. So A0 is significantly stronger than Lc0 with good nets, but they similarly scale badly to LTC (see http://talkchess.com/forum3/viewtopic.php?f=2&t=69068 about Lc0).
Ofc it does scale terribly bad. It was already clear from the preprint, but ppl failed to notice it coz Google showed totally bogus SF scaling figure in the same graph.
Anyone expecting that MCTS scaling (especially using averaging backup operator instead of minimax) will be superior to A/B once A/B reaches high depths just doesn't have much clue about computer chess search (which seems to be the case with DM ppl).
Using small net like 64x6 and A/B with RTX cards like dual 2080ti would yield much stronger engine than current Lc0. One would need to minimize the parallelization loss of batching in a smart way (which might be difficult though) but the rest should be quite straightforward.
User avatar
Laskos
Posts: 10948
Joined: Wed Jul 26, 2006 10:21 pm
Full name: Kai Laskos

Re: Bad scaling of AlphaZero to long time control (LTC)?

Post by Laskos »

hgm wrote: Sun Dec 09, 2018 2:15 pm This is the reverse conclusion from what was reported initially, namely that the AlphaZero advantage over Stockfish would shrink to nothing at faster TC. So it is a bit suspect. Could it be that Stockfish scales even less than 12 Elo per doubling, at this level of play? Unless this new AlphaZero has a so much improved NN that it is now much better at tactics, and isn't dependent anymore on very long TC to avoid most possible blunders.

I am not sure if Elo still is a meaningful measure between engines that think so unlike each other. It would have been interesting to see a similar graph for AlphaZero self play (with time odds). Perhaps then the result would drop much more spectacularly. AlphaZero, with its small search tree, probably can never compete tactically with Stockfish at any of these TCs. So it doesn't matter very much how large its tree is; if it stumbles on tactics it is screwed. It gets its wins by exploiting Stockfish' strategic mistakes, and it probably recognizes these as easily at fast TC.
Yes, I had the same concern about the validity of Elo model. Self-games scaling would be interesting to see too, but it also does not tell the whole story.
I improvised now something quick and fairly extreme: in 80% of the games Lc0 (or A0, doesn't matter) performs at 3500 Elo level, in 20% of the games it's a bit patzer due blunders, performing at 2000 Elo level. Although score against an opponent is easy to get, Elo function of opponent's Elo is not that trivial. Anyway, here is the plot for this model

Image

Lc0 performance of 50% is obtained against a regular engine of 3411 Elo points. A significantly weaker regular engine (weaker against other regular engines) gets its weakness compressed by a factor of 2 or so when matched to Lc0, ratings are compressed. So, a factor of 2 is possible in that scaling calculation. But this example is pretty extreme, as hardly 20% of the games of A0 are patzer-like. And let's say it is possible. Another problem is that probably a factor of 2 is not enough. For regular engines like SF and Komodo, 25-30 Elo points per doubling are in TCEC conditions, not from 1/10 TCEC conditions to TCEC conditions, that would be 35 or more Elo points per doubling. One has to assume pretty wild models to get A0 scaling similarly to SF or Komodo to LTC. Probably A0 does scale worse to LTC, for reasons you also stated.

Edit:
For Lc0 I am using "Elo" not as a model in a pool of regular engines, just as performance in percentage points. For regular engines, "Elo" is obeying the "Elo model" in a pool of regular engines.
User avatar
Laskos
Posts: 10948
Joined: Wed Jul 26, 2006 10:21 pm
Full name: Kai Laskos

Re: Bad scaling of AlphaZero to long time control (LTC)?

Post by Laskos »

Also, for Lc0 I got almost unequivocally that to LTC it scales worse than SF10. No such Elo model deviations can account for inversion of Elo differences (from stronger to weaker). I guess A0 can exhibit the same. From that picture they posted, I would be very curious how A0 at 1/3 TCEC time performs against SF8 also at 1/3 TCEC time. I can almost bet that A0 performs better at 1/3 : 1/3 than at 1:1, which is almost surely denoting worse scaling.
jp
Posts: 1470
Joined: Mon Apr 23, 2018 7:54 am

Re: Bad scaling of AlphaZero to long time control (LTC)?

Post by jp »

Milos wrote: Sun Dec 09, 2018 3:53 pm
Laskos wrote: Sun Dec 09, 2018 1:42 pm
from 1/10 TCEC TC to 1 TCEC TC A0 improved only some 40 Elo points, or some 12 Elo points per doubling.
This is bad. In TCEC conditios SF and Komodo show at least 25, more likely 30 Elo points per doubling.
Ofc it does scale terribly bad. It was already clear from the preprint, but ppl failed to notice it coz Google showed totally bogus SF scaling figure in the same graph.
Maybe that's why the preprint graph vanished from the final paper.
User avatar
Laskos
Posts: 10948
Joined: Wed Jul 26, 2006 10:21 pm
Full name: Kai Laskos

Re: Bad scaling of AlphaZero to long time control (LTC)?

Post by Laskos »

Milos wrote: Sun Dec 09, 2018 3:53 pm
Laskos wrote: Sun Dec 09, 2018 1:42 pm But, now, heck no, this is not what this reads like. They are varying only the TC used by A0 to 1/3, 1/10, 1/30, 1/100, keeping SF8 at fixed TCEC time control. As I read fast these diagrams and tables (I am very bright person, you know :lol: ), it instantly struck me that the scaling of A0 is bad, at least starting from 1/10 TCEC time, if not earlier. I don't have the results of these matches (maybe I missed them somewhere in materials posted), only for 1:1 1000 games match from single Initial Board position (pretty crazy and bad practice, I will show that later), but from picture given:

1/10 TCEC time: A0 is ~14 Elo points stronger than SF8 TCEC time with ~86% draw rate
1/3 TCEC time: A0 is ~35 Elo points stronger than SF8 TCEC time with ~85% draw rate
1/1 TCEC time (here I saw raw numbers): A0 is 52 Elo points stronger than SF8 TCEC time with 84% draw rate.

from 1/10 TCEC TC to 1 TCEC TC A0 improved only some 40 Elo points, or some 12 Elo points per doubling.
This is bad. In TCEC conditios SF and Komodo show at least 25, more likely 30 Elo points per doubling.

Also, to observe that A0 beats conclusively SF8 at 1/10 TC or "Leela Ratio" of 0.1. Lc0 needs at least a "Leela Ratio" of about 0.3 or higher to beat SF8 in the same conditions. So A0 is significantly stronger than Lc0 with good nets, but they similarly scale badly to LTC (see http://talkchess.com/forum3/viewtopic.php?f=2&t=69068 about Lc0).
Ofc it does scale terribly bad. It was already clear from the preprint, but ppl failed to notice it coz Google showed totally bogus SF scaling figure in the same graph.
Anyone expecting that MCTS scaling (especially using averaging backup operator instead of minimax) will be superior to A/B once A/B reaches high depths just doesn't have much clue about computer chess search (which seems to be the case with DM ppl).
Using small net like 64x6 and A/B with RTX cards like dual 2080ti would yield much stronger engine than current Lc0. One would need to minimize the parallelization loss of batching in a smart way (which might be difficult though) but the rest should be quite straightforward.
By now all strong engines using MCTS --- Komodo MCTS, Lc0 and A0 seem to scale badly to LTC. Maybe there are some tuning problems, let's see, but I guess a reverse to AB and better parallelization would be a clearer way.