Alphazero news

Discussion of anything and everything relating to chess playing software and machines.

Moderators: hgm, Rebel, chrisw

matthewlai
Posts: 793
Joined: Sun Aug 03, 2014 4:48 am
Location: London, UK

Re: Alphazero news

Post by matthewlai »

yanquis1972 wrote: Fri Dec 07, 2018 5:06 am
matthewlai wrote: Fri Dec 07, 2018 2:24 am
jp wrote: Fri Dec 07, 2018 2:20 am
matthewlai wrote: Fri Dec 07, 2018 2:15 am
Daniel Shawul wrote: Fri Dec 07, 2018 12:35 am While I sympathize with that statement, releasing A0 source code and networks for anyone to test sounds better.
Many will not be satisfied with in-house testing with supposedly fair conditions.
That would be good, but it would also be a lot of work for us (AZ is tightly-coupled with DM and Google's systems) for not really much value to the scientific community. We feel that it's our ideas and algorithms that are important, not our implementation. That's why we have published all the algorithms we developed in detail, with almost-runnable pseudo-code, so that they can be replicated easily.
What were the best values/functions for CPUCT used for playing & training?
They are all in the pseudo-code in supplementary materials.

Code: Select all

class AlphaZeroConfig(object):

  def __init__(self):
    ### Self-Play
    self.num_actors = 5000

    self.num_sampling_moves = 30
    self.max_moves = 512  # for chess and shogi, 722 for Go.
    self.num_simulations = 800

    # Root prior exploration noise.
    self.root_dirichlet_alpha = 0.3  # for chess, 0.03 for Go and 0.15 for shogi.
    self.root_exploration_fraction = 0.25

    # UCB formula
    self.pb_c_base = 19652
    self.pb_c_init = 1.25

    ### Training
    self.training_steps = int(700e3)
    self.checkpoint_interval = int(1e3)
    self.window_size = int(1e6)
    self.batch_size = 4096

    self.weight_decay = 1e-4
    self.momentum = 0.9
    # Schedule for chess and shogi, Go starts at 2e-2 immediately.
    self.learning_rate_schedule = {
        0: 2e-1,
        100e3: 2e-2,
        300e3: 2e-3,
        500e3: 2e-4
    }
I just read that as code for training, would the 1.25 value apply to match play and does it correlate to lc0s search variable?
It applies in both match play and training. If you search for it in the pseudo-code you can see how it's used. I haven't looked at lc0 code so I don't know what it corresponds to.
Disclosure: I work for DeepMind on the AlphaZero project, but everything I say here is personal opinion and does not reflect the views of DeepMind / Alphabet.
matthewlai
Posts: 793
Joined: Sun Aug 03, 2014 4:48 am
Location: London, UK

Re: Alphazero news

Post by matthewlai »

mwyoung wrote: Fri Dec 07, 2018 9:34 am
hgm wrote: Fri Dec 07, 2018 9:16 am
mwyoung wrote: Fri Dec 07, 2018 7:51 am a Core i7 4770K has 45 GFlops per core, a gen 3 TPU has 45 TFlops which equals 1000x faster speed for the chosen task (per core)
But AlphaZero was using gen 1 TPUs, right? IIRC these had 0 GFlops, as the could not do floating point at all.
I was told gen 3. But it did not say in the information posted. Here is what was posted on the site.

For the games themselves, Stockfish used 44 CPU (central processing unit) cores and AlphaZero used a single machine with four TPUs and 44 CPU cores. Stockfish had a hash size of 32GB and used syzygy endgame tablebases.

BTW: I need to buy a 2080 ti since 2 TPUs are equal one 2080 ti.
AlphaZero got 44 cores just because that's the machine they ran on. The games were run on a 44 cores + 4 1st gen TPU machines, no pondering.

So I guess you could say Stockfish got 4 TPUs, too, but it would be a bit cheesy to say since SF cannot make use of them.

AlphaZero is not CPU-bound. Most of the cores are idle during play.
Disclosure: I work for DeepMind on the AlphaZero project, but everything I say here is personal opinion and does not reflect the views of DeepMind / Alphabet.
OneTrickPony
Posts: 157
Joined: Tue Apr 30, 2013 1:29 am

Re: Alphazero news

Post by OneTrickPony »

Going through the games it's clear Alpha Zero exposes a weakness in SF's long term planning. It gets very promising positions with long term pressure on regular basis which it is sometimes able to convert. It seems to be weaker tactically, maybe a stronger engine would be able to get even more wins.

I am not convinced the newest SF would win against it. The ELO is calculated against a pool of similar engines. It's not clear if 50 or 100 ELO more against this pool is equal to 50-100 ELO more against an opponent of a different type.
Due to architectural differences and difficulty of coming up with a definition of fair hardware conditions I think it's not very important if Alpha Zero on 4TPUs is stronger than the newest SF on 44 cores or w/e. What is important is that the games show that there are paths in chess which SF is still unable to understand and the losses are very different than just playing against another alpha-beta engine on 4x or 10x the hardware.

I also feel those games, even drawn ones are more interesting than average top GM game. It starts to look like very human risk aversion causes so many quick draws and uneventful games, not theoretical limitations of chess.
crem
Posts: 177
Joined: Wed May 23, 2018 9:29 pm

Re: Alphazero news

Post by crem »

The paper says that during the training moves are selected "in proportion to the root visit count", without mention that it happens only for first 30 plies (so I assume it happens for the entire game).

However, in the training pseudocode it looks like there is a temperature cutoff at ply 30:

Code: Select all

def select_action(config: AlphaZeroConfig, game: Game, root: Node):
  visit_counts = [(child.visit_count, action)
                  for action, child in root.children.iteritems()]
  if len(game.history) < config.num_sampling_moves:
    _, action = softmax_sample(visit_counts)
  else:
    _, action = max(visit_counts)
  return action
matthewlai, would it be possible to clarify whether temperature cutoff was used during training or not?


There is temperature cutoff in description of the play versus stockfish though (not during training), so maybe that's what was meant in pseudocode:
"by softmax sampling with a temperature of 10.0 among moves for which the value was no more than 1% away from the best move for the first 30 plies".

That raises some more questions though (not as important as training question though):
- "Softmax sampling with temperature 10" is a bit ambiguous, my best guess is that that means "proportional to exp(N / 10)".
- Values not more than 1% away. Is value of Q or value of N? (I guess it's Q?)
- If it's Q, what does "1% away" mean? It is just 1% of Q range (i.e. 0.02, as Q is from -1 to 1, e.g. if Q for best move is -0.015, then moves with Q >= -0.035 are taken)?
Or it's a relative percentage? E.g. if Q=(-0.015), then nodes with Q >= (-0.01515) are sampled? (doesn't look correct)
matthewlai
Posts: 793
Joined: Sun Aug 03, 2014 4:48 am
Location: London, UK

Re: Alphazero news

Post by matthewlai »

crem wrote: Fri Dec 07, 2018 12:34 pm The paper says that during the training moves are selected "in proportion to the root visit count", without mention that it happens only for first 30 plies (so I assume it happens for the entire game).

However, in the training pseudocode it looks like there is a temperature cutoff at ply 30:

Code: Select all

def select_action(config: AlphaZeroConfig, game: Game, root: Node):
  visit_counts = [(child.visit_count, action)
                  for action, child in root.children.iteritems()]
  if len(game.history) < config.num_sampling_moves:
    _, action = softmax_sample(visit_counts)
  else:
    _, action = max(visit_counts)
  return action
matthewlai, would it be possible to clarify whether temperature cutoff was used during training or not?


There is temperature cutoff in description of the play versus stockfish though (not during training), so maybe that's what was meant in pseudocode:
"by softmax sampling with a temperature of 10.0 among moves for which the value was no more than 1% away from the best move for the first 30 plies".

That raises some more questions though (not as important as training question though):
- "Softmax sampling with temperature 10" is a bit ambiguous, my best guess is that that means "proportional to exp(N / 10)".
- Values not more than 1% away. Is value of Q or value of N? (I guess it's Q?)
- If it's Q, what does "1% away" mean? It is just 1% of Q range (i.e. 0.02, as Q is from -1 to 1, e.g. if Q for best move is -0.015, then moves with Q >= -0.035 are taken)?
Or it's a relative percentage? E.g. if Q=(-0.015), then nodes with Q >= (-0.01515) are sampled? (doesn't look correct)
During training, we do softmax sampling by visit count up to move 30. There is no value cutoff. Temperature is 1.

In those games against SF to increase diversity (this is the only place we used softmax sampling in normal gameplay), we did the same but with a higher temperature, and only consider moves within 1% value.

Definition of temperature is the standard definition in softmax - exp(N / 10) is correct.

By 1% we mean in absolute value. All our values are between 0 and 1, so if the best move has a value of 0.8, we would sample from all moves with values >= 0.79.
Disclosure: I work for DeepMind on the AlphaZero project, but everything I say here is personal opinion and does not reflect the views of DeepMind / Alphabet.
Astatos
Posts: 18
Joined: Thu Apr 10, 2014 5:20 pm

Re: Alphazero news

Post by Astatos »

OK what we know :
1) Stockfish is the best engine in the world
2) LC0 guys did manage to reverse engineer A0 successfully
3) LC0 and A0 roughly at the same strength
4) NN are not less resource hungry than Alpha Beta
5) Scalability is about the same in both methods
6) Google has unacceptable behaviour, hiding data, obfuscating opponents and hyping results
crem
Posts: 177
Joined: Wed May 23, 2018 9:29 pm

Re: Alphazero news

Post by crem »

matthewlai wrote: Fri Dec 07, 2018 12:49 pm All our values are between 0 and 1, so if the best move has a value of 0.8, we would sample from all moves with values >= 0.79.
The paper says: "At the end of the game, the terminal position sT is scored according to the rules of the game to compute the game outcome z: −1 for a loss, 0 for a draw, and +1 for a win.".

So it's not like that? It's 0 for loss, 1 for win and 0.5 for a draw?
Also paper says that initial Q=0 (and pseudocode also says "if self.visit_count == 0: return 0"). Does it mean that it's initialized to "loss" value?


Whether it's -1 to 1 or 0 to 1 is also important to Cpuct scaling (or C(s) in the latest version of the paper). Do c_base and c_init values assume that Q range is -1..1 or 0..1?
USGroup1
Posts: 33
Joined: Sun Oct 14, 2018 7:01 pm
Full name: Sina Vaziri

Re: Alphazero news

Post by USGroup1 »

OneTrickPony wrote: Fri Dec 07, 2018 12:13 pm ...What is important is that the games show that there are paths in chess which SF is still unable to understand and the losses are very different than just playing against another alpha-beta engine on 4x or 10x the hardware.
That does not mean we can't improve alpa-beta engines so they could handle those paths efficiently.
Jouni
Posts: 3281
Joined: Wed Mar 08, 2006 8:15 pm

Re: Alphazero news

Post by Jouni »

I only looked so far for TCEC opening games. AO seems to be sometimes like patzer and loses in 22 moves to outdated SF :o .

[pgn] [Event "Computer Match"] [Site "London, UK"] [Date "2018.01.18"] [Round "255"] [White "Stockfish 8"] [Black "AlphaZero"] [Result "1-0"] [PlyCount "43"] [EventDate "2018.??.??"] 1. e4 {book} e6 {book} 2. d4 {book} d5 {book} 3. Nc3 {book} Nf6 {book} 4. Bg5 { book} Be7 {book} 5. e5 {book} Nfd7 {book} 6. h4 {book} Bxg5 {book} 7. hxg5 { book} Qxg5 {book} 8. Nh3 {book} Qe7 {book} 9. Qg4 g6 10. Ng5 h6 11. O-O-O Nc6 12. Nb5 Nb6 13. Rd3 h5 14. Rf3 a6 15. Qg3 Nd8 16. Nc3 Nd7 17. Bd3 Nf8 18. Rh4 Rg8 19. Bc4 Qd7 20. Nce4 dxe4 21. Nxe4 Nh7 22. Rxh5 1-0 [/pgn]
Jouni
noobpwnftw
Posts: 560
Joined: Sun Nov 08, 2015 11:10 pm

Re: Alphazero news

Post by noobpwnftw »

Jouni wrote: Fri Dec 07, 2018 2:00 pm I only looked so far for TCEC opening games. AO seems to be sometimes like patzer and loses in 22 moves to outdated SF :o .
g4 with a queen, can't defend. :D