Learning time growing exponentially with number of training examples

brianr · Post by **brianr** » Fri Aug 31, 2018 2:52 am

I am currently training a very small endgame (KQvK) net that is "32x4" in terms of the AlphaZero approach with around 1.15 million parameters. The smallest run that I do is with 144,000 games, where each starting position and every move in each game is a sample. There are about 1.7 million samples.

With a gpu batch size of 512 and doing 20 mini-batches at a time (10,240 samples) it takes less than 2 hours. This is actually doing 2 epochs per batch (not full epochs over the entire sample set at once). On my GTX-1070 gpu each batch of 512 epoch takes about 3 seconds. The vast majority of the time is parsing the pgn input which is fed directly into the training. Most other approaches convert the pgn to another binary input plane format, but I have limited disk space (yes, I know disks are relatively cheap), and I also want to be able to change the input without having to re-process everything. Of course, I end up reprocessing it every run, but that's how I'm currently doing it.

I'm not using just MSE but loss as a weighted combination of value head MSE and policy head categorical cross-entropy. After one run (again with 2 epochs per batch) the accuracy is about 70% and loss goes down from 8.6 to 1.1 Finally, the actual numbers are less important than the trend (although I would like to get to greater accuracy).

I am still fiddling with the learning rate parameters and using 1 cycle LR (along with the 2 epochs per batch) from here:
https://medium.com/oracledevs/lessons-f ... cfcbe4ca9a

For more net info see:
https://github.com/Zeta36/chess-alpha-zero

FYI, the net looks like this:

Code: Select all

>>> model.summary()
__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to
==================================================================================================
input_1 (InputLayer)            (None, 18, 8, 8)     0
__________________________________________________________________________________________________
input_conv-3-32 (Conv2D)        (None, 32, 8, 8)     5184        input_1[0][0]
__________________________________________________________________________________________________
input_batchnorm (BatchNormaliza (None, 32, 8, 8)     128         input_conv-3-32[0][0]
__________________________________________________________________________________________________
input_relu (Activation)         (None, 32, 8, 8)     0           input_batchnorm[0][0]
__________________________________________________________________________________________________
res1_conv1-3-32 (Conv2D)        (None, 32, 8, 8)     9216        input_relu[0][0]
__________________________________________________________________________________________________
res1_batchnorm1 (BatchNormaliza (None, 32, 8, 8)     128         res1_conv1-3-32[0][0]
__________________________________________________________________________________________________
res1_relu1 (Activation)         (None, 32, 8, 8)     0           res1_batchnorm1[0][0]
__________________________________________________________________________________________________
res1_conv2-3-32 (Conv2D)        (None, 32, 8, 8)     9216        res1_relu1[0][0]
__________________________________________________________________________________________________
res1_batchnorm2 (BatchNormaliza (None, 32, 8, 8)     128         res1_conv2-3-32[0][0]
__________________________________________________________________________________________________
res1_add (Add)                  (None, 32, 8, 8)     0           input_relu[0][0]
                                                                 res1_batchnorm2[0][0]
__________________________________________________________________________________________________
res1_relu2 (Activation)         (None, 32, 8, 8)     0           res1_add[0][0]
__________________________________________________________________________________________________
res2_conv1-3-32 (Conv2D)        (None, 32, 8, 8)     9216        res1_relu2[0][0]
__________________________________________________________________________________________________
res2_batchnorm1 (BatchNormaliza (None, 32, 8, 8)     128         res2_conv1-3-32[0][0]
__________________________________________________________________________________________________
res2_relu1 (Activation)         (None, 32, 8, 8)     0           res2_batchnorm1[0][0]
__________________________________________________________________________________________________
res2_conv2-3-32 (Conv2D)        (None, 32, 8, 8)     9216        res2_relu1[0][0]
__________________________________________________________________________________________________
res2_batchnorm2 (BatchNormaliza (None, 32, 8, 8)     128         res2_conv2-3-32[0][0]
__________________________________________________________________________________________________
res2_add (Add)                  (None, 32, 8, 8)     0           res1_relu2[0][0]
                                                                 res2_batchnorm2[0][0]
__________________________________________________________________________________________________
res2_relu2 (Activation)         (None, 32, 8, 8)     0           res2_add[0][0]
__________________________________________________________________________________________________
res3_conv1-3-32 (Conv2D)        (None, 32, 8, 8)     9216        res2_relu2[0][0]
__________________________________________________________________________________________________
res3_batchnorm1 (BatchNormaliza (None, 32, 8, 8)     128         res3_conv1-3-32[0][0]
__________________________________________________________________________________________________
res3_relu1 (Activation)         (None, 32, 8, 8)     0           res3_batchnorm1[0][0]
__________________________________________________________________________________________________
res3_conv2-3-32 (Conv2D)        (None, 32, 8, 8)     9216        res3_relu1[0][0]
__________________________________________________________________________________________________
res3_batchnorm2 (BatchNormaliza (None, 32, 8, 8)     128         res3_conv2-3-32[0][0]
__________________________________________________________________________________________________
res3_add (Add)                  (None, 32, 8, 8)     0           res2_relu2[0][0]
                                                                 res3_batchnorm2[0][0]
__________________________________________________________________________________________________
res3_relu2 (Activation)         (None, 32, 8, 8)     0           res3_add[0][0]
__________________________________________________________________________________________________
res4_conv1-3-32 (Conv2D)        (None, 32, 8, 8)     9216        res3_relu2[0][0]
__________________________________________________________________________________________________
res4_batchnorm1 (BatchNormaliza (None, 32, 8, 8)     128         res4_conv1-3-32[0][0]
__________________________________________________________________________________________________
res4_relu1 (Activation)         (None, 32, 8, 8)     0           res4_batchnorm1[0][0]
__________________________________________________________________________________________________
res4_conv2-3-32 (Conv2D)        (None, 32, 8, 8)     9216        res4_relu1[0][0]
__________________________________________________________________________________________________
res4_batchnorm2 (BatchNormaliza (None, 32, 8, 8)     128         res4_conv2-3-32[0][0]
__________________________________________________________________________________________________
res4_add (Add)                  (None, 32, 8, 8)     0           res3_relu2[0][0]
                                                                 res4_batchnorm2[0][0]
__________________________________________________________________________________________________
res4_relu2 (Activation)         (None, 32, 8, 8)     0           res4_add[0][0]
__________________________________________________________________________________________________
value_conv-1-1 (Conv2D)         (None, 4, 8, 8)      128         res4_relu2[0][0]
__________________________________________________________________________________________________
policy_conv-1-2 (Conv2D)        (None, 8, 8, 8)      256         res4_relu2[0][0]
__________________________________________________________________________________________________
value_batchnorm (BatchNormaliza (None, 4, 8, 8)      16          value_conv-1-1[0][0]
__________________________________________________________________________________________________
policy_batchnorm (BatchNormaliz (None, 8, 8, 8)      32          policy_conv-1-2[0][0]
__________________________________________________________________________________________________
value_relu (Activation)         (None, 4, 8, 8)      0           value_batchnorm[0][0]
__________________________________________________________________________________________________
policy_relu (Activation)        (None, 8, 8, 8)      0           policy_batchnorm[0][0]
__________________________________________________________________________________________________
value_flatten (Flatten)         (None, 256)          0           value_relu[0][0]
__________________________________________________________________________________________________
policy_flatten (Flatten)        (None, 512)          0           policy_relu[0][0]
__________________________________________________________________________________________________
value_dense (Dense)             (None, 256)          65792       value_flatten[0][0]
__________________________________________________________________________________________________
policy_out (Dense)              (None, 1968)         1009584     policy_flatten[0][0]
__________________________________________________________________________________________________
value_out (Dense)               (None, 1)            257         value_dense[0][0]
==================================================================================================
Total params: 1,156,129
Trainable params: 1,155,529
Non-trainable params: 600
__________________________________________________________________________________________________
>>>

brianr · Post by **brianr** » Fri Aug 31, 2018 3:11 am

Robert Pope wrote: ↑Thu Aug 30, 2018 7:41 pm It's also questionable if a network can truly unlearn wrong information. It seems logical on its face, yet Deepmind found that networks that were started from analysis of expert games topped out lower than a network starting from scratch. If they could really unlearn things completely, then both should have reached the same level of play, and starting with non-zero games would have gotten there faster.

DeepMind only ran what they wanted to, within the context of the "zero" approach. Unlearning wrong info is just climbing out of a local minimum. It can be done, but how much effort is required is certainly an issue, both in terms of the number of samples and the hyper-parameter tuning. As the Leela chess and Lc0 efforts have shown, these details are critical, and unfortunately, many of them were not included in the AZ papers. In fact, I seem to recall that there were some comments that the nets could get better, but DeepMind chose not to bother. IMO, they could have made the supervised learning results better, but perhaps that did not fit their project goals.

Henk · Post by **Henk** » Sat Sep 01, 2018 12:02 pm

Slow progress. 70000 examples and each epoch taking ten minutes.
When progress stops I change network and it starts all over again.
Maybe adding more training examples might help if progress stops. I don't know. Haven't tested that.

Or perhaps I better use 10000 examples or less to evaluate a network (architecture, hyperparameter) to save time.
First goal is getting the network (architecture) right but there are many.

Code: Select all

0.208287827661261  0.183077094335066
0.205673240014877  0.181602108422112

brianr · Post by **brianr** » Sat Sep 01, 2018 12:52 pm

Henk wrote: ↑Sat Sep 01, 2018 12:02 pm Slow progress. 70000 examples and each epoch taking ten minutes.
When progress stops I change network and it starts all over again.
Maybe adding more training examples might help if progress stops. I don't know. Haven't tested that.

Or perhaps I better use 10000 examples or less to evaluate a network (architecture, hyperparameter) to save time.
First goal is getting the network (architecture) right but there are many.
Code: Select all
0.208287827661261  0.183077094335066
0.205673240014877  0.181602108422112

I suggest thinking a bit more about things. I was quite surprised to find that one of the 32x4 networks from here:
http://webphactory.net/lczero/
plays at around 2,800 Elo on a 1070. Note that this was a 32x4 net running Lc0 (which is NOT the same as the net listed in detail in an earlier post). Moreover, larger nets have been proven to play better (when properly trained), than smaller nets, even though there is a proportional reduction in nps speed.

So, I have started to think that figuring out how to properly tune a net should be one of the first goals. Note that all of the nets mentioned are trying to roughly follow the AlphaZero deep residual net architecture:
https://applied-data.science/static/mai ... _sheet.png
The above is for Go, but the Chess one is quite similar. So the "right" net architecture IS important, but the number of filters and layers seems somewhat less so.

Finally, it is not clear how important the "history" input is at this point, although there are indications that it is quite important (albeit in testing with FEN positions with no history vs even minimal history, not for full games, AFAIK).

Henk · Post by **Henk** » Sat Sep 01, 2018 1:02 pm

There are many network architectures. Why should a residual network be the best.
Perhaps great many variants to try. But unfortunately testing them takes so much time.

So my strategy is to start with smallest ones first. For it should take less time to test or to find out it is not working and also debugging might be easier.

Henk · Post by **Henk** » Sat Sep 01, 2018 1:14 pm

Hi hi hi. It might even be true that a network with bad learning properties or with bugs performs best for it generalizes better.

brianr · Post by **brianr** » Sat Sep 01, 2018 2:00 pm

Henk wrote: ↑Sat Sep 01, 2018 1:02 pm There are many network architectures. Why should a residual network be the best.
Perhaps great many variants to try. But unfortunately testing them takes so much time.

So my strategy is to start with smallest ones first. For it should take less time to test or to find out it is not working and also debugging might be easier.

I agree that we don't know that residual nets are the best; but, we do know they work quite well for chess.
So, I'm starting with that overall architecture. To try to keep things simple, I would like to find a small net that I can tune well enough so it at least always mates "pretty well" with KQvK. I think it is safe to assume that full 32 piece chess is enormously more complicated, so if I can't figure out how to do just KQvK, then there is little point in my trying the full game by myself.

brianr · Post by **brianr** » Sat Sep 01, 2018 2:06 pm

Henk wrote: ↑Sat Sep 01, 2018 1:14 pm Hi hi hi. It might even be true that a network with bad learning properties or with bugs performs best for it generalizes better.

Working on Tinker I found found several times that a version with clear bugs actually played better. But, as conflicted as I was, I finally came around to working only from code without known bugs.

Beyond bugs, it turns out that my testing methodology has had so many flaws over the years that it is hard to know how many "good" changes were missed and "bad" ones made it in.

Working with nets, the entire month of August was wasted trying to tune a larger net than my previous best, so it's back to "square one" for me.

Good luck and please let us know how things progress with your efforts.

Brian

Sesse · Post by **Sesse** » Sun Sep 02, 2018 10:38 am

Henk wrote: ↑Sat Sep 01, 2018 1:02 pm There are many network architectures. Why should a residual network be the best.

As I understand it (it's a while since I worked with this), a residual network is just a trick on top of some other architecture—you let every other layer start be an additive layer on top of the previous one. This is purely a numerical trick for helping learning in large networks; you can rewrite a residual network to a non-residual one just by adding the weights from the previous layer.

Henk · Post by **Henk** » Mon Sep 10, 2018 4:21 pm

Getting from 0.1738 to 0.1722 costing 49 minutes.

[Last week I lost the weights because I accidentally generated a new network overwriting the old one. So I had to start all over again. Busy two days now getting to this]

Code: Select all

0.173843260482675  0.155715351217672
0.17376476122923  0.155656586818636
0.17366084220189  0.155592418777899
0.173559118662966  0.15553298462641
0.173530084285974  0.15547422579013
0.173353026116885  0.155411660602527
0.17309622524582  0.155351764491435
0.17297541578088  0.155294262807597
0.172839301373288  0.155238627009984
0.172709488173819  0.155184913771754
0.17261382259316  0.155133167039571
0.172526916972935  0.1550816322635
0.172442104214535  0.155031822855693
0.172374160038596  0.154979208902587
0.172296394774209  0.154926758988309

Learning time growing exponentially with number of training examples

Re: Learning time growing exponentially with number of training examples

Re: Learning time growing exponentially with number of training examples

Re: Learning time growing exponentially with number of training examples

Re: Learning time growing exponentially with number of training examples

Re: Learning time growing exponentially with number of training examples

Re: Learning time growing exponentially with number of training examples

Re: Learning time growing exponentially with number of training examples

Re: Learning time growing exponentially with number of training examples

Re: Learning time growing exponentially with number of training examples

Re: Learning time growing exponentially with number of training examples