Learning time growing exponentially with number of training examples

brianr · Post by **brianr** » Tue Aug 28, 2018 3:44 pm

You want to stop your training when the improvement in MSE (and accuracy if training for moves too) starts to flatline. The actual value is less important than the trend. Look at over and under fitting with validation and/or test sample sets. There are MANY meta and hyper parameters involved. In the end what counts is does the new net beat the old net when playing games.

There will be many setbacks and sometimes they will be painful. Keep an eye on the Leela Discord chats, where there have been dozens of restarts with entire teams of devs and testers much smarter than I am. Incidentally, they may be quite close to AlphaZero chess and SF9 performance on reasonably strong h/w already.

In fact, I just realized that I have wasted an entire month training a new 256x20 net (see Zeta36 chess-alpha-zero on GitHub). This was 3 weeks training plus a week of testing. Unfortunately, the new net is at least 100 Elo worse than the old best net. What is worse, I'm not even sure which old net is the best and could never even duplicate creating it. Maybe I shouldn't be offering suggestions at all, although I do enjoy even just tinkering.

BrianR
(author of Tinker)

Joost Buijs · Post by **Joost Buijs** » Tue Aug 28, 2018 4:12 pm

Henk wrote: ↑Tue Aug 28, 2018 2:58 pm For loss function I still use mean square error. Maybe I should change that. For activation function I use SELU. I read that if you use a SELU you don't need batch normalization for it is self normalizing. But computing an Exp(x) is one of slowest operations during training.
Code: Select all
       static public double SELU(double x)
        {
            return 1.0507 * (x >= 0 ? x : 1.67326 * (Math.Exp(x) - 1));
        }

I don't think that for SELU the accuracy of the Exp() function plays a big role, maybe you can use an approximation with a Taylor series or something alike. It won't make a difference of a magnitude, but every bit of speed you can gain will help of course.

Joost Buijs · Post by **Joost Buijs** » Tue Aug 28, 2018 5:16 pm

brianr wrote: ↑Tue Aug 28, 2018 3:44 pm There will be many setbacks and sometimes they will be painful. Keep an eye on the Leela Discord chats, where there have been dozens of restarts with entire teams of devs and testers much smarter than I am. Incidentally, they may be quite close to AlphaZero chess and SF9 performance on reasonably strong h/w already.

The problem with Leela was that they were training with bugged versions, recently they found another bug in the 50 move draw recognition. I never understood all these restarts from scratch, a network can also unlearn wrong information.

LC0 v0.17 with the latest network performs with blitz about 3200 CCRL on my machine with GTX-1080ti, which is not so bad when you consider that the project is just half a year old. With longer thinking times though the rating doesn't scale up that much, for A-B engines this is usually ~60 Elo per doubling in time, for LC0 this seems to be way less, when they want to catch Stockfish 9 at TCEC conditions they still have a very long time ahead.

brianr · Post by **brianr** » Tue Aug 28, 2018 5:34 pm

Joost Buijs wrote: ↑Tue Aug 28, 2018 5:16 pm The problem with Leela was that they were training with bugged versions, recently they found another bug in the 50 move draw recognition. I never understood all these restarts from scratch, a network can also unlearn wrong information.

Bugs were just one of many issues to address. The AlphaZero papers left out quite a few critical details, all of which had to be determined by experimentation. At the same time, the entire distributed test game generation and training infrastructure was improved along the way. The combination of all three things has been a dizzying rate of change, although improvement as you noted as has been remarkable, IMHO. Of course, credit is also due to SF/Fishtest and Leela-zero Go.

Sesse · Post by **Sesse** » Tue Aug 28, 2018 7:59 pm

If your activation function is what's holding your network back, you have a strange architecture. Normally, the matrix multiplies are where the bulk of the time goes, even more so with fully connected layers (just because you have so many multiplications).

Henk · Post by **Henk** » Tue Aug 28, 2018 9:02 pm

Sesse wrote: ↑Tue Aug 28, 2018 7:59 pm If your activation function is what's holding your network back, you have a strange architecture. Normally, the matrix multiplies are where the bulk of the time goes, even more so with fully connected layers (just because you have so many multiplications).

I saw in a video somewhere that fully connected layer was not necessary. So I removed it. But I have plans to restore it for maybe that might be why it is learning so slowly.

Also probably if I add more filters matrix multiplication might become bottleneck. I wanted to find out how far I could get with small networks.
Also large networks are slow as hell on my computer not to talk about using a full connected layer.

Sesse · Post by **Sesse** » Tue Aug 28, 2018 10:26 pm

You really want to use a GPU.

Henk · Post by **Henk** » Thu Aug 30, 2018 5:49 pm

brianr wrote: ↑Tue Aug 28, 2018 3:44 pm You want to stop your training when the improvement in MSE (and accuracy if training for moves too) starts to flatline. The actual value is less important than the trend. Look at over and under fitting with validation and/or test sample sets. There are MANY meta and hyper parameters involved. In the end what counts is does the new net beat the old net when playing games.

There will be many setbacks and sometimes they will be painful. Keep an eye on the Leela Discord chats, where there have been dozens of restarts with entire teams of devs and testers much smarter than I am. Incidentally, they may be quite close to AlphaZero chess and SF9 performance on reasonably strong h/w already.

In fact, I just realized that I have wasted an entire month training a new 256x20 net (see Zeta36 chess-alpha-zero on GitHub). This was 3 weeks training plus a week of testing. Unfortunately, the new net is at least 100 Elo worse than the old best net. What is worse, I'm not even sure which old net is the best and could never even duplicate creating it. Maybe I shouldn't be offering suggestions at all, although I do enjoy even just tinkering.

BrianR
(author of Tinker)

Maybe something like this . 25 epochs. I only used 3000 examples this time.
I can use Excel for creating graphics. Looks like no more then 25 epochs are needed for this network.

4 minutes needed to produce this. If I go to 70000 examples it may take many hours I guess.
Might be interesting to know how many examples you need to get to 0.02 accuracy. But then network needs to change too.

Code: Select all

accuracy                  loss
0.458013710544201  0.367862032822002
0.453077727612043  0.350646540255515
0.449212547080192  0.335723648808655
0.445865818013073  0.322727363529838
0.442433604756379  0.311228538414997
0.438995924054079  0.300913524347082
0.435773768996642  0.291434877972697
0.432861667809229  0.28292564419744
0.430230891868324  0.27509496124421
0.427686040693979  0.26793934442099
0.425365970491599  0.261373392313979
0.423314145714826  0.255471106366534
0.421567529055967  0.249946968524498
0.420119441961668  0.24491328602785
0.418955067525141  0.240276449169904
0.417980999577265  0.23600902351908
0.417224789650568  0.232054312958871
0.416588542222407  0.228320648594056
0.416118417554459  0.224943371131533
0.415736412805752  0.221769135172235
0.415392560786855  0.218876717276215
0.415184835891526  0.216232079801687
0.414975611427226  0.213814402292785
0.414900894763836  0.211559877222756
0.41496182027214  0.209489659598996

Robert Pope · Post by **Robert Pope** » Thu Aug 30, 2018 7:41 pm

brianr wrote: ↑Tue Aug 28, 2018 5:34 pm
Joost Buijs wrote: ↑Tue Aug 28, 2018 5:16 pm The problem with Leela was that they were training with bugged versions, recently they found another bug in the 50 move draw recognition. I never understood all these restarts from scratch, a network can also unlearn wrong information.
Bugs were just one of many issues to address. The AlphaZero papers left out quite a few critical details, all of which had to be determined by experimentation. At the same time, the entire distributed test game generation and training infrastructure was improved along the way. The combination of all three things has been a dizzying rate of change, although improvement as you noted as has been remarkable, IMHO. Of course, credit is also due to SF/Fishtest and Leela-zero Go.

It's also questionable if a network can truly unlearn wrong information. It seems logical on its face, yet Deepmind found that networks that were started from analysis of expert games topped out lower than a network starting from scratch. If they could really unlearn things completely, then both should have reached the same level of play, and starting with non-zero games would have gotten there faster.

Henk · Post by **Henk** » Thu Aug 30, 2018 11:41 pm

Henk wrote: ↑Thu Aug 30, 2018 5:49 pm
brianr wrote: ↑Tue Aug 28, 2018 3:44 pm You want to stop your training when the improvement in MSE (and accuracy if training for moves too) starts to flatline. The actual value is less important than the trend. Look at over and under fitting with validation and/or test sample sets. There are MANY meta and hyper parameters involved. In the end what counts is does the new net beat the old net when playing games.

There will be many setbacks and sometimes they will be painful. Keep an eye on the Leela Discord chats, where there have been dozens of restarts with entire teams of devs and testers much smarter than I am. Incidentally, they may be quite close to AlphaZero chess and SF9 performance on reasonably strong h/w already.

In fact, I just realized that I have wasted an entire month training a new 256x20 net (see Zeta36 chess-alpha-zero on GitHub). This was 3 weeks training plus a week of testing. Unfortunately, the new net is at least 100 Elo worse than the old best net. What is worse, I'm not even sure which old net is the best and could never even duplicate creating it. Maybe I shouldn't be offering suggestions at all, although I do enjoy even just tinkering.

BrianR
(author of Tinker)
Maybe something like this . 25 epochs. I only used 3000 examples this time.
I can use Excel for creating graphics. Looks like no more then 25 epochs are needed for this network.

4 minutes needed to produce this. If I go to 70000 examples it may take many hours I guess.
Might be interesting to know how many examples you need to get to 0.02 accuracy. But then network needs to change too.
Code: Select all
accuracy                  loss
0.458013710544201  0.367862032822002
0.453077727612043  0.350646540255515
0.449212547080192  0.335723648808655
0.445865818013073  0.322727363529838
0.442433604756379  0.311228538414997
0.438995924054079  0.300913524347082
0.435773768996642  0.291434877972697
0.432861667809229  0.28292564419744
0.430230891868324  0.27509496124421
0.427686040693979  0.26793934442099
0.425365970491599  0.261373392313979
0.423314145714826  0.255471106366534
0.421567529055967  0.249946968524498
0.420119441961668  0.24491328602785
0.418955067525141  0.240276449169904
0.417980999577265  0.23600902351908
0.417224789650568  0.232054312958871
0.416588542222407  0.228320648594056
0.416118417554459  0.224943371131533
0.415736412805752  0.221769135172235
0.415392560786855  0.218876717276215
0.415184835891526  0.216232079801687
0.414975611427226  0.213814402292785
0.414900894763836  0.211559877222756
0.41496182027214  0.209489659598996

30000 examples taking 146 minutes but 25 epochs not enough,

Code: Select all

0.51334014734026  0.35042039985026
0.464390222849539  0.315072539741007
0.441037883490549  0.29892015998411
0.428137215031953  0.288855649218407
0.419934378348723  0.282072441348667
0.414433265222756  0.277100474454861
0.409873038184546  0.273135870832422
0.40615286288943  0.269766922350763
0.402648138504851  0.26677033369879
0.399390106333783  0.264008944522664
0.396005959140849  0.261470779797726
0.392240139315238  0.259121975584111
0.388769059613872  0.256860116284345
0.384834502730976  0.254575196095213
0.380694149653319  0.252355890794477
0.375440151056492  0.250154296940123
0.369873845835248  0.247963937490551
0.364236024776097  0.245779243438712
0.35828918067003  0.243593313773837
0.351584318735543  0.241410311867053
0.345129748317661  0.239261184620213
0.338871537820065  0.237169053284895
0.332758918517876  0.23514527421077
0.326351954124777  0.233107562070871
0.320626414619341  0.231087649267064

Learning time growing exponentially with number of training examples

Re: Learning time growing exponentially with number of training examples

Re: Learning time growing exponentially with number of training examples

Re: Learning time growing exponentially with number of training examples

Re: Learning time growing exponentially with number of training examples

Re: Learning time growing exponentially with number of training examples

Re: Learning time growing exponentially with number of training examples

Re: Learning time growing exponentially with number of training examples

Re: Learning time growing exponentially with number of training examples

Re: Learning time growing exponentially with number of training examples

Re: Learning time growing exponentially with number of training examples