Learning time growing exponentially with number of training examples

Discussion of chess software programming and technical issues.

Moderators: hgm, Rebel, chrisw

brianr
Posts: 536
Joined: Thu Mar 09, 2006 3:01 pm

Re: Learning time growing exponentially with number of training examples

Post by brianr »

You want to stop your training when the improvement in MSE (and accuracy if training for moves too) starts to flatline. The actual value is less important than the trend. Look at over and under fitting with validation and/or test sample sets. There are MANY meta and hyper parameters involved. In the end what counts is does the new net beat the old net when playing games.

There will be many setbacks and sometimes they will be painful. Keep an eye on the Leela Discord chats, where there have been dozens of restarts with entire teams of devs and testers much smarter than I am. Incidentally, they may be quite close to AlphaZero chess and SF9 performance on reasonably strong h/w already.

In fact, I just realized that I have wasted an entire month training a new 256x20 net (see Zeta36 chess-alpha-zero on GitHub). This was 3 weeks training plus a week of testing. Unfortunately, the new net is at least 100 Elo worse than the old best net. What is worse, I'm not even sure which old net is the best and could never even duplicate creating it. Maybe I shouldn't be offering suggestions at all, although I do enjoy even just tinkering.

BrianR
(author of Tinker)
Joost Buijs
Posts: 1563
Joined: Thu Jul 16, 2009 10:47 am
Location: Almere, The Netherlands

Re: Learning time growing exponentially with number of training examples

Post by Joost Buijs »

Henk wrote: Tue Aug 28, 2018 2:58 pm For loss function I still use mean square error. Maybe I should change that. For activation function I use SELU. I read that if you use a SELU you don't need batch normalization for it is self normalizing. But computing an Exp(x) is one of slowest operations during training.

Code: Select all

       static public double SELU(double x)
        {
            return 1.0507 * (x >= 0 ? x : 1.67326 * (Math.Exp(x) - 1));
        }
I don't think that for SELU the accuracy of the Exp() function plays a big role, maybe you can use an approximation with a Taylor series or something alike. It won't make a difference of a magnitude, but every bit of speed you can gain will help of course.
Joost Buijs
Posts: 1563
Joined: Thu Jul 16, 2009 10:47 am
Location: Almere, The Netherlands

Re: Learning time growing exponentially with number of training examples

Post by Joost Buijs »

brianr wrote: Tue Aug 28, 2018 3:44 pm There will be many setbacks and sometimes they will be painful. Keep an eye on the Leela Discord chats, where there have been dozens of restarts with entire teams of devs and testers much smarter than I am. Incidentally, they may be quite close to AlphaZero chess and SF9 performance on reasonably strong h/w already.
The problem with Leela was that they were training with bugged versions, recently they found another bug in the 50 move draw recognition. I never understood all these restarts from scratch, a network can also unlearn wrong information.

LC0 v0.17 with the latest network performs with blitz about 3200 CCRL on my machine with GTX-1080ti, which is not so bad when you consider that the project is just half a year old. With longer thinking times though the rating doesn't scale up that much, for A-B engines this is usually ~60 Elo per doubling in time, for LC0 this seems to be way less, when they want to catch Stockfish 9 at TCEC conditions they still have a very long time ahead.
brianr
Posts: 536
Joined: Thu Mar 09, 2006 3:01 pm

Re: Learning time growing exponentially with number of training examples

Post by brianr »

Joost Buijs wrote: Tue Aug 28, 2018 5:16 pm The problem with Leela was that they were training with bugged versions, recently they found another bug in the 50 move draw recognition. I never understood all these restarts from scratch, a network can also unlearn wrong information.
Bugs were just one of many issues to address. The AlphaZero papers left out quite a few critical details, all of which had to be determined by experimentation. At the same time, the entire distributed test game generation and training infrastructure was improved along the way. The combination of all three things has been a dizzying rate of change, although improvement as you noted as has been remarkable, IMHO. Of course, credit is also due to SF/Fishtest and Leela-zero Go.
Sesse
Posts: 300
Joined: Mon Apr 30, 2018 11:51 pm

Re: Learning time growing exponentially with number of training examples

Post by Sesse »

If your activation function is what's holding your network back, you have a strange architecture. Normally, the matrix multiplies are where the bulk of the time goes, even more so with fully connected layers (just because you have so many multiplications).
Henk
Posts: 7218
Joined: Mon May 27, 2013 10:31 am

Re: Learning time growing exponentially with number of training examples

Post by Henk »

Sesse wrote: Tue Aug 28, 2018 7:59 pm If your activation function is what's holding your network back, you have a strange architecture. Normally, the matrix multiplies are where the bulk of the time goes, even more so with fully connected layers (just because you have so many multiplications).
I saw in a video somewhere that fully connected layer was not necessary. So I removed it. But I have plans to restore it for maybe that might be why it is learning so slowly.

Also probably if I add more filters matrix multiplication might become bottleneck. I wanted to find out how far I could get with small networks.
Also large networks are slow as hell on my computer not to talk about using a full connected layer.
Sesse
Posts: 300
Joined: Mon Apr 30, 2018 11:51 pm

Re: Learning time growing exponentially with number of training examples

Post by Sesse »

You really want to use a GPU. :-)
Henk
Posts: 7218
Joined: Mon May 27, 2013 10:31 am

Re: Learning time growing exponentially with number of training examples

Post by Henk »

brianr wrote: Tue Aug 28, 2018 3:44 pm You want to stop your training when the improvement in MSE (and accuracy if training for moves too) starts to flatline. The actual value is less important than the trend. Look at over and under fitting with validation and/or test sample sets. There are MANY meta and hyper parameters involved. In the end what counts is does the new net beat the old net when playing games.

There will be many setbacks and sometimes they will be painful. Keep an eye on the Leela Discord chats, where there have been dozens of restarts with entire teams of devs and testers much smarter than I am. Incidentally, they may be quite close to AlphaZero chess and SF9 performance on reasonably strong h/w already.

In fact, I just realized that I have wasted an entire month training a new 256x20 net (see Zeta36 chess-alpha-zero on GitHub). This was 3 weeks training plus a week of testing. Unfortunately, the new net is at least 100 Elo worse than the old best net. What is worse, I'm not even sure which old net is the best and could never even duplicate creating it. Maybe I shouldn't be offering suggestions at all, although I do enjoy even just tinkering.

BrianR
(author of Tinker)
Maybe something like this . 25 epochs. I only used 3000 examples this time.
I can use Excel for creating graphics. Looks like no more then 25 epochs are needed for this network.

4 minutes needed to produce this. If I go to 70000 examples it may take many hours I guess.
Might be interesting to know how many examples you need to get to 0.02 accuracy. But then network needs to change too.

Code: Select all

accuracy                  loss
0.458013710544201  0.367862032822002
0.453077727612043  0.350646540255515
0.449212547080192  0.335723648808655
0.445865818013073  0.322727363529838
0.442433604756379  0.311228538414997
0.438995924054079  0.300913524347082
0.435773768996642  0.291434877972697
0.432861667809229  0.28292564419744
0.430230891868324  0.27509496124421
0.427686040693979  0.26793934442099
0.425365970491599  0.261373392313979
0.423314145714826  0.255471106366534
0.421567529055967  0.249946968524498
0.420119441961668  0.24491328602785
0.418955067525141  0.240276449169904
0.417980999577265  0.23600902351908
0.417224789650568  0.232054312958871
0.416588542222407  0.228320648594056
0.416118417554459  0.224943371131533
0.415736412805752  0.221769135172235
0.415392560786855  0.218876717276215
0.415184835891526  0.216232079801687
0.414975611427226  0.213814402292785
0.414900894763836  0.211559877222756
0.41496182027214  0.209489659598996
Robert Pope
Posts: 558
Joined: Sat Mar 25, 2006 8:27 pm

Re: Learning time growing exponentially with number of training examples

Post by Robert Pope »

brianr wrote: Tue Aug 28, 2018 5:34 pm
Joost Buijs wrote: Tue Aug 28, 2018 5:16 pm The problem with Leela was that they were training with bugged versions, recently they found another bug in the 50 move draw recognition. I never understood all these restarts from scratch, a network can also unlearn wrong information.
Bugs were just one of many issues to address. The AlphaZero papers left out quite a few critical details, all of which had to be determined by experimentation. At the same time, the entire distributed test game generation and training infrastructure was improved along the way. The combination of all three things has been a dizzying rate of change, although improvement as you noted as has been remarkable, IMHO. Of course, credit is also due to SF/Fishtest and Leela-zero Go.
It's also questionable if a network can truly unlearn wrong information. It seems logical on its face, yet Deepmind found that networks that were started from analysis of expert games topped out lower than a network starting from scratch. If they could really unlearn things completely, then both should have reached the same level of play, and starting with non-zero games would have gotten there faster.
Henk
Posts: 7218
Joined: Mon May 27, 2013 10:31 am

Re: Learning time growing exponentially with number of training examples

Post by Henk »

Henk wrote: Thu Aug 30, 2018 5:49 pm
brianr wrote: Tue Aug 28, 2018 3:44 pm You want to stop your training when the improvement in MSE (and accuracy if training for moves too) starts to flatline. The actual value is less important than the trend. Look at over and under fitting with validation and/or test sample sets. There are MANY meta and hyper parameters involved. In the end what counts is does the new net beat the old net when playing games.

There will be many setbacks and sometimes they will be painful. Keep an eye on the Leela Discord chats, where there have been dozens of restarts with entire teams of devs and testers much smarter than I am. Incidentally, they may be quite close to AlphaZero chess and SF9 performance on reasonably strong h/w already.

In fact, I just realized that I have wasted an entire month training a new 256x20 net (see Zeta36 chess-alpha-zero on GitHub). This was 3 weeks training plus a week of testing. Unfortunately, the new net is at least 100 Elo worse than the old best net. What is worse, I'm not even sure which old net is the best and could never even duplicate creating it. Maybe I shouldn't be offering suggestions at all, although I do enjoy even just tinkering.

BrianR
(author of Tinker)
Maybe something like this . 25 epochs. I only used 3000 examples this time.
I can use Excel for creating graphics. Looks like no more then 25 epochs are needed for this network.

4 minutes needed to produce this. If I go to 70000 examples it may take many hours I guess.
Might be interesting to know how many examples you need to get to 0.02 accuracy. But then network needs to change too.

Code: Select all

accuracy                  loss
0.458013710544201  0.367862032822002
0.453077727612043  0.350646540255515
0.449212547080192  0.335723648808655
0.445865818013073  0.322727363529838
0.442433604756379  0.311228538414997
0.438995924054079  0.300913524347082
0.435773768996642  0.291434877972697
0.432861667809229  0.28292564419744
0.430230891868324  0.27509496124421
0.427686040693979  0.26793934442099
0.425365970491599  0.261373392313979
0.423314145714826  0.255471106366534
0.421567529055967  0.249946968524498
0.420119441961668  0.24491328602785
0.418955067525141  0.240276449169904
0.417980999577265  0.23600902351908
0.417224789650568  0.232054312958871
0.416588542222407  0.228320648594056
0.416118417554459  0.224943371131533
0.415736412805752  0.221769135172235
0.415392560786855  0.218876717276215
0.415184835891526  0.216232079801687
0.414975611427226  0.213814402292785
0.414900894763836  0.211559877222756
0.41496182027214  0.209489659598996

30000 examples taking 146 minutes but 25 epochs not enough,

Code: Select all

0.51334014734026  0.35042039985026
0.464390222849539  0.315072539741007
0.441037883490549  0.29892015998411
0.428137215031953  0.288855649218407
0.419934378348723  0.282072441348667
0.414433265222756  0.277100474454861
0.409873038184546  0.273135870832422
0.40615286288943  0.269766922350763
0.402648138504851  0.26677033369879
0.399390106333783  0.264008944522664
0.396005959140849  0.261470779797726
0.392240139315238  0.259121975584111
0.388769059613872  0.256860116284345
0.384834502730976  0.254575196095213
0.380694149653319  0.252355890794477
0.375440151056492  0.250154296940123
0.369873845835248  0.247963937490551
0.364236024776097  0.245779243438712
0.35828918067003  0.243593313773837
0.351584318735543  0.241410311867053
0.345129748317661  0.239261184620213
0.338871537820065  0.237169053284895
0.332758918517876  0.23514527421077
0.326351954124777  0.233107562070871
0.320626414619341  0.231087649267064