You want to stop your training when the improvement in MSE (and accuracy if training for moves too) starts to flatline. The actual value is less important than the trend. Look at over and under fitting with validation and/or test sample sets. There are MANY meta and hyper parameters involved. In the end what counts is does the new net beat the old net when playing games.
There will be many setbacks and sometimes they will be painful. Keep an eye on the Leela Discord chats, where there have been dozens of restarts with entire teams of devs and testers much smarter than I am. Incidentally, they may be quite close to AlphaZero chess and SF9 performance on reasonably strong h/w already.
In fact, I just realized that I have wasted an entire month training a new 256x20 net (see Zeta36 chess-alpha-zero on GitHub). This was 3 weeks training plus a week of testing. Unfortunately, the new net is at least 100 Elo worse than the old best net. What is worse, I'm not even sure which old net is the best and could never even duplicate creating it. Maybe I shouldn't be offering suggestions at all, although I do enjoy even just tinkering.
BrianR
(author of Tinker)
Learning time growing exponentially with number of training examples
Moderators: hgm, Rebel, chrisw
-
- Posts: 536
- Joined: Thu Mar 09, 2006 3:01 pm
-
- Posts: 1567
- Joined: Thu Jul 16, 2009 10:47 am
- Location: Almere, The Netherlands
Re: Learning time growing exponentially with number of training examples
I don't think that for SELU the accuracy of the Exp() function plays a big role, maybe you can use an approximation with a Taylor series or something alike. It won't make a difference of a magnitude, but every bit of speed you can gain will help of course.Henk wrote: ↑Tue Aug 28, 2018 2:58 pm For loss function I still use mean square error. Maybe I should change that. For activation function I use SELU. I read that if you use a SELU you don't need batch normalization for it is self normalizing. But computing an Exp(x) is one of slowest operations during training.
Code: Select all
static public double SELU(double x) { return 1.0507 * (x >= 0 ? x : 1.67326 * (Math.Exp(x) - 1)); }
-
- Posts: 1567
- Joined: Thu Jul 16, 2009 10:47 am
- Location: Almere, The Netherlands
Re: Learning time growing exponentially with number of training examples
The problem with Leela was that they were training with bugged versions, recently they found another bug in the 50 move draw recognition. I never understood all these restarts from scratch, a network can also unlearn wrong information.brianr wrote: ↑Tue Aug 28, 2018 3:44 pm There will be many setbacks and sometimes they will be painful. Keep an eye on the Leela Discord chats, where there have been dozens of restarts with entire teams of devs and testers much smarter than I am. Incidentally, they may be quite close to AlphaZero chess and SF9 performance on reasonably strong h/w already.
LC0 v0.17 with the latest network performs with blitz about 3200 CCRL on my machine with GTX-1080ti, which is not so bad when you consider that the project is just half a year old. With longer thinking times though the rating doesn't scale up that much, for A-B engines this is usually ~60 Elo per doubling in time, for LC0 this seems to be way less, when they want to catch Stockfish 9 at TCEC conditions they still have a very long time ahead.
-
- Posts: 536
- Joined: Thu Mar 09, 2006 3:01 pm
Re: Learning time growing exponentially with number of training examples
Bugs were just one of many issues to address. The AlphaZero papers left out quite a few critical details, all of which had to be determined by experimentation. At the same time, the entire distributed test game generation and training infrastructure was improved along the way. The combination of all three things has been a dizzying rate of change, although improvement as you noted as has been remarkable, IMHO. Of course, credit is also due to SF/Fishtest and Leela-zero Go.Joost Buijs wrote: ↑Tue Aug 28, 2018 5:16 pm The problem with Leela was that they were training with bugged versions, recently they found another bug in the 50 move draw recognition. I never understood all these restarts from scratch, a network can also unlearn wrong information.
-
- Posts: 300
- Joined: Mon Apr 30, 2018 11:51 pm
Re: Learning time growing exponentially with number of training examples
If your activation function is what's holding your network back, you have a strange architecture. Normally, the matrix multiplies are where the bulk of the time goes, even more so with fully connected layers (just because you have so many multiplications).
-
- Posts: 7221
- Joined: Mon May 27, 2013 10:31 am
Re: Learning time growing exponentially with number of training examples
I saw in a video somewhere that fully connected layer was not necessary. So I removed it. But I have plans to restore it for maybe that might be why it is learning so slowly.
Also probably if I add more filters matrix multiplication might become bottleneck. I wanted to find out how far I could get with small networks.
Also large networks are slow as hell on my computer not to talk about using a full connected layer.
-
- Posts: 300
- Joined: Mon Apr 30, 2018 11:51 pm
Re: Learning time growing exponentially with number of training examples
You really want to use a GPU.
-
- Posts: 7221
- Joined: Mon May 27, 2013 10:31 am
Re: Learning time growing exponentially with number of training examples
Maybe something like this . 25 epochs. I only used 3000 examples this time.brianr wrote: ↑Tue Aug 28, 2018 3:44 pm You want to stop your training when the improvement in MSE (and accuracy if training for moves too) starts to flatline. The actual value is less important than the trend. Look at over and under fitting with validation and/or test sample sets. There are MANY meta and hyper parameters involved. In the end what counts is does the new net beat the old net when playing games.
There will be many setbacks and sometimes they will be painful. Keep an eye on the Leela Discord chats, where there have been dozens of restarts with entire teams of devs and testers much smarter than I am. Incidentally, they may be quite close to AlphaZero chess and SF9 performance on reasonably strong h/w already.
In fact, I just realized that I have wasted an entire month training a new 256x20 net (see Zeta36 chess-alpha-zero on GitHub). This was 3 weeks training plus a week of testing. Unfortunately, the new net is at least 100 Elo worse than the old best net. What is worse, I'm not even sure which old net is the best and could never even duplicate creating it. Maybe I shouldn't be offering suggestions at all, although I do enjoy even just tinkering.
BrianR
(author of Tinker)
I can use Excel for creating graphics. Looks like no more then 25 epochs are needed for this network.
4 minutes needed to produce this. If I go to 70000 examples it may take many hours I guess.
Might be interesting to know how many examples you need to get to 0.02 accuracy. But then network needs to change too.
Code: Select all
accuracy loss
0.458013710544201 0.367862032822002
0.453077727612043 0.350646540255515
0.449212547080192 0.335723648808655
0.445865818013073 0.322727363529838
0.442433604756379 0.311228538414997
0.438995924054079 0.300913524347082
0.435773768996642 0.291434877972697
0.432861667809229 0.28292564419744
0.430230891868324 0.27509496124421
0.427686040693979 0.26793934442099
0.425365970491599 0.261373392313979
0.423314145714826 0.255471106366534
0.421567529055967 0.249946968524498
0.420119441961668 0.24491328602785
0.418955067525141 0.240276449169904
0.417980999577265 0.23600902351908
0.417224789650568 0.232054312958871
0.416588542222407 0.228320648594056
0.416118417554459 0.224943371131533
0.415736412805752 0.221769135172235
0.415392560786855 0.218876717276215
0.415184835891526 0.216232079801687
0.414975611427226 0.213814402292785
0.414900894763836 0.211559877222756
0.41496182027214 0.209489659598996
-
- Posts: 558
- Joined: Sat Mar 25, 2006 8:27 pm
Re: Learning time growing exponentially with number of training examples
It's also questionable if a network can truly unlearn wrong information. It seems logical on its face, yet Deepmind found that networks that were started from analysis of expert games topped out lower than a network starting from scratch. If they could really unlearn things completely, then both should have reached the same level of play, and starting with non-zero games would have gotten there faster.brianr wrote: ↑Tue Aug 28, 2018 5:34 pmBugs were just one of many issues to address. The AlphaZero papers left out quite a few critical details, all of which had to be determined by experimentation. At the same time, the entire distributed test game generation and training infrastructure was improved along the way. The combination of all three things has been a dizzying rate of change, although improvement as you noted as has been remarkable, IMHO. Of course, credit is also due to SF/Fishtest and Leela-zero Go.Joost Buijs wrote: ↑Tue Aug 28, 2018 5:16 pm The problem with Leela was that they were training with bugged versions, recently they found another bug in the 50 move draw recognition. I never understood all these restarts from scratch, a network can also unlearn wrong information.
-
- Posts: 7221
- Joined: Mon May 27, 2013 10:31 am
Re: Learning time growing exponentially with number of training examples
Henk wrote: ↑Thu Aug 30, 2018 5:49 pmMaybe something like this . 25 epochs. I only used 3000 examples this time.brianr wrote: ↑Tue Aug 28, 2018 3:44 pm You want to stop your training when the improvement in MSE (and accuracy if training for moves too) starts to flatline. The actual value is less important than the trend. Look at over and under fitting with validation and/or test sample sets. There are MANY meta and hyper parameters involved. In the end what counts is does the new net beat the old net when playing games.
There will be many setbacks and sometimes they will be painful. Keep an eye on the Leela Discord chats, where there have been dozens of restarts with entire teams of devs and testers much smarter than I am. Incidentally, they may be quite close to AlphaZero chess and SF9 performance on reasonably strong h/w already.
In fact, I just realized that I have wasted an entire month training a new 256x20 net (see Zeta36 chess-alpha-zero on GitHub). This was 3 weeks training plus a week of testing. Unfortunately, the new net is at least 100 Elo worse than the old best net. What is worse, I'm not even sure which old net is the best and could never even duplicate creating it. Maybe I shouldn't be offering suggestions at all, although I do enjoy even just tinkering.
BrianR
(author of Tinker)
I can use Excel for creating graphics. Looks like no more then 25 epochs are needed for this network.
4 minutes needed to produce this. If I go to 70000 examples it may take many hours I guess.
Might be interesting to know how many examples you need to get to 0.02 accuracy. But then network needs to change too.
Code: Select all
accuracy loss 0.458013710544201 0.367862032822002 0.453077727612043 0.350646540255515 0.449212547080192 0.335723648808655 0.445865818013073 0.322727363529838 0.442433604756379 0.311228538414997 0.438995924054079 0.300913524347082 0.435773768996642 0.291434877972697 0.432861667809229 0.28292564419744 0.430230891868324 0.27509496124421 0.427686040693979 0.26793934442099 0.425365970491599 0.261373392313979 0.423314145714826 0.255471106366534 0.421567529055967 0.249946968524498 0.420119441961668 0.24491328602785 0.418955067525141 0.240276449169904 0.417980999577265 0.23600902351908 0.417224789650568 0.232054312958871 0.416588542222407 0.228320648594056 0.416118417554459 0.224943371131533 0.415736412805752 0.221769135172235 0.415392560786855 0.218876717276215 0.415184835891526 0.216232079801687 0.414975611427226 0.213814402292785 0.414900894763836 0.211559877222756 0.41496182027214 0.209489659598996
30000 examples taking 146 minutes but 25 epochs not enough,
Code: Select all
0.51334014734026 0.35042039985026
0.464390222849539 0.315072539741007
0.441037883490549 0.29892015998411
0.428137215031953 0.288855649218407
0.419934378348723 0.282072441348667
0.414433265222756 0.277100474454861
0.409873038184546 0.273135870832422
0.40615286288943 0.269766922350763
0.402648138504851 0.26677033369879
0.399390106333783 0.264008944522664
0.396005959140849 0.261470779797726
0.392240139315238 0.259121975584111
0.388769059613872 0.256860116284345
0.384834502730976 0.254575196095213
0.380694149653319 0.252355890794477
0.375440151056492 0.250154296940123
0.369873845835248 0.247963937490551
0.364236024776097 0.245779243438712
0.35828918067003 0.243593313773837
0.351584318735543 0.241410311867053
0.345129748317661 0.239261184620213
0.338871537820065 0.237169053284895
0.332758918517876 0.23514527421077
0.326351954124777 0.233107562070871
0.320626414619341 0.231087649267064