I have read in several threads that some of the people tune their data with so called "Mini-Batches",
a subset of the total dataset. What is the idea and how can it be used.
A link to an easy introduction on that topic would already be interesting.
Thanks in advance.
Using Mini-Batch for tunig
Moderators: hgm, Rebel, chrisw
-
- Posts: 1871
- Joined: Sat Nov 25, 2017 2:28 pm
- Location: France
-
- Posts: 16
- Joined: Fri Dec 27, 2019 8:47 pm
- Full name: Jacek Dermont
Re: Using Mini-Batch for tunig
Stochastic gradient descent could be seen as mini-batch of size 1. So you iterate over your data, one by one and update the weights after one position. In mini-batch you divide your training data into parts of size N (typically in range 32..1024 or more) and update weights once per mini-batch. It is usually more effective than SGD and more parallelizable be it via GPU or CPU, so in practice much faster.
-
- Posts: 879
- Joined: Mon Dec 15, 2008 11:45 am
Re: Using Mini-Batch for tunig
Splitting the training data into parts doesn't sound like it depends on a tuning algorithm itself, so what is the general idea.derjack wrote: ↑Tue Jan 12, 2021 9:11 pm Stochastic gradient descent could be seen as mini-batch of size 1. So you iterate over your data, one by one and update the weights after one position. In mini-batch you divide your training data into parts of size N (typically in range 32..1024 or more) and update weights once per mini-batch. It is usually more effective than SGD and more parallelizable be it via GPU or CPU, so in practice much faster.
I can imagine, that results in more updates of an parameter vector and you reach very fast a semi-good solution.
Further i would guess the challenge could be to calm down to avoid fluctuations or divergence.
In basic algorithms where i don't have something like a learning rate i would control it with stepcount * stepsize when updating a parameter.
Additionally someone could operate on every parameter in the beginning and later only pick a subset of parameter vector.
What do you think ? sorry, a complete new field for me.
-
- Posts: 16
- Joined: Fri Dec 27, 2019 8:47 pm
- Full name: Jacek Dermont
Re: Using Mini-Batch for tunig
Actually it is algorithm specific, or gradient descent specific. In mini-batches you still operate on the entire data in each epoch.Desperado wrote: ↑Tue Jan 12, 2021 9:30 pmSplitting the training data into parts doesn't sound like it depends on a tuning algorithm itself, so what is the general idea.derjack wrote: ↑Tue Jan 12, 2021 9:11 pm Stochastic gradient descent could be seen as mini-batch of size 1. So you iterate over your data, one by one and update the weights after one position. In mini-batch you divide your training data into parts of size N (typically in range 32..1024 or more) and update weights once per mini-batch. It is usually more effective than SGD and more parallelizable be it via GPU or CPU, so in practice much faster.
Maybe you misunderstood with something like using only a subset of positions and training on them, then using other subset of positions etc. training separately for each subsets, and since they are smaller, the training would be faster.
-
- Posts: 879
- Joined: Mon Dec 15, 2008 11:45 am
Re: Using Mini-Batch for tunig
What i mean is that you accept changes during an epoch, so you update the reference fitness too (in the hope you are on the right track).derjack wrote: ↑Tue Jan 12, 2021 10:07 pmActually it is algorithm specific, or gradient descent specific. In mini-batches you still operate on the entire data in each epoch.Desperado wrote: ↑Tue Jan 12, 2021 9:30 pmSplitting the training data into parts doesn't sound like it depends on a tuning algorithm itself, so what is the general idea.derjack wrote: ↑Tue Jan 12, 2021 9:11 pm Stochastic gradient descent could be seen as mini-batch of size 1. So you iterate over your data, one by one and update the weights after one position. In mini-batch you divide your training data into parts of size N (typically in range 32..1024 or more) and update weights once per mini-batch. It is usually more effective than SGD and more parallelizable be it via GPU or CPU, so in practice much faster.
Maybe you misunderstood with something like using only a subset of positions and training on them, then using other subset of positions etc. training separately for each subsets, and since they are smaller, the training would be faster.
The would speed up things at the beginning but you would need to "cool down" the learning rate in later epochs, so you don't begin to fluctuate or to diverge because of too many updates of the reference fitness.