Hello everybody,
i am interested in a simple gradient descent implementation. Unfortunately i am not able to put the puzzle pieces together.
Here is what i think that i understand and what i can do for now:
base model:
1. i have a sample set of positions including results
2. i have a parameter list with N elements.
3. i have a cost function (MSE)
a. to minimize the cost function, which is a squared function, i need the derivative which leads to a linear model y=mx+b.
b. solved this, i can tune the parameter the way, that y=mx+b gets close to 0.
example:
1. SAMPLESIZE 10000
2. PARAMETERLIST 5
3. MSE = (sum(resultcomputed_value)^2) / SAMPLESIZE
How do i have to iterate over my parameterlist and the samples to compute m,b ?
Do i have to compute m,b for each single sample ? m,b for the batch ? how do i get m,b for the batch ?
I red some articles on the web, but i am interested in the dialogue and the practice how to handle it in the context chess parameter tuning.
So, i think i got the idea but need to know how to do it.
Thanks a lot in advance...
Gradient Descent Introduction
Moderators: bob, hgm, Harvey Williamson
Forum rules
This textbox is used to restore diagrams posted with the [d] tag before the upgrade.
This textbox is used to restore diagrams posted with the [d] tag before the upgrade.

 Posts: 793
 Joined: Sun Aug 03, 2014 2:48 am
 Location: London, UK
 Contact:
Re: Gradient Descent Introduction
Let's start very simple, and say each training example has an input x (let's say material balance  single number) and output y^ (result).Desperado wrote: ↑Sun Dec 09, 2018 2:32 pmHello everybody,
i am interested in a simple gradient descent implementation. Unfortunately i am not able to put the puzzle pieces together.
Here is what i think that i understand and what i can do for now:
base model:
1. i have a sample set of positions including results
2. i have a parameter list with N elements.
3. i have a cost function (MSE)
a. to minimize the cost function, which is a squared function, i need the derivative which leads to a linear model y=mx+b.
b. solved this, i can tune the parameter the way, that y=mx+b gets close to 0.
example:
1. SAMPLESIZE 10000
2. PARAMETERLIST 5
3. MSE = (sum(resultcomputed_value)^2) / SAMPLESIZE
How do i have to iterate over my parameterlist and the samples to compute m,b ?
Do i have to compute m,b for each single sample ? m,b for the batch ? how do i get m,b for the batch ?
I red some articles on the web, but i am interested in the dialogue and the practice how to handle it in the context chess parameter tuning.
So, i think i got the idea but need to know how to do it.
Thanks a lot in advance...
And let's say you use a very simple model y(x) = wx + b. That is, you think y is linear to x. w and b would be your parameters. Note that here y(x) is the output of the model given x, and you want it to predict y^ (the actual result).
Then when you add MSE to the mix, you get L(x) = (1/2)(y(x)  y^)^2.
Now you want to figure out how to update w and b to make your function a better model. To do that, we take partial derivative of L with respect to w and b (remember that they are part of y), because that tells us how changing them will affect L(x).
Code: Select all
dL(x)/dw = dL(x)/dy(x) * dy(x)/dw # chain rule
= (y(x)  y^) * x
dL(x)/db = (y(x)  y^) # dy(x)/db = 1
w' = w  dL(x)/dw = w  (y(x)  y^) * x
b' = b  dL(x)/db = b  (y(x)  y^)
Now this is for two parameters and a single scalar feature. In reality we use a lot more, but the idea remains the same. You just have to compute dL(x)/dw_i for all parameters w_i, and apply w_i' = w_i  dL(x)/dw_i. We usually write them in vector notation for brevity, but that's just notation. You can have your model in any form as long as it's differentiable with respect to all the parameters.
As for how often to apply the update  the theoretically correct way is to compute the update for all your samples, average them, and then apply. In reality no one does that because it's extremely computationally intensive, and usually you can get pretty good updates by just using a small set of randomly selected samples, and this allows you to get more updates given the same amount of compute. This is called stochastic gradient descent (well, technically pure SGD is using only one sample at a time, but in common usage people still call it SGD when applied to small batches).
Disclosure: I work for DeepMind on the AlphaZero project, but everything I say here is personal opinion and does not reflect the views of DeepMind / Alphabet.
Re: Gradient Descent Introduction
Thanks both of you so far. Unfortunately my time is very limited, so there might be a delay for further posts.
Now,first, i would like to follow Matthew's description and i would appreciate if you can guide me through.
1.=> e(x) = y^
For me x are my feature parameters (say material p,n,b,r,q >1,3,3,5,9) used by my evaluation function e(), that produces an output y^
2. error = given result in sample  evaluation output = y(x)  y^
That is how i get the error (and finally sum all squared errors of the sample set).
But what you are saying is, that x is based on one feature. (eg material delta, put into a number). That only would go hand in hand
if my evaluation would be based on one feature (eg material balance). I simply do not have an x0,x1,...xn this way.
Based on my explanation above, y(x) is constant as x is constant too (my feature parameter list). That is why i would have only one x.
If i use the way you described, would i need to break the features/paramaters into seperated x0,x1,...,xn input values ?
like..
delta matrerial > 200 = x0
delta mobility > 40 = x1
delta passer > 75 = x2
and so on...
Thanks a lot
PS: i guess it does not make sense to go on with your description as long this is not clarified. Maybe one additional note. The "parameters" m,b
are not the "paramters" i want to tune, but i need them to modify my x (feature parameters) rapidly. At least this is the way i think at the moment.
Now,first, i would like to follow Matthew's description and i would appreciate if you can guide me through.
Already at this point my confusion starts, sounds funny, i know...matthewlai wrote: ↑Sun Dec 09, 2018 2:57 pmLet's start very simple, and say each training example has an input x (let's say material balance  single number) and output y^ (result).
1.=> e(x) = y^
For me x are my feature parameters (say material p,n,b,r,q >1,3,3,5,9) used by my evaluation function e(), that produces an output y^
2. error = given result in sample  evaluation output = y(x)  y^
That is how i get the error (and finally sum all squared errors of the sample set).
But what you are saying is, that x is based on one feature. (eg material delta, put into a number). That only would go hand in hand
if my evaluation would be based on one feature (eg material balance). I simply do not have an x0,x1,...xn this way.
Code: Select all
x = featureparameterlist
e(x) = evaluation using x
y^ = output of e(x)
y(x) = result given by sample
err = y(x)  y^
Based on my explanation above, y(x) is constant as x is constant too (my feature parameter list). That is why i would have only one x.
If i use the way you described, would i need to break the features/paramaters into seperated x0,x1,...,xn input values ?
like..
delta matrerial > 200 = x0
delta mobility > 40 = x1
delta passer > 75 = x2
and so on...
Thanks a lot
PS: i guess it does not make sense to go on with your description as long this is not clarified. Maybe one additional note. The "parameters" m,b
are not the "paramters" i want to tune, but i need them to modify my x (feature parameters) rapidly. At least this is the way i think at the moment.
Maybe you will catch me before i can catch myself...And let's say you use a very simple model y(x) = wx + b. That is, you think y is linear to x. w and b would be your parameters. Note that here y(x) is the output of the model given x, and you want it to predict y^ (the actual result).

 Posts: 793
 Joined: Sun Aug 03, 2014 2:48 am
 Location: London, UK
 Contact:
Re: Gradient Descent Introduction
If you want to work with multiple features already, that's fine too, but single feature is easier and I just thought it would be best to get the single feature case sorted out first.Desperado wrote: ↑Mon Dec 10, 2018 8:21 pmThanks both of you so far. Unfortunately my time is very limited, so there might be a delay for further posts.
Now,first, i would like to follow Matthew's description and i would appreciate if you can guide me through.
Already at this point my confusion starts, sounds funny, i know...matthewlai wrote: ↑Sun Dec 09, 2018 2:57 pmLet's start very simple, and say each training example has an input x (let's say material balance  single number) and output y^ (result).
1.=> e(x) = y^
For me x are my feature parameters (say material p,n,b,r,q >1,3,3,5,9) used by my evaluation function e(), that produces an output y^
2. error = given result in sample  evaluation output = y(x)  y^
That is how i get the error (and finally sum all squared errors of the sample set).
But what you are saying is, that x is based on one feature. (eg material delta, put into a number). That only would go hand in hand
if my evaluation would be based on one feature (eg material balance). I simply do not have an x0,x1,...xn this way.
Code: Select all
x = featureparameterlist e(x) = evaluation using x y^ = output of e(x) y(x) = result given by sample err = y(x)  y^
Based on my explanation above, y(x) is constant as x is constant too (my feature parameter list). That is why i would have only one x.
If i use the way you described, would i need to break the features/paramaters into seperated x0,x1,...,xn input values ?
like..
delta matrerial > 200 = x0
delta mobility > 40 = x1
delta passer > 75 = x2
and so on...
Thanks a lot
PS: i guess it does not make sense to go on with your description as long this is not clarified. Maybe one additional note. The "parameters" m,b
are not the "paramters" i want to tune, but i need them to modify my x (feature parameters) rapidly. At least this is the way i think at the moment.
Maybe you will catch me before i can catch myself...And let's say you use a very simple model y(x) = wx + b. That is, you think y is linear to x. w and b would be your parameters. Note that here y(x) is the output of the model given x, and you want it to predict y^ (the actual result).
If you have multiple features (x0, x1, x2), the evaluation function may look something like this: y = w0*x0 + w1*x1 + w2*x2 + b
In this case you would have 4 parameters  w0, w1, w2, b.
In general the goal of training is not changing your features. You are trying to find a function that given a set of features, will give you a useful output, and you achieve that by changing the weights (w0, w1, w2, b). Features are, in general, not something you control, but something you get from the board.PS: i guess it does not make sense to go on with your description as long this is not clarified. Maybe one additional note. The "parameters" m,b
are not the "paramters" i want to tune, but i need them to modify my x (feature parameters) rapidly. At least this is the way i think at the moment.
Think of a very simple example first. Let's say your training set is:
1 => 3
2 => 5
3 => 7
And you are trying to find a function f(x) = mx + b that gives you that relationship, your goal is to find an m and b that give you a function that models the relationship. In this case the optimal parameters are m=2, b=1. You don't really control x. The user of the function can give you any x.
Now imagine x is your feature list, and f(x) is the evaluation output. You want the set of parameters that, given any feature, will give you a prediction of the outcome of the game.
Disclosure: I work for DeepMind on the AlphaZero project, but everything I say here is personal opinion and does not reflect the views of DeepMind / Alphabet.