SGD basics

Desperado · Post by **Desperado** » Sun Feb 06, 2022 8:13 pm

Hi,

my tuner is an implementation of a GD algorithm.
As I have understood this, in the form as SGD is defined.

However, when I read the theoretical basics, I stumble a bit.
Therefore here two pseudocode variants and the question which or whether my implementation is SGD or a variant.

This is what i do

Code: Select all

for(epoch = 0; epoch < MAX_EPOCH; epoch++)
{
	for(int e = 0; e < elements; e++)
	{
	  compute_error_of_element()
	  update_coefficients_of_element();
	}
	
	// optional: modify learing rate
	// optional: shuffle data
}

Now this is how i understand SGD should work

Code: Select all

for(epoch = 0; epoch < MAX_EPOCH; epoch++)
{
	pick_one_random_element(){
	  compute_error_of_element()
	  update_coefficients_of_element();		
	}
	
	// optional: modify learing rate
	// optional: shuffle data
}

At least I can say that the variant I implemented does the job well.
However, I would like to implement other (known) variants out of my curiosity. My implementation is intuitive, so it's a homebrew.

If my pseudocode is not a form of SGD, where do you see the advantages and disadvantages.
If the second pseudocode does not represent SGD either, what does SGD of the pseudocode form analogous above look like then?

At this point I want to ignore advanced algorithms like Adam, Adagrad, AdaDelta, Momentum or whatever.
Maybe it was a stupid question, but I see a significant difference in both pseudocodes and would like to be able to classify my own implementation somehow.

Thanks a lot.

jdart · Post by **jdart** » Sun Feb 06, 2022 9:02 pm

ADAM, Adagrad, etc. and also your first method tune all elements using all training data during each epoch. The "more advanced" ones you mention are good methods and they converge reasonably rapidly, generally within 100-200 epochs in my experience, but it can take a long time to complete an epoch of training because you have to go through all the training data.

SGD can be faster and scales better to very large data sets, which is why it is frequently the method of choice for machine learning.

Note whether using a method such as ADAM or SGD, you do generally have to decrease the learning rate as the tuning progresses, otherwise you may "overshoot" the minimum and fail to get convergence.

Desperado · Post by **Desperado** » Sun Feb 06, 2022 10:02 pm

jdart wrote: ↑Sun Feb 06, 2022 9:02 pm ADAM, Adagrad, etc. and also your first method tune all elements using all training data during each epoch. The "more advanced" ones you mention are good methods and they converge reasonably rapidly, generally within 100-200 epochs in my experience, but it can take a long time to complete an epoch of training because you have to go through all the training data.

SGD can be faster and scales better to very large data sets, which is why it is frequently the method of choice for machine learning.

Note whether using a method such as ADAM or SGD, you do generally have to decrease the learning rate as the tuning progresses, otherwise you may "overshoot" the minimum and fail to get convergence.

Hello Jon,

thanks for your reply. Maybe it's obvious to you or others, but it's still not clear to me.
I think I have read too much on the subject today, but the following terms confuse me.

"online", this term is obviously used in this special context to express that after evaluation of a data element,
an adjustment of the parameters is immediately performed.

"stochastic", in this context means that not all (or maximum one ?) data elements are updated in one epoch.

Do I understand the terms correctly?

Now, suppose I divide 10K elements into 1K mini-batches. I calculate the average gradient per mini-batch and then update the coefficients for the data contained in this mini-batch. In this scenario, it is neither "online" since I am not using the gradient of the individual data element as a basis for an update, nor is it "stochastic" since I am ultimately using all the elements in the epoch to adjust the parameters.

Whether I use single data items or mini-batches is finally unimportant for the definition of "stochastic".
And again, the term "online" is used when updating parameters based on the gradient of a single element.

Sorry if I confuse all these basics.

j.t. · Post by **j.t.** » Sun Feb 06, 2022 10:12 pm

Desperado wrote: ↑Sun Feb 06, 2022 10:02 pm Now, suppose I divide 10K elements into 1K mini-batches. I calculate the average gradient per mini-batch and then update the coefficients for the data contained in this mini-batch. In this scenario, it is neither "online" since I am not using the gradient of the individual data element as a basis for an update, nor is it "stochastic" since I am ultimately using all the elements in the epoch to adjust the parameters.

It is stochastic, because you don't use all the elements to calculate a gradient. In my experience stochastic GD is not really necessary for hand crafted evaluations. Non-stochastik gradient descent seems to work fast enough. Using stochastik gradient descent is only really necessary if you otherwise would see very bad performance.

EDIT 1: Here is a reply I wrote without seeing your second comment, maybe it helps anyway:

Gradient descent is that you compute the gradient over all the data you have, and then update the parameters according to the calculated gradient. Stochastic gradient descent means that you calculate the gradient only using a number of randomly selected elements (for example, 15, 200, or 10000). This number of how many elements you look at each time you calculate a new gradient is often called "batch size". What you described in your second example is, I believe, stochastic gradient descent with batch size of 1. This shouldn't really work well in most cases (including chess evaluations). It should be more.

When you have a relatively simple evaluation function (i.e. no neural network) then you don't need huge data sets (1 million - 10 million positions will very likely be enough, most people already have great success with the ~750 000 positions from the zurichess quiet labeled set), and then you also don't really need to do the stochastic part of gradient descent, you can just calculate the gradient over all the data, if you're careful with your general performance. I am not sure, but maybe that's what you are describing as what you are doing already in your first example.

EDIT 2: your understanding of "online" makes sense to me. Online algorithms are interesting if you don't have all the input data when you start the algorithm.

Desperado · Post by **Desperado** » Sun Feb 06, 2022 10:56 pm

j.t. wrote: ↑Sun Feb 06, 2022 10:12 pm EDIT 2: your understanding of "online" makes sense to me. Online algorithms are interesting if you don't have all the input data when you start the algorithm.

That implies that "online" approaches are (always) at the same time "stochastic" because the gradient is not based on the complete data set, but on a single data element.

As long as the gradient is not computed on the complete data set we are in the "stochastic" context. I think I have it now.
Ok, so it makes sense to use the gradient for all elements (non stochastic) only for "small" training data.
But why or when, can this be more efficient for corresponding data sets?

Related to my pseudocodes.

Both are "online" and "stochastic" as i learned now. Having a closer look to the term of mini-batch.
One can argue that both use a mini-batch of one element, but there is still a difference which is not only about cosmetic.

The first one (I use) divides the trainingdata in mini-batches of one element and loops all mini-batches (each element) in one epoch.
The second one only use 1 out of N mini-batches (one element) in each epoch.
If you think by running more epochs it will result in the same solution as the first one, well, that is not the case.
The second approach will target some data elements more often than the first one, and at the same time it might ignore
some other elements.

However, now i know what i am doing

, a basic SGD implementation.

j.t. · Post by **j.t.** » Mon Feb 07, 2022 12:25 am

Desperado wrote: ↑Sun Feb 06, 2022 10:56 pm Ok, so it makes sense to use the gradient for all elements (non stochastic) only for "small" training data.
But why or when, can this be more efficient for corresponding data sets?

Why might stochastic gradient descent be more efficient for large data sets? Because you to get to the optimum quickly, you need to take many gradient steps. You can't just calculate one perfect for the current parameters gradient and hope it leads directly to the optimum. Now, what we do is, instead of slowly calculating each gradient on the whole data set, we only use a small subset. Of course, the gradient won't be really accurate but in many cases this doesn't matter that much.

Why might non-stachastic gradient descent be better? Well, in some cases it turns out, that computing the gradient over the entire data set works better. I can't really tell you why, it simply works like this. It likely depends on the other specifics of how you tune the data, and of course on the data and the model itself.

Of course, you could now experiment. Maybe works best if you calculate the gradient stochastically at the beginning of the training and then slowly increase the batch size (to get more and more accurate gradients) until you calculate the gradient over the full set at the end of the training.
You pretty much have to try out what works best. But as I said, if performance is not a big problem, then there is little reason to go for stochastic gradient computation instead of non-stochastic.

On this note, I would also add, that in the end, Elo is the ultimate check if something improves your engine or not. I have seen it often enough, that a new evaluation feature significantly improves the training error without big performance hits, and it doesn't improve plying strength at all. So yeah, you will have to experiment.

EDIT 1: Also, I am not sure I understand correctly what you are doing. But usually the gradients of elements in one batch are first summed up, before applying them to the parameters. So for example:

Code: Select all

Gradient gradient;
for(element in batch){
    gradient += calculateGradient(parameters, element)
}
parameters.apply(gradient);

jdart · Post by **jdart** » Mon Feb 07, 2022 12:33 am

Why might stochastic gradient descent be more efficient for large data sets? Because you to get to the optimum quickly, you need to take many gradient steps.

It may or may not be more efficient. But think about multi-terabyte or larger training sets. With a non-stochastic method, you grind through something that size, computing the gradient, then you update the params, then you repeat. You don't get any closer to the minimum until you're through the set. With SGD you are adjusting the parameters all the time.

Desperado · Post by **Desperado** » Fri Feb 11, 2022 8:12 pm

j.t. wrote: ↑Mon Feb 07, 2022 12:25 am
EDIT 1: Also, I am not sure I understand correctly what you are doing. But usually the gradients of elements in one batch are first summed up, before applying them to the parameters. So for example:
Code: Select all
Gradient gradient;
for(element in batch){
    gradient += calculateGradient(parameters, element)
}
parameters.apply(gradient);

Code: Select all

Gradient gradient;
for(element in batch){
    parameters.apply(calculateGradient(parameters, element))
}

... is what i do.
I update the parameters with every element. After that is done for every element, the next epoch will be started.
I do this for a fixed number of epochs (currently 200) and i i use a fix learning rate with alpha = 0.1.
Well, i am sure that can be done in a better way, but it already works pretty well.

On this note, I would also add, that in the end, Elo is the ultimate check if something improves your engine or not. I have seen it often enough, that a new evaluation feature significantly improves the training error without big performance hits, and it doesn't improve plying strength at all. So yeah, you will have to experiment.

Of course, the goal is to obtain parameter values that improve the engine's performance. But I often miss a very simple alternative,
when I think about how to use the result of the optimizer. Instead of discarding the values, one can use them and adjust other engine components (like the search). If there is a confidence that based on the data and optimization, the values are objectively better for a static evaluation than the ones used so far. The perspective is then just different, not the values are the problem, but the interaction of the current implementation of other components (e.g. you can keep the values and work on search parameters instead. By the way, I think many people are already doing this when a NN replaced an HCE.)

The biggest difficulty from my personal point of view is not the optimization algorithm. Neither is the motivation to adopt values in the end, like playing style or playing strength or similar. The biggest difficulty is to determine the data source that is relevant for the intended goal. On the one hand the data source must not already be biased, on the other hand it must be special enough to contain information that is relevant to be learned / optimized.

Regards

j.t. · Post by **j.t.** » Fri Feb 11, 2022 9:46 pm

Desperado wrote: ↑Fri Feb 11, 2022 8:12 pm ... is what i do.

Oh ok, that's basically doing stochastik gradient descent with batchsize of 1.

The biggest difficulty is to determine the data source that is relevant for the intended goal.

What I did was to run a bunch of games with my engine, and collect some randomly sampled positions, that my engine's evaluation function encounters. Because positions like these are exactly what my evaluation function is supposed to evaluate accurately (I also removed non-quiet positions, but I haven't experimented yet with not removing non-quiet positions, so this could be nonsensical).

The hard part is then to create target labels for these positions. I used very short self-play games for that (like 50ms per move or so). Because this is so hardware intensive, many people use data sets created by others. I still use the zurichess quite-labeled set to enhance my own data set because it gives like 10 Elo or so extra.

Desperado · Post by **Desperado** » Sun Feb 13, 2022 3:22 pm

Hello,

I did some tests and experiments. You might want to jump to the results first and read about ressources afterwards.

Engine description

The strength is somewhere in the range of 2200-2300 elo. Currently no need to measure that because i am only interest in the relative strength between original and tuned versions. The reason why i metion this is, that the effect of elo progress is certainly different in a more advanced engine.

• pvs
• transposition table
• mvv-lva
• killer
• best move to front at the root
• staged move generation
• pvs
• qs (captures only)
• eval = material + pst

Data source

1. quiet-labeled.epd

The public known table is a good point to start and compare results. There have been successful tuning sessions with the data. A well known example is Rofchade. The data includes 750000 training positions.

2. self-play
The engine played 150K games on a level with 64k nodes per move. Then there were five positions randomly picked out of each game. Finaly the training includes 750K positions.

• position not in check
• abs(staticEval – qsEval) < 40
• fabs( simoid(statiEval) – result) <= 0.8) // fishy

Although these criteria influence the basic setup, i doubt that the specific choice has a high impact on the general process. More important than the selection criteria of the source is the source itself imo.

Tuning Setup

• Algorithm : SGD
• Learning Rate: const 0.1
• Epochs : 500
• PST : asymetric
• Parameters : 768

Experiment 1 – Test against Original PSTs

• opening : book UHO_XXL_100_129
• trained with: quiet-labeled.epd
• score wld : 315 - 97 - 56 [0.733] 468
• Elo : +175.35 +/- 32.86
• Stopped : SPRT llr 2.97, lbound -2.94, ubound 2.94 - H1 was accepted

• opening : book UHO_XXL_100_129
• trained with: 64k_5.epd
• score wld : 389 - 168 - 76 [0.675] 633
• Elo : +126.62 +/- 26.87
• Stopped : llr 2.95, lbound -2.94, ubound 2.94 - H1 was accepted

Experiment 2 – Test against Rofchade PSTs

Result vs Rofchade PSTs (match 1)

• opening : book UHO_XXL_100_129
• trained with : quiet-labeled.epd
• Score WLD : 7505 - 7311 - 4813 [0.505] 19629
• Elo : +3.43 +/- 4.22
• Stopped : SPRT llr 2.96, lbound -2.94, ubound 2.94 - H1 was accepted

Result vs Rofchade PSTs (match 2)

• opening : 8moves_gm_lb.epd
• trained with : quiet-labeled.epd
• Score wld : 5674 - 5784 - 5257 [0.497] 16715
• Elo : -2.29 +/- 4.35
• Stopped : by hand

Conclusion

The most important point is that the tuning process works in two aspects. First the optimizer is minimizing the error of the evaluation function. The second thing is, it improves the elo strength. The impressive elo improvements show, that different kind of data sources have a very different impact. Although the result of the well known quiet-labeled.epd is better than the self-played database, it is important to know, that based on the current engine level, you can produce new data with the potential to make further progress. That feature might be more important than the better result for a single tuning session.

The second experiment is very exciting too. Rofchade PSTs are based on the same datasource, the quiet-labeled.epd. At this point, the engine used for the test, was prepared to plugin these Rofchade values. As you can see in the match result it is a head-to-head match.
While the match statistics are close, the MSE scores for the quiet-labeled data set are:

K : 1.663756460
MSE Rofchade: 0.0615515096270968
MSE Engine : 0.0614646864265578

Maybe it is important to say, that the original values (before tuning) are not useless tables.
Here are the initial values (before tuning) for king and knight, that have been improved.
So, the reported elo gains are related to tables that already played reasonable chess.

Code: Select all


void Eval::setup_knights()
{ 
     // A1 ... H8
    const int mg[ID64] = {
        -28, -22, -18, -12, -12, -18, -22, -28,
        -10,  -4,   0,   6,   6,   0,  -4, -10,
         -6,   0,   4,  10,  10,   4,   0,  -6,
          0,   6,  10,  16,  16,  10,   6,   0,
          0,   6,  10,  16,  16,  10,   6,   0,
         -6,   0,   4,  10,  10,   4,   0,  -6,
        -10,  -4,   0,   6,   6,   0,  -4, -10,
        -16, -10,  -6,   0,   0,  -6, -10, -16
    };
    const int eg[ID64] = {
        -24, -16, -10,  -4,  -4, -10, -16, -24,
        -16,  -8,  -2,   4,   4,  -2,  -8, -16,
        -10,  -2,   4,  10,  10,   4,  -2, -10,
         -4,   4,  10,  16,  16,  10,   4,  -4,
         -4,   4,  10,  16,  16,  10,   4,  -4,
        -10,  -2,   4,  10,  10,   4,  -2, -10,
        -16,  -8,  -2,   4,   4,  -2,  -8, -16,
        -24, -16, -10,  -4,  -4, -10, -16, -24
    };

    for(int i = A1; i <= H8; i++) {
        pst[KNIGHT][i] = Score(mg[i],eg[i]);
    }
}

    const int mg[ID64] = {
          8,  16,   8,  -4,   0,  -4,  16,   8,
          0,   0,  -4,  -4,  -4,  -4,   0,   0,
         -8,  -8,  -8,  -8,  -8,  -8,  -8,  -8,
        -12, -12, -12, -12, -12, -12, -12, -12,
        -16, -16, -16, -16, -16, -16, -16, -16,
        -20, -20, -20, -20, -20, -20, -20, -20,
        -24, -24, -24, -24, -24, -24, -24, -24,
        -28, -28, -28, -28, -28, -28, -28, -28
    };

    const int eg[ID64] = {
        -48, -20, -16,  -8,  -8, -16, -20, -48,
        -20,   8,  12,  20,  20,  12,   8, -20,
        -16,  12,  16,  24,  24,  16,  12, -16,
         -4,  24,  28,  36,  36,  28,  24,  -4,
          0,  28,  32,  40,  40,  32,  28,   0,
        -12,  16,  20,  28,  28,  20,  16, -12,
        -16,  12,  16,  24,  24,  16,  12, -16,
        -44, -16, -12,  -4,  -4, -12, -16, -44
    };

Finally i have a look at the optimizing process.

Time duration
Most impressive for me is, that a tuning session was between 15 and 20 minutes for 500 epochs. So far i did not output the time duration. To be honest, i don‘t care (at the moment) if that can be done in five or even two minutes because the validation (test match) will take much longer anyway. It is not a overnight process or a matter of days, we are talking of minutes. This is basically idependant of the numbers of parameters like eight values, 800 or even 8000 values.

Hill climbing or oscillation
Another intersting point is, that i introduced two indicators in my output that showed me the ratio of successful epochs and the latest successfull update. When the tuning session starts it is like 1/1, 2/2, …, 35/35. When coming closer epoch 100 it beginns to look like 76/96 (current 100). That means that i had 75% improvements over the run and the latest update was in epoch 96.
That means while coming closer to the (an) optimum, it looks like it is ozillating, but it isn‘t.
It just climbs small hills because at some point the optimum is improved again. A good starting point to learn something about the learning rate and spend some time on it.

Epochs and parameters
Although the patterns and the produced numbers between 200 and 500 epochs look very similar (for the human eye), the test match afterwards sometimes reported negative elo differences between 10 and 50 elo. This might be dependent on how i implement the optimizer.

Sample size
Well, i have epd files that include more than 100 million positions. But it looks like, even with the generated 64k_5.epd, based on a low level engine, 750K positions can lead to massive improvements by a very basic selection process.

At the end the basics are working, so i will spent some time in advanced techniques of GD approches. Exploring the world of optimization by myself took a lot of time and of course i am very happy that things become successful now. Especially i would like to elaborate more on the choice of data.

On the fly, i did some different test too. One was to check symetric tables against asymetric tables. The result was very clear, so i did not metioned it above, but related to the original scores also an improvement. Another test was to tune the tables one after the other, like K,P,N,B,R,Q. That worked too, but was not as efficient like tuning all parameters in on run. And I did a lot of useless stuff that I won't talk about

.

If someone is interested in the resulting tables, just send a pm.

P.S.: I will think about White+Black tables. Because the asymetric tables are better between +40 and +60 elo than the symetric tables, that might give another boost. It won't take long to create such tables

Michael

SGD basics

SGD basics

Re: SGD basics

Re: SGD basics

Re: SGD basics

Re: SGD basics

Re: SGD basics

Re: SGD basics

Re: SGD basics

Re: SGD basics

Re: SGD basics / Tuning sessions and Experiments