SGD basics

Discussion of chess software programming and technical issues.

Moderator: Ras

Chessnut1071
Posts: 313
Joined: Tue Aug 03, 2021 2:41 pm
Full name: Bill Beame

Re: SGD basics

Post by Chessnut1071 »

j.t. wrote: Sun Feb 06, 2022 10:12 pm
Desperado wrote: Sun Feb 06, 2022 10:02 pm Now, suppose I divide 10K elements into 1K mini-batches. I calculate the average gradient per mini-batch and then update the coefficients for the data contained in this mini-batch. In this scenario, it is neither "online" since I am not using the gradient of the individual data element as a basis for an update, nor is it "stochastic" since I am ultimately using all the elements in the epoch to adjust the parameters.
It is stochastic, because you don't use all the elements to calculate a gradient. In my experience stochastic GD is not really necessary for hand crafted evaluations. Non-stochastik gradient descent seems to work fast enough. Using stochastik gradient descent is only really necessary if you otherwise would see very bad performance.

EDIT 1: Here is a reply I wrote without seeing your second comment, maybe it helps anyway:

Gradient descent is that you compute the gradient over all the data you have, and then update the parameters according to the calculated gradient. Stochastic gradient descent means that you calculate the gradient only using a number of randomly selected elements (for example, 15, 200, or 10000). This number of how many elements you look at each time you calculate a new gradient is often called "batch size". What you described in your second example is, I believe, stochastic gradient descent with batch size of 1. This shouldn't really work well in most cases (including chess evaluations). It should be more.

When you have a relatively simple evaluation function (i.e. no neural network) then you don't need huge data sets (1 million - 10 million positions will very likely be enough, most people already have great success with the ~750 000 positions from the zurichess quiet labeled set), and then you also don't really need to do the stochastic part of gradient descent, you can just calculate the gradient over all the data, if you're careful with your general performance. I am not sure, but maybe that's what you are describing as what you are doing already in your first example.

EDIT 2: your understanding of "online" makes sense to me. Online algorithms are interesting if you don't have all the input data when you start the algorithm.
Gradient methods are quick, but, have all sorts of issues on convergence and local minimums. That's why I use a slower but more effective search using a modified Hooke & Jeeves step search where you can set the level of precision. I use it to optimize parameters to minimize engine calls for checkmate problems and it performs, on average,25% better on my bank of 500 games up to 4 - 20-ply. The last time I ran it, I had a 94% reduction; however, it ran for 4 straight days. You don't have as many issues with local minimums and aliases, With all of these methods you have to be very clever on what you choose as your objective function.
User avatar
lithander
Posts: 915
Joined: Sun Dec 27, 2020 2:40 am
Location: Bremen, Germany
Full name: Thomas Jahn

Re: SGD basics

Post by lithander »

Thanks for the detailed write up!

The PSTs I'm using for MinimalChess were also tuned on the quiet-labeled.epd dataset that you are using. I've tinkered a long time with my tuner trying to get better results than with the Pesto tables (Rofchade PSTs) and finally succeeded in that quest (+20 ELO) as described here:
forum3/viewtopic.php?f=7&t=77089&sid=a4 ... 30#p891122

These PSTs are what I'm also currently using for my new engine Leorik (which otherwise currently has the exact same feature set as yours, btw!) and your findings confirm my impression that the quality of the tables depends more on the training data used than on the actual tuner. I want to revisit the topic eventually and - like you - try to cook up my own set of training positions. But between your own training data and the quiet-labeled.epd there are still 50 ELO difference! That really makes me wonder what the process behind Zurichess' data is? There's a lot of information on how to write a working tuner but very little (to my knowledge) how the data should be structured for best results.
Minimal Chess (simple, open source, C#) - Youtube & Github
Leorik (competitive, in active development, C#) - Github & Lichess
User avatar
j.t.
Posts: 263
Joined: Wed Jun 16, 2021 2:08 am
Location: Berlin
Full name: Jost Triller

Re: SGD basics

Post by j.t. »

lithander wrote: Sun Feb 13, 2022 6:43 pm That really makes me wonder what the process behind Zurichess' data is?
From the README.txt:
How was the data generated
====

1. 75000 games were played between three slightly different versions of zurichess
using 2moves_v1.pgn opening book to ensure high play variability.
2. From each game 20 positions were sampled from the millions of positions evaluated by
the engine during the game play. This resulted in 1500k random positions which were stored in violent.epd.
3. From the set were removed all positions on which quiescence search found a winning capture.
The remaining positions were stored in quiet.epd.
4. Next, from each quiet position a game using Stockfish 080916 was played.
The result were stored in quiet-labeled.epd.
User avatar
lithander
Posts: 915
Joined: Sun Dec 27, 2020 2:40 am
Location: Bremen, Germany
Full name: Thomas Jahn

Re: SGD basics

Post by lithander »

j.t. wrote: Sun Feb 13, 2022 8:11 pm From the README.txt:
How was the data generated
Thanks. Fairly straight forward!
Minimal Chess (simple, open source, C#) - Youtube & Github
Leorik (competitive, in active development, C#) - Github & Lichess
Mike Sherwin
Posts: 965
Joined: Fri Aug 21, 2020 1:25 am
Location: Planet Earth, Sol system
Full name: Michael J Sherwin

Re: SGD basics

Post by Mike Sherwin »

Why just one set of trained PSTs? Why not a set for each common opening like the French, Sicilian, Caro-Kann, Queen pawn, Indian, English, etc? For every common two move sequence? Or even for the top 100 or top 1000 sequences. It would be a lot of data but so what, only one set of PSTs will be used in any one search and therefore not slow down the search at all. Romi loads the subtree into the TT from the learn.dat file on the hard drive before each search with no noticeable lag at all. So just retrieving the PSTs from the hard drive would also work. I don't think explaining the benefits to playing strength is necessary. Maybe just start with two additional sets, e4 and d4, just to see if there is some gain in ELO? Whomever starts this would be performing a service to the community of engine authors. :D
User avatar
j.t.
Posts: 263
Joined: Wed Jun 16, 2021 2:08 am
Location: Berlin
Full name: Jost Triller

Re: SGD basics

Post by j.t. »

Mike Sherwin wrote: Mon Feb 14, 2022 6:01 pm Why just one set of trained PSTs? Why not a set for each common opening like the French, Sicilian, Caro-Kann, Queen pawn, Indian, English, etc?
I think in this case, the main goal is to test different approaches of tuning evaluation parameters. A simple PST is probably easier to work with and reason about.
I found that some Elo can be gained by having different PSTs depending on where the kings are in the current position. So I think your general idea of having specialized PSTs could be interesting to be explored further. However, at some point you'll have so many evaluation parameters that an evaluation parameter object doesn't fit on the stack. So I avoided adding many more parameters so far.
User avatar
Desperado
Posts: 879
Joined: Mon Dec 15, 2008 11:45 am

Re: SGD basics

Post by Desperado »

Mike Sherwin wrote: Mon Feb 14, 2022 6:01 pm Why just one set of trained PSTs? Why not a set for each common opening like the French, Sicilian, Caro-Kann, Queen pawn, Indian, English, etc? For every common two move sequence? Or even for the top 100 or top 1000 sequences. It would be a lot of data but so what, only one set of PSTs will be used in any one search and therefore not slow down the search at all. Romi loads the subtree into the TT from the learn.dat file on the hard drive before each search with no noticeable lag at all. So just retrieving the PSTs from the hard drive would also work. I don't think explaining the benefits to playing strength is necessary. Maybe just start with two additional sets, e4 and d4, just to see if there is some gain in ELO? Whomever starts this would be performing a service to the community of engine authors. :D
To me, the sequence looks like this,

1. symetric tables
2. asymetric tables
3. color related tables

The piece positions are very different for White and Black, especially in the opening.
I think the effect will be similarly strong as between the points 1. + 2. where we talking about 40 to 60 Elo.
Many engines even do not need an update on its data structure for that.

In this thread, however, the focus is on the optimizer, not the results of the data. The material values and PST values are merely a means to an end. At least I plan to explore the topic further. The properties of the parameters, the number and also the effect on the playing strength are very suitable to look at the overall process of optimizing. One set of data should be enough for that.

Of course, at some point it will be interesting to talk about the possibilities.
HCE evaluation could be quite different if we can use large tuned tables without guessing, but based on data analysis and optimization algorithms.
User avatar
Desperado
Posts: 879
Joined: Mon Dec 15, 2008 11:45 am

Re: SGD basics

Post by Desperado »

Hello again.

There is a thing that puzzles me for a long time.

Currently i compute everything based on the interpolated score and the corresponding game phase.

Code: Select all

            // Returning the interpolated score from wpov. At the same time the
            // function provides the delta vector, the phase and phase values for MG/EG.
            score = eval(&pos);

            // using interpolated score
            error_mg = Fitness::single_error(i,score) * se.phase / PHASE_MAX;
            error_eg = Fitness::single_error(i,score) * (PHASE_MAX - se.phase) / PHASE_MAX;
What you can do too and makes a lot sense to me, is to tune both phase values seperately.
In the end, it is nothing else than tuning two evaluation functions also weighted by the phase value.

Code: Select all

            // using phase scores
            error_mg = Fitness::single_error(i,se.mg) * se.phase / PHASE_MAX;
            error_eg = Fitness::single_error(i,se.eg) * (PHASE_MAX - se.phase) / PHASE_MAX;
This is what phase values finally are, the output of different evaluation functions.

1. What do you think is the correct or better way?
2. Are there different ways to handle the phase weights or the phase values?
User avatar
j.t.
Posts: 263
Joined: Wed Jun 16, 2021 2:08 am
Location: Berlin
Full name: Jost Triller

Re: SGD basics

Post by j.t. »

Desperado wrote: Sun Feb 20, 2022 9:10 am Hello again.

There is a thing that puzzles me for a long time.

Currently i compute everything based on the interpolated score and the corresponding game phase.

Code: Select all

            // Returning the interpolated score from wpov. At the same time the
            // function provides the delta vector, the phase and phase values for MG/EG.
            score = eval(&pos);

            // using interpolated score
            error_mg = Fitness::single_error(i,score) * se.phase / PHASE_MAX;
            error_eg = Fitness::single_error(i,score) * (PHASE_MAX - se.phase) / PHASE_MAX;
What you can do too and makes a lot sense to me, is to tune both phase values seperately.
In the end, it is nothing else than tuning two evaluation functions also weighted by the phase value.

Code: Select all

            // using phase scores
            error_mg = Fitness::single_error(i,se.mg) * se.phase / PHASE_MAX;
            error_eg = Fitness::single_error(i,se.eg) * (PHASE_MAX - se.phase) / PHASE_MAX;
This is what phase values finally are, the output of different evaluation functions.

1. What do you think is the correct or better way?
2. Are there different ways to handle the phase weights or the phase values?
I do it like this. Also generally for the evaluation, I only interpolate the final score between end op values. However, running the functions truly separately will probably make the eval slower, so what I do is to calculate endgame and opening values side by side. This avoids neatly doing interpolations all the time.
User avatar
Desperado
Posts: 879
Joined: Mon Dec 15, 2008 11:45 am

Re: SGD basics

Post by Desperado »

j.t. wrote: Sun Feb 20, 2022 1:25 pm
Desperado wrote: Sun Feb 20, 2022 9:10 am Hello again.

There is a thing that puzzles me for a long time.

Currently i compute everything based on the interpolated score and the corresponding game phase.

Code: Select all

            // Returning the interpolated score from wpov. At the same time the
            // function provides the delta vector, the phase and phase values for MG/EG.
            score = eval(&pos);

            // using interpolated score
            error_mg = Fitness::single_error(i,score) * se.phase / PHASE_MAX;
            error_eg = Fitness::single_error(i,score) * (PHASE_MAX - se.phase) / PHASE_MAX;
What you can do too and makes a lot sense to me, is to tune both phase values seperately.
In the end, it is nothing else than tuning two evaluation functions also weighted by the phase value.

Code: Select all

            // using phase scores
            error_mg = Fitness::single_error(i,se.mg) * se.phase / PHASE_MAX;
            error_eg = Fitness::single_error(i,se.eg) * (PHASE_MAX - se.phase) / PHASE_MAX;
This is what phase values finally are, the output of different evaluation functions.

1. What do you think is the correct or better way?
2. Are there different ways to handle the phase weights or the phase values?
I do it like this. Also generally for the evaluation, I only interpolate the final score between end op values. However, running the functions truly separately will probably make the eval slower, so what I do is to calculate endgame and opening values side by side. This avoids neatly doing interpolations all the time.
Just to be sure we do not confuse each other. I understand that you weight the phases like above.
But i am not sure which score(s) do you use for reducing the error? The phase scores or the interpolated score?

Code: Select all

            error_mg = Fitness::single_error(i,INTERPOLATED_SCORE) * se.phase / PHASE_MAX;
            error_eg = Fitness::single_error(i,INTERPOLATED_SCORE) * (PHASE_MAX - se.phase) / PHASE_MAX;
or

Code: Select all

            error_mg = Fitness::single_error(i,MIDGAME_SCORE) * se.phase / PHASE_MAX;
            error_eg = Fitness::single_error(i,ENDGAME_SCORE) * (PHASE_MAX - se.phase) / PHASE_MAX;

This doesn't take any extra time because the evaluation function collects the phase scores and the phase.

Code: Select all

    // phase interpolation + pov
    int phase = Pos::phase(pos);

    #if TUNE // wpov
    GD::se.mg = ScoreMG(score);
    GD::se.eg = ScoreEG(score);
    GD::se.phase = phase;
    #endif

    int s = phase_score(ScoreMG(score),ScoreEG(score),phase);
    return pos->stm == WHITE ? s : -s;
It is something very different.