Tapered Evaluation and MSE (Texel Tuning)

Ferdy · Post by **Ferdy** » Mon Jan 11, 2021 6:35 pm

Sven wrote: ↑Mon Jan 11, 2021 1:15 pm

Ferdy wrote: ↑Mon Jan 11, 2021 11:26 am [...] I tune next param once the mse is improved. Also tried one side at a time that is, +1 if it fails try -1. If +1 improves the mse then go to the next param.

I do roughly the same in Jumbo, and I am satisfied with the resulting convergence behaviour. With a small number of parameters, like 11 (material parameters only), it needs less than a minute to converge from guessed values to a local optimum even with several 100k positions and single-threaded (Jumbo can use parallel threads for calculating the static evaluation of positions). The full static evaluation is used (with lazy eval disabled) which is quite slow in Jumbo.

This is a simplified version of the essential part of Jumbo's tuning code (I left out things like threading code, special eval topics like pawn hash, range checking of parameter values, error handling, diagnostic output etc.):

Code: Select all

struct TuningPosition {
    char    m_fen[120];
    double  m_result; // 1.0/0.5/0.0 from white viewpoint
};

class Tuner {
public:
    void run(std::string trainingFilePath, std::string parameterFilePath, uint maxIterations);

private:
    double averageError                 ();

    std::vector<TuningPosition> const   m_tuningPos;
    Board                               m_board;
};

static Parameter * tuningParam[] = {
    &(PARAM_INSTANCE(Material_Pawn_MG)),
    &(PARAM_INSTANCE(Material_Knight_MG)),
    &(PARAM_INSTANCE(Material_Bishop_MG)),
    &(PARAM_INSTANCE(Material_Rook_MG)),
    &(PARAM_INSTANCE(Material_Queen_MG)),

    //&(PARAM_INSTANCE(Material_Pawn_EG)),
    &(PARAM_INSTANCE(Material_Knight_EG)),
    &(PARAM_INSTANCE(Material_Bishop_EG)),
    &(PARAM_INSTANCE(Material_Rook_EG)),
    &(PARAM_INSTANCE(Material_Queen_EG)),

    // ...
};

static double sigmoid(int score)
{
    static double const K = 1.0; // TODO
    return 1.0 / (1.0 + pow(10.0, -K * score / 400));
}

double Tuner::averageError()
{
    double sum = 0.0;
    uint nPos = m_tuningPos.size()
    for (uint i = 0; i < nPos; i++) {
        TuningPosition const & tp = m_tuningPos[i];
        (void) m_board.setupFromFEN(tp.m_fen);
        int score = evaluateForWhite(m_board);
        double sig = sigmoid(score);
        sum += (tp.m_result - sig) * (tp.m_result - sig);
    }
    return sum / nPos;
}

void Tuner::run(char const * trainingFilePath, char const * parameterFilePath, uint maxIterations)
{
    // read training data (FEN + result) from training file into vector "m_tuningPos"
    // ...

    double e0 = averageError();
    bool isImproved = true;
    for (uint it = 0; it < maxIterations && isImproved; it++) {
        isImproved = false;
        for (uint i = 0; i < ARRAY_SIZE(tuningParam); i++) {
            Parameter & param = *(tuningParam[i]);
            constexpr int inc[2] = { +1, -1 };
            for (int j = 0; j < ARRAY_SIZE(inc) && !isImproved; j++) {
                param.add(inc);
                double e = averageError();
                isImproved = (e < e0);
                if (isImproved) {
                    e0 = e;
                } else {
                    param.add(-inc);
                }
            }
        }
    }

    // save parameters to parameter file
    // ...
}

The evaluation score I have is from Deuterium's qsearch.

BrianNeal · Post by **BrianNeal** » Mon Jan 11, 2021 7:21 pm

Ferdy wrote: ↑Mon Jan 11, 2021 11:26 am In every iteration I change the 300k pos by first shuffling the 1m+ training pos randomly.

So you are using mini-batches for training. Is it a standard for Texel tuning or using a batch size equal to the size of the training set is also fine? (leaving time considerations aside)

Desperado · Post by **Desperado** » Mon Jan 11, 2021 7:23 pm

Hello,

just an update on how i try to find the issue.

1. I switched to a material only evaluator to reduce the source of potential bugs.
2. Finally i changed to cpw algorithm to get a picture of the situation.

So, currently i measure things based on the mentioned changes, so i get a base to compare things.
This takes some time because i play with the following contents in various combinations.

* using qs()
* using static evaluation
* different batch sizes
* different scaling factors of K
* stepsize

Because i use now a public know algorithm and a meterial only evaluation, there is the possiblity to compare
results with other people. So for comparing things it might possible to use the "quiet-labeled.epd" which i think i available somewhere.

Thanks Pio
Thanks Ferdy
Thanks Sven

Regards

Joost Buijs · Post by **Joost Buijs** » Mon Jan 11, 2021 7:50 pm

BrianNeal wrote: ↑Mon Jan 11, 2021 7:21 pm
Ferdy wrote: ↑Mon Jan 11, 2021 11:26 am In every iteration I change the 300k pos by first shuffling the 1m+ training pos randomly.
So you are using mini-batches for training. Is it a standard for Texel tuning or using a batch size equal to the size of the training set is also fine? (leaving time considerations aside)

Using a batch size equal to the size of the training set is perfectly fine, then you are doing batch-gd instead of minibatch-gd. It will take somewhat longer to train but it has the advantage that you don't have to shuffle your training positions.

Desperado · Post by **Desperado** » Mon Jan 11, 2021 9:00 pm

Desperado wrote: ↑Mon Jan 11, 2021 7:23 pm Hello,

just an update on how i try to find the issue.

1. I switched to a material only evaluator to reduce the source of potential bugs.
2. Finally i changed to cpw algorithm to get a picture of the situation.

So, currently i measure things based on the mentioned changes, so i get a base to compare things.
This takes some time because i play with the following contents in various combinations.

* using qs()
* using static evaluation
* different batch sizes
* different scaling factors of K
* stepsize

Because i use now a public know algorithm and a meterial only evaluation, there is the possiblity to compare
results with other people. So for comparing things it might possible to use the "quiet-labeled.epd" which i think i available somewhere.

Thanks Pio
Thanks Ferdy
Thanks Sven

Regards

I was able to identify the real probelm now. I can say that is not about an anchor, scaling factor, static evaluation issue.
Three algorithms work fine now with slighty different behaviour. No combination matter at all. No problem with my code.

It is my self created data. Neither my static evaluation nor my qs are able to handle it.
Using the known "quiet-labeled.epd" and everything works as expected.

I need to check if i messed up something when creating my files with millions of positions,
or if the positions are simple too noisy to be handled for my evaluation routines.

Maybe i come back with some questions on this topic.

All i can say now, that i used ccrl games with both players having ratings above 3000 for example.
I created various files with various ratings and some of them include more than 10M or even more than 100M positions.
I pick random subsets as my working files, without any special filters.

I must say that it is really the last place where i expected the problem to be.

Regards

Ferdy · Post by **Ferdy** » Mon Jan 11, 2021 11:22 pm

BrianNeal wrote: ↑Mon Jan 11, 2021 7:21 pm
Ferdy wrote: ↑Mon Jan 11, 2021 11:26 am In every iteration I change the 300k pos by first shuffling the 1m+ training pos randomly.
So you are using mini-batches for training. Is it a standard for Texel tuning or using a batch size equal to the size of the training set is also fine? (leaving time considerations aside)

Sort of like a mini-batch but with random shuffling. So a training position at previous iteration may comeback on the other iterations. It also depends on what you are tuning, since this is just material and I know my training data is full of material imbalances, I don't need to evaluate the full 1m+ positions.

Pio · Post by **Pio** » Tue Jan 12, 2021 8:38 am

Desperado wrote: ↑Mon Jan 11, 2021 9:00 pm
Desperado wrote: ↑Mon Jan 11, 2021 7:23 pm Hello,

just an update on how i try to find the issue.

1. I switched to a material only evaluator to reduce the source of potential bugs.
2. Finally i changed to cpw algorithm to get a picture of the situation.

So, currently i measure things based on the mentioned changes, so i get a base to compare things.
This takes some time because i play with the following contents in various combinations.

* using qs()
* using static evaluation
* different batch sizes
* different scaling factors of K
* stepsize

Because i use now a public know algorithm and a meterial only evaluation, there is the possiblity to compare
results with other people. So for comparing things it might possible to use the "quiet-labeled.epd" which i think i available somewhere.

Thanks Pio
Thanks Ferdy
Thanks Sven

Regards
I was able to identify the real probelm now. I can say that is not about an anchor, scaling factor, static evaluation issue.
Three algorithms work fine now with slighty different behaviour. No combination matter at all. No problem with my code.

It is my self created data. Neither my static evaluation nor my qs are able to handle it.
Using the known "quiet-labeled.epd" and everything works as expected.

I need to check if i messed up something when creating my files with millions of positions,
or if the positions are simple too noisy to be handled for my evaluation routines.

Maybe i come back with some questions on this topic.

All i can say now, that i used ccrl games with both players having ratings above 3000 for example.
I created various files with various ratings and some of them include more than 10M or even more than 100M positions.
I pick random subsets as my working files, without any special filters.

I must say that it is really the last place where i expected the problem to be.

Regards

Yes, I think you are right. What your data says is that the EG result correlates a lot more with the result of the game. I can only see two reasons why that is the case. Either you loose a lot of games due to time losses (but I don’t think that is the case) or the positions you have generated are not quiet and since end games have a greater proportion of quiet positions the positions correlate a lot more with the end game result.

I think that if you change your loss function to minimise the absolute error instead of the squared error you will get better results since the labelling errors probably will even out in that case. Of course it is probably better trying to find out if and why the positions are non quiet.

Ferdy · Post by **Ferdy** » Tue Jan 12, 2021 1:20 pm

Loss functions that can be used in Texel tuning. For Log loss, just consider 1-0 or 0-1 result.

Code: Select all

def CrossEntropy(sig_value, result):
    return -(result * log(sig_value) + (1 - result) * log(1 - sig_value))

Log loss is nice, you get more penalty as your prediction is getting far away from the target result.

Desperado · Post by **Desperado** » Tue Jan 12, 2021 1:22 pm

Pio wrote: ↑Tue Jan 12, 2021 8:38 am
Desperado wrote: ↑Mon Jan 11, 2021 9:00 pm
Desperado wrote: ↑Mon Jan 11, 2021 7:23 pm Hello,

just an update on how i try to find the issue.

1. I switched to a material only evaluator to reduce the source of potential bugs.
2. Finally i changed to cpw algorithm to get a picture of the situation.

So, currently i measure things based on the mentioned changes, so i get a base to compare things.
This takes some time because i play with the following contents in various combinations.

* using qs()
* using static evaluation
* different batch sizes
* different scaling factors of K
* stepsize

Because i use now a public know algorithm and a meterial only evaluation, there is the possiblity to compare
results with other people. So for comparing things it might possible to use the "quiet-labeled.epd" which i think i available somewhere.

Thanks Pio
Thanks Ferdy
Thanks Sven

Regards
I was able to identify the real probelm now. I can say that is not about an anchor, scaling factor, static evaluation issue.
Three algorithms work fine now with slighty different behaviour. No combination matter at all. No problem with my code.

It is my self created data. Neither my static evaluation nor my qs are able to handle it.
Using the known "quiet-labeled.epd" and everything works as expected.

I need to check if i messed up something when creating my files with millions of positions,
or if the positions are simple too noisy to be handled for my evaluation routines.

Maybe i come back with some questions on this topic.

All i can say now, that i used ccrl games with both players having ratings above 3000 for example.
I created various files with various ratings and some of them include more than 10M or even more than 100M positions.
I pick random subsets as my working files, without any special filters.

I must say that it is really the last place where i expected the problem to be.

Regards
Yes, I think you are right. What your data says is that the EG result correlates a lot more with the result of the game. I can only see two reasons why that is the case. Either you loose a lot of games due to time losses (but I don’t think that is the case) or the positions you have generated are not quiet and since end games have a greater proportion of quiet positions the positions correlate a lot more with the end game result.

I think that if you change your loss function to minimise the absolute error instead of the squared error you will get better results since the labelling errors probably will even out in that case. Of course it is probably better trying to find out if and why the positions are non quiet.

Hi Pio,

there are definitely a lot of positions included that are not quiet. The reason is very simple. I just took the pgn files of
ccrl, created epd files out of it and shuffled these positions. That's all. I hoped that using my Quiescence will handle that.

But it's not enough. The second matter is, what i realized from the start and you tend to confirm now, that the correlation
between mg/eg scores play a more important role than it would be expected.

Finally it seems to be a mixture of both arguments.

As goody i was able to find a small logical error. I use my qs() from my engine that includes tt lookups. Of course that
needs to be switched of when searching with different parameter vectors. But that did not influence much, well, it did not avoid
the problem.

I am satisfied now with the result of my research. There are some new/other question for me but i will open new topics for them.

Already now, i would like to take a closer look at optimizations you mentioned before.
Maybe you can point me to the right place or give me a short introduction of what you talked about.

Regards

Desperado · Post by **Desperado** » Wed Jan 13, 2021 1:04 pm

Hello everybody,

to understand what is going on, i thought i can use a databse that i did not generate myself.
So i used ccrl-40-15-elo-3200.epd from https://rebel13.nl/misc/epd.html.

Setup:

1. material only evaluator
2. cpw algorithm
3. scalingfactor K=1.0
4. pawn values are anchor for mg and eg [100,100]
5. starting values [300,300][300,300],[500,500][1000,1000]
6. loss function uses squared error
7. 50K sample size
8. phase value computation

Code: Select all

    int phase = 1 * Bit::popcnt(Pos::minors(pos));
    phase += 2 * Bit::popcnt(Pos::rooks(pos));
    phase += 4 * Bit::popcnt(Pos::queens(pos));
    phase = min(phase, 24);

    int s = (score.mg * phase + score.eg * (24 - phase)) / 24;
    return pos->stm == WHITE ? s : -s;

Result: N[30 180] B[35 180] R[20 265] [-90 485] best: 0.112642 epoch: 218 (stepsize 5)
Result: N[176 212] B[183 214] R[193 295] [427 443] Done: best: 0.114679 epoch: 579 (stepsize 1)

It is clearly a data related problem, where the correlation between the outcome of the game and
distribution of phase values strongly interact.

Interesting is that the stepsize 1 solution hides that information definitely.

I draw this conclusion because the result must be a local minimum and the mean error is significantly larger.

However, everybody is free to reproduce this now or to ignore it.
For me it is still interesting how I can balance this circumstance when I select training data.

Tapered Evaluation and MSE (Texel Tuning)

Re: Tapered Evaluation and MSE (Texel Tuning)

Re: Tapered Evaluation and MSE (Texel Tuning)

Re: Tapered Evaluation and MSE (Texel Tuning)

Re: Tapered Evaluation and MSE (Texel Tuning)

Re: Tapered Evaluation and MSE (Texel Tuning)

Re: Tapered Evaluation and MSE (Texel Tuning)

Re: Tapered Evaluation and MSE (Texel Tuning)

Re: Tapered Evaluation and MSE (Texel Tuning)

Re: Tapered Evaluation and MSE (Texel Tuning)

Re: Tapered Evaluation and MSE (Texel Tuning)