Tapered Evaluation and MSE (Texel Tuning)

hgm · Post by **hgm** » Sat Jan 16, 2021 5:16 pm

I would only believe that if you would show the result with various constant evaluations with the new database, and it would look like a parabola.

Because it should not matter what database you use. It should be a parabola for any database. And it doesn't seem to me that changing an interpretation should change the numbers. The only thing that could cure it is to change the calculation. Has that been done?

Desperado · Post by **Desperado** » Sat Jan 16, 2021 5:45 pm

hgm wrote: ↑Sat Jan 16, 2021 5:16 pm I would only believe that if you would show the result with various constant evaluations with the new database, and it would look like a parabola.

Because it should not matter what database you use. It should be a parabola for any database. And it doesn't seem to me that changing an interpretation should change the numbers. The only thing that could cure it is to change the calculation. Has that been done?

Here you are, constant eval 0...5. MSE for 4041988 positions.

Code: Select all

0.1029451349187578
0.1028744839553837
0.1028079761630773
0.1027456123021462
0.1026873929086546
0.1026333182236804

whatever i measured that point was wrong, since the mentioned timestemp it is verified as correct.
There has been stm issue, a different database and maybe another thing i am not aware of.
I did not need to change my error computation code that was reported from somewhere in the beginning of the
thread, even that you can lookup.

There should be another siuation that you should trust in. Compute the mse yourself and compare with the metioned posts.
Then you have verified yourself that the numbers are now ok.

Nothing more to say on that. The code works correclty for the error computation. If you don't believe it, please report a
different mse for the given database, then we can further discuss which one is right or wrong.

Desperado · Post by **Desperado** » Sat Jan 16, 2021 6:20 pm

hgm wrote: ↑Sat Jan 16, 2021 5:16 pm I would only believe that if you would show the result with various constant evaluations with the new database, and it would look like a parabola.

Because it should not matter what database you use. It should be a parabola for any database. And it doesn't seem to me that changing an interpretation should change the numbers. The only thing that could cure it is to change the calculation. Has that been done?

Additionaly, with "interpretation" i referred to the "stm pov" or "white pov" content for the used data set and the stored results...
of course not to some interpretation of the measured numbers. However it is not relevant anymore.

If you are interested, you can follow from "Fri Jan 15, 2021 4:30 pm". Beside the outdated numbers you referred to,
we should have the same base now for further discussion.

hgm · Post by **hgm** » Sat Jan 16, 2021 6:40 pm

OK, this looks like a parabola; the minimum extrapolates to a value for the constant around 18. So I am convinced, and we seem to be able to fit a 1-parameter evaluation function.

Now a useful first step would be to forget about the tapering, and test if you can fit sensible piece values that are constant during the game.

Then, because it is the opening values that are consistently getting garbage values, I would take the subset of all positions with phase >=28, and see what happens if you fit phase-independent values to these.

If that doesn't give sensible values, I would select the subset of the positions with phase=31, to see what it makes of that. Or those with phase=32, to see if it can get at least a reasonable Pawn value from that. (Obviously the other piece values would not have any effect then, and could get any value.)

Desperado · Post by **Desperado** » Sat Jan 16, 2021 7:49 pm

hgm wrote: ↑Sat Jan 16, 2021 6:40 pm OK, this looks like a parabola; the minimum extrapolates to a value for the constant around 18. So I am convinced, and we seem to be able to fit a 1-parameter evaluation function.

Now a useful first step would be to forget about the tapering, and test if you can fit sensible piece values that are constant during the game.

Then, because it is the opening values that are consistently getting garbage values, I would take the subset of all positions with phase >=28, and see what happens if you fit phase-independent values to these.

If that doesn't give sensible values, I would select the subset of the positions with phase=31, to see what it makes of that. Or those with phase=32, to see if it can get at least a reasonable Pawn value from that. (Obviously the other piece values would not have any effect then, and could get any value.)

Algorithm: cpw-algorithm
Data: first n positions of the database
initial vector: 100,300,300,500,1000
anchor: P(100)
K: 1.0
batchsize: 50K
evaltype: static evaluation - non-tapered
param-content: N,B,R,Q

Code: Select all

 75  85 110  135 best: 0.115371 epoch: 173 // ccrl_3200_texel.epd
390 410 640 1260 best: 0.073142 epoch:  52 // quiet-labeled.epd

Algorithm: cpw-algorithm
Data: first n positions of the database
initial vector: 100,300,300,500,1000
anchor: P(100)
K: 1.0
batchsize: 50K
evaltype: qs() - non tapered
param-content: N,B,R,Q

Code: Select all

 95 105 150  235 best: 0.113058 epoch: 153 // ccrl_3200_texel.epd
345 360 560 1045 best: 0.097876 epoch:  20 // quiet-labeled.epd

Before we start a discussion on quies(), i am aware of the impact and the difference in context of the tuning algorithm or general context.
Especially for the ccrl_3200_texel.epd the impact is significantly higher because of many non quiet positions with corresponding mg phase.
So, please do not discuss this now, i will come back later on that topic.
More important is, that the MG-EG ratio is very different in both sources. While the first database includes a natural
ratio for MG-EG because it is generated out of games, the quiet-labeled is a collection. More on that later please.

So, there are comprehensible values. At the same time it shows, that the algorithm will work (quiet-labeled.epd).
Well, just for you HG, my gamephase values are 0...24 where 24 is full board. (1,2,4 / minor,rook,queen).

What do you want to do next? I am open minded.

(Did you already read the latest posts from Ferdy and my answer?)

NON-TAPERED EVAL

Code: Select all

int Eval::full(pos_t* pos)
{
    int score = 0;

    for(int c = WHITE; c <= BLACK; c++)
    {
        for(int p = WP + c; p <= WQ + c; p += 2)
        {
            score += Bit::popcnt(pos->bb[p]) * mgMat[PID(p)];
        }

        score = -score;
    }

    return score;
}

TAPERED EVAL (will be used for later measurements)

Code: Select all

int Eval::full(pos_t* pos)
{
    int cnt;
    score_t score = {0,0};

    for(int c = WHITE; c <= BLACK; c++)
    {
        for(int p = WP + c; p <= WQ + c; p += 2)
        {
            cnt = Bit::popcnt(pos->bb[p]);
            score.mg += cnt * mgMat[PID(p)];
            score.eg += cnt * egMat[PID(p)];
        }

        score.mg = -score.mg;
        score.eg = -score.eg;
    }

    int phase = 1 * Bit::popcnt(Pos::minors(pos));
    phase += 2 * Bit::popcnt(Pos::rooks(pos));
    phase += 4 * Bit::popcnt(Pos::queens(pos));
    phase = min(phase, 24);

    int s = (score.mg * phase + score.eg * (24 - phase)) / 24;
    return pos->stm == WHITE ? s : -s;
}

MINIMIZING CODE

Code: Select all

double Tuner::minimize(param_t* param)
{
    double fitness;
    bool improved = TRUE;
    int backup;

    // mse for the current vector
    double bestFitness = mse();

    // loop each parameter in the vector
    for(int i = 0; i < pcount; i++)
    {
        improved = FALSE;
        backup = *param[i].score;

        // modify param
        *param[i].score = backup + 5;
        fitness = mse();

        if(fitness < bestFitness)
        {
            bestFitness = fitness;
            improved = TRUE;
        }
        else
        {
            *param[i].score = backup - 5;
            fitness = mse();
            if(fitness < bestFitness)
            {
                bestFitness = fitness;
                improved = TRUE;
            }
        }

        // reset - keep original
        if(improved == FALSE)
            *param[i].score = backup;
    }

    return bestFitness;
}

// CONTROLLER

Code: Select all

void Tuner::solve()
{
    double fitness, bestFitness = mse();
    int epoch;

    printf("\nStart with %.16f", bestFitness);
    getchar();

    for(epoch = 0; epoch < 10000; epoch++)
    {
        fitness = minimize(param);

        if(fitness < bestFitness)
            bestFitness = fitness;
        else break;

        printParameters(param);
        printf(" best: %f epoch: %d", bestFitness, epoch);
    }

    printParameters(param);
    printf("Done: best: %f epoch: %d", bestFitness, epoch);
}

hgm · Post by **hgm** » Sat Jan 16, 2021 8:17 pm

O sorry, I somehow switched back to thinking that they were 0-32.

Do you anchor the P value (to 100) and fix K (to 1.0)? That would be wrong. You can only fix one of the two. If I understand correctly what you are posting, you have to sets of positions; the one gives reasonable values (apart from some scaling), like N/B/R/Q = 390/410/640/1260, the other gives garbage.

I guess the conclusion should be that the ccrl_3200 set just doesn't contain information on the piece values. Or at least, not enough to stick out above the noise cased by the other position aspects, so that it tries to fit the other aspects by tuning the piece values. (Which is possible by accidental correlations between the material balance and the other aspects.)

You cannot extract information that is not there.

To shed more light on this you can split the set into sub-sets containing only positions with a certain given imbalance, and check whether the score for that imbalance is approximately as you expect. E.g. you get a Q only marginally worth more than a Pawn (135 vs 100). So from the entire ccrl_3200 set, take all positions where the imbalance is a Queen vs 1 or 2 Pawns. And to be sure that there are no positions in there where the Pawn unavoidably promotes, only accept positions with phase >= 12.

The 'best' values found by the algorithm predicts that to be about equal, right? Queen vs 1 Pawn is a slight advantage for the Queen, Queen vs 2 Pawns a larger advantage for the Pawns. So on the average the Pawns should beat the Queen. Count the total score in that data set to see if it indeed supports that. If it does, there must be either something very wrong in the scoring of the data set (and it should be obvious what by looking at a few examples from that huge majority of positions where 1 or 2 Pawns beat a Queen), or the positions must be very atypical for Q vs 2P positions.

Desperado · Post by **Desperado** » Sat Jan 16, 2021 8:26 pm

Sorry, before i read your post, i need to measure again. I introduced a bug in the non-tapered eval that i have written because of your request
To be clear, i did not use i before. It is important that i use stm condition when returning a score. i'll fix that...

Desperado · Post by **Desperado** » Sat Jan 16, 2021 8:42 pm

Sorry for the noise,Here are the latest measurements:

Algorithm: cpw-algorithm
Data: first n positions of the database
initial vector: 100,300,300,500,1000
anchor: P(100)
K: 1.0
batchsize: 50K
evaltype: static evaluation - non-tapered
param-content: N,B,R,Q

Code: Select all

210 225 310  625 best: 0.108797 epoch: 75 // ccrl_3200_texel.epd
390 410 640 1260 best: 0.073142 epoch: 52 // quiet-labeled

Algorithm: cpw-algorithm
Data: first n positions of the database
initial vector: 100,300,300,500,1000
anchor: P(100)
K: 1.0
batchsize: 50K
evaltype: qs() non-tapered
param-content: N,B,R,Q

Code: Select all

320 335 475  955 best: 0.100884 epoch: 9    // ccrl_3200_texel.epd
395 420 655 1285 best: 0.067833 epoch: 57 // quiet-labeled

Sorry HG for wasting your time because of this careless mistake. I guess you need to look again at the numbers.
They look pretty normal. The lower number for the ccrl_3200_texel.epd could be noise because they include more non-quiet/tactical positions.
It is just based on a pgn to epd conversation (that is my information). Well, i did not fixed the K, it is just the initial setup and natural K.
To use an anchor now was an arbitrary choice.

I will measure and provide now the tapered eval results for the same settings.

hgm · Post by **hgm** » Sat Jan 16, 2021 8:57 pm

It seems that you have indeed many non-quiet positions in there, which spoil the values. And that using QS instead of eval solves that. The values obtained with QS look quite normal.

I am still a bit worried about this anchor business. Perhaps K=1.0 happens to be the value that belongs with P=100, so that it doesn't hurt much.

Desperado · Post by **Desperado** » Sat Jan 16, 2021 10:32 pm

I guess my analysis comes to an end.

Algorithm: cpw-algorithm
Data: first n positions of the database
initial vector: 100,300,300,500,1000
anchor: P(100)
K: 1.0
batchsize: 50K
evaltype: static evaluation - tapered
param-content: P,N,B,R,Q

Code: Select all

MG: --  10  20 -45 -100
EG: 15 250 260 400  795  best: 0.103522 epoch: 220 // ccrl_3200_texel.epd

MG:  -- 520 545 675 1500
EG: 185 445 475 820 1495 best: 0.069622 epoch: 100 // quiet-labeled.epd

Algorithm: cpw-algorithm
Data: first n positions of the database
initial vector: 100,300,300,500,1000
anchor: P(100)
K: 1.0
batchsize: 50K
evaltype: qs() tapered
param-content: P,N,B,R,Q

Code: Select all

MG: -- 275 290 295 880
EG: 55 280 300 490 810 best: 0.099596 epoch: 41 // ccrl_3200_texel.epd

MG:  -- 570 600 730 1610
EG: 195 455 490 850 1540 best: 0.063943 epoch: 122 // quiet-labeled.epd

The data looks comprehensible now. There are no diverging numbers anymore when using the qs().
The static evaluation usage keeps to be like that, and it must be like that, because many unbalanced positions exist
where a recapture can be done or something like that. There are too many unbalanced positions that weight in too much.

The qs() balance that out, but why did i have had the diverging numbers in the beginning.

I simulated the bug i had the days before with the current data to show the effect
It was another database and i used result == white pov but it was result == stm pov.
At the beginning the qs() also included TT lookups which surely enforced the problem.

Code: Select all

BUG1 (epd result info was pov stm but used as white pov, anchor pawnMG)
MG: --  155 140 205 120
EG: -45 -25 -10 -30 115 best: 0.103880 epoch: 177 // ccrl_3200_texel.epd

BUG2   (epd result info was pov stm but used as white pov, anchor pawnEG)
MG: -45  10  -10  -15 100
EG:  --  135 160  225 315 best: 0.104583 epoch:  181 //ccrl_3200_texel.epd

That looks very similar to the static evaluation result in the context of diverging numbers.
My assumtion was that both diverging results are related to the same cause.
But in the first, that is ok, the quiet-criterion is missing, so that will happen if the data is not prepared in that matter.
The second reason is a simple bug (not directly related to my code but in the processing step so to say)

Finally i will report some useful numbers with optimal K values and a bigger batchsize.
Maybe some followers want to use these values then as startup value.

Already now a very special thank you for these three guys.

1. Sven - he triggered me to simplify my code when he told me about the cpw algorithm
2. Ferdy - a big thank you for the effort and providing the mse information to verify my code and error computation. That was sooo helpful.
3. HG - he triggered to check the pov of the result in the database (not my code).

I'll report some material values from ccrl_3200_texel.epd later. I really hope i can close the book on the subject for myself.

It was a real journey and a hard job

Tapered Evaluation and MSE (Texel Tuning)

Re: Tapered Evaluation and MSE (Texel Tuning)

Re: Tapered Evaluation and MSE (Texel Tuning)

Re: Tapered Evaluation and MSE (Texel Tuning)

Re: Tapered Evaluation and MSE (Texel Tuning)

Re: Tapered Evaluation and MSE (Texel Tuning)

Re: Tapered Evaluation and MSE (Texel Tuning)

Re: Tapered Evaluation and MSE (Texel Tuning)

Re: Tapered Evaluation and MSE (Texel Tuning)

Re: Tapered Evaluation and MSE (Texel Tuning)

Re: Tapered Evaluation and MSE (Texel Tuning)