Tapered Evaluation and MSE (Texel Tuning)

Pio · Post by **Pio** » Thu Jan 14, 2021 12:20 am

Desperado wrote: ↑Wed Jan 13, 2021 11:44 pm
hgm wrote: ↑Wed Jan 13, 2021 11:09 pm
Desperado wrote: ↑Wed Jan 13, 2021 10:36 pmBut we talk about why the tuner will diverge the mg an eg value while keeping
the average of both constant.
That much I understood. But doing this should greatly drive up the rms error, agreed? Because keeping the average constant will prevent any effect on the prediction for those positions that have gamePhase = 0.5, but it sure as hell will have an enormous impact on positions with gamePhase=0 or gamePhase=1. Because these only depend on the mg or eg values, and do not care the slightest about the other or the average of them.

And in particular the mg values are completely wrong, with a score of 68 cP for a position where you are a Queen up.

So basically you are talking about why the optimizer tries to make the error as large as possible, rather than minimize it...
Well, it is not inteded to keep the average constant, that is part of the puzzle.

Yes the mg values are completely wrong, that is why this threads exists. I want to find out why and how that can happen.
The effect can be softened (using qs() instead of eval and using a better scalingfactorK), but it does not disappear.
The effect even does not appear without tapered eval, because the tuner is not able to split a term (for example a knight value) in any from.

As you point out the phase bounds have massive impact on that matter, that's why i began to analyse the data and produced a
file where the total error per phase is about equal. Each phase has the same portion of influence on the total error and the mse.

That does not change the result either, but that is the place where i want to continue my analyses.
Somehow the tuner keeps to reduce the mg values in an unreasonable proportion.

i cannot follow why the tuner tries to make the error as large as possible.
If the frequency of phase 24(mg) is 15% and the error component in one position is larger than the average error, then both quantities together also produce the largest error. Apply the same logic to phase 22 and 20 and the three middlegame phases already have a share of 45%.
At the same time, the middlegame positions will produce a larger error percentage than an endgame position (quiet criterion) using static evaluation. Of course, the tuner will get the better result if he accepts the endgame errors and minimizes the midgame errors.
Unfortunately, the tuner achieves this by generating two mathematically useful values, but content-wise nonsensical values.

I am a little bit tired and have probably not thought things through but can it be that your set of data can be very unbalanced. What I mean is that maybe you have not included 50% games with wins/losses and 50% of draws. If you have generated the positions from where your engine plays itself from a variety of balanced positions you will probably have 90% draws. If that is the case and you haven’t divided them in equally big groups it becomes much more important for the tuner to get the prediction of the 90% correct than the 10% of wins/losses.

This might explain your strange numbers. That the numbers differ between the experiment of step size 5 and 1 can then be explained with that it is much more important to get a fine grained eval than a good predictor of wins/losses. You can emulate a much finer eval since you have more degrees of freedom wit MG and EG interpolation and since the approximation of draws might be of much greater importance for you.

I have couple of more ideas of how to improve the Texel algorithm and data generation but I can take that later when we have sorted the big issues first.

Pio · Post by **Pio** » Thu Jan 14, 2021 12:21 am

Pio wrote: ↑Thu Jan 14, 2021 12:20 am
Desperado wrote: ↑Wed Jan 13, 2021 11:44 pm
hgm wrote: ↑Wed Jan 13, 2021 11:09 pm
Desperado wrote: ↑Wed Jan 13, 2021 10:36 pmBut we talk about why the tuner will diverge the mg an eg value while keeping
the average of both constant.
That much I understood. But doing this should greatly drive up the rms error, agreed? Because keeping the average constant will prevent any effect on the prediction for those positions that have gamePhase = 0.5, but it sure as hell will have an enormous impact on positions with gamePhase=0 or gamePhase=1. Because these only depend on the mg or eg values, and do not care the slightest about the other or the average of them.

And in particular the mg values are completely wrong, with a score of 68 cP for a position where you are a Queen up.

So basically you are talking about why the optimizer tries to make the error as large as possible, rather than minimize it...
Well, it is not inteded to keep the average constant, that is part of the puzzle.

Yes the mg values are completely wrong, that is why this threads exists. I want to find out why and how that can happen.
The effect can be softened (using qs() instead of eval and using a better scalingfactorK), but it does not disappear.
The effect even does not appear without tapered eval, because the tuner is not able to split a term (for example a knight value) in any from.

As you point out the phase bounds have massive impact on that matter, that's why i began to analyse the data and produced a
file where the total error per phase is about equal. Each phase has the same portion of influence on the total error and the mse.

That does not change the result either, but that is the place where i want to continue my analyses.
Somehow the tuner keeps to reduce the mg values in an unreasonable proportion.

i cannot follow why the tuner tries to make the error as large as possible.
If the frequency of phase 24(mg) is 15% and the error component in one position is larger than the average error, then both quantities together also produce the largest error. Apply the same logic to phase 22 and 20 and the three middlegame phases already have a share of 45%.
At the same time, the middlegame positions will produce a larger error percentage than an endgame position (quiet criterion) using static evaluation. Of course, the tuner will get the better result if he accepts the endgame errors and minimizes the midgame errors.
Unfortunately, the tuner achieves this by generating two mathematically useful values, but content-wise nonsensical values.
I am a little bit tired and have probably not thought things through but can it be that your set of data might be very unbalanced. What I mean is that maybe you have not included 50% games with wins/losses and 50% of draws. If you have generated the positions from where your engine plays itself from a variety of balanced positions you will probably have 90% draws. If that is the case and you haven’t divided them in equally big groups it becomes much more important for the tuner to get the prediction of the 90% correct than the 10% of wins/losses.

This might explain your strange numbers. That the numbers differ between the experiment of step size 5 and 1 can then be explained with that it is much more important to get a fine grained eval for draws than a good predictor of wins/losses. You can emulate a much finer eval since you have more degrees of freedom with MG and EG interpolation and since the approximation of draws might be of much greater importance for you the algorithm will use all it freedoms it can.

I have couple of more ideas of how to improve the Texel algorithm and data generation but I can take that later when we have sorted out the big issues.

Desperado · Post by **Desperado** » Thu Jan 14, 2021 12:34 am

Pio wrote: ↑Thu Jan 14, 2021 12:20 am
Desperado wrote: ↑Wed Jan 13, 2021 11:44 pm
hgm wrote: ↑Wed Jan 13, 2021 11:09 pm
Desperado wrote: ↑Wed Jan 13, 2021 10:36 pmBut we talk about why the tuner will diverge the mg an eg value while keeping
the average of both constant.
That much I understood. But doing this should greatly drive up the rms error, agreed? Because keeping the average constant will prevent any effect on the prediction for those positions that have gamePhase = 0.5, but it sure as hell will have an enormous impact on positions with gamePhase=0 or gamePhase=1. Because these only depend on the mg or eg values, and do not care the slightest about the other or the average of them.

And in particular the mg values are completely wrong, with a score of 68 cP for a position where you are a Queen up.

So basically you are talking about why the optimizer tries to make the error as large as possible, rather than minimize it...
Well, it is not inteded to keep the average constant, that is part of the puzzle.

Yes the mg values are completely wrong, that is why this threads exists. I want to find out why and how that can happen.
The effect can be softened (using qs() instead of eval and using a better scalingfactorK), but it does not disappear.
The effect even does not appear without tapered eval, because the tuner is not able to split a term (for example a knight value) in any from.

As you point out the phase bounds have massive impact on that matter, that's why i began to analyse the data and produced a
file where the total error per phase is about equal. Each phase has the same portion of influence on the total error and the mse.

That does not change the result either, but that is the place where i want to continue my analyses.
Somehow the tuner keeps to reduce the mg values in an unreasonable proportion.

i cannot follow why the tuner tries to make the error as large as possible.
If the frequency of phase 24(mg) is 15% and the error component in one position is larger than the average error, then both quantities together also produce the largest error. Apply the same logic to phase 22 and 20 and the three middlegame phases already have a share of 45%.
At the same time, the middlegame positions will produce a larger error percentage than an endgame position (quiet criterion) using static evaluation. Of course, the tuner will get the better result if he accepts the endgame errors and minimizes the midgame errors.
Unfortunately, the tuner achieves this by generating two mathematically useful values, but content-wise nonsensical values.
I am a little bit tired and have probably not thought things through but can it be that your set of data can be very unbalanced. What I mean is that maybe you have not included 50% games with wins/losses and 50% of draws. If you have generated the positions from where your engine plays itself from a variety of balanced positions you will probably have 90% draws. If that is the case and you haven’t divided them in equally big groups it becomes much more important for the tuner to get the prediction of the 90% correct than the 10% of wins/losses.

This might explain your strange numbers. That the numbers differ between the experiment of step size 5 and 1 can then be explained with that it is much more important to get a fine grained eval than a good predictor of wins/losses. You can emulate a much finer eval since you have more degrees of freedom wit MG and EG interpolation and since the approximation of draws might be of much greater importance for you.

I have couple of more ideas of how to improve the Texel algorithm and data generation but I can take that later when we have sorted the big issues first.

Hi Pio,

maybe you did not see that i switch to ccrl-40-15-elo-3200.epd from the rebel website.
I summarized the settings for my tuning experiments in one or two posts before.
Everything can reproduced now outside my box. Ferdy mentioned he will providing some results for that data too.

It's already late for me...

Pio · Post by **Pio** » Thu Jan 14, 2021 12:43 am

Desperado wrote: ↑Thu Jan 14, 2021 12:34 am
Pio wrote: ↑Thu Jan 14, 2021 12:20 am
Desperado wrote: ↑Wed Jan 13, 2021 11:44 pm
hgm wrote: ↑Wed Jan 13, 2021 11:09 pm
Desperado wrote: ↑Wed Jan 13, 2021 10:36 pmBut we talk about why the tuner will diverge the mg an eg value while keeping
the average of both constant.
That much I understood. But doing this should greatly drive up the rms error, agreed? Because keeping the average constant will prevent any effect on the prediction for those positions that have gamePhase = 0.5, but it sure as hell will have an enormous impact on positions with gamePhase=0 or gamePhase=1. Because these only depend on the mg or eg values, and do not care the slightest about the other or the average of them.

And in particular the mg values are completely wrong, with a score of 68 cP for a position where you are a Queen up.

So basically you are talking about why the optimizer tries to make the error as large as possible, rather than minimize it...
Well, it is not inteded to keep the average constant, that is part of the puzzle.

Yes the mg values are completely wrong, that is why this threads exists. I want to find out why and how that can happen.
The effect can be softened (using qs() instead of eval and using a better scalingfactorK), but it does not disappear.
The effect even does not appear without tapered eval, because the tuner is not able to split a term (for example a knight value) in any from.

As you point out the phase bounds have massive impact on that matter, that's why i began to analyse the data and produced a
file where the total error per phase is about equal. Each phase has the same portion of influence on the total error and the mse.

That does not change the result either, but that is the place where i want to continue my analyses.
Somehow the tuner keeps to reduce the mg values in an unreasonable proportion.

i cannot follow why the tuner tries to make the error as large as possible.
If the frequency of phase 24(mg) is 15% and the error component in one position is larger than the average error, then both quantities together also produce the largest error. Apply the same logic to phase 22 and 20 and the three middlegame phases already have a share of 45%.
At the same time, the middlegame positions will produce a larger error percentage than an endgame position (quiet criterion) using static evaluation. Of course, the tuner will get the better result if he accepts the endgame errors and minimizes the midgame errors.
Unfortunately, the tuner achieves this by generating two mathematically useful values, but content-wise nonsensical values.
I am a little bit tired and have probably not thought things through but can it be that your set of data can be very unbalanced. What I mean is that maybe you have not included 50% games with wins/losses and 50% of draws. If you have generated the positions from where your engine plays itself from a variety of balanced positions you will probably have 90% draws. If that is the case and you haven’t divided them in equally big groups it becomes much more important for the tuner to get the prediction of the 90% correct than the 10% of wins/losses.

This might explain your strange numbers. That the numbers differ between the experiment of step size 5 and 1 can then be explained with that it is much more important to get a fine grained eval than a good predictor of wins/losses. You can emulate a much finer eval since you have more degrees of freedom wit MG and EG interpolation and since the approximation of draws might be of much greater importance for you.

I have couple of more ideas of how to improve the Texel algorithm and data generation but I can take that later when we have sorted the big issues first.
Hi Pio,

maybe you did not see that i switch to ccrl-40-15-elo-3200.epd from the rebel website.
I summarized the settings for my tuning experiments in one or two posts before.
Everything can reproduced now outside my box. Ferdy mentioned he will providing some results for that data too.

It's already late for me...

Sorry I Missed that but can’t it be that the dataset is unbalanced too? Couldn’t read it was balanced.

Going to sleep as well

Desperado · Post by **Desperado** » Thu Jan 14, 2021 12:50 am

That the numbers differ between the experiment of step size 5 and 1 can then be explained with that it is much more important to get a fine grained eval than a good predictor of wins/losses.

Did you noticed that the total error with stepsize 5 was less than with 1? significantly!

Ferdy · Post by **Ferdy** » Thu Jan 14, 2021 8:08 am

Desperado wrote: ↑Wed Jan 13, 2021 10:22 pm
Ferdy wrote: ↑Wed Jan 13, 2021 9:42 pm
Desperado wrote: ↑Wed Jan 13, 2021 1:04 pm Hello everybody,

to understand what is going on, i thought i can use a databse that i did not generate myself.
So i used ccrl-40-15-elo-3200.epd from https://rebel13.nl/misc/epd.html.
Tried the ccrl-3200.epd will post results later.
Nice!

What is you complete setup?

* Algorithm: i used cpw algorithm
* Phase: 0...24 (1,2,4 / minor,rook,queen) 24 == Full board
* Start: 100 Fixed / 300 / 300 / 500 / 1000 (both mg,eg)
* Stepsize: 1 and 5
* Loss function: squared error
* Batchsize: 50K (the first, no modification of the file)
* EvalParam: i used material only eval
* EvalFunc: static eval (qs is possible)
* scalingfactorK: 1.0 (can configure anything) ?

Beside the algorithm (used cpw) anything is easy to configure on my side.
I can use the same configuration as you, so we can compare later. I only need to know the setup parameters.

I set my settings to be the same as yours except the batch size. Step size is only 5.

This training data is difficult for material only eval tuning. I used 50k batch size out of 4m positions, and initial mse is not improved so I ended up a param values without changes.

What happened was first I calculate the initial mse (from the initial param values) using 50k pos from a randomized 4m positions, call it current best mse. If this mse happens to be small, then succeeding mse could no longer be improved.

After calculating the initial mse, first iteration begins and randomize the 4m training positions first and select the 50k pos. Now if all parameter values tried at +/-5 could not improved over the initial mse, then go for the next iteration, randomize first the training positions and select a new 50k pos. If there are 3 successive iterations that did not improve the current mse then I abort the tuning. That happens when I tried the 50k pos batch size.

Next is 120k pos, so far so good.

Tuning aborted.

Code: Select all

successive iteration without error improvement: 3
Exit tuning, error is not improved.

Iterations: 6
Best mse: 0.11087187526675499
Best parameters:
+----------+--------+---------+
| par      |   init |   tuned |
+==========+========+=========+
| KnightOp |    300 |     310 |
+----------+--------+---------+
| KnightEn |    300 |     290 |
+----------+--------+---------+
| BishopOp |    300 |     310 |
+----------+--------+---------+
| BishopEn |    300 |     310 |
+----------+--------+---------+
| RookOp   |    500 |     490 |
+----------+--------+---------+
| RookEn   |    500 |     490 |
+----------+--------+---------+
| QueenOp  |   1000 |    1005 |
+----------+--------+---------+
| QueenEn  |   1000 |     990 |
+----------+--------+---------+

Next I will try to run at 300k batch size.

Pio · Post by **Pio** » Thu Jan 14, 2021 8:27 am

Desperado wrote: ↑Thu Jan 14, 2021 12:50 am
That the numbers differ between the experiment of step size 5 and 1 can then be explained with that it is much more important to get a fine grained eval than a good predictor of wins/losses.
Did you noticed that the total error with stepsize 5 was less than with 1? significantly!

The total error for step size 5 was a tiny bit better than for step size one and that might be that step size one happened to settle for a local optimum worse than that of step size 5 found

Pio · Post by **Pio** » Thu Jan 14, 2021 8:52 am

Probably the balancing should be like 1/3 wins, 1/3 draws and 1/3 losses. If you don’t balance the games with respect to results it will try to predict that almost everything is a draw.

Will you go sailing only during winter if you know that 90 % of the drowning accidents occur during summer?

Putting two anchors for both mg- and eg-pawns is really bad. If you put an anchor only do it for one value

Desperado · Post by **Desperado** » Thu Jan 14, 2021 9:27 am

Ferdy wrote: ↑Thu Jan 14, 2021 8:08 am
Desperado wrote: ↑Wed Jan 13, 2021 10:22 pm
Ferdy wrote: ↑Wed Jan 13, 2021 9:42 pm
Desperado wrote: ↑Wed Jan 13, 2021 1:04 pm Hello everybody,

to understand what is going on, i thought i can use a databse that i did not generate myself.
So i used ccrl-40-15-elo-3200.epd from https://rebel13.nl/misc/epd.html.
Tried the ccrl-3200.epd will post results later.
Nice!

What is you complete setup?

* Algorithm: i used cpw algorithm
* Phase: 0...24 (1,2,4 / minor,rook,queen) 24 == Full board
* Start: 100 Fixed / 300 / 300 / 500 / 1000 (both mg,eg)
* Stepsize: 1 and 5
* Loss function: squared error
* Batchsize: 50K (the first, no modification of the file)
* EvalParam: i used material only eval
* EvalFunc: static eval (qs is possible)
* scalingfactorK: 1.0 (can configure anything) ?

Beside the algorithm (used cpw) anything is easy to configure on my side.
I can use the same configuration as you, so we can compare later. I only need to know the setup parameters.
I set my settings to be the same as yours except the batch size. Step size is only 5.

This training data is difficult for material only eval tuning. I used 50k batch size out of 4m positions, and initial mse is not improved so I ended up a param values without changes.

What happened was first I calculate the initial mse (from the initial param values) using 50k pos from a randomized 4m positions, call it current best mse. If this mse happens to be small, then succeeding mse could no longer be improved.

After calculating the initial mse, first iteration begins and randomize the 4m training positions first and select the 50k pos. Now if all parameter values tried at +/-5 could not improved over the initial mse, then go for the next iteration, randomize first the training positions and select a new 50k pos. If there are 3 successive iterations that did not improve the current mse then I abort the tuning. That happens when I tried the 50k pos batch size.

Next is 120k pos, so far so good.

Hello Ferdy,

first thank you for your efforts.

However, it makes me wonder a little bit,
if you make a random selection of positions and cannot produce a single change on the defined input vector (100,300,300,500,900 mg+eg).
That doesn't sound plausible.

It is also interesting that your algorithm treats the data differently. You define Best-MSE for a selection of 50K and change the data in the course of the algorithm. Of course, I understand that you then determine a new Best-MSE value as a reference then.

I use the first 50K of the file (this is ultimately also random or are the positions related to each other, since they follow from a game sequence. I can easily avoid that by shuffling the data). I do not change the database until the vector remains unchanged.

Anyway, I can adjust my algorithm without any effort. I have the appropriate functions at my disposal.

Desperado · Post by **Desperado** » Thu Jan 14, 2021 9:34 am

Pio wrote: ↑Thu Jan 14, 2021 8:52 am Probably the balancing should be like 1/3 wins, 1/3 draws and 1/3 losses. If you don’t balance the games with respect to results it will try to predict that almost everything is a draw.

Intersting point! I will have a closer look into that idea. Until now i only tried to balance game phases or total error of game phases.

Tapered Evaluation and MSE (Texel Tuning)

Re: Tapered Evaluation and MSE (Texel Tuning)

Re: Tapered Evaluation and MSE (Texel Tuning)

Re: Tapered Evaluation and MSE (Texel Tuning)

Re: Tapered Evaluation and MSE (Texel Tuning)

Re: Tapered Evaluation and MSE (Texel Tuning)

Re: Tapered Evaluation and MSE (Texel Tuning)

Re: Tapered Evaluation and MSE (Texel Tuning)

Re: Tapered Evaluation and MSE (Texel Tuning)

Re: Tapered Evaluation and MSE (Texel Tuning)

Re: Tapered Evaluation and MSE (Texel Tuning)