Tapered Evaluation and MSE (Texel Tuning)

Discussion of chess software programming and technical issues.

Moderators: hgm, Rebel, chrisw

User avatar
Desperado
Posts: 879
Joined: Mon Dec 15, 2008 11:45 am

Re: Tapered Evaluation and MSE (Texel Tuning)

Post by Desperado »

Here is the phase distribution of the file ccrl-40-15-elo-3200.epd, computed with the formula above.

Code: Select all

 Phase 0: [7252] 0.005
 Phase 1: [5564] 0.004
 Phase 2: [66899] 0.044
 Phase 3: [20326] 0.013
 Phase 4: [100173] 0.065
 Phase 5: [30753] 0.020
 Phase 6: [116946] 0.076
 Phase 7: [30194] 0.020
 Phase 8: [86460] 0.056
 Phase 9: [21681] 0.014
 Phase 10: [87846] 0.057
 Phase 11: [19751] 0.013
 Phase 12: [70576] 0.046
 Phase 13: [14432] 0.009
 Phase 14: [72276] 0.047
 Phase 15: [15655] 0.010
 Phase 16: [59808] 0.039
 Phase 17: [19346] 0.013
 Phase 18: [98192] 0.064
 Phase 19: [16546] 0.011
 Phase 20: [142588] 0.093
 Phase 21: [15115] 0.010
 Phase 22: [168848] 0.110
 Phase 23: [13979] 0.009
 Phase 24: [236174] 0.154

More than 5%...

 Phase 4: [100173] 0.065
 Phase 6: [116946] 0.076
 Phase 8: [86460] 0.056
 Phase 10: [87846] 0.057
 Phase 18: [98192] 0.064
 Phase 20: [142588] 0.093
 Phase 22: [168848] 0.110
 Phase 24: [236174] 0.154
Phase 20,22,24 are about 45%. As pointed out already the error for mg positions should be larger
than the eg error, of course i can it measure too. For me it is still clear that the tuner wants to eleminate the mg-error
as much as possible.

Now, i will force a file that includes balanced phases and look into the results...
User avatar
Desperado
Posts: 879
Joined: Mon Dec 15, 2008 11:45 am

Re: Tapered Evaluation and MSE (Texel Tuning)

Post by Desperado »

Analysis: phase / count / error (100,300,300,500,1000 mg+eg)

Phase 0: [7252] 1873.056
Phase 1: [5564] 1768.902
Phase 2: [66899] 17770.634
Phase 3: [20326] 5905.695
Phase 4: [100173] 26904.363
Phase 5: [30753] 9425.472
Phase 6: [116946] 31809.222
Phase 7: [30194] 9135.382
Phase 8: [86460] 23592.215
Phase 9: [21681] 6660.410
Phase 10: [87846] 23705.280
Phase 11: [19751] 5999.591
Phase 12: [70576] 19088.990
Phase 13: [14432] 4545.347
Phase 14: [72276] 19569.744
Phase 15: [15655] 4799.077
Phase 16: [59808] 16465.990
Phase 17: [19346] 5883.957
Phase 18: [98192] 26205.976
Phase 19: [16546] 5369.263
Phase 20: [142588] 37114.395
Phase 21: [15115] 5115.681
Phase 22: [168848] 43101.501
Phase 23: [13979] 4969.940
Phase 24: [236174] 59650.151

avErrSum: 17351.259757
New File: the sum of error / phase is nearly equal
Phase 0: [7252] 196.126
Phase 1: [5564] 602.491
Phase 2: [66899] 3565.721
Phase 3: [20326] 2539.556
Phase 4: [69277] 5000.015
Phase 5: [26542] 5000.009
Phase 6: [50012] 5000.024
Phase 7: [28497] 5000.002
Phase 8: [48888] 5000.119
Phase 9: [21681] 4087.962
Phase 10: [49110] 5000.001
Phase 11: [19751] 3516.368
Phase 12: [47448] 5000.116
Phase 13: [14432] 2886.263
Phase 14: [42699] 5000.205
Phase 15: [15655] 3016.635
Phase 16: [38649] 5000.005
Phase 17: [19346] 4016.885
Phase 18: [38632] 5000.191
Phase 19: [16546] 3588.302
Phase 20: [37445] 5000.100
Phase 21: [15115] 3328.338
Phase 22: [39339] 5000.239
Phase 23: [13979] 3110.675
Phase 24: [42593] 5000.145

avErrSum: 4144.020453
To use this "balanced" file related to the error doesn't help either! The tuner reduces the mg values a lot more.
N 40 150 / B 50 155 / R 10 245 /Q -30 375 Done: best: 0.112146 epoch: 206

I am slowly running out of ideas
User avatar
hgm
Posts: 27789
Joined: Fri Mar 10, 2006 10:06 am
Location: Amsterdam
Full name: H G Muller

Re: Tapered Evaluation and MSE (Texel Tuning)

Post by hgm »

Desperado wrote: Sun Jan 10, 2021 12:46 pm

Code: Select all

Material:   P,  N,  B,  R,  Q      P,  N,  B,   R,   Q
Start: MG: 80,300,320,500,980 EG:100,300,320, 500, 980
End:   MG:  5, 24, 32, 48, 68 EG:192,595,610,1020,1990
I have not read this entire discussion, and am entering it only new, so forgive me if I address points that have already been addressed.

Either there is something very wrong in your optimizing algorithm that prevents convergence, or you are not feeding it a wide enough variety of positions to determine the parameters unambiguously. The MG piece values you get as 'optimum' above are unacceptably low. With these values you will get nonsensical result predictions for early middle-game positions. Basically every such position will be predicted as a draw, or at least very close to 50-50, even those where one of the players is a Queen ahead. That should give a huge error, as in practice it would be a 100% result. So it cannot give an optimal fit with Q=68; it should always be better to increase the Q value.

Of course when you have no positions in your test set where one of the players has all pieces, and the other only lacks a Queen, this prediction error will remain totally unnoticed, as the prediction will never have to be made. It becomes a bit like trying to tune the Rook value on a test set of positions none of which contains a Rook: anything flies.

My first guess is that this is your problem. You don't have positions that are heavily unbalanced in a very early game stage in the test set, because you took the positions from high-quality games, and GMs do not blunder away Queens early in the game. Opening positions in high-quality games will always be approximately equal, and by only showing it such positions you deluded the optimizer in to thinking that only the game phase matters, and that early middle-game positions can always go either way. To make it understand that being a Queen or Rook behind is totally fatal even if it is the first piece you lose, a significant fraction of the positions should be Queen-odds or Rook-odds positions.
User avatar
Desperado
Posts: 879
Joined: Mon Dec 15, 2008 11:45 am

Re: Tapered Evaluation and MSE (Texel Tuning)

Post by Desperado »

hgm wrote: Wed Jan 13, 2021 5:31 pm
Desperado wrote: Sun Jan 10, 2021 12:46 pm

Code: Select all

Material:   P,  N,  B,  R,  Q      P,  N,  B,   R,   Q
Start: MG: 80,300,320,500,980 EG:100,300,320, 500, 980
End:   MG:  5, 24, 32, 48, 68 EG:192,595,610,1020,1990
I have not read this entire discussion, and am entering it only new, so forgive me if I address points that have already been addressed.

Either there is something very wrong in your optimizing algorithm that prevents convergence, or you are not feeding it a wide enough variety of positions to determine the parameters unambiguously. The MG piece values you get as 'optimum' above are unacceptably low. With these values you will get nonsensical result predictions for early middle-game positions. Basically every such position will be predicted as a draw, or at least very close to 50-50, even those where one of the players is a Queen ahead. That should give a huge error, as in practice it would be a 100% result. So it cannot give an optimal fit with Q=68; it should always be better to increase the Q value.

Of course when you have no positions in your test set where one of the players has all pieces, and the other only lacks a Queen, this prediction error will remain totally unnoticed, as the prediction will never have to be made. It becomes a bit like trying to tune the Rook value on a test set of positions none of which contains a Rook: anything flies.

My first guess is that this is your problem. You don't have positions that are heavily unbalanced in a very early game stage in the test set, because you took the positions from high-quality games, and GMs do not blunder away Queens early in the game. Opening positions in high-quality games will always be approximately equal, and by only showing it such positions you deluded the optimizer in to thinking that only the game phase matters, and that early middle-game positions can always go either way. To make it understand that being a Queen or Rook behind is totally fatal even if it is the first piece you lose, a significant fraction of the positions should be Queen-odds or Rook-odds positions.
Hello HG,

thanks for joining. To give you a short summary:

From the start of the thread to now every component was simplified to catch the issue. The current setup is...

* material only evaluation
* cpw algorithm (stepzsize 5 or 1)
* standard phase calcluation - mg(24) eg(0), minor(1),rook(2),queen(4)
* scalingFaktorK 1.0 (to analyse the situation)
* 50K sample size

The conclusion is that is not about the algorithm and code implementation. The challenge depends on the data.
As you pointed out it might be connected with missing/unbalanced properties in the positions.

The latest step was to switch to public available epd file i mentioned before, that is created out of ccrl games.
So was my own, but everybody now has access and can reproduce the issue. I use static evaluation instead of qs(),
but the issue keeps the nearly the same.

At least my own data was shuffled and i used batch sizes of 1M, to be sure i have any important type of position represented.
Nothing changes the result, unfortunatelly.

Even to balance out the impact of error / phase did not change anything. The mg - error still dominates, so the tuner reduces
it much more.

My next thought is, but i' ll do a break, that the error is a porperty of the position. So, it might be natural,that the mg error can
be reduced much more than eg-error. The distance from a mg position to the outcome of a game is usually bigger.

Having a statistically balanced error related to the game phases, now using qs() instead of static eval might give the desired results.
Alternatively the amount of error per gamephase might be updated, more endgame positions or something like that...need to think.

In any case, the problem is caused by the training data.

Need to leave now...
User avatar
hgm
Posts: 27789
Joined: Fri Mar 10, 2006 10:06 am
Location: Amsterdam
Full name: H G Muller

Re: Tapered Evaluation and MSE (Texel Tuning)

Post by hgm »

Well, a fact is that the piece values you arived at should produce an ENORMOUS prediction error on almost every chess position with 28-31 pieces present. You can easily verify that: e.g. take all positions in your test set with 31 pieces, where one player lacks a Knight, and compare the average of all the winning probabilities predicted for those to the actual result of the games these positions were from.
Ferdy
Posts: 4833
Joined: Sun Aug 10, 2008 3:15 pm
Location: Philippines

Re: Tapered Evaluation and MSE (Texel Tuning)

Post by Ferdy »

Desperado wrote: Wed Jan 13, 2021 1:04 pm Hello everybody,

to understand what is going on, i thought i can use a databse that i did not generate myself.
So i used ccrl-40-15-elo-3200.epd from https://rebel13.nl/misc/epd.html.
Tried the ccrl-3200.epd will post results later.
User avatar
Desperado
Posts: 879
Joined: Mon Dec 15, 2008 11:45 am

Re: Tapered Evaluation and MSE (Texel Tuning)

Post by Desperado »

Ferdy wrote: Wed Jan 13, 2021 9:42 pm
Desperado wrote: Wed Jan 13, 2021 1:04 pm Hello everybody,

to understand what is going on, i thought i can use a databse that i did not generate myself.
So i used ccrl-40-15-elo-3200.epd from https://rebel13.nl/misc/epd.html.
Tried the ccrl-3200.epd will post results later.
Nice!

What is you complete setup?

* Algorithm: i used cpw algorithm
* Phase: 0...24 (1,2,4 / minor,rook,queen) 24 == Full board
* Start: 100 Fixed / 300 / 300 / 500 / 1000 (both mg,eg)
* Stepsize: 1 and 5
* Loss function: squared error
* Batchsize: 50K (the first, no modification of the file)
* EvalParam: i used material only eval
* EvalFunc: static eval (qs is possible)
* scalingfactorK: 1.0 (can configure anything) ?

Beside the algorithm (used cpw) anything is easy to configure on my side.
I can use the same configuration as you, so we can compare later. I only need to know the setup parameters.
User avatar
Desperado
Posts: 879
Joined: Mon Dec 15, 2008 11:45 am

Re: Tapered Evaluation and MSE (Texel Tuning)

Post by Desperado »

hgm wrote: Wed Jan 13, 2021 6:31 pm Well, a fact is that the piece values you arived at should produce an ENORMOUS prediction error on almost every chess position with 28-31 pieces present. You can easily verify that: e.g. take all positions in your test set with 31 pieces, where one player lacks a Knight, and compare the average of all the winning probabilities predicted for those to the actual result of the games these positions were from.
Without measuring i can say that the prediction error must be too high, because the data is not filtered by quiesce or any other citeria,
so using a static evaluation will produce that high error. But we talk about why the tuner will diverge the mg an eg value while keeping
the average of both constant. To get an impact on the error itself i can use qs instead of static eval an update to an optimal scalingFaktorK.
Then the problem is still there because of the more realistic, but still useless relation, the issue gets hidden.

If you scan the thread, you will surely notice the most important passages on the topic.
(I have repeated myself so many times and do not want to bore the others :) )

I'm very interested what Ferdy will report later.
User avatar
hgm
Posts: 27789
Joined: Fri Mar 10, 2006 10:06 am
Location: Amsterdam
Full name: H G Muller

Re: Tapered Evaluation and MSE (Texel Tuning)

Post by hgm »

Desperado wrote: Wed Jan 13, 2021 10:36 pmBut we talk about why the tuner will diverge the mg an eg value while keeping
the average of both constant.
That much I understood. But doing this should greatly drive up the rms error, agreed? Because keeping the average constant will prevent any effect on the prediction for those positions that have gamePhase = 0.5, but it sure as hell will have an enormous impact on positions with gamePhase=0 or gamePhase=1. Because these only depend on the mg or eg values, and do not care the slightest about the other or the average of them.

And in particular the mg values are completely wrong, with a score of 68 cP for a position where you are a Queen up.

So basically you are talking about why the optimizer tries to make the error as large as possible, rather than minimize it...
User avatar
Desperado
Posts: 879
Joined: Mon Dec 15, 2008 11:45 am

Re: Tapered Evaluation and MSE (Texel Tuning)

Post by Desperado »

hgm wrote: Wed Jan 13, 2021 11:09 pm
Desperado wrote: Wed Jan 13, 2021 10:36 pmBut we talk about why the tuner will diverge the mg an eg value while keeping
the average of both constant.
That much I understood. But doing this should greatly drive up the rms error, agreed? Because keeping the average constant will prevent any effect on the prediction for those positions that have gamePhase = 0.5, but it sure as hell will have an enormous impact on positions with gamePhase=0 or gamePhase=1. Because these only depend on the mg or eg values, and do not care the slightest about the other or the average of them.

And in particular the mg values are completely wrong, with a score of 68 cP for a position where you are a Queen up.

So basically you are talking about why the optimizer tries to make the error as large as possible, rather than minimize it...
Well, it is not inteded to keep the average constant, that is part of the puzzle.

Yes the mg values are completely wrong, that is why this threads exists. I want to find out why and how that can happen.
The effect can be softened (using qs() instead of eval and using a better scalingfactorK), but it does not disappear.
The effect even does not appear without tapered eval, because the tuner is not able to split a term (for example a knight value) in any from.

As you point out the phase bounds have massive impact on that matter, that's why i began to analyse the data and produced a
file where the total error per phase is about equal. Each phase has the same portion of influence on the total error and the mse.

That does not change the result either, but that is the place where i want to continue my analyses.
Somehow the tuner keeps to reduce the mg values in an unreasonable proportion.

i cannot follow why the tuner tries to make the error as large as possible.
If the frequency of phase 24(mg) is 15% and the error component in one position is larger than the average error, then both quantities together also produce the largest error. Apply the same logic to phase 22 and 20 and the three middlegame phases already have a share of 45%.
At the same time, the middlegame positions will produce a larger error percentage than an endgame position (quiet criterion) using static evaluation. Of course, the tuner will get the better result if he accepts the endgame errors and minimizes the midgame errors.
Unfortunately, the tuner achieves this by generating two mathematically useful values, but content-wise nonsensical values.