Tapered Evaluation and MSE (Texel Tuning)

Desperado · Post by **Desperado** » Tue Jan 19, 2021 9:03 am

Ferdy wrote: ↑Tue Jan 19, 2021 8:56 am
Desperado wrote: ↑Sun Jan 17, 2021 9:12 pm Now the really interesting part...

THE VALIDATION SETUP
Algorithm: cpw-algorithm
Stepsize: 5
evaltype: qs() tapered - material only
initial vector: 100,100,300,300,300,300,500,500,1000,1000
param-content: P,P,N,N,B,B,R,R,Q,Q
anchor: none
K: 1.0
database: material.epd 13994 positions
batchsize: 13994
Data: no modification
Code: Select all
MG:  65 350 350 410 970
EG: 115 330 370 615 1080 best: 0.136443 epoch: 31
THE CHALLENGE SETUP
Algorithm: cpw-algorithm
Stepsize: 8,4,2,1
evaltype: qs() tapered - material only
initial vector: 100,100,300,300,300,300,500,500,1000,1000
param-content: P,P,N,N,B,B,R,R,Q,Q
anchor: none
K: 1.0
database: material.epd 13994 positions
batchsize: 13994
Data: no modification
Code: Select all
MG:  64 348 344 404  952
EG: 124 384 436 696 1260 best: 0.135957 epoch: 56 (Ferdy reports mse: 0.13605793198602772)
THE TRIAL SETUP
Algorithm: cpw-algorithm
Stepsize: 8,7,6,5,4,3,2,1
evaltype: qs() tapered - material only
initial vector: 100,100,300,300,300,300,500,500,1000,1000
param-content: P,P,N,N,B,B,R,R,Q,Q
anchor: none
K: 1.0
database: material.epd 13994 positions
batchsize: 13994
Data: no modification
Code: Select all
MG:  62 333 333 389   899
EG: 134 416 476 752 1400 best: 0.135663 epoch: 107 (even better)
The better the mse the more the material phase values diverge. Ferdy choosed the data (it is a subset of ccrl_3200_texel.epd)
The material.epd is not a subset of ccrl_3200_texel.epd.

Ok, sorry for spreading wrong information.

Sven · Post by **Sven** » Tue Jan 19, 2021 9:26 am

hgm wrote: ↑Mon Jan 18, 2021 10:39 pm
Sven wrote: ↑Mon Jan 18, 2021 5:51 pmE.g. when allowing a step size of 8 or even more, I have observed that the pawn MG value jumps away to more than 200cp and both queen values to 1400, 1500 cp and more, leading to weaker play. I have never observed such extremes when using a small step size (1).
Basically what you saying is that Texel tuning sucks, and you should try to break it as much as possible in order to prevent it can complete its task. So that it won't spoil your starting parameters as much as it otherwise would.

I don't doubt your experience, (in fact it is exactly what I expect from the currently derived set: pretty weak play). But your reaction to this is not consistent. More logical would be to not use it at all, or at least not use it for the piece values, but fix these, and use it to tune only the more subtle eval terms.

And the problem is not in the method, the problem is in the tuning set. If you tune on a set that doesn't contain any info on the piece values, you will of course get poor piece values. Because it wil abuse the piece values to improve some minor terms that accidentally correlate with the material composition, not caring how much that spoils the evaluation of the heavily unbalanced positions missing from your test set.

Engines that play over 3200 Elo search a tree that for >99.9% consists of moronic play, visiting positions even a 1000 Elo player would not tolerate in his games. They can only find the 3200 Elo PV if they score the garbage positions well enough to realize they are worse than the PV. Tuning the eval on the very narrow range of 'sensible' positions will give an eval that very wrongly extrapolates to idiotic positions. To the point where it might start preferring those. E.g. trade Queen for two Bishops, if the piece values are off.

Best BS post ever, from someone with a broad practical experience in applying Texel tuning

Jumbo got a boost of slightly more than 100 Elo points by correctly applying the method. I used the original description, except for calling eval() instead of qsearch() since I used the well-known "quiet-labeled.epd" dataset. Using step sizes greater than 1 resulted in different parameter values that caused Jumbo to play weaker than with those values resulting from tuning with step size 1.

There is no reason to believe that the data set I used does not contain sufficient information to tune piece values. In fact I tuned all my several hundreds of eval parameters at once and have been successful with it.

The attempt to only tune piece values with a different algorithm while using a much smaller data set may work or not ...

Desperado · Post by **Desperado** » Tue Jan 19, 2021 9:38 am

Ferdy wrote: ↑Tue Jan 19, 2021 7:15 am
Desperado wrote: ↑Mon Jan 18, 2021 5:52 pm
Ferdy wrote: ↑Mon Jan 18, 2021 5:28 pm A good challenge next would be tuning the material and passer. The position can be dominated by material on one side but can be countered by the presence of a passed pawn on the other side.
Hello Ferdy,

please be so kind and give me a clear answer to the question, do you compare mse from data that differ from each other? yes or no ?
I can compare mse from 2 different data but have to know what are in those data and of course I know what parameters I am trying to optimize.

Desperado wrote: ↑Mon Jan 18, 2021 5:52 pm And what do you say to the fact that the values drift further and further apart more efficiently the algorithm works.
The values shown should also show a smaller MSE for you. Can you please confirm or deny this ? more results
The training position has a result field, if result is 1-0 its target score is 1.0. Calculating the material only evaluation score of the engine would say give you +300cp for a position with one knight ahead. Now convert that to sigmoid and you will get a value sig. The error is 1.0 - sig. As you increase the knight value its sig would also increase, if sig increases the error goes down.

You can compare 2 mse from different data, but that is wrong. The results of the tuner do not prove that what you are doing is right.
On the contrary, it certainly introduces an error into your algorithm.

HGM gave more information on the effect.
That means you risk a regression if you do not assure that the mse of the total set is better, for sure!

You must at least adjust the basis for comparing two vectors and you must have the mse for the current data set for two vectors you are comparing. Only then they are comparable at all. (This requirement is mandatory). In this case and without considering the whole dataset, you could at least determine if a new vector in this new dataset is better. (However, a regression is still not excluded).
The properties of the position are irrelvant in that context.

Don't get me wrong, I like the idea to explore a bigger space of positions while keeping the sample size constant. Really nice idea!
But it requires an update of mse to the corresponding data. Then you can compare to it.

But, what HGM did not point out was that there are algorithms that work on "your" idea.
You need to define a threshold function, that controls the potential regression. The basic idea is, you would be able to leave locals.

Your second answer does not address my question at all. Sorry.
Either you present the mse of the vectors i provided or you show with a smaller mse that the material values of a piece
stay close together. E.g. pawn 40,60 instead of 10,110 for example.

hgm · Post by **hgm** » Tue Jan 19, 2021 10:01 am

Sven wrote: ↑Tue Jan 19, 2021 9:26 amJumbo got a boost of slightly more than 100 Elo points by correctly applying the method.

Except that you just told us that you did not correctly apply the method at all. Correctly applying it would be to minimize the MSE. Instead you sabotaged the optimizer to prevent it from finding the minimum MSE. As you now repeat, correctly applying the method would make the engine weaker.

Ferdy · Post by **Ferdy** » Tue Jan 19, 2021 10:03 am

Desperado wrote: ↑Tue Jan 19, 2021 9:38 am
Ferdy wrote: ↑Tue Jan 19, 2021 7:15 am
Desperado wrote: ↑Mon Jan 18, 2021 5:52 pm
Ferdy wrote: ↑Mon Jan 18, 2021 5:28 pm A good challenge next would be tuning the material and passer. The position can be dominated by material on one side but can be countered by the presence of a passed pawn on the other side.
Hello Ferdy,

please be so kind and give me a clear answer to the question, do you compare mse from data that differ from each other? yes or no ?
I can compare mse from 2 different data but have to know what are in those data and of course I know what parameters I am trying to optimize.

Desperado wrote: ↑Mon Jan 18, 2021 5:52 pm And what do you say to the fact that the values drift further and further apart more efficiently the algorithm works.
The values shown should also show a smaller MSE for you. Can you please confirm or deny this ? more results
The training position has a result field, if result is 1-0 its target score is 1.0. Calculating the material only evaluation score of the engine would say give you +300cp for a position with one knight ahead. Now convert that to sigmoid and you will get a value sig. The error is 1.0 - sig. As you increase the knight value its sig would also increase, if sig increases the error goes down.
You can compare 2 mse from different data, but that is wrong. The results of the tuner do not prove that what you are doing is right.
On the contrary, it certainly introduces an error into your algorithm.

HGM gave more information on the effect.
That means you risk a regression if you do not assure that the mse of the total set is better, for sure!

You must at least adjust the basis for comparing two vectors and you must have the mse for the current data set for two vectors you are comparing. Only then they are comparable at all. (This requirement is mandatory). In this case and without considering the whole dataset, you could at least determine if a new vector in this new dataset is better. (However, a regression is still not excluded).
The properties of the position are irrelvant in that context.

Don't get me wrong, I like the idea to explore a bigger space of positions while keeping the sample size constant. Really nice idea!
But it requires an update of mse to the corresponding data. Then you can compare to it.

But, what HGM did not point out was that there are algorithms that work on "your" idea.
You need to define a threshold function, that controls the potential regression. The basic idea is, you would be able to leave locals.

Your second answer does not address my question at all. Sorry.
Either you present the mse of the vectors i provided or you show with a smaller mse that the material values of a piece
stay close together. E.g. pawn 40,60 instead of 10,110 for example.

I know what I am doing.

I hope you fixed the bugs in your tuner

Desperado · Post by **Desperado** » Tue Jan 19, 2021 10:16 am

Sven wrote: ↑Tue Jan 19, 2021 9:26 am
hgm wrote: ↑Mon Jan 18, 2021 10:39 pm
Sven wrote: ↑Mon Jan 18, 2021 5:51 pmE.g. when allowing a step size of 8 or even more, I have observed that the pawn MG value jumps away to more than 200cp and both queen values to 1400, 1500 cp and more, leading to weaker play. I have never observed such extremes when using a small step size (1).
Basically what you saying is that Texel tuning sucks, and you should try to break it as much as possible in order to prevent it can complete its task. So that it won't spoil your starting parameters as much as it otherwise would.

I don't doubt your experience, (in fact it is exactly what I expect from the currently derived set: pretty weak play). But your reaction to this is not consistent. More logical would be to not use it at all, or at least not use it for the piece values, but fix these, and use it to tune only the more subtle eval terms.

And the problem is not in the method, the problem is in the tuning set. If you tune on a set that doesn't contain any info on the piece values, you will of course get poor piece values. Because it wil abuse the piece values to improve some minor terms that accidentally correlate with the material composition, not caring how much that spoils the evaluation of the heavily unbalanced positions missing from your test set.

Engines that play over 3200 Elo search a tree that for >99.9% consists of moronic play, visiting positions even a 1000 Elo player would not tolerate in his games. They can only find the 3200 Elo PV if they score the garbage positions well enough to realize they are worse than the PV. Tuning the eval on the very narrow range of 'sensible' positions will give an eval that very wrongly extrapolates to idiotic positions. To the point where it might start preferring those. E.g. trade Queen for two Bishops, if the piece values are off.
Best BS post ever, from someone with a broad practical experience in applying Texel tuning

Jumbo got a boost of slightly more than 100 Elo points by correctly applying the method. I used the original description, except for calling eval() instead of qsearch() since I used the well-known "quiet-labeled.epd" dataset. Using step sizes greater than 1 resulted in different parameter values that caused Jumbo to play weaker than with those values resulting from tuning with step size 1.

There is no reason to believe that the data set I used does not contain sufficient information to tune piece values. In fact I tuned all my several hundreds of eval parameters at once and have been successful with it.

The attempt to only tune piece values with a different algorithm while using a much smaller data set may work or not ...

Hi Sven,

i agree in one point at least with HGM. It is not consequent to ignore results that you do not like. That makes the complete process random.
Of course you do not need to put BS values for material into your engine either because they result in a lower mse. Another useful handling would be
to find out the reason why the tuner reports an obvious BS vector as best. You can try to improve your data for example, that is useful too.

On the other side, i agree with you that the Texel Tuning Method can help a lot. The difficulty is to prepare the prerequisites.

Desperado · Post by **Desperado** » Tue Jan 19, 2021 10:20 am

Ferdy wrote: ↑Tue Jan 19, 2021 10:03 am
Desperado wrote: ↑Tue Jan 19, 2021 9:38 am
Ferdy wrote: ↑Tue Jan 19, 2021 7:15 am
Desperado wrote: ↑Mon Jan 18, 2021 5:52 pm
Ferdy wrote: ↑Mon Jan 18, 2021 5:28 pm A good challenge next would be tuning the material and passer. The position can be dominated by material on one side but can be countered by the presence of a passed pawn on the other side.
Hello Ferdy,

please be so kind and give me a clear answer to the question, do you compare mse from data that differ from each other? yes or no ?
I can compare mse from 2 different data but have to know what are in those data and of course I know what parameters I am trying to optimize.

Desperado wrote: ↑Mon Jan 18, 2021 5:52 pm And what do you say to the fact that the values drift further and further apart more efficiently the algorithm works.
The values shown should also show a smaller MSE for you. Can you please confirm or deny this ? more results
The training position has a result field, if result is 1-0 its target score is 1.0. Calculating the material only evaluation score of the engine would say give you +300cp for a position with one knight ahead. Now convert that to sigmoid and you will get a value sig. The error is 1.0 - sig. As you increase the knight value its sig would also increase, if sig increases the error goes down.
You can compare 2 mse from different data, but that is wrong. The results of the tuner do not prove that what you are doing is right.
On the contrary, it certainly introduces an error into your algorithm.

HGM gave more information on the effect.
That means you risk a regression if you do not assure that the mse of the total set is better, for sure!

You must at least adjust the basis for comparing two vectors and you must have the mse for the current data set for two vectors you are comparing. Only then they are comparable at all. (This requirement is mandatory). In this case and without considering the whole dataset, you could at least determine if a new vector in this new dataset is better. (However, a regression is still not excluded).
The properties of the position are irrelvant in that context.

Don't get me wrong, I like the idea to explore a bigger space of positions while keeping the sample size constant. Really nice idea!
But it requires an update of mse to the corresponding data. Then you can compare to it.

But, what HGM did not point out was that there are algorithms that work on "your" idea.
You need to define a threshold function, that controls the potential regression. The basic idea is, you would be able to leave locals.

Your second answer does not address my question at all. Sorry.
Either you present the mse of the vectors i provided or you show with a smaller mse that the material values of a piece
stay close together. E.g. pawn 40,60 instead of 10,110 for example.
I know what I am doing.

I hope you fixed the bugs in your tuner

I know what I am doing.

That is not a serious or professional answer

. A lot of people know what they are doing here.
It does not mean they don't make mistakes! If you don't trust me, than trust other experts. Ask them...

Keep an open mind and do not be offended

Well the point is, there are no bugs in my tuner. The longer the thread was ongoing it got clear it is the data.
I showed that the tuner produces on the given data lower mse than other people, for example you did report.

The only way to bring evidence is to report an mse that is lower than what i reported and has "normal" values.
Since I provided these facts you do not answer factually or ignore the questions. Please report facts and not statements like "i know what i am doing". Thanks a lot (especially for the effort and the numbers you already reported)

Sven · Post by **Sven** » Tue Jan 19, 2021 10:25 am

hgm wrote: ↑Tue Jan 19, 2021 10:01 am
Sven wrote: ↑Tue Jan 19, 2021 9:26 amJumbo got a boost of slightly more than 100 Elo points by correctly applying the method.
Except that you just told us that you did not correctly apply the method at all. Correctly applying it would be to minimize the MSE. Instead you sabotaged the optimizer to prevent it from finding the minimum MSE. As you now repeat, correctly applying the method would make the engine weaker.

Parameter tuning without measuring playing strength is nonsense. And I have not seen any proof that "MSE(paramVector1) < MSE(paramVector2)" always implies "Elo(paramVector1) >= Elo(paramVector2)". Whether a parameter vector that results in a smaller MSE for a given data set really improves playing strength depends heavily on many properties of the engine. I would accept a statement like "then it is your engine that sucks" since Jumbo is still slightly below 2600 Elo @CCRL ... but it was even far below that level before tuning it.

Also I did not "sabotage the optimizer", that is another bullshit statement. I used the same algorithm and the same step size as the original author of Texel tuning.

Sven · Post by **Sven** » Tue Jan 19, 2021 10:27 am

Desperado wrote: ↑Tue Jan 19, 2021 10:16 am
Sven wrote: ↑Tue Jan 19, 2021 9:26 am
hgm wrote: ↑Mon Jan 18, 2021 10:39 pm
Sven wrote: ↑Mon Jan 18, 2021 5:51 pmE.g. when allowing a step size of 8 or even more, I have observed that the pawn MG value jumps away to more than 200cp and both queen values to 1400, 1500 cp and more, leading to weaker play. I have never observed such extremes when using a small step size (1).
Basically what you saying is that Texel tuning sucks, and you should try to break it as much as possible in order to prevent it can complete its task. So that it won't spoil your starting parameters as much as it otherwise would.

I don't doubt your experience, (in fact it is exactly what I expect from the currently derived set: pretty weak play). But your reaction to this is not consistent. More logical would be to not use it at all, or at least not use it for the piece values, but fix these, and use it to tune only the more subtle eval terms.

And the problem is not in the method, the problem is in the tuning set. If you tune on a set that doesn't contain any info on the piece values, you will of course get poor piece values. Because it wil abuse the piece values to improve some minor terms that accidentally correlate with the material composition, not caring how much that spoils the evaluation of the heavily unbalanced positions missing from your test set.

Engines that play over 3200 Elo search a tree that for >99.9% consists of moronic play, visiting positions even a 1000 Elo player would not tolerate in his games. They can only find the 3200 Elo PV if they score the garbage positions well enough to realize they are worse than the PV. Tuning the eval on the very narrow range of 'sensible' positions will give an eval that very wrongly extrapolates to idiotic positions. To the point where it might start preferring those. E.g. trade Queen for two Bishops, if the piece values are off.
Best BS post ever, from someone with a broad practical experience in applying Texel tuning

Jumbo got a boost of slightly more than 100 Elo points by correctly applying the method. I used the original description, except for calling eval() instead of qsearch() since I used the well-known "quiet-labeled.epd" dataset. Using step sizes greater than 1 resulted in different parameter values that caused Jumbo to play weaker than with those values resulting from tuning with step size 1.

There is no reason to believe that the data set I used does not contain sufficient information to tune piece values. In fact I tuned all my several hundreds of eval parameters at once and have been successful with it.

The attempt to only tune piece values with a different algorithm while using a much smaller data set may work or not ...
Hi Sven,

i agree in one point at least with HGM. It is not consequent to ignore results that you do not like. That makes the complete process random.
Of course you do not need to put BS values for material into your engine either because they result in a lower mse. Another useful handling would be
to find out the reason why the tuner reports an obvious BS vector as best. You can try to improve your data for example, that is useful too.

On the other side, i agree with you that the Texel Tuning Method can help a lot. The difficulty is to prepare the prerequisites.

My point was not that I did not like results, it was that I discarded tuning results due to a significant difference in playing strength.

Desperado · Post by **Desperado** » Tue Jan 19, 2021 10:43 am

Sven wrote: ↑Tue Jan 19, 2021 10:27 am
Desperado wrote: ↑Tue Jan 19, 2021 10:16 am
Sven wrote: ↑Tue Jan 19, 2021 9:26 am
hgm wrote: ↑Mon Jan 18, 2021 10:39 pm
Sven wrote: ↑Mon Jan 18, 2021 5:51 pmE.g. when allowing a step size of 8 or even more, I have observed that the pawn MG value jumps away to more than 200cp and both queen values to 1400, 1500 cp and more, leading to weaker play. I have never observed such extremes when using a small step size (1).
Basically what you saying is that Texel tuning sucks, and you should try to break it as much as possible in order to prevent it can complete its task. So that it won't spoil your starting parameters as much as it otherwise would.

I don't doubt your experience, (in fact it is exactly what I expect from the currently derived set: pretty weak play). But your reaction to this is not consistent. More logical would be to not use it at all, or at least not use it for the piece values, but fix these, and use it to tune only the more subtle eval terms.

And the problem is not in the method, the problem is in the tuning set. If you tune on a set that doesn't contain any info on the piece values, you will of course get poor piece values. Because it wil abuse the piece values to improve some minor terms that accidentally correlate with the material composition, not caring how much that spoils the evaluation of the heavily unbalanced positions missing from your test set.

Engines that play over 3200 Elo search a tree that for >99.9% consists of moronic play, visiting positions even a 1000 Elo player would not tolerate in his games. They can only find the 3200 Elo PV if they score the garbage positions well enough to realize they are worse than the PV. Tuning the eval on the very narrow range of 'sensible' positions will give an eval that very wrongly extrapolates to idiotic positions. To the point where it might start preferring those. E.g. trade Queen for two Bishops, if the piece values are off.
Best BS post ever, from someone with a broad practical experience in applying Texel tuning

Jumbo got a boost of slightly more than 100 Elo points by correctly applying the method. I used the original description, except for calling eval() instead of qsearch() since I used the well-known "quiet-labeled.epd" dataset. Using step sizes greater than 1 resulted in different parameter values that caused Jumbo to play weaker than with those values resulting from tuning with step size 1.

There is no reason to believe that the data set I used does not contain sufficient information to tune piece values. In fact I tuned all my several hundreds of eval parameters at once and have been successful with it.

The attempt to only tune piece values with a different algorithm while using a much smaller data set may work or not ...
Hi Sven,

i agree in one point at least with HGM. It is not consequent to ignore results that you do not like. That makes the complete process random.
Of course you do not need to put BS values for material into your engine either because they result in a lower mse. Another useful handling would be
to find out the reason why the tuner reports an obvious BS vector as best. You can try to improve your data for example, that is useful too.

On the other side, i agree with you that the Texel Tuning Method can help a lot. The difficulty is to prepare the prerequisites.
My point was not that I did not like results, it was that I discarded tuning results due to a significant difference in playing strength.

I agree, that is another subject. The translation into elo is important and is measured completely different.
Well, it is not the fault of the tuner or the tuning algorithm itself if a smaller mse does not translate in better gameplay. At the same time i would always ask, why to hell do i get a lower mse that does not do that? Your goal should be to maximize the ratio when it helps and to close lacks as often as possible. The efficiency will rise then. imho.

@HG if there is something wrong in the complete scenario, then it would certainly be the human made assumption that the lowest mse leads to always better gameplay. The tuner or the tuning algorithm only provides this information. It does not tell anybody that the result will perform better in gameplay. You simply do not ask the question to him, you only ask how the best vector would look like for a given data set. That is basically a different question.

Tapered Evaluation and MSE (Texel Tuning)

Re: Tapered Evaluation and MSE (Texel Tuning)

Re: Tapered Evaluation and MSE (Texel Tuning)

Re: Tapered Evaluation and MSE (Texel Tuning)

Re: Tapered Evaluation and MSE (Texel Tuning)

Re: Tapered Evaluation and MSE (Texel Tuning)

Re: Tapered Evaluation and MSE (Texel Tuning)

Re: Tapered Evaluation and MSE (Texel Tuning)

Re: Tapered Evaluation and MSE (Texel Tuning)

Re: Tapered Evaluation and MSE (Texel Tuning)

Re: Tapered Evaluation and MSE (Texel Tuning)