Weird Results from Texel Tuning

arjunbasandrai · Post by **arjunbasandrai** » Wed Jan 17, 2024 12:21 pm

Hi everyone!

I am currently in the process of adding Texel's Tuning to Shuffle. This is my implementation (https://github.com/ArjunBasandrai/shuff ... el/texel.c) which is heavily inspired by (https://github.com/jhonnold/berserk/blo ... xel.c#L793)

However I am facing some issues :-

1. When I use a function to determine the best value of K, it always comes out to be a value very close to 0 (0.05-0.1)
2. After a few iterations (15 to be precise), I used the values and ran a test against the untuned version, but the new version only score around 38%

Can anybody help me figure what could be causing these issues?

lithander · Post by **lithander** » Fri Jan 19, 2024 12:50 pm

For me "Texel Tuning" means a simple approach to tuning your evaluation parameters. As you continue working on your engine you will naturally add more complexity to make the process more efficient or accurate. Then when you look at a top 10 engine's code everything seems dauntingly complex so I'm not sure if berserk's implementation should really guide your first steps, it has a lot of these improvements already integrated.

The idea however is really simple. And it works like this:

What makes a good evaluation function? A good evaluation is able to look at a position and tell you who is winning.
You play a bunch of games and then you look at how the games ended. You create a big dataset of positions and their outcome. (in EPD format) And now you want to tweak your evaluation function in a way that the predictive quality when applied to this data set improves.

So first you have to be able to measure the predictive quality of your evaluation function. Usually evaluation functions use centipawns as a unit. A balanced position hovers around zero but a lost or won position is scored with hundreds or even thousands of centipawns.

So what you do now is you convert the output of your eval into the range [-1..+1] and interpret it as the probability that black (-1) or white (+1) wins.
A good way to do that mapping is using the sigmoid function.

Now... there's an infinite number of different sigmoid functions to chose from. You'll want to pick one that does a good job at mapping your eval output into the probabilistic value predicting the outcome.

This is what finding 'K' is about. It's not yet about tuning your eval parameters, it's about tuning your sigmoid function to fit your existing evaluation!

So how do you find 'K'? The simplest way is you just try different values for 'K' and settle on the best one. But before you can do that you have to be able to measure the predictive quality of your eval on the dataset. For that you calculate the mean square error over the dataset.

For each position you get the eval score. You map it to a probabilistic value using your current flavor of sigmoid function. You substract the result from the game result which get's you the error; how well did you predict the real outcome? you take that error and square it. (big errors matter more then small ones)

Now you add all these squared errors up and at the end divide them by the number of positions in your dataset.. and voila! You have computed the mean squared error over your dataset. This how you measure the predictive quality of your evaluation.

And from here on everything you do serves one purpose: Minimizing the MSE!

Now try to optimize your mapping first. Try different sigmoids or if you have chosen one with a constant K (like Berzerk) then try different values of K. Once your mapping is solid you start tweaking your evaluation parameters.

And while Berzerk seems to be using stochastic gradient descent to tweak the values (which is a good idea but complicated) any optimization method works. And if you're just starting out and trying to find good PSQT values then by all means keep it simple and stupid.
Pick an eval parameter and add +1 and see if MSE improves. Continue adding +1 as long as you find improvements. Do the same with subtracting -1... do that a lot and you arrive at a local optimium and you find no further parameters you could modify for a tiny improvements.
Congratzs you have tuned your eval values and you should setup a selfplay match of the new values vs the old values and verify your success.

Now you can start refining! With Berzerk you look at an implemention that has seen a lot of refinements and that looks quite daunting. But with all optimizations and refinements stripped away the basic concept is super simple.

Ras · Post by **Ras** » Fri Jan 19, 2024 2:49 pm

arjunbasandrai wrote: ↑Wed Jan 17, 2024 12:21 pm1. When I use a function to determine the best value of K, it always comes out to be a value very close to 0 (0.05-0.1)

Which means that your evaluation is not really related to the training data. Could be some bug in the K calculation, or the MSE, or you parse the game result wrong, or you use non-quiet positions with static eval instead of quiescence.

mvanthoor · Post by **mvanthoor** » Fri Jan 19, 2024 6:57 pm

lithander wrote: ↑Fri Jan 19, 2024 12:50 pm Now try to optimize your mapping first. Try different sigmoids or if you have chosen one with a constant K (like Berzerk) then try different values of K. Once your mapping is solid you start tweaking your evaluation parameters.

So when do you know when your K is 'perfect' or your 'mapping is solid'?

I was planning to do 16 texel-runs at the same time, each with different K-values, and then test each of the outcomes in self-play matches.

Ras · Post by **Ras** » Fri Jan 19, 2024 7:39 pm

mvanthoor wrote: ↑Fri Jan 19, 2024 6:57 pmSo when do you know when your K is 'perfect' or your 'mapping is solid'?

When the MSE before tuning is minimised over K. Otherwise, you'll end up wasting the training time on scaling the evaluation as a whole instead of tuning the parameters to discern position features.

mvanthoor · Post by **mvanthoor** » Fri Jan 19, 2024 7:55 pm

Ras wrote: ↑Fri Jan 19, 2024 7:39 pm When the MSE before tuning is minimised over K. Otherwise, you'll end up wasting the training time on scaling the evaluation as a whole instead of tuning the parameters to discern position features.

So when I understand this correctly, you should FIRST tune K, to get the lowest possible MSE with your eval, WITHOUT touching any of the eval parameters?

I've never encountered that step in my research; I might have missed it, but it sounds logical. So I'll add this to the pipeline.

Ras · Post by **Ras** » Fri Jan 19, 2024 8:09 pm

mvanthoor wrote: ↑Fri Jan 19, 2024 7:55 pmSo when I understand this correctly, you should FIRST tune K, to get the lowest possible MSE with your eval, WITHOUT touching any of the eval parameters?

Exactly, and then you never change that K again for these training data.

I've never encountered that step in my research

From CPW (https://www.chessprogramming.org/Texel% ... ing_Method):

K is a scaling constant.
Compute the K that minimizes E. K is never changed again by the algorithm.
If needed, refactor the source code so that the qScore function depends on a set of evaluation function parameters w_j, 1 <= j <= M.
The average error E is now a function of the M parameters. Find parameter values such that E is a local minimum in parameter space.

So the K scaling is determined before the actual weight tuning starts.

mvanthoor · Post by **mvanthoor** » Fri Jan 19, 2024 10:34 pm

Ras wrote: ↑Fri Jan 19, 2024 8:09 pm Exactly, and then you never change that K again for these training data.

I've never encountered that step in my research
From CPW (https://www.chessprogramming.org/Texel% ... ing_Method):

K is a scaling constant.
Compute the K that minimizes E. K is never changed again by the algorithm.
If needed, refactor the source code so that the qScore function depends on a set of evaluation function parameters w_j, 1 <= j <= M.
The average error E is now a function of the M parameters. Find parameter values such that E is a local minimum in parameter space.
So the K scaling is determined before the actual weight tuning starts.

It seems this is one more thing I didn't quite fully understand until now. I still have two questions:

- Is K computed in centipawns (150, for example) or pawns (1.5)?
- When you compute K, you say that it never changes for this set of training data. Is that in combination with the evaluation as it is now, or without? In other words: if the training data stays the same, but the training data changes, do you NOT recompute K because the training data stayed the same, or DO you recompute K because the evaluation changed? I'd say the latter, because a different evaluation would give a different MSE, and thus a different K to minimize it.

Ras · Post by **Ras** » Fri Jan 19, 2024 10:54 pm

mvanthoor wrote: ↑Fri Jan 19, 2024 10:34 pm- Is K computed in centipawns (150, for example) or pawns (1.5)?

No, it's a unit-less scaling constant in your pre-existing eval. If your eval were perfect in centipawns already with regard to winning advantage in the training data, K would probably be around 1.

- When you compute K, you say that it never changes for this set of training data. Is that in combination with the evaluation as it is now, or without?

It's about the current, pre-trained eval, and the reasoning is that you want to improve the quality of your eval, but you don't want to change its general scaling.

In other words: if the training data stays the same, but the training data changes

That doesn't make sense to me?

do you NOT recompute K because the training data stayed the same, or DO you recompute K because the evaluation changed?

For the same training data, you don't recompute. The idea is to find a K where your pre-trained eval matches best without the training, and then all the rest of the training happens with the same K so that you optimise the discernation of winning / losing features instead of re-scaling the evaluation. That's because you have clearly winning positions such as KQ:K where you can always gain a slightly better sigmoid fit by just increasing all material values until you end up with a queen being 100 pawns, a pawn being 10 pawns, and so on.

The most difficult part is finding good training data, and the Zurichess quiet ones are a very good starting point. I found an even better one from Lichess Big3 Resolved, but it's not by much. I also improves the Texel method in that my tuner takes the number of iterations and a step width as parameters so that I first do a coarse tuning, then a medium one, and only then the fine tuning with step width 1.

A pretty huge performance win for training was adding a function that tells if a parameter is even relevant in a given position. As in, if a position does not have a white king on F2, then this won't contribute to the white-king-on-F2 piece square table entry. Or, the bishop pair advantage is only relevant in positions where exactly one side has the bishop pair.

mvanthoor · Post by **mvanthoor** » Fri Jan 19, 2024 11:04 pm

Ras wrote: ↑Fri Jan 19, 2024 10:54 pm That doesn't make sense to me?

What I mean is: if you have training data T, and evaluation E, you can compute K to get the lowest MSE, by running the training data T through evaluation E before tuning. Understood.

When you use a different training set T2 with evaluation E, you should recompute K.
Do you also need to recompute K if you use the old training data T with a new evaluation E2?

Weird Results from Texel Tuning

Weird Results from Texel Tuning

Re: Weird Results from Texel Tuning

Re: Weird Results from Texel Tuning

Re: Weird Results from Texel Tuning

Re: Weird Results from Texel Tuning

Re: Weird Results from Texel Tuning

Re: Weird Results from Texel Tuning

Re: Weird Results from Texel Tuning

Re: Weird Results from Texel Tuning

Re: Weird Results from Texel Tuning