Texel tuning questions

algerbrex · Post by **algerbrex** » Tue Aug 17, 2021 4:34 pm

I've recently taken a stab at writing a texel tuner, and after a bit of work, it seems to work, and it's pretty cool seeing the training set automatically updating the piece-square tables. However, I'm curious about two things right now.

The tuner seems reasonable fast and calculates the mean square error of my training set (~3.5M quiet positions from Zurichess) in about ~8-12 seconds. Still, though, I'm trying to tune 768 weights (6 middle-game piece-square tables, and 6 end-game piece-square tables, for tapered eval), and so just going by the time it takes to calculate the mean square error after tuning a single weight, one iteration through the entire set of weights would take anywhere between two to four hours. And I'm planning on adding more weights for tuning mobility bonuses and whatnot.

I'm currently just using the naive implementation for minimizing the error function found on the CPW, so I knew it would be slower. What would be some good ways to speed the algorithm up? I know of gradient descent, and while I understand the concept the exact math and implementation details are pretty fuzzy to me, especially when varients of gradient descent are brought up. I'll keep looking into this more though, and any good resources would be appreciated.

Second, from some preliminary testing, while I can see the tuning is starting to encode basic chess principles into the piece-square tables, the tables still pretty awful, and don't even complete with handcrafted tables. But I'm assuming running through the full dataset every iteration, and running several hundred iterations will significantly improve things? In addition to going through several training sessions and randomizing the epd file as well? Lastly, if it matters, I'm putting the baseline material values into piece-square tables themselves and letting the tuner run from there (e.g. the pawn middle-game table starts off with 100 for each square, each square in the middle-game table for knights starts off with 300, etc.). That way, I don't need to count material separately.

amanjpro · Post by **amanjpro** » Tue Aug 17, 2021 4:39 pm

algerbrex wrote: ↑Tue Aug 17, 2021 4:34 pm I've recently taken a stab at writing a texel tuner, and after a bit of work, it seems to work, and it's pretty cool seeing the training set automatically updating the piece-square tables. However, I'm curious about two things right now.

The tuner seems reasonable fast and calculates the mean square error of my training set (~3.5M quiet positions from Zurichess) in about ~8-12 seconds. Still, though, I'm trying to tune 768 weights (6 middle-game piece-square tables, and 6 end-game piece-square tables, for tapered eval), and so just going by the time it takes to calculate the mean square error after tuning a single weight, one iteration through the entire set of weights would take anywhere between two to four hours. And I'm planning on adding more weights for tuning mobility bonuses and whatnot.

I'm currently just using the naive implementation for minimizing the error function found on the CPW, so I knew it would be slower. What would be some good ways to speed the algorithm up? I know of gradient descent, and while I understand the concept the exact math and implementation details are pretty fuzzy to me, especially when varients of gradient descent are brought up. I'll keep looking into this more though, and any good resources would be appreciated.

Second, from some preliminary testing, while I can see the tuning is starting to encode basic chess principles into the piece-square tables, the tables still pretty awful, and don't even complete with handcrafted tables. But I'm assuming running through the full dataset every iteration, and running several hundred iterations will significantly improve things? In addition to going through several training sessions and randomizing the epd file as well? Lastly, if it matters, I'm putting the baseline material values into piece-square tables themselves and letting the tuner run from there (e.g. the pawn middle-game table starts off with 100 for each square, each square in the middle-game table for knights starts off with 300, etc.). That way, I don't need to count material separately.

My engine is also written in Go, I too implement the naiive algorithm (or local optimization really), you can do the following:

- Parse all the positions once, and load them into memory
- Add bounds to values if it makes sense, for example you should probably never have passed pawn award that is negative!
- Add multi-threading (if you don't have it)

AndrewGrant · Post by **AndrewGrant** » Tue Aug 17, 2021 4:58 pm

https://github.com/AndyGrant/Ethereal/b ... Tuning.pdf

This was my best contribution to computer chess aside from OpenBench I suppose. Ethereal's quick rise was heavily due to the speed at which new terms could be created, inserted into the tuner, and tuned to a degree of certainty such that you could discard the term or not based just on the SPRT testing.

algerbrex · Post by **algerbrex** » Tue Aug 17, 2021 5:04 pm

amanjpro wrote: ↑Tue Aug 17, 2021 4:39 pm
algerbrex wrote: ↑Tue Aug 17, 2021 4:34 pm I've recently taken a stab at writing a texel tuner, and after a bit of work, it seems to work, and it's pretty cool seeing the training set automatically updating the piece-square tables. However, I'm curious about two things right now.

The tuner seems reasonable fast and calculates the mean square error of my training set (~3.5M quiet positions from Zurichess) in about ~8-12 seconds. Still, though, I'm trying to tune 768 weights (6 middle-game piece-square tables, and 6 end-game piece-square tables, for tapered eval), and so just going by the time it takes to calculate the mean square error after tuning a single weight, one iteration through the entire set of weights would take anywhere between two to four hours. And I'm planning on adding more weights for tuning mobility bonuses and whatnot.

I'm currently just using the naive implementation for minimizing the error function found on the CPW, so I knew it would be slower. What would be some good ways to speed the algorithm up? I know of gradient descent, and while I understand the concept the exact math and implementation details are pretty fuzzy to me, especially when varients of gradient descent are brought up. I'll keep looking into this more though, and any good resources would be appreciated.

Second, from some preliminary testing, while I can see the tuning is starting to encode basic chess principles into the piece-square tables, the tables still pretty awful, and don't even complete with handcrafted tables. But I'm assuming running through the full dataset every iteration, and running several hundred iterations will significantly improve things? In addition to going through several training sessions and randomizing the epd file as well? Lastly, if it matters, I'm putting the baseline material values into piece-square tables themselves and letting the tuner run from there (e.g. the pawn middle-game table starts off with 100 for each square, each square in the middle-game table for knights starts off with 300, etc.). That way, I don't need to count material separately.
My engine is also written in Go, I too implement the naiive algorithm (or local optimization really), you can do the following:

- Parse all the positions once, and load them into memory
- Add bounds to values if it makes sense, for example you should probably never have passed pawn award that is negative!
- Add multi-threading (if you don't have it)

Hi amanj, thanks for the advice. I'll definitely go ahead and make the change to load all the positions into memory.

With regards to your last two points:

- That sounds like a good idea, and you're right, I noticed that my tuner would spend time give stupidly high or low values (Ironically enough, passers being valued too highly is my issue. I suppose the positions I sampled had passed pawns being the deciding factor, and so I have piece square tables where pawns on the 7th and 2nd rank are being valued the same as two minors!) How would I go about doing this exactly though, efficiently at least? I already have a good idea of what values should be bounded and where, but I have to think about how exactly I'd implement that efficiently. It'll probably occur to me soon enough though

- Another good idea, but I'm not sure how to go about multithreading the process either, particularly in Go. I'm aware of goroutines and whatnot, but its not yet clear to me how to piece everything together, especially in the context of texel tuning. But I'll keep doing some more reading.

algerbrex · Post by **algerbrex** » Tue Aug 17, 2021 5:09 pm

AndrewGrant wrote: ↑Tue Aug 17, 2021 4:58 pm https://github.com/AndyGrant/Ethereal/b ... Tuning.pdf
This was my best contribution to computer chess aside from OpenBench I suppose.

Thanks, Andrew, I'll definitely look through that paper, although it'll probably take me a while to grok everything.

AndrewGrant wrote: ↑Tue Aug 17, 2021 4:58 pm Ethereal's quick rise was heavily due to the speed at which new terms could be created, inserted into the tuner, and tuned to a degree of certainty such that you could discard the term or not based just on the SPRT testing.

That makes sense. I've heard from several experience chess engine writers that testing is the key to making a strong engine, so my hope is that once I invest in some better computing, I can begin using the tuning and SPRT testing to make strong progress.

algerbrex · Post by **algerbrex** » Tue Aug 17, 2021 5:50 pm

amanjpro wrote: ↑Tue Aug 17, 2021 4:39 pm
algerbrex wrote: ↑Tue Aug 17, 2021 4:34 pm I've recently taken a stab at writing a texel tuner, and after a bit of work, it seems to work, and it's pretty cool seeing the training set automatically updating the piece-square tables. However, I'm curious about two things right now.

The tuner seems reasonable fast and calculates the mean square error of my training set (~3.5M quiet positions from Zurichess) in about ~8-12 seconds. Still, though, I'm trying to tune 768 weights (6 middle-game piece-square tables, and 6 end-game piece-square tables, for tapered eval), and so just going by the time it takes to calculate the mean square error after tuning a single weight, one iteration through the entire set of weights would take anywhere between two to four hours. And I'm planning on adding more weights for tuning mobility bonuses and whatnot.

I'm currently just using the naive implementation for minimizing the error function found on the CPW, so I knew it would be slower. What would be some good ways to speed the algorithm up? I know of gradient descent, and while I understand the concept the exact math and implementation details are pretty fuzzy to me, especially when varients of gradient descent are brought up. I'll keep looking into this more though, and any good resources would be appreciated.

Second, from some preliminary testing, while I can see the tuning is starting to encode basic chess principles into the piece-square tables, the tables still pretty awful, and don't even complete with handcrafted tables. But I'm assuming running through the full dataset every iteration, and running several hundred iterations will significantly improve things? In addition to going through several training sessions and randomizing the epd file as well? Lastly, if it matters, I'm putting the baseline material values into piece-square tables themselves and letting the tuner run from there (e.g. the pawn middle-game table starts off with 100 for each square, each square in the middle-game table for knights starts off with 300, etc.). That way, I don't need to count material separately.
My engine is also written in Go, I too implement the naiive algorithm (or local optimization really), you can do the following:

- Parse all the positions once, and load them into memory
- Add bounds to values if it makes sense, for example you should probably never have passed pawn award that is negative!
- Add multi-threading (if you don't have it)

By the way, did you need to run through many iterations of the full training set before you started to get good piece-square tables? I'm just trying to figure out the optimal ratio of iterations to positions evaluated. My intuition is that it's better to go through the full dataset at lower iterations and visa versa. The only problem is that one run through the full trainset looks like it'll take an hour or so. But I'll go ahead and implement your other two suggestions and see what kind of speed improvement I get.

jdart · Post by **jdart** » Tue Aug 17, 2021 6:08 pm

>did you need to run through many iterations of the full training set before you started to get good piece-square tables?

My tuner is multithreaded and typically takes about 150 iterations to converge. That's maybe an hour of runtime on a 24-core Xeon.

amanjpro · Post by **amanjpro** » Tue Aug 17, 2021 6:23 pm

Hey Christian,

You can definitely take a look at my code, and if you wanted feel free to copy-adapt-paste (and if you want to understand what is happening, then it is a different story, https://github.com/amanjpro/zahak/blob/ ... _tuning.go)

Idea for mutli-threading is simple, lets say you have 10 cores, and 1M positions:

- A single threaded loop tunes one single weight (+1 or -1)
- You lunch a job to test the evaluation on the positions (Eval and MSE)
- Now, since Eval produced a single integer, and integer addition can be done in any order x+y+z == z+y+x, you could run the eval function in any order and still get the same result... Now, we know that, we also know that if we tell each core to compute the evals of 100K positions, we then sum their results and have our MSE... And you continue doing that
- Here is my multi-threaded MSE: https://github.com/amanjpro/zahak/blob/ ... #L401-L414

As for the bounds, my weights have both upper and lower bounds and I track them naively, you don't need to squeeze every performance bit here, this is not the bottleneck, here is how I do it: https://github.com/amanjpro/zahak/blob/ ... ng.go#L327
I was lucky that my weights are ordered in a way that everything before index 768 can go below zero, everything after it cannot... you can do something similar

If your tuner takes 2 hours run it for 2 hours... you should give it time to produced optimal weights, I have no idea what will be the behaviour if you cut the tuner short

algerbrex · Post by **algerbrex** » Tue Aug 17, 2021 7:56 pm

jdart wrote: ↑Tue Aug 17, 2021 6:08 pm >did you need to run through many iterations of the full training set before you started to get good piece-square tables?

My tuner is multithreaded and typically takes about 150 iterations to converge. That's maybe an hour of runtime on a 24-core Xeon.

Hmm ok, thanks. The reason I asked is that from my estimate, for my tuner to make even one iteration through the full training set is going to take 2-4 hours. And so for any reasonable convergence to occur, I'm looking at days to weeks of runtime, which seems a little extreme, especially given the numbers you posted. So I'm going to try out some of the recommendations I've been given here and see if that helps speed things up.

algerbrex · Post by **algerbrex** » Tue Aug 17, 2021 7:59 pm

amanjpro wrote: ↑Tue Aug 17, 2021 6:23 pm Hey Christian,

You can definitely take a look at my code, and if you wanted feel free to copy-adapt-paste (and if you want to understand what is happening, then it is a different story, https://github.com/amanjpro/zahak/blob/ ... _tuning.go)

Idea for mutli-threading is simple, lets say you have 10 cores, and 1M positions:

- A single threaded loop tunes one single weight (+1 or -1)
- You lunch a job to test the evaluation on the positions (Eval and MSE)
- Now, since Eval produced a single integer, and integer addition can be done in any order x+y+z == z+y+x, you could run the eval function in any order and still get the same result... Now, we know that, we also know that if we tell each core to compute the evals of 100K positions, we then sum their results and have our MSE... And you continue doing that
- Here is my multi-threaded MSE: https://github.com/amanjpro/zahak/blob/ ... #L401-L414

As for the bounds, my weights have both upper and lower bounds and I track them naively, you don't need to squeeze every performance bit here, this is not the bottleneck, here is how I do it: https://github.com/amanjpro/zahak/blob/ ... ng.go#L327
I was lucky that my weights are ordered in a way that everything before index 768 can go below zero, everything after it cannot... you can do something similar

If your tuner takes 2 hours run it for 2 hours... you should give it time to produced optimal weights, I have no idea what will be the behavior if you cut the tuner short

Thanks Amanj, all of the links you posted were very helpful, and my code actually looks pretty much identically to yours, so that's encouraging

I'll add those bounds the way you're doing in your code, so hopefully, that'll help.

And thanks for the explanation on multi-threading, I think I understand what you mean, so I'll try to implement that in my own code.

Texel tuning questions

Texel tuning questions

Re: Texel tuning questions

Re: Texel tuning questions

Re: Texel tuning questions

Re: Texel tuning questions

Re: Texel tuning questions

Re: Texel tuning questions

Re: Texel tuning questions

Re: Texel tuning questions

Re: Texel tuning questions