Progress on Rustic

amanjpro · Post by **amanjpro** » Sat Jun 19, 2021 5:35 am

Congrats, and I guess I am waiting for a new Rustic binary?

mvanthoor · Post by **mvanthoor** » Sat Jun 19, 2021 10:10 am

amanjpro wrote: ↑Sat Jun 19, 2021 5:35 am Congrats, and I guess I am waiting for a new Rustic binary?

No; Alpha 3.0.0 is basically the same as you're running now.

lithander · Post by **lithander** » Sun Jun 20, 2021 2:27 am

mvanthoor wrote: ↑Fri Jun 18, 2021 11:29 pm I'll finally have to look into the tuning part. Maybe it's a chance to use the Rust Rayon library. That thing is awesome. If you have stuff that needs to run from beginning to end (such as a for-loop that doesn't break), you can have Rayon parallelize it automatically.

One of my biggest regrets with my tuner (and the reason why it's never found it's way into the 'master' branch) is that I never bothered with making it fast by using multi threading. I should have done that right from the beginning. If it is really almost automatically in Rust, that's a pretty awesome feature you get for free there!

amanjpro · Post by **amanjpro** » Sun Jun 20, 2021 4:51 am

lithander wrote: ↑Sun Jun 20, 2021 2:27 am
mvanthoor wrote: ↑Fri Jun 18, 2021 11:29 pm I'll finally have to look into the tuning part. Maybe it's a chance to use the Rust Rayon library. That thing is awesome. If you have stuff that needs to run from beginning to end (such as a for-loop that doesn't break), you can have Rayon parallelize it automatically.
One of my biggest regrets with my tuner (and the reason why it's never found it's way into the 'master' branch) is that I never bothered with making it fast by using multi threading. I should have done that right from the beginning. If it is really almost automatically in Rust, that's a pretty awesome feature you get for free there!

For Zahak, I basically had all the EPD positions in an array, I had N go routines (close to Actors in Erlang and other languages, you can even use Future or any other similar concepts for that), each responsible of running the evaluation funcion on a portion of the array. And since the result of the evaluation is a monoid (has zero value, and is associative and commutative), I could simply aggregate the results after waiting for all routines to finish.

I believe, you can easily do the same in C#

Granted, this won't give you parallelism when chaning the value of the evaluation parameters, but only when testing them, but I believe that is the only place that matters

lithander · Post by **lithander** » Sun Jun 20, 2021 10:54 pm

amanjpro wrote: ↑Sun Jun 20, 2021 4:51 am For Zahak, I basically had all the EPD positions in an array, I had N go routines (close to Actors in Erlang and other languages, you can even use Future or any other similar concepts for that), each responsible of running the evaluation funcion on a portion of the array. And since the result of the evaluation is a monoid (has zero value, and is associative and commutative), I could simply aggregate the results after waiting for all routines to finish.

I believe, you can easily do the same in C#

That it would have been so easy is exactly why I regret not having done it.

I was just running the tuner during work hours where I can't spare so many cores anyway and where it didn't really matter how fast the tuning process was. But the tuner is clearly worse off for the lack of parallelism. I just didn't consider it really a part of the product for the longest time.

Well maybe at some point I'm going to revisit the topic. There are other things I wanted to try but never got around to... simulated annealing for example. (anybody ever tried that successfully?)

connor_mcmonigle · Post by **connor_mcmonigle** » Mon Jun 21, 2021 12:41 am

lithander wrote: ↑Sun Jun 20, 2021 10:54 pm ...
Well maybe at some point I'm going to revisit the topic. There are other things I wanted to try but never got around to... simulated annealing for example. (anybody ever tried that successfully?)

If you're just trying to fit some set of labeled positions with a linear model (linear in parameter space) and using MSE (or MSEsigmoid or cross entropy) for loss, you can't do any better than gradient descent as the problem is convex. Gradient descent is guaranteed to converge to the global optima.

lithander · Post by **lithander** » Mon Jun 21, 2021 1:18 am

connor_mcmonigle wrote: ↑Mon Jun 21, 2021 12:41 am If you're just trying to fit some set of labeled positions with a linear model (linear in parameter space) and using MSE (or MSEsigmoid or cross entropy) for loss, you can't do any better than gradient descent as the problem is convex. Gradient descent is guaranteed to converge to the global optima.

So a tapered PST evaluation would still be considered a linear model? Just lot's of features (presence of pieces on squares, either 1 or 0) multiplied with a lot of weights (the material value of the piece on that square) and added up. And that's like a line in multi dimensional space and so there are no local maxima you could get stuck on?

So if you use the same tuner and the same dataset and you randomize the PSTs before starting the tuning it will always converge on the same set of PSTs?

(It's been 15 years since my last maths class. Intuitively it felt like there were many sets of tables where the tuner wouldn't find any way forward anymore. I assumed this was because of getting stuck on some kind of local maximum or "hill")

connor_mcmonigle · Post by **connor_mcmonigle** » Mon Jun 21, 2021 2:04 am

lithander wrote: ↑Mon Jun 21, 2021 1:18 am
connor_mcmonigle wrote: ↑Mon Jun 21, 2021 12:41 am If you're just trying to fit some set of labeled positions with a linear model (linear in parameter space) and using MSE (or MSEsigmoid or cross entropy) for loss, you can't do any better than gradient descent as the problem is convex. Gradient descent is guaranteed to converge to the global optima.
So a tapered PST evaluation would still be considered a linear model? Just lot's of features (presence of pieces on squares, either 1 or 0) multiplied with a lot of weights (the material value of the piece on that square) and added up. And that's like a line in multi dimensional space and so there are no local maxima you could get stuck on?

So if you use the same tuner and the same dataset and you randomize the PSTs before starting the tuning it will always converge on the same set of PSTs?

(It's been 15 years since my last maths class. Intuitively it felt like there were many sets of tables where the tuner wouldn't find any way forward anymore. I assumed this was because of getting stuck on some kind of local maximum or "hill")

This is correct in my understanding. Even with a tapered evaluation, the model is still linear in parameter space (the tapering scaling can just be pushed back to the split terms if that makes sense) which means that one should expect gradient descent to converge to very nearly the exact same values for a given dataset, irrespective of the initialization. In fact, even a coordinate descent approach should converge to the same global optima, albeit, slowly.

j.t. · Post by **j.t.** » Mon Jun 21, 2021 2:14 am

While I am not sure about optimizing only PSTs, there are definitely situations where you can't find the global optimum using gradient descent if you want to optimize all evaluation parameters, so experimenting with different optimization methods other than gradient based searches makes sense, in my opinion.

Regarding PSTs, I usually convert the final evaluation value from centipawns to winning probability to calculate the error, as the 1000 cp vs 1100 cp is not a significant difference in evaluation accuracy, but 0 cp to 100 cp is. Intuitively, this already means that my PSTs is very similar to a simple neural network with activation functions. And neural networks with activation function definitely have local optima, so I always assumed that a PST has too.

connor_mcmonigle · Post by **connor_mcmonigle** » Mon Jun 21, 2021 2:44 am

j.t. wrote: ↑Mon Jun 21, 2021 2:14 am While I am not sure about optimizing only PSTs, there are definitely situations where you can't find the global optimum using gradient descent if you want to optimize all evaluation parameters, so experimenting with different optimization methods other than gradient based searches makes sense, in my opinion.

Regarding PSTs, I usually convert the final evaluation value from centipawns to winning probability to calculate the error, as the 1000 cp vs 1100 cp is not a significant difference in evaluation accuracy, but 0 cp to 100 cp is. Intuitively, this already means that my PSTs is very similar to a simple neural network with activation functions. And neural networks with activation function definitely have local optima, so I always assumed that a PST has too.

Neural networks definitely have local optima (though this is mostly irrelevant in high dimensional spaces as there is almost always a direction which leads down). Just because there is nonlinearity on the output does not mean that the loss space isn't convex (see GLMs). However, I'm quite confident PST and tapered PST are both linear in parameter space which implies convexity (provided MSE is used for loss) and that gradient descent will converge to a global optima.

I'm not sure whether this convergence guarantee can be extended to the *"MSE-sigmoid" loss function which is popular in computer chess. Intuitively, I think convergence can still be guaranteed as the loss space should still be quasi-convex with only one stationary point (which I think is sufficient for guaranteed convergence). I'm too lazy to try to prove this though.

*L=(sigmoid(x)-label)^2

Progress on Rustic

Re: Progress on Rustic

Re: Progress on Rustic

Re: Progress on Rustic

Re: Progress on Rustic

Re: Progress on Rustic

Re: Progress on Rustic

Re: Progress on Rustic

Re: Progress on Rustic

Re: Progress on Rustic

Re: Progress on Rustic