For anyone interested, I posted on my website the C# code I wrote that implements a multithreaded particle swarm optimizer. See MadChess 3.0 Beta Build 75 (Eval Param Tuning). I find the idea rather straightforward and easy to understand. Each iteration a particle's velocity is influenced by inertia and is a) drawn toward their individual best location and b) their swarm's best location and c) the global best location. The magnitude of each vector (a, b, and c) is randomized so the particles jitter in parameterspace.
Has anyone else used PSO to minimize evaluation error? If not, what algorithm(s) have you used? Did you write the code yourself or did you write an evaluation error cost function and plug it into a thirdparty optimization library? Just curious what others have done.
Particle Swarm Optimization Code
Moderators: hgm, Harvey Williamson, bob
Forum rules
This textbox is used to restore diagrams posted with the [d] tag before the upgrade.
This textbox is used to restore diagrams posted with the [d] tag before the upgrade.
Particle Swarm Optimization Code
My C# chess engine: http://www.madchess.net

 Posts: 3662
 Joined: Fri Mar 10, 2006 4:23 am
 Location: http://www.arasanchess.org
Re: Particle Swarm Optimization Code
For machine learning, the standard algorithm is Stochastic Gradient descent (SGD).
I use a socalled batch gradient descent algorithm called ADAM (https://arxiv.org/pdf/1412.6980.pdf).
These methods can tune parameters to find the minimum error over very large data sets of training examples. For this use case the function that you are minimizing is a sum of errors over the training set. So it is a sum of an error measure based on the evaluation function for chess, or the result of search, over possibly millions of training examples. For this you need a method that converges with relatively few evaluations of this overall error measure. (SGD "cheats" by evaluating only a sample of the training examples in each iteration).
The problem with global optimization methods including PSO is that they can be slow to converge, especially when there are large numbers of variables (a chess evaluation function can have hundreds). They also generally assume that it is virtually free to execute the function being optimized, so the "budget" of evaluations can be a very large number. But for the eval tuning use case this is not true.
I use a socalled batch gradient descent algorithm called ADAM (https://arxiv.org/pdf/1412.6980.pdf).
These methods can tune parameters to find the minimum error over very large data sets of training examples. For this use case the function that you are minimizing is a sum of errors over the training set. So it is a sum of an error measure based on the evaluation function for chess, or the result of search, over possibly millions of training examples. For this you need a method that converges with relatively few evaluations of this overall error measure. (SGD "cheats" by evaluating only a sample of the training examples in each iteration).
The problem with global optimization methods including PSO is that they can be slow to converge, especially when there are large numbers of variables (a chess evaluation function can have hundreds). They also generally assume that it is virtually free to execute the function being optimized, so the "budget" of evaluations can be a very large number. But for the eval tuning use case this is not true.
Re: Particle Swarm Optimization Code
Yes, I know that that's what my code does.The function that you are minimizing is a sum of errors over the training set.
You lay out the challenges quite clearly. Many of these global optimization methods are slow to converge, and even slower because a chess engine's evaluation cost function is so timeconsuming to compute.
My C# chess engine: http://www.madchess.net

 Posts: 3662
 Joined: Fri Mar 10, 2006 4:23 am
 Location: http://www.arasanchess.org
Re: Particle Swarm Optimization Code
The other point is that generally evals are a linear function of eval weights and you can evaluate the gradient of the evaluation function easily. So with a linear function and the gradient there are gradient descent methods that converge quite rapidly.
If though you want a good global method that works with nonlinear functions and does not require gradients, there are a bunch of those. CMAES is one of the better ones. See also https://github.com/avaneev/biteopt, and https://ccse.lbl.gov/people/julianem/index.html. But I don't think the chess use case is a good fit for these. They might have some use optimizing noneval parameters.
Jon
If though you want a good global method that works with nonlinear functions and does not require gradients, there are a bunch of those. CMAES is one of the better ones. See also https://github.com/avaneev/biteopt, and https://ccse.lbl.gov/people/julianem/index.html. But I don't think the chess use case is a good fit for these. They might have some use optimizing noneval parameters.
Jon
Re: Particle Swarm Optimization Code
Most everything in my evaluation function is linear. Though passed pawns, mobility, and king safety are not.The other point is that generally evals are a linear function of eval weights and you can evaluate the gradient of the evaluation function easily.
I wrote the PSO because it's so simple, and frankly, I was too impatient to get my head around the math of other techniques. Not that I can't do it I majored in physics in college and did well in calculus, diffy Q, linear algebra, etc. I'm just out of practice when it comes to reading math in scientific papers, comprehending it, and reformulating the ideas in my own code. It's funny because between my sophomore and junior year in college, I stayed on campus during the summer to write a global optimizer for my physics professor. I remember poring over the Numerical Recipes in C book trying to decipher the LevenbergMarquardt optimization algorithm. I managed to get it working by the end of the summer, optimizing 18 variables of his theoretical equation to the data he had collected in his biophysics research. The program drew a graph on screen, updated after each iteration was calculated. So it was rewarding to see the theoretical graph lines approaching the empirical data closer and closer each iteration. That was 23 years ago.
But I digress... My intent is to run the PSO as I add features to my engine's evaluation function. Perhaps not after every feature, maybe after every two or three. If the PSO fails to find improvements, I'll consider implementing SGD or one of the gradientfree, nonlinear global optimizers you mention. Thanks for your suggestions. Do you know if any of these algorithms are better suited for multithreading than others?
My C# chess engine: http://www.madchess.net

 Posts: 3662
 Joined: Fri Mar 10, 2006 4:23 am
 Location: http://www.arasanchess.org
Re: Particle Swarm Optimization Code
Batch gradient descent parallelizes nicely  each thread just computes a part of the training examples and the parameters are constant for each iteration so you can use an evaluation cache if you have one.
SGD is harder to parallelize because the parameters aren't constant but you can do "minibatches" where they are. There is also a parallel SGD variant called HogWild (https://people.eecs.berkeley.edu/~brech ... wildTR.pdf) but it assumes sparsity in the parameters which is generally not true for chess.
CMAES can support parallelized evaluation I believe.
Jon
SGD is harder to parallelize because the parameters aren't constant but you can do "minibatches" where they are. There is also a parallel SGD variant called HogWild (https://people.eecs.berkeley.edu/~brech ... wildTR.pdf) but it assumes sparsity in the parameters which is generally not true for chess.
CMAES can support parallelized evaluation I believe.
Jon
Re: Particle Swarm Optimization Code
I think the SGD is an wrong name.
Just simply GD is the right name. (Gradient Descent)
Different Types of GDs
 Full  Batch or (vanilla) or (basic): Looping over every training example
 Mini  Batch : we process n training examples at once
 Stochastic  GD: we use JUST USE ONE training example
I use Full Batch GD.
I experiment with three different methods:
 ADAM
 Nesterov Momentum
 Different learning rate for all weights
Each gives similar results..
I'm trying to use the L2 penalty as well.
I feel I have not found yet the perfect hyperparameters. I'm still experimenting.
But the results so far are already very good: ~ 100 elo better than handtuned version.
But there are weights that are few examples. I think they work better if they are not linear. (eg: kingsafety, passed pawn..)