Particle Swarm Optimization Code

Discussion of chess software programming and technical issues.

Moderators: hgm, Harvey Williamson, bob

Forum rules
This textbox is used to restore diagrams posted with the [d] tag before the upgrade.
Post Reply
User avatar
emadsen
Posts: 184
Joined: Wed Apr 25, 2012 11:51 pm
Location: Naperville, IL, USA
Contact:

Particle Swarm Optimization Code

Post by emadsen » Sat Nov 24, 2018 10:38 pm

For anyone interested, I posted on my website the C# code I wrote that implements a multi-threaded particle swarm optimizer. See MadChess 3.0 Beta Build 75 (Eval Param Tuning). I find the idea rather straight-forward and easy to understand. Each iteration a particle's velocity is influenced by inertia and is a) drawn toward their individual best location and b) their swarm's best location and c) the global best location. The magnitude of each vector (a, b, and c) is randomized so the particles jitter in parameter-space.

Has anyone else used PSO to minimize evaluation error? If not, what algorithm(s) have you used? Did you write the code yourself or did you write an evaluation error cost function and plug it into a third-party optimization library? Just curious what others have done.
My C# chess engine: http://www.madchess.net

jdart
Posts: 3624
Joined: Fri Mar 10, 2006 4:23 am
Location: http://www.arasanchess.org

Re: Particle Swarm Optimization Code

Post by jdart » Sun Nov 25, 2018 1:51 am

For machine learning, the standard algorithm is Stochastic Gradient descent (SGD).

I use a so-called batch gradient descent algorithm called ADAM (https://arxiv.org/pdf/1412.6980.pdf).

These methods can tune parameters to find the minimum error over very large data sets of training examples. For this use case the function that you are minimizing is a sum of errors over the training set. So it is a sum of an error measure based on the evaluation function for chess, or the result of search, over possibly millions of training examples. For this you need a method that converges with relatively few evaluations of this overall error measure. (SGD "cheats" by evaluating only a sample of the training examples in each iteration).

The problem with global optimization methods including PSO is that they can be slow to converge, especially when there are large numbers of variables (a chess evaluation function can have hundreds). They also generally assume that it is virtually free to execute the function being optimized, so the "budget" of evaluations can be a very large number. But for the eval tuning use case this is not true.

User avatar
emadsen
Posts: 184
Joined: Wed Apr 25, 2012 11:51 pm
Location: Naperville, IL, USA
Contact:

Re: Particle Swarm Optimization Code

Post by emadsen » Tue Nov 27, 2018 1:10 pm

The function that you are minimizing is a sum of errors over the training set.
Yes, I know that- that's what my code does.

You lay out the challenges quite clearly. Many of these global optimization methods are slow to converge, and even slower because a chess engine's evaluation cost function is so time-consuming to compute.
My C# chess engine: http://www.madchess.net

jdart
Posts: 3624
Joined: Fri Mar 10, 2006 4:23 am
Location: http://www.arasanchess.org

Re: Particle Swarm Optimization Code

Post by jdart » Wed Nov 28, 2018 2:58 am

The other point is that generally evals are a linear function of eval weights and you can evaluate the gradient of the evaluation function easily. So with a linear function and the gradient there are gradient descent methods that converge quite rapidly.

If though you want a good global method that works with nonlinear functions and does not require gradients, there are a bunch of those. CMA-ES is one of the better ones. See also https://github.com/avaneev/biteopt, and https://ccse.lbl.gov/people/julianem/index.html. But I don't think the chess use case is a good fit for these. They might have some use optimizing non-eval parameters.

--Jon

User avatar
emadsen
Posts: 184
Joined: Wed Apr 25, 2012 11:51 pm
Location: Naperville, IL, USA
Contact:

Re: Particle Swarm Optimization Code

Post by emadsen » Fri Nov 30, 2018 5:11 am

The other point is that generally evals are a linear function of eval weights and you can evaluate the gradient of the evaluation function easily.
Most everything in my evaluation function is linear. Though passed pawns, mobility, and king safety are not.

I wrote the PSO because it's so simple, and frankly, I was too impatient to get my head around the math of other techniques. Not that I can't do it- I majored in physics in college and did well in calculus, diffy Q, linear algebra, etc. I'm just out of practice when it comes to reading math in scientific papers, comprehending it, and reformulating the ideas in my own code. It's funny because between my sophomore and junior year in college, I stayed on campus during the summer to write a global optimizer for my physics professor. I remember poring over the Numerical Recipes in C book trying to decipher the Levenberg-Marquardt optimization algorithm. I managed to get it working by the end of the summer, optimizing 18 variables of his theoretical equation to the data he had collected in his bio-physics research. The program drew a graph on screen, updated after each iteration was calculated. So it was rewarding to see the theoretical graph lines approaching the empirical data closer and closer each iteration. That was 23 years ago.

But I digress... My intent is to run the PSO as I add features to my engine's evaluation function. Perhaps not after every feature, maybe after every two or three. If the PSO fails to find improvements, I'll consider implementing SGD or one of the gradient-free, non-linear global optimizers you mention. Thanks for your suggestions. Do you know if any of these algorithms are better suited for multi-threading than others?
My C# chess engine: http://www.madchess.net

jdart
Posts: 3624
Joined: Fri Mar 10, 2006 4:23 am
Location: http://www.arasanchess.org

Re: Particle Swarm Optimization Code

Post by jdart » Fri Nov 30, 2018 3:01 pm

Batch gradient descent parallelizes nicely - each thread just computes a part of the training examples and the parameters are constant for each iteration so you can use an evaluation cache if you have one.

SGD is harder to parallelize because the parameters aren't constant but you can do "mini-batches" where they are. There is also a parallel SGD variant called HogWild (https://people.eecs.berkeley.edu/~brech ... wildTR.pdf) but it assumes sparsity in the parameters which is generally not true for chess.

CMA-ES can support parallelized evaluation I believe.

--Jon

tomitank
Posts: 185
Joined: Sat Mar 04, 2017 11:24 am
Location: Hungary

Re: Particle Swarm Optimization Code

Post by tomitank » Fri Nov 30, 2018 3:58 pm

jdart wrote:
Sun Nov 25, 2018 1:51 am
For machine learning, the standard algorithm is Stochastic Gradient descent (SGD).
I think the SGD is an wrong name.
Just simply GD is the right name. (Gradient Descent)

Different Types of GDs
- Full - Batch or (vanilla) or (basic): Looping over every training example
- Mini -  Batch : we process n training examples at once
- Stochastic  - GD: we use JUST USE ONE training example

I use Full- Batch GD.
I experiment with three different methods:
- ADAM
- Nesterov Momentum
- Different learning rate for all weights

Each gives similar results..

I'm trying to use the L2 penalty as well.

I feel I have not found yet the perfect hyper-parameters. I'm still experimenting.

But the results so far are already very good: ~ 100 elo better than hand-tuned version.

But there are weights that are few examples. I think they work better if they are not linear. (eg: king-safety, passed pawn..)

Post Reply