parameter tuning in chess4j and Prophet

Discussion of chess software programming and technical issues.

Moderator: Ras

jswaff
Posts: 107
Joined: Mon Jun 09, 2014 12:22 am
Full name: James Swafford

parameter tuning in chess4j and Prophet

Post by jswaff »

Hi all.

I've written a blog article about my experience with auto-tuning in chess4j and Prophet.

http://jamesswafford.com/2022/07/02/aut ... n-chess4j/

I hope you enjoy. :)

--
James
dangi12012
Posts: 1062
Joined: Tue Apr 28, 2020 10:03 pm
Full name: Daniel Infuehr

Re: parameter tuning in chess4j and Prophet

Post by dangi12012 »

Good read - thanks for sharing.

Additional thoughts:
1)
You dont have to dive deep into neural networks because what you are describing is already a single layer network. What you call hypothesis is the activation function.

2)
It comes up time and time again but how hard is it to adapt your tuning during actual gameplay (runtime)? The keyword is dynamic learning.
Essentially your training set of "opening" "midgame" "endgame" has to be calculated during runtime not from the start - but the current position.
Then you get more optimal piece values for your current position. If this is possible in a few hundred milliseconds.

3)
Who says that there need to be only 3 Pawn PST networks. If training is easy why not have them trained per ply. The values from ply to ply ideally would not change much - and it could already be interesting if you are using linear interpolation for values in between mid and endgame?
Worlds-fastest-Bitboard-Chess-Movegenerator
Daniel Inführ - Software Developer
User avatar
algerbrex
Posts: 608
Joined: Sun May 30, 2021 5:03 am
Location: United States
Full name: Christian Dean

Re: parameter tuning in chess4j and Prophet

Post by algerbrex »

Nice read, I enjoyed it a lot.

Indeed gradient descent is much more practical than naive-based texel tuning. Now that I finally switched over, development time is much quicker. I'm now able to run 50k tuning iterations on 1-2M positions in only about ~4 hours, whereas with the old tuner I could only use 400-600K at most, and that would take 10-12 hours, or sometimes even longer. And the values I get are better!

It was pretty incredible running the tuner for the first time on the Zurichess dataset, and in only about 5 minutes and 10K iterations, I had a set of evaluation parameters, tuned completely from scratch, that beat the old ones that I had spent hours tuning by 60 Elo!
User avatar
algerbrex
Posts: 608
Joined: Sun May 30, 2021 5:03 am
Location: United States
Full name: Christian Dean

Re: parameter tuning in chess4j and Prophet

Post by algerbrex »

dangi12012 wrote: Sat Jul 02, 2022 7:26 pm You dont have to dive deep into neural networks because what you are describing is already a single layer network. What you call hypothesis is the activation function.
Correct me if I'm wrong since I'm also planning to begin working on a simple (768 -> 16 -> 1) neural network to experiment with in Blunder, but isn't the activation function something that's per neuron? Or is the term also used to apply to the whole network?
zenpawn
Posts: 349
Joined: Sat Aug 06, 2016 8:31 pm
Location: United States

Re: parameter tuning in chess4j and Prophet

Post by zenpawn »

algerbrex wrote: Sat Jul 02, 2022 9:58 pm Indeed gradient descent is much more practical than naive-based texel tuning. Now that I finally switched over, development time is much quicker. I'm now able to run 50k tuning iterations on 1-2M positions in only about ~4 hours, whereas with the old tuner I could only use 400-600K at most, and that would take 10-12 hours, or sometimes even longer. And the values I get are better!

It was pretty incredible running the tuner for the first time on the Zurichess dataset, and in only about 5 minutes and 10K iterations, I had a set of evaluation parameters, tuned completely from scratch, that beat the old ones that I had spent hours tuning by 60 Elo!
Tens of thousands of iterations? Do you not have stopping conditions? My tuning sessions usually only need 100 or so iterations before the error stops improving.
Erin Dame
Author of RookieMonster
User avatar
algerbrex
Posts: 608
Joined: Sun May 30, 2021 5:03 am
Location: United States
Full name: Christian Dean

Re: parameter tuning in chess4j and Prophet

Post by algerbrex »

zenpawn wrote: Sun Jul 03, 2022 2:17 am Tens of thousands of iterations? Do you not have stopping conditions? My tuning sessions usually only need 100 or so iterations before the error stops improving.
I currently don't, no, but that's something I should look into. But what I found, with my tuner anyway, is that I got much higher quality results when I used a much smaller learning rate and more iterations than a higher learning rate and only a couple of hundred iterations. 50K gave better results than 10K iterations.

And to make sure I don't overfit, I have my tuner plot the error rate about 100 times over the course of tuning and save the results in a file, so I can plot them. Here's the plot produced from the most recent tuning session: which shows the error trending downward, even after several thousand games.

Now, I'm not saying my approach is optimal. I'm not an expert in gradient descent, but it's worked very well for me so far.
jdart
Posts: 4410
Joined: Fri Mar 10, 2006 5:23 am
Location: http://www.arasanchess.org

Re: parameter tuning in chess4j and Prophet

Post by jdart »

I used ADAM, which tunes each parameter with separate decay schedules.
Convergence typically occurs in 100-200 iterations.
AndrewGrant
Posts: 1960
Joined: Tue Apr 19, 2016 6:08 am
Location: U.S.A
Full name: Andrew Grant

Re: parameter tuning in chess4j and Prophet

Post by AndrewGrant »

jswaff wrote: Sat Jul 02, 2022 6:37 pm I've written a blog article about my experience with auto-tuning in chess4j and Prophet.

http://jamesswafford.com/2022/07/02/aut ... n-chess4j/
Thanks for the post, James, and thanks for the shoutout. Now if only this could be done for search paramaters instead of just eval features, then we would really be cooking :)

I noticed this portion of your blog post... emphasis mine
For each training session, the test data was shuffled and then split into two subsets. The first subset contained 80% of the records and was used for the actual training. The other 20% were used for validation (technically the “test set”). Ideally, the training error should decrease after every iteration, but with gradient descent (particularly stochastic gradient descent) that may not be the case. The important point is that the error trends downward (this is tough if you’re a perfectionist).
There is a problem here in my view. For any particular game G, 20% of the positions from G will be in the validation, and 80% in the training. These are not completely independent from one another. A more sound option would be to take 20% of the games and turn that into validation data, and the other 80% and turn them into training data, that way there subsets are as independent as they can be.

But in my experience with NNUE (and HCE years ago), is that validation loss and training loss don't seem to be great metrics for anything. I'll do runs and get much lower loss, but not have it equate to elo. I'll do runs with equal loss, and have it equate to elo.

---

An aside -- my trainer always used Adagrad as well. Mostly, because my first attempt at doing ADAM was bugged (!), and I only found out some time later when writing the NNUE trainer. I had put ADAM into the HCE trainer and done additional runs, gaining no elo.

HOWEVER, it is my belief that if you are going to add new terms to the trainer, and then train only those terms, that ADAM is a better solution. My reason stems from my NNUE experience, which is that Adagrad simply cannot get the job done from a random init. As a result, if you add a new term and set the defaults to 0, I would expect ADAM to do a better job than Adagrad.

Speculation, however.
User avatar
algerbrex
Posts: 608
Joined: Sun May 30, 2021 5:03 am
Location: United States
Full name: Christian Dean

Re: parameter tuning in chess4j and Prophet

Post by algerbrex »

AndrewGrant wrote: Sun Jul 03, 2022 11:46 am Thanks for the post, James, and thanks for the shoutout. Now if only this could be done for search paramaters instead of just eval features, then we would really be cooking :)
I've been thinking about ways to tune the search parameters a good bit, and after reading Thomas Petzke's blog from a couple of years ago, I've been tempted to try some sort of genetic tuning of the search parameters.

Now of course the difficulty is selecting a good fitness function. It would be nice if there was a quick fitness function that could be used, like for Texel Tuning, but I suspect I'll only see any good results using a fitness function based on playing many games. Nevertheless, I have come across several papers (like this one: https://arxiv.org/pdf/1711.08337.pdf), that used a fitness function based on finding the highest number of correct moves, and the fewest number of nodes used to find such a move, for certain positions from grandmaster games.

I'm skeptical that such a fitness function would work well for me in practice, but since it's relatively quick, I'll probably start with it first and see what happens. I'd be surprised if I was able to start from random values and even equalize with the search parameters in the current code.
jswaff
Posts: 107
Joined: Mon Jun 09, 2014 12:22 am
Full name: James Swafford

Re: parameter tuning in chess4j and Prophet

Post by jswaff »

AndrewGrant wrote: Sun Jul 03, 2022 11:46 am There is a problem here in my view. For any particular game G, 20% of the positions from G will be in the validation, and 80% in the training. These are not completely independent from one another. A more sound option would be to take 20% of the games and turn that into validation data, and the other 80% and turn them into training data, that way there subsets are as independent as they can be.

But in my experience with NNUE (and HCE years ago), is that validation loss and training loss don't seem to be great metrics for anything. I'll do runs and get much lower loss, but not have it equate to elo. I'll do runs with equal loss, and have it equate to elo.

---

An aside -- my trainer always used Adagrad as well. Mostly, because my first attempt at doing ADAM was bugged (!), and I only found out some time later when writing the NNUE trainer. I had put ADAM into the HCE trainer and done additional runs, gaining no elo.

HOWEVER, it is my belief that if you are going to add new terms to the trainer, and then train only those terms, that ADAM is a better solution. My reason stems from my NNUE experience, which is that Adagrad simply cannot get the job done from a random init. As a result, if you add a new term and set the defaults to 0, I would expect ADAM to do a better job than Adagrad.

Speculation, however.
Good point about the data in each set being independent, but in practice I'm not sure it's mattered. The datasets you've provided are large enough that I don't think overfitting is really an issue anyway, at least not for my program.

I hadn't considered training only a new term or set of terms without rebalancing all terms. That seems dangerous to me actually.

Thanks for the pointers regarding ADAM and Adagrad. Something else to put on the list to investigate.