Pedantic Developer's Log Stardate...

Discussion of chess software programming and technical issues.

Moderators: hgm, chrisw, Rebel

User avatar
lithander
Posts: 881
Joined: Sun Dec 27, 2020 2:40 am
Location: Bremen, Germany
Full name: Thomas Jahn

Re: Pedantic Developer's Log Stardate...

Post by lithander »

JoAnnP38 wrote: Thu Jan 26, 2023 3:50 pm At this point I have looked at extensions both ways. I really need to get serious about setting up a test suite I can use to evaluate the benefit of these changes. Many of the changes may only benefit a minority of positions but without a good test I only have my hunches to lead me (which is probably a very bad idea.)
My standard test is to run a depth 15 search on the WAC positions. For the longest time it held true that getting more of the positions solved or solving the same amount in less time both usually meant an increase in Elo.

But that's not a natural law. If you only commit changes to an actual engine vs engine test when it does well in the test suite you may discard a promising feature too early. There's an improvement I'm currently working on that seems to give +40 Elo but at the same time it reduces the number of solved positions from 295 to 284... :shock:

I found it only because I exposed values that control how I prune as UCI parameters. Then I tried different sets of values with the same build and so I didn't get to run the test suite on them. I wanted to push the boundaries and realized that some crazy values did much better than expected. When I ran the testsuite on them later the results were so horrible that I would have never submitted these values to an actual selfplay match.
Minimal Chess (simple, open source, C#) - Youtube & Github
Leorik (competitive, in active development, C#) - Github & Lichess
JoAnnP38
Posts: 253
Joined: Mon Aug 26, 2019 4:34 pm
Location: Clearwater, Florida USA
Full name: JoAnn Peeler

Re: Pedantic Developer's Log Stardate...

Post by JoAnnP38 »

It's time for an update on Pedantic's progress. I finally have everything in place to kick off the evolution of my genetic algorithm whose purpose was to create a more optimum set of weights for my evaluation function. I'll have to admit the endeavor wasn't quite as successful as I had hoped for various reasons. Here are my findings:
  1. A single elimination tournament does not serve as an acceptable fitness function.
  2. Very tight ranges have to be maintained on random weights to result in reasonable time-to-convergence.
  3. An active or stable population of 64 "solutions" may not provide enough genetic diversity.
  4. Solutions resulting from very tight time controls may not be the best solutions with looser or more generous times.
Even with very tight time controls (i.e. 350ms per move + 50ms) running a single elimination tournament of 64 participants takes too much time to complete. I started with 2 games per encounter and that took an hour and a half (or so) to complete. However, I quickly found out that such a tournament is not a very good indication of which "solutions" are best. Too often the winner of the tournament would become a first-round loser in the next generation and be kicked out of the gene pool. Even increasing the number of games per encounter to 4 did not improve the results appreciably (although it had the "wonderful" side-effect doubling my average tournament times!) The best thing I can say about a winner of a single elimination tournament with 2-4 games per encounter is that the winner is probably not the worst contestant. Given enough time and a more relaxed convergence criteria even this might lead to an eventual solution but I'm not optimistic.

From the very beginning I always enforce ranges on my random weights. After all, I wanted my evaluation function to return a value that fit into a 16-bit integer. However, even with my initial ranges too many solutions and immigrants (my term for randomly generated "solutions" needed to keep the population stable) were essentially DOA. (Games are lost pretty quickly when your pawns are worth more than your queen!) I tried a couple of solutions to address this problem. One, I collected statistics on the current top 128 solutions and generated normally distributed random weights based on those statistics. But if your population has started to converge on an early suboptimum solution it can take quite some time for the new immigrants to provide any genetic diversity. Second, I really tightened down the ranges of values to be -20% of the minimum values I've seen in other programs to +20 of the maximum values I've seen in other programs. I eventually found that this second solution was superior and went forward using that.

After my first reasonable convergence, I proceeded to play that solution against my standard weights. Even though my standard weights did not win a single tournament in the evolution process, it ended up beating the winner in a 20 game match with 5 seconds per move! Ugh :( :x :oops: :cry: :evil:

At this point I am ready to conclude that using a GA for optimizing weights failed low on my potential solution tree. Perhaps I just need better move ordering? Next up -- gradient descent.
dangi12012
Posts: 1062
Joined: Tue Apr 28, 2020 10:03 pm
Full name: Daniel Infuehr

Re: Pedantic Developer's Log Stardate...

Post by dangi12012 »

JoAnnP38 wrote: Wed Feb 22, 2023 7:05 pm
  1. Solutions resulting from very tight time controls may not be the best solutions with looser or more generous times.
At this point I am ready to conclude that using a GA for optimizing weights failed low on my potential solution tree. Perhaps I just need better move ordering? Next up -- gradient descent.
I read that multiple times now.
Any theories on why short time controls dont train so well? A neural network if a fixed size has the same computational cost no matter what.
In chess the better/deeper move always wins so IMO it should be tilt towards a far ahead looking network also in short time controls?`
Do you see what I mean?
Worlds-fastest-Bitboard-Chess-Movegenerator
Daniel Inführ - Software Developer
JoAnnP38
Posts: 253
Joined: Mon Aug 26, 2019 4:34 pm
Location: Clearwater, Florida USA
Full name: JoAnn Peeler

Re: Pedantic Developer's Log Stardate...

Post by JoAnnP38 »

dangi12012 wrote: Wed Feb 22, 2023 10:24 pm
JoAnnP38 wrote: Wed Feb 22, 2023 7:05 pm
  1. Solutions resulting from very tight time controls may not be the best solutions with looser or more generous times.
I read that multiple times now.
Any theories on why short time controls dont train so well? A neural network if a fixed size has the same computational cost no matter what.
In chess the better/deeper move always wins so IMO it should be tilt towards a far ahead looking network also in short time controls?`
Do you see what I mean?
I'm not sure, but it seems to me that it would be very difficult to infer anything from my results (i.e. using a genetic algorithm to optimize weights for an HCE) that might inform someone on the training of a neural network. Perhaps some aspects of evaluation have a more long-term impact on the outcome of the game? If so, then maybe their value can only be truly appreciated if given the opportunity to search to greater depths. That's a shot in the dark, but that's the best I have.
JoAnnP38
Posts: 253
Joined: Mon Aug 26, 2019 4:34 pm
Location: Clearwater, Florida USA
Full name: JoAnn Peeler

Re: Pedantic Developer's Log Stardate...

Post by JoAnnP38 »

Okay, I lied. Gradient descent was not up next. Instead, I decided to rework my genetic algorithm. Instead of using a single elimination tournament to decide "'fitness", I instead used the sum of the square of errors. Each generation tries to find a solution that minimizes this. It can generally find a better solution than my standard weights within the first 10 generations. I am waiting on a current run to complete and then I'll confirm that it is better with tournament play. Here is some printout of the current process. Generation 0 represents the fitness of my standard weights. As you can see, at generation 9 it has already found a "better" solution. It seems promising so far.

Code: Select all

Population size: 100, Sample size: 50000, Max generations: 150
K = 0.84
Generation   0: 0.1160333
Generation   1: 0.1161791 from (pop: 100)
Generation   2: 0.1161710 from (pop: 100)
Generation   3: 0.1161167 from (pop: 100)
Generation   4: 0.1160827 from (pop: 100)
Generation   5: 0.1160726 from (pop: 100)
Generation   6: 0.1160560 from (pop: 100)
Generation   7: 0.1160485 from (pop: 100)
Generation   8: 0.1160434 from (pop: 100)
Generation   9: 0.1160166 from (pop: 100)
Generation  10: 0.1159900 from (pop: 100)
JoAnnP38
Posts: 253
Joined: Mon Aug 26, 2019 4:34 pm
Location: Clearwater, Florida USA
Full name: JoAnn Peeler

Re: Pedantic Developer's Log Stardate...

Post by JoAnnP38 »

The last few weeks I have been devoted strictly to optimizing my evaluation function using the Texel method. At first, I was almost ready to chuck this idea because after getting my first set of weights after minimizing the errors I not only didn't get any improvement, but I lost Elo!!! I did eventually find a bug in my code that led to this result, but it still left me with a sense of doubt about whether this was a worth endeavor or not. So, I decided to conduct an experiment to prove to myself that this method works. Taking 500K positions from games played by computers having an Elo of 2800 or greater I started up my iterative process that minimizes the square of errors to see if it affects the Elo or not. This process is described in the Chess Programming wiki for the Texel tuning method. The only design change I made is to go through the weights randomly on each iteration to try and spread-out changes. While iteratively searching for better and better solutions I saved intermediate sets of weights along the way. Initially I saved weights after 5, 10, 15, 20, 25, 30, 50, 75, 100 and 150 iterations and my routine converged after 157 iterations. I then setup all those weights in a round-robin tournament along with the iteration 0 progenitor (paragon) to see how they all relate to one another in terms of Elo. I was trying to show myself that as the error was minimized, at least up to the point of overfitting, that the Elo would also go up. Here are the results of that first experiment:

Code: Select all

Rank Name                          Elo     +/-   Games   Score    Draw
   1 pass75                          9      11    1100   51.4%   72.2%
   2 pass158                         4      10    1100   50.6%   75.5%
   3 pass100                         3      10    1100   50.5%   75.6%
   4 pass50                          3      10    1100   50.4%   75.5%
   5 pass15                          3      10    1100   50.4%   75.5%
   6 pass30                          2      10    1100   50.3%   76.5%
   7 pass5                          -1      10    1100   49.9%   74.9%
   8 pass40                         -2      10    1100   49.8%   77.7%
   9 pass10                         -3      10    1100   49.5%   76.3%
  10 pass25                         -5      10    1100   49.3%   76.6%
  11 paragon                        -6      11    1100   49.2%   73.6%
  12 pass20                         -8      10    1100   48.8%   75.8%
These results weren't quite what I was hoping for, but they did show that in a very general sense the Elo goes up with more passes (i.e. less error). The upper half had all of the weights were created from an aggregate of 428 passes, while the lower half had 43 passes with paragon being pass0. However, as you can see there isn't an ordering that pops out at you that would allow someone to say a 40 pass solution is better than a 15 pass solution. However, it would appear I could pick up 15 Elo by choosing pass 75. I repeated this experiment with a different data set. This time I sampled 2 million records from human grandmaster games (i.e. > 2500 Elo) and preserved the weights for 10, 20, 30, 40, 50, 60, 70, 80, 90, 100, 110, 120, 130, 140 and 150 and my routine actually converged on the 157th pass. I then setup another round-robin tournament and here are the results:

Code: Select all

Rank Name                          Elo     +/-   Games    Wins  Losses   Draws   Points   Score    Draw
   1 pass110                        12       9    1024     116      82     826    529.0   51.7%   80.7%
   2 pass50                          7      10    1024     117      97     810    522.0   51.0%   79.1%
   3 pass40                          6      10    1024     112      94     818    521.0   50.9%   79.9%
   4 pass150                         5      10    1024     121     105     798    520.0   50.8%   77.9%
   5 pass60                          5       9    1024     105      89     830    520.0   50.8%   81.1%
   6 pass157                         5      10    1024     116     101     807    519.5   50.7%   78.8%
   7 pass140                         5      10    1024     123     108     793    519.5   50.7%   77.4%
   8 pass70                          3       9    1024     104      94     826    517.0   50.5%   80.7%
   9 pass100                         1      10    1024     109     105     810    514.0   50.2%   79.1%
  10 pass20                          0      10    1024     110     109     805    512.5   50.0%   78.6%
  11 pass130                        -0      10    1024     103     104     817    511.5   50.0%   79.8%
  12 pass90                         -5      10    1024     106     120     798    505.0   49.3%   77.9%
  13 pass80                         -5      10    1024      97     113     814    504.0   49.2%   79.5%
  14 pass30                         -6      10    1024     103     121     800    503.0   49.1%   78.1%
  15 pass120                        -7       9    1024      90     110     824    502.0   49.0%   80.5%
  16 pass10                        -13      11    1024     111     148     765    493.5   48.2%   74.7%
  17 paragon                       -15      10    1024     101     144     779    490.5   47.9%   76.1%
These are very similar results to my first experiment. So, while the Texel method does work generally, you may need to test all of the solutions it creates to determine which are better. Of course, my testing method was limited to 100 games per round on the first experiment and 64 games per round on the second, so it's very possible that my results might change significantly if the games per round were increased to a larger number, but I doubt they would ever line up in order such that higher passes (i.e. less error) were always better than lower passes. In the meantime it looks like I may be able to pick up 27 Elo!
User avatar
lithander
Posts: 881
Joined: Sun Dec 27, 2020 2:40 am
Location: Bremen, Germany
Full name: Thomas Jahn

Re: Pedantic Developer's Log Stardate...

Post by lithander »

I like your scientific approach to the problem! But it feels like I'm missing something crucial here. The results, especially in the first table, seem to vary only within the error margin of +/- 10 Elo. To me it seems like the tuning passes aren't doing a lot for you. So... does that mean your 'paragon' state is already optimized somehow? What are you tuning from?

In my own experience if you reset the weights to something random or even representing vanilla material values (eg all queens squares are 900cp) the engine loses >1000 Elo and then you can do a few tuning passes with really crappy data (in my case derived from self-play of the crippled version) and you'll gain several 100 Elo back. So to me your results only make sense when your weights were already pretty optimized to begin with. To verify that your tuning approach works I would use intentionally bad starting weights and if your evaluation recovers as expected.
Minimal Chess (simple, open source, C#) - Youtube & Github
Leorik (competitive, in active development, C#) - Github & Lichess
JoAnnP38
Posts: 253
Joined: Mon Aug 26, 2019 4:34 pm
Location: Clearwater, Florida USA
Full name: JoAnn Peeler

Re: Pedantic Developer's Log Stardate...

Post by JoAnnP38 »

lithander wrote: Mon Mar 06, 2023 12:28 am I like your scientific approach to the problem! But it feels like I'm missing something crucial here. The results, especially in the first table, seem to vary only within the error margin of +/- 10 Elo. To me it seems like the tuning passes aren't doing a lot for you. So... does that mean your 'paragon' state is already optimized somehow? What are you tuning from?
You are quite right to ask that. My paragon weights were initially derived from other engines. Most importantly, my piece square tables and piece weights are the PeSTO tables and piece weights as described on the Chess Programming Wiki. However, the rest of my weights were mostly estimated (guessed) by myself. I see an important use of the tuning is to recalibrate all of these values so if there are any orthogonal or partially orthogonal evaluation parameters they will be rebalanced during the tuning. I have the ability to create a "purely" random set of weights, but I have not used those as a pass 0. So, the short answer is -- yes, weights have already been optimized in a prior pass. However, I have added some new evaluation parameters and needed to make sure they are optimized. Perhaps when I get another moment, I can rerun the test starting with random weights.
JoAnnP38
Posts: 253
Joined: Mon Aug 26, 2019 4:34 pm
Location: Clearwater, Florida USA
Full name: JoAnn Peeler

Re: Pedantic Developer's Log Stardate...

Post by JoAnnP38 »

My development environment now consists of two mini-PCs (i.e. the Beelink GTR6) each with AMD Ryzen 9 6900HX processor and the one I'm using for development is outfitted with 64GB ram and 4TB of SSD storage, and the other is the default configuration of 32GB ram and 500GB of SSD storage. I'm using the second one as my test/tournament PC. Both of these fit side-by-side underneath my monitor stand. Each of them fit comfortably in the palm of my hand. No more bulky desktops for me! I can only run about 12 simultaneous games, but that's 3 times what I could on my laptop.

Image

Sorry for the SPAM. I just happy to have a reasonable development environment up and working.

(I tried scaling the bitmap by including a widthxheight parameter int the BBCode but it didn't work for me.)
User avatar
lithander
Posts: 881
Joined: Sun Dec 27, 2020 2:40 am
Location: Bremen, Germany
Full name: Thomas Jahn

Re: Pedantic Developer's Log Stardate...

Post by lithander »

When I'm trying to concentrate I find the noise small notebook fans make very irritating. That's the main reason I still use a "bulky" desktop with proper cooling. I suspect your mini-PCs are pretty loud under full load with the power they pack into such a small form factor?
Minimal Chess (simple, open source, C#) - Youtube & Github
Leorik (competitive, in active development, C#) - Github & Lichess