Devlog of Leorik

algerbrex · Post by **algerbrex** » Fri Jun 10, 2022 2:22 am

clayt wrote: ↑Thu Jun 09, 2022 8:12 pm Doing one epoch in under a second should be well within reason - I have a gradient descent tuner with 777 data points per board, and it does a full epoch of ~30M positions in about half a second. I'm using roughly the same approach as Leorik for my personal engine (using sigmoids to make error in strongly winning and losing positions not matter so much). Maybe there's a mistake in your computation of the gradient which makes it more convoluted than it has to be?

I strongly recommend using a sparse representation of each datapoint, as usually only 30 or so points in any one position will be nonzero.

Hmm, maybe so. I sat down and did the derivations myself so maybe I misunderstood the math. The issue is that for each partial derivative you compute for the gradient, the partial contains a summation, so that you're having to go through the entire batch/dataset to compute each partial for the gradient, before you can use the gradient to update the weights. Or no?

Also if this discussion does get too long I'll make another thread as this is Thomas thread for Leorik development/

algerbrex · Post by **algerbrex** » Fri Jun 10, 2022 3:16 am

clayt wrote: ↑Thu Jun 09, 2022 8:12 pm ...

What I mean when I say computing the gradient is slow is that I'm taking the formula I derived and applying it to every weight to compute a gradient vector. And each time I apply the formula to weight, I'm iterating over the whole dataset of positions with length N. So the complexity ends up being O (M x N), where M is the number of weights I have (in this case 778). Not great.

But since I anticipate this discussion will be going on for a bit, I'll ask a dedicated question so we don't pollute Thomas's dev log: forum3/viewtopic.php?f=7&t=80069

lithander · Post by **lithander** » Sat Jun 11, 2022 11:44 pm

dangi12012 wrote: ↑Thu Jun 09, 2022 12:12 am Hey it took some time to create a pointer version. It was an interesting excercise since there are many ways to allocate memory other than arrays in C#
(stackalloc, Marshal.AllocHGlobal, MemoryBuffer, Memory, Span, Buffer, NativeMemory, CompilerServices.Unsafe as well as pinvoke) and all with different pros and cons.

Some classes live in arrays and cannot be structs - so not all code can become pointer arithmetic.

This is the speed delta when making everything unsafe (which is not recommended):
Pointer Version: https://1drv.ms/u/s!AoDIyNlDG2QTg8tIm6R ... A?e=mPWK2p
Safe Version: https://1drv.ms/u/s!AoDIyNlDG2QTg8tL281 ... A?e=JiEge7

I tip my hat to Thomas Jahn.

Sorry for the late reply. I was away from the desktop for a few days. Did you do any changes to the Save Version compared to what I have committed on github? It seems to perform roughly the same on my computer:

Code: Select all

Master:
Searched 300 positions for 100ms each.
176820K nodes visited. Took 30,008 seconds!
5892K NPS.
Best move found in 265 / 300 positions!

Dangi's Safe Version:
Searched 300 positions for 100ms each.
178222K nodes visited. Took 30,007 seconds!
5939K NPS.
Best move found in 265 / 300 positions!

Dangi's Pointer Version:
Searched 300 positions for 100ms each.
198911K nodes visited. Took 30,007 seconds!
6628K NPS.
Best move found in 271 / 300 positions!

6 more solved positions from only ~12% more NPS came as a surprise! I first assumed you might have made other changes but if I increase the time budget to 112ms I also get the 271 solved positions on the 'master'. Strange!

And I was a bit sad that you didn't include source code. I was really curious to see your changes! Code reviews are always welcome, too, of course.

algerbrex · Post by **algerbrex** » Mon Jun 13, 2022 7:48 pm

lithander wrote: ↑Tue May 31, 2022 2:06 am I wish there was a an easier way to asses how much the current implementation exhausts the theoretical potential. So (@all) what's your experience with pawn structure eval terms? How much Elo did you gain from adding it in your engines?

Now that I've started experimenting with pawn structure evaluations, I remember you asked this question.

I can confirm that I've also been struggling to get any good results from adding pawn structure evaluation terms to Blunder and running them through the tuner.

A full fledge PSQT just for passed pawns seemed to me to be overly fine-grained for a first approach, so instead, I opted to have two 8 element arrays, one for the middle and endgame, where each entry represented a "bonus" a pawn would get for being passed and on a certain rank. The closer the rank to the back rank, the bigger the bonus.

I tried a good couple of configurations, tuning the learning parameter, trying different batch sizes, number of iterations, etc. But at the end of the day, the engine with the new, tuned evaluation with passed pawn terms lost usually ~15-20 Elo.

I have some other configurations I'm going to play with to get things working, including holding the passed pawn evaluation terms constant and letting the tuner only touch the material and PSQT values. I suspect overfitting many also been an issue as well, as well as some redundancy going on.

But to answer your original question, I remember adding in doubled and isolated pawn evaluation terms was good for about 20 Elo, and that was probably without tuning them very well either. And the passed pawn evaluation was probably good for another 20 Elo I estimate, so overall 40-ish Elo; again without tuning them very well. If I had the patience to tweak an evaluation term slightly and re-run another 2000 games to confirm its effectiveness, many, many times over, I probably could've gotten much better results. But I'm afraid I don't have the that patience

So your numbers sound pretty good to me, at least compared to what I got in Blunder 7.6.0.

lithander · Post by **lithander** » Mon Jun 13, 2022 8:44 pm

algerbrex wrote: ↑Mon Jun 13, 2022 7:48 pm I have some other configurations I'm going to play with to get things working, including holding the passed pawn evaluation terms constant and letting the tuner only touch the material and PSQT values.

That's also what I did in the end. I only used the additional Pawn-Feature PSQT to get a feel for what a certain type of pawn is worth but then hard-coded either a simple formula (passed pawns) or even just flat modifiers (Isolated, Connected & Protected). In the end there are only 5 coefficients to my entire pawn structure evaluation.

To find good values for those I started with something close to what seemed to be the average of the table. Then I retuned with the formula AND the Pawn-Feature PSQTs both contributing to the tuner's evaluation. The goal was to capture all relevant patterns in the formula and turn the PSQTs essentially into noise with an average value around zero.

This was an annoyingly manual process but thanks to the fast tuner and using the MSE value as a metric for improvement instead of actually running engine vs engine tests it went quickly enough. The result is fairly simple an elegant code. But I fear there's a lot of room for improvement... For example I didn't managed to incorporate "Doubled Pawns" in a way that it helped to drive the MSE down. But does that mean "doubled pawns" are not relevant? I doubt it. They are probably just having pro's and con's depending on other factors and my simple approach doesn't do them justice.

lithander · Post by **lithander** » Mon Jun 13, 2022 8:56 pm

algerbrex wrote: ↑Mon Jun 13, 2022 7:48 pm so overall 40-ish Elo [...] So your numbers sound pretty good to me, at least compared to what I got in Blunder 7.6.0.

Selfplay numbers are usually inflating the real gains in my experience. I'm curious what the gains will be in the CCRL lists. Currently it's amazing +63 Elo in 40/15 but only disappointing +11 Elo on the Blitz list. So... I can't decide whether to feel amazed or disappointed right now!

algerbrex · Post by **algerbrex** » Mon Jun 13, 2022 9:25 pm

lithander wrote: ↑Mon Jun 13, 2022 8:44 pm That's also what I did in the end. I only used the additional Pawn-Feature PSQT to get a feel for what a certain type of pawn is worth but then hard-coded either a simple formula (passed pawns) or even just flat modifiers (Isolated, Connected & Protected). In the end there are only 5 coefficients to my entire pawn structure evaluation.

Right, I think I noticed you mentioned that, and decided I'd go ahead and move up in my list of ideas, because I think some evaluation values are best hand-tuned, and then you let the tuner re-adjust the other values to fit your new terms.

I think king safety was a good example of this for me. I believe I ended up getting a good 60-ish Elo from king safety in Blunder, but that was only from manually tuning a lot of the values used, even after trying to use the tuner. And at the same when I did use the tuner, the initial values I chose to start from had a big impact on how good of a result I got.

Of course YMMV.

lithander wrote: ↑Mon Jun 13, 2022 8:44 pm This was an annoyingly manual process but thanks to the fast tuner and using the MSE value as a metric for improvement instead of actually running engine vs engine tests it went quickly enough. The result is fairly simple an elegant code. But I fear there's a lot of room for improvement... For example I didn't managed to incorporate "Doubled Pawns" in a way that it helped to drive the MSE down. But does that mean "doubled pawns" are not relevant? I doubt it. They are probably just having pro's and con's depending on other factors and my simple approach doesn't do them justice.

Oh, it's definitely much nicer having a faster tuner. Even the basic, vanilla gradient descent tuner I have implemented right now is orders of magnitude faster than the older one. To get the current best material and PSQT values in Blunder 7.6.0, I probably had to do a 11-12 hour tuning session.

Once I ironed the bugs out of my gradient descent tuner and ran it using the same 725K quiet positions from Zurichess, not only did it finish running in 5 minutes, but the values also beat the old ones by 60 Elo. Now of course that's only self-play, but assuming even the worst case scenario of it giving no gain in gauntlet testing, the fact that I can match my current major eval terms with only 5 minutes of tuning feels insanse.

lithander wrote: ↑Mon Jun 13, 2022 8:56 pm Selfplay numbers are usually inflating the real gains in my experience. I'm curious what the gains will be in the CCRL lists. Currently it's amazing +63 Elo in 40/15 but only disappointing +11 Elo on the Blitz list. So... I can't decide whether to feel amazed or disappointed right now!

Right, very true. Still though, for having a few simple formulas for some pawn structure evaluation, and what, like 20-30 seconds of individual tuning sessions, I'd say 11 Elo isn't too shabby at all

lithander · Post by **lithander** » Mon Jun 13, 2022 10:56 pm

algerbrex wrote: ↑Mon Jun 13, 2022 9:25 pm Once I ironed the bugs out of my gradient descent tuner and ran it using the same 725K quiet positions from Zurichess, not only did it finish running in 5 minutes, but the values also beat the old ones by 60 Elo.

Wow, that's amazing! What else are you planning to do before Blunder 8?

algerbrex · Post by **algerbrex** » Mon Jun 13, 2022 11:39 pm

lithander wrote: ↑Mon Jun 13, 2022 10:56 pm
algerbrex wrote: ↑Mon Jun 13, 2022 9:25 pm Once I ironed the bugs out of my gradient descent tuner and ran it using the same 725K quiet positions from Zurichess, not only did it finish running in 5 minutes, but the values also beat the old ones by 60 Elo.
Wow, that's amazing! What else are you planning to do before Blunder 8?

Thanks!

The "offical" goal I have set is to reach feature parity with Blunder 7.6.0, but be stronger, hopefully by a good 50-60 Elo.

Basically, the approach I took two weeks ago was to strip Blunder down to the bare essentials and build from the ground up, optimizing as much as I could. If I remember correctly, I speed up the move generator a good bit, and basically started from the feature set of Blunder 1, building from there.

I'm making sure to document every feature I add back to the search and evaluation, noting the Elo and exactly what it was. I'm mostly doing this to keep myself honest about not adding in features prematurely. But I also hope it'll be a useful document for new chess programmers, so they can get some idea of what to expect from some less well-documented features.

Right now the dev version of Blunder 8 is quite a bit weaker than Leorik, and weaker still then Blunder 7.4. The dev version and 7.4 basically have feature parity actually, but 7.4's evaluation is much more extensive and tweaked, so it's still stronger. I ran a test earlier today where I replaced 7.4's evaluation with the evaluation of an older version of Blunder, one that roughly had feature parity with Blunder dev. And Blunder dev actually seemed to be a little stronger.

So hopefully once I'm able to add back in the old evaluation features, Blunder 8 will prove to be a good step up from the latest release.

lithander · Post by **lithander** » Tue Jun 28, 2022 12:35 am

The last update to Leorik's devlog was almost a month ago, where I announced the release of version 2.1. Since then it was tested by the CCRL and is listed at 2583 Elo (+27) in the Blitz and 2606 Elo (+63) in the 40/15 rating list.

Leorik 2.1 has also played in Division 7 of Graham's 94th Amateur Series and ended up on the 4th place half a point behind Blunder 7.6.

Before the next tournament starts I hope to have a version ready that can compete for the top spot. But I also know that the competition doesn't sleep. I'm especially scared about facing Odonata 0.6 but also version 8.0 of Blunder will probably be ready by then. And of course I'd like to claim the title of Leorik being the worlds strongest C# chess engine, which (to the best of my knowledge) is currently MadChess 3.0 but reading Eriks blog, I know that the real champion is MadChess 3.1 Beta!

So, long story short: It's the friendly competition that's providing me motivation to improve Leorik further and if I were to release version 2.2 now it would contain three changes:

I disabled null-move pruning when the side to move has only pawns and the king left on the board.
I added a new mobility term to the evaluation. (in addition to material & pawn structure)
I have improved the replacement scheme of my transposition table which should improve it in longer time controls like they are used in tournaments or analysis.

Because my changes to the TT are documented in this thread already the only feature worth discussing in more detail would be the mobility evaluation.

I started with adding a new type of feature for my tuner to consider alongside material: Each specific Piece/Move combination would get it's own dedicated feature in the vector. For example a queen that can move 23 squares is a different feature than a queen that can move 24 squares.
All pieces have between 0 and X legal moves. And I looked into my set of FENs to count what X is and learned that I need to add exactly 88 boolean features, total. Of course it's not really 'boolean' because you can have two pieces of the same type with the same move count on the board.

Running that through the tuner I could look at the 88 newly tuned weights and see something like this:

Code: Select all

Moves     0    1    2    3   4   5   6   7   8   9   10   11   12   13
======================================================================
Bishop = -27, -20, -12, -8, -2,  2,  5,  6,  8,  9,  12,  5,   12, -8

What that means is that for the final evaluation each Bishop's material value (that depends on square and phase) should be modified by a CP value from this table based on how mobile it is. If it can't move at all it loses 27 cp of it's worth. With 5 or more moves it gains a small bonus. If I use these modifiers verbatim I gain some good Elo already.

In the next step I did some curve-fitting to get rid of the noise from my too small dataset. I tried to fit the table to a segment on an arc defined by this formula:

Code: Select all

int Arc(int x, int width, int height)
{
    //Looks like an inverted parabola with Arc(0) = 0 Arc(width) = 0 and Arc(width/2) = height
    return height * 4 * (width * x - x * x) / (width * width);
}

And while I really like this formula in the end it was too complicated. I got pretty good results already by fitting a linear function to the tuned data, especially when I truncated it for high values of moves.

Code: Select all

        static int BishopValue(int moves) => Min(7, moves) * 5;
        static int RookValue(int moves) => Min(11, moves) * 5;
        static int QueenValue(int moves) => Min(25, moves) * 3;
        static int KingValue(int moves) => moves * -5;

What about Knights and Pawns you may ask? Wether I included or skipped the knight features didn't change the MSE of the tuned results much. In other words: The mobility of knights doesn't matter. Or so the tuner told me... (it's my oracle^^)

And pawns get a special treatment because their move table looked like this:

Code: Select all

Pawn: -10, 0, -1, 0, 74, 0, 0, 0, 0, 0, 0, 0, 0,

So the number of moves a pawn has doesn't matter much except that pawns with 0 moves are pretty bad (they are stuck) and pawns with 4 moves deserve a huge bonus. 4 moves? Well that's a pawn on the 7th rank with an empty target square ahead of him. So no curve fitting required. I just subtract 10 from the eval for each stuck pawn and add 74 for each pawn ready to promote. (And that's already a valuable pawn in the PSQTs, but the square infront of it being empty makes it extra valuable)

With a fast tuner you can just modify a value here and there and retune and see what the MSE does... It's like manual texel tuning! Or gambling...? In any case: It's quite fun!

So really no sophisticated tech here. I just use the tuner to guide me to coefficients that minimize the MSE after tuning and this seems to correlate well with the actual playing strength.

MSE(material) = 0.2460
MSE(material + pawns) = 0.2417
MSE(material + pawns + mobility) = 0.2364

...I didn't really bother yet with measuring how much better my dev version is now compared to 2.1 because I hope to first complete my work on the eval (for now) by adding something about king-safety. I don't know much more about it than that many engines have it in their eval. So I will start again with coming up with a plethora of "features" for the tuner and then see which of them have the most significance in predicting the game outcome. Then I'll try to simplify again... it has worked for pawn structure and mobility and so I'm positive it will also work here.

Devlog of Leorik

Re: Devlog of Leorik

Re: Devlog of Leorik

Re: Devlog of Leorik

Re: Devlog of Leorik - New Version 2.1

Re: Devlog of Leorik - New Version 2.1

Re: Devlog of Leorik - New Version 2.1

Re: Devlog of Leorik - New Version 2.1

Re: Devlog of Leorik - New Version 2.1

Re: Devlog of Leorik - New Version 2.1

Re: Devlog of Leorik