Devlog of Leorik

algerbrex · Post by **algerbrex** » Tue Jul 05, 2022 5:22 pm

lithander wrote: ↑Tue Jul 05, 2022 12:49 pm Here's the source code of one of my attempts to do add a king-safety related term. Basically it counts how many squares around the king are threatened by enemy pieces. If the same square is attacked by multiple pieces the threatcounter increases each time so the count can be bigger than the amount of squares around the king. Then you modify the evaluation based on that counter and a lookup how that should affect the evaluation from the two arrays. One adjusts the base score (midgame) the other governs the modification of that base score as the game transitions into endgame.
Code: Select all
        static short[] KingThreatsBase = new short[20] { -56, -50, -48, -51, -47, -36, -27, 1, 8, 37, 81, 42, 51, 91, 4, 0, 0, 0, 0, 0, };
        static short[] KingThreatsEndgame = new short[20] { 45, 39, 23, 37, 34, 15, 14, -23, -29, -66, -94, -34, 22, 15, 0, 0, 0, 0, 0, 0, };

        public static void Update(BoardState board, ref EvalTerm eval)
        {
            //White
            int count = Features.CountBlackKingThreats(board);
            eval.Base += KingThreatsBase[count];
            eval.Endgame += KingThreatsEndgame[count];
            //Black
            count = Features.CountWhiteKingThreats(board);
            eval.Base -= KingThreatsBase[count];
            eval.Endgame -= KingThreatsEndgame[count];
        }
The values in these arrays are tuned by extending the feature-vector so that each position will set a component to one that is exclusively associated with a specific threatcount on a specific king. So each position will set two components - one for the black king and one for the white king.

In the middle game it seems that a threatcount of 7 or more on the opposing king starts to be a sign for a winning position and less starts to be contributing to a losing position. Each slot in the array represents a few percent of positions and can be tuned linearly. It's resulting in a lookup table for a function mapping the threat-count to a value and this function is not constrained to be linear as far as I can see.

...but on the other hand it doesn't work. So I'm sure I'm missing something important. Looking forward to your paper!

Ah, ok, interesting approach. Here are some thoughts.

In Blunder, I have a couple of conditions for when king safety can be evaluated. The queen has to be on the board, there have to be at least two pieces attacking the king, including the queen, and king safety is only applied to the middlegame score.

The first two restrictions help Blunder to avoid getting itself in losing positions over an unsound attack, especially one with no queen. And the last helps Blunder to avoid hiding it's king away as the endgame approach since there are less threats and the king should start becoming more active.

So trying to add some of the above restrictions to your scheme may help things work a little bit better.

Second, I think it might help to score attack squares based on the type of piece that's attacking. So a queen attacking three squares isn't treated the same as a knight attacking three squares. I think this would help teach the engine when it has a good attack (e.g. queen and two minors bearing down around an exposed king), versus a more unsound one (e.g. bishop and rook laser-beaming an exposed king).

Lastly, I remember in the past I tried directly tuning the attack table for king safety, but it never produced very good results, and I was advised by some other authors on the forum that if I wanted to tune an attack table, it'd be better to tune the coefficients of some formula which is used to create a non-linear attack table, instead of exposing the table directly to the tuner. So that might be something to try as well.

And thanks, hopefully this paper will be wrapped up pretty soon

Mike Sherwin · Post by **Mike Sherwin** » Tue Jul 05, 2022 5:27 pm

algerbrex wrote: ↑Tue Jul 05, 2022 4:44 pm
Mike Sherwin wrote: ↑Tue Jul 05, 2022 4:30 pm There is something very simple that can be tried. For each root move count the checkmates for and against. Do some math on those numbers. Adjust that root moves score.
I remember you mentioning this idea Before, and it's something I meant to try in Blunder as well, so thanks for reminding me! I think I remember adding it in before and it seem promising, but I never got the chance to run a full, rigorous test.

I think a simple ratio with the biggest number on top times a constant would be a good place to start. Of course if the opponent has the biggest number the result is negative.

algerbrex · Post by **algerbrex** » Tue Jul 05, 2022 5:31 pm

Mike Sherwin wrote: ↑Tue Jul 05, 2022 5:27 pm
algerbrex wrote: ↑Tue Jul 05, 2022 4:44 pm
Mike Sherwin wrote: ↑Tue Jul 05, 2022 4:30 pm There is something very simple that can be tried. For each root move count the checkmates for and against. Do some math on those numbers. Adjust that root moves score.
I remember you mentioning this idea Before, and it's something I meant to try in Blunder as well, so thanks for reminding me! I think I remember adding it in before and it seem promising, but I never got the chance to run a full, rigorous test.
I think a simple ratio with the biggest number on top times a constant would be a good place to start. Of course if the opponent has the biggest number the result is negative.

Cool, that makes sense to me and I believe it's the approach I tried last time since I saw you mentioned it elsewhere on here. I actually just finished running the last test in my queue, so I think I'll give this idea a go, if I can get it working right.

Mike Sherwin · Post by **Mike Sherwin** » Tue Jul 05, 2022 9:35 pm

algerbrex wrote: ↑Tue Jul 05, 2022 5:31 pm
Mike Sherwin wrote: ↑Tue Jul 05, 2022 5:27 pm
algerbrex wrote: ↑Tue Jul 05, 2022 4:44 pm
Mike Sherwin wrote: ↑Tue Jul 05, 2022 4:30 pm There is something very simple that can be tried. For each root move count the checkmates for and against. Do some math on those numbers. Adjust that root moves score.
I remember you mentioning this idea Before, and it's something I meant to try in Blunder as well, so thanks for reminding me! I think I remember adding it in before and it seem promising, but I never got the chance to run a full, rigorous test.
I think a simple ratio with the biggest number on top times a constant would be a good place to start. Of course if the opponent has the biggest number the result is negative.
Cool, that makes sense to me and I believe it's the approach I tried last time since I saw you mentioned it elsewhere on here. I actually just finished running the last test in my queue, so I think I'll give this idea a go, if I can get it working right.

I know that you know this already however for clarity the polarity of odd vs even ply has to be accounted for in the counting. So I guess the calculations for all root moves would start with, allMatesFor[stm] / allMoves[stm]; stm = 1 - stm; allMatesFor[stm] / allMoves[stm]; then divide bigger by smaller and negate if bigger belongs to the opponent. Then find the constant to multiply by that will normalize the best/worst results to some range between zero and some max value for both positive and negative results. That is just my thinking. There is probably a million different ways to do it.

lithander · Post by **lithander** » Sat Jul 16, 2022 12:37 pm

I've just released a "minor" new version that adds a mobility term to the evaluation along with improved time-control logic and a new TT replacement scheme better suitable for long matches.
https://github.com/lithander/Leorik/releases/tag/2.2

A gauntlet I played on 40/20 timecontrols (~500ms per move, so quite fast) looks promising:

Code: Select all

   # PLAYER           :  RATING  POINTS  PLAYED   (%)
   1 odonata-0.6.2    :  2737.0   184.5     360    51
   2 Leorik 2.0       :  2711.2  1121.5    2148    52
   3 Ceibo_0.8        :  2702.0   176.5     358    49
   4 dumb-1.9         :  2698.0   177.5     361    49
   5 Inanis-1.0.1     :  2690.0   174.0     360    48
   6 blunder-8.0.0    :  2690.0   161.0     361    45
   7 zevra-2.5        :  2655.0   153.0     348    44

2700 Elo in the CCRL lists would be a great result considering the set of changes I made. (2.1 is listed at 2583 and 2602) But I'm a bit disappointed about all the things I worked on that *didn't* make it into this release. I have spent a lot of time on trying to get King Safety to work. And also had some Bishop-specific evaluation implemented that was giving a bonus for having a bishop-pair and otherwise evaluated the placement of pieces in regards to the color of the remaining single bishop. It looked quite promising but after thorough testing I felt like it was not improving playing strength as much it should based on the reduction of MSE I saw while tuning. So I feared it was causing problems in some situations while improving the eval on average only... this is a very dangerous thing to do long term and so I decided to release a simple but solid version 2.2 and shelve the rest.

Well.. wouldn't be fun if things were always easy!

Mike Sherwin wrote: ↑Wed Jun 01, 2022 10:27 pm Big improvement in playing style! Much more human like. Needs pawn storm code. This was a very interesting game!!

I'd be very happy if you'd try the new version. I'm very curious what your verdict will be and if you can still beat it!

algerbrex · Post by **algerbrex** » Sat Jul 16, 2022 1:13 pm

lithander wrote: ↑Sat Jul 16, 2022 12:37 pm I've just released a "minor" new version that adds a mobility term to the evaluation along with improved time-control logic and a new TT replacement scheme better suitable for long matches.
https://github.com/lithander/Leorik/releases/tag/2.2

A gauntlet I played on 40/20 timecontrols (~500ms per move, so quite fast) looks promising:
Code: Select all
   # PLAYER           :  RATING  POINTS  PLAYED   (%)
   1 odonata-0.6.2    :  2737.0   184.5     360    51
   2 Leorik 2.0       :  2711.2  1121.5    2148    52
   3 Ceibo_0.8        :  2702.0   176.5     358    49
   4 dumb-1.9         :  2698.0   177.5     361    49
   5 Inanis-1.0.1     :  2690.0   174.0     360    48
   6 blunder-8.0.0    :  2690.0   161.0     361    45
   7 zevra-2.5        :  2655.0   153.0     348    44
2700 Elo in the CCRL lists would be a great result considering the set of changes I made. (2.1 is listed at 2583 and 2602)

Congrats! Leorik's performance does indeed look promising. I'm looking forward to running some tests between it and the dev version of Blunder

lithander wrote: ↑Sat Jul 16, 2022 12:37 pm But I'm a bit disappointed about all the things I worked on that *didn't* make it into this release. I have spent a lot of time on trying to get King Safety to work. And also had some Bishop-specific evaluation implemented that was giving a bonus for having a bishop-pair and otherwise evaluated the placement of pieces in regards to the color of the remaining single bishop. It looked quite promising but after thorough testing I felt like it was not improving playing strength as much it should based on the reduction of MSE I saw while tuning. So I feared it was causing problems in some situations while improving the eval on average only... this is a very dangerous thing to do long term and so I decided to release a simple but solid version 2.2 and shelve the rest. Well.. wouldn't be fun if things were always easy!
[/code]

Yep, I know the feeling.

What I generally found to be true was not to focus too much on how much the MSE was reduced and focus more on whether it was reduced and what values the tuner chose. From my testing larger MSE drop often didn't correlate much with more Elo, although sometimes it did.

It'd be quite impressive though in my mind for an engine to reach 2700-2800 rating without any king safety!

algerbrex · Post by **algerbrex** » Sat Jul 16, 2022 2:04 pm

lithander wrote: ↑Sat Jul 16, 2022 12:37 pm I've just released a "minor" new version that adds a mobility term to the evaluation along with improved time-control logic and a new TT replacement scheme better suitable for long matches.
https://github.com/lithander/Leorik/releases/tag/2.2
...

Too early in the match to draw conclusions (10+0.1s) I suppose, but right now Leorik 2.2 is blowing Blunder out of the water:

Code: Select all

Score of Blunder 8.4.4 vs Leorik 2.2: 41 - 104 - 55  [0.343] 200
...      Blunder 8.4.4 playing White: 22 - 47 - 32  [0.376] 101
...      Blunder 8.4.4 playing Black: 19 - 57 - 23  [0.308] 99
...      White vs Black: 79 - 66 - 55  [0.532] 200
Elo difference: -113.3 +/- 42.5, LOS: 0.0 %, DrawRatio: 27.5 %
SPRT: llr -1.08 (-36.6%), lbound -2.94, ubound 2.94

Thomas what did you put in that thing

Edit, after 500 games H0 was accepted:

Code: Select all

Finished game 552 (Leorik 2.2 vs Blunder 8.4.4): * {No result}
Score of Blunder 8.4.4 vs Leorik 2.2: 114 - 288 - 148  [0.342] 550
...      Blunder 8.4.4 playing White: 63 - 137 - 75  [0.365] 275
...      Blunder 8.4.4 playing Black: 51 - 151 - 73  [0.318] 275
...      White vs Black: 214 - 188 - 148  [0.524] 550
Elo difference: -113.8 +/- 25.7, LOS: 0.0 %, DrawRatio: 26.9 %
SPRT: llr -2.95 (-100.1%), lbound -2.94, ubound 2.94 - H0 was accepted
Finished match

Quite odd to me, either Blunder's much weaker than I thought, or Leorik is much stronger than your testing shows.

Modern Times · Post by **Modern Times** » Sun Jul 17, 2022 8:24 am

lithander wrote: ↑Sat Jul 16, 2022 12:37 pm 2700 Elo in the CCRL lists would be a great result considering the set of changes I made. (2.1 is listed at 2583 and 2602)

Yes, a 2700 Elo engine but only just, and subject to more games. Could go either way.

Code: Select all

CCRL 40/15 main list:
Leorik 2.2 64-bit is #152-154 with rating of 2702 Elo points (+36 -35),
based on 260 games: 102 wins, 76 losses and 82 draws
Score: 55.0%, Average opponent: −36.6, Draws: 31.5%

Pairwise results:
     Opponent                              Elo     Score                  LOS   Perf
 - Blunder 8.0.0 64-bit                    2718  24.0-28.0  (+13-17=22)   28.9    -6
 - Daydreamer 1.75 64-bit                  2681  24.5-27.5  (+14-17=21)   84.8   -37
 - Supernova 2.4 64-bit                    2668  30.5-21.5  (+23-14=15)   95.1   +25
 - Movei 00.8.438                          2647  29.0-23.0  (+22-16=14)   99.6   -14
 - Pharaon 3.5.1                           2613  35.0-17.0  (+30-12=10)  100.0   +40

lithander · Post by **lithander** » Sun Jul 17, 2022 11:32 pm

algerbrex wrote: ↑Sat Jul 16, 2022 2:04 pm Thomas what did you put in that thing
Code: Select all
Finished game 552 (Leorik 2.2 vs Blunder 8.4.4): * {No result}
Score of Blunder 8.4.4 vs Leorik 2.2: 114 - 288 - 148  [0.342] 550
...      Blunder 8.4.4 playing White: 63 - 137 - 75  [0.365] 275
...      Blunder 8.4.4 playing Black: 51 - 151 - 73  [0.318] 275
...      White vs Black: 214 - 188 - 148  [0.524] 550
Elo difference: -113.8 +/- 25.7, LOS: 0.0 %, DrawRatio: 26.9 %
SPRT: llr -2.95 (-100.1%), lbound -2.94, ubound 2.94 - H0 was accepted
Finished match
Quite odd to me, either Blunder's much weaker than I thought, or Leorik is much stronger than your testing shows.

In my own tests against Blunder 8.0 the difference has been around +30 Elo for Leorik. So this comes a bit as a surprise to me, too. I think part of the explanation is that you are testing at very fast time controls and Leorik is traditionally very strong there. (Which has often led to me being a bit disappointed after a release^^) But there could also be a regression between 8.0.0 and 8.4.4 when pairing Blunder with Leorik and it could be interesting to find out what change caused it. If Leorik can exploit it other engines maybe also can... until recently our engines were both tuned on the same dataset, right? Maybe it's got something to do with that. But whatever it is I'm pretty sure that Leorik is not much stronger than 2700!

algerbrex wrote: ↑Sat Jul 16, 2022 1:13 pm It'd be quite impressive though in my mind for an engine to reach 2700-2800 rating without any king safety!

I think that even without an explicit king safety evaluation the engine can keep the king reasonably safe through other means. E.g. the PSQTs reward castling and an intact pawn shield in the midgame. Or the mobility eval deducts a few CP for a too mobile king. But most importantly all short-term king-safety issues should be uncovered by the search. Or at leasts that's what I'm telling myself because I'm pretty sure I'm not going to touch that cursed topic again anytime soon!

Modern Times wrote: ↑Sun Jul 17, 2022 8:24 am Yes, a 2700 Elo engine but only just, and subject to more games. Could go either way.
Code: Select all
CCRL 40/15 main list:
Leorik 2.2 64-bit is #152-154 with rating of 2702 Elo points (+36 -35),
based on 260 games: 102 wins, 76 losses and 82 draws
Score: 55.0%, Average opponent: −36.6, Draws: 31.5%

Thanks for testing!! I'm looking forward to the final result but this already looks promising!

algerbrex · Post by **algerbrex** » Mon Jul 18, 2022 12:04 am

lithander wrote: ↑Sun Jul 17, 2022 11:32 pm In my own tests against Blunder 8.0 the difference has been around +30 Elo for Leorik. So this comes a bit as a surprise to me, too. I think part of the explanation is that you are testing at very fast time controls and Leorik is traditionally very strong there. (Which has often led to me being a bit disappointed after a release^^) But there could also be a regression between 8.0.0 and 8.4.4 when pairing Blunder with Leorik and it could be interesting to find out what change caused it. If Leorik can exploit it other engines maybe also can... until recently our engines were both tuned on the same dataset, right? Maybe it's got something to do with that. But whatever it is I'm pretty sure that Leorik is not much stronger than 2700!

True, when I get back to my laptop again I’m going to run another test between Leorik 2.2 and Blunder 8.4.5, the latest dev version, at a time control of 40 moves in 20 seconds to match your gauntlet time control, which should give double the time per move. And after that maybe one at 60+0.6s.

For the test above I used 10+0.1s with 16 MB of hash. Which should give around 0.25 seconds per move. What time control was done for your test against Blunder 8.0.0?

Defintely could be a regression, unfortunately for me. My latest testing shows the dev version should have now 50-60 Elo on 8.0.0, but ofc self play can be very deceiving.

To confirm the dev version isn’t a regression, I’m going to select a variety of engines to run a gauntlet against 8.0.0, including Leorik 2.2, and then 8.4.5.

Even if it’s just Leorik, it’s still quite strange to me just how much weaker it would be, even at faster time controls.

You’re correct though our datasets are different now. Using the extended dataset showed an Elo gain against a gauntlet, so it seems overhaul it’s a gain, but perhaps Leorik is exploiting it somehow. I may retune the evaluation and re-test.

A lot to consider, looks like I have plenty to keep my busy

Devlog of Leorik

Re: Devlog of Leorik

Re: Devlog of Leorik

Re: Devlog of Leorik

Re: Devlog of Leorik

New Version 2.2

Re: New Version 2.2

Re: New Version 2.2

Re: New Version 2.2

Re: New Version 2.2

Re: New Version 2.2