Progress on Rustic

Ras · Post by **Ras** » Sun Mar 07, 2021 11:52 pm

mvanthoor wrote: ↑Sun Mar 07, 2021 8:52 pmpostponing a capture by checks, in-between moves, and tactical round-about antics, finally making the capture in some sort of weird, unexpected way.

Do you score with distance from root? I.e. do you reward reaching good positions earlier, and bad ones later?

mvanthoor · Post by **mvanthoor** » Mon Mar 08, 2021 12:01 am

Ras wrote: ↑Sun Mar 07, 2021 11:52 pm
mvanthoor wrote: ↑Sun Mar 07, 2021 8:52 pmpostponing a capture by checks, in-between moves, and tactical round-about antics, finally making the capture in some sort of weird, unexpected way.
Do you score with distance from root? I.e. do you reward reaching good positions earlier, and bad ones later?

No; I just do swear he and evaluate. So I should subtract ply from the score, such that the same score will be lower when reached later. Same as adding ply to minus checkmate.

(Would it be helpful somehow to use 2 * ply for more distinction?)

I understand the idea, but I don’t understand why an engine would postpone good things...

Ras · Post by **Ras** » Mon Mar 08, 2021 12:18 am

mvanthoor wrote: ↑Mon Mar 08, 2021 12:01 amSo I should subtract ply from the score, such that the same score will be lower when reached later. Same as adding ply to minus checkmate.

Yes, similar approach. I subtract the ply from eval if eval is positive, but don't reduce it to less than +1. If eval is negative, I add the ply, but not to more than -1 as result. I think you're right that some minor eval noise such as slightly better positioning at the end of the chain could be the cause. Not sure whether 2*ply would be better. That could be up to some experiments after you have added the basic distance handling.

mvanthoor · Post by **mvanthoor** » Mon Mar 08, 2021 12:41 am

Thanks. I’ll add this as well. Saw it before, mainly in the end game, but now that there’s aTT this behavior gets really obvious (and annoying).

Do you by any chance have an idea why the TT move ordering does seem to have a positive impact on nodes searched (less) and Time to Depth (shorter), while not gaining any strength over the course of 200 games?

Could it actually be that 200 games isn’t enough...?

Ras · Post by **Ras** » Mon Mar 08, 2021 10:04 am

mvanthoor wrote: ↑Mon Mar 08, 2021 12:41 amwhile not gaining any strength over the course of 200 games?

Calculating deeper only leads to more strength if that actually translates into finding better moves. Take the extreme case of eval always returning the same value, then deeper calculations will not be useful.

mvanthoor · Post by **mvanthoor** » Tue Mar 09, 2021 5:21 pm

Ras wrote: ↑Mon Mar 08, 2021 12:18 am Yes, similar approach. I subtract the ply from eval if eval is positive, but don't reduce it to less than +1. If eval is negative, I add the ply, but not to more than -1 as result. I think you're right that some minor eval noise such as slightly better positioning at the end of the chain could be the cause. Not sure whether 2*ply would be better. That could be up to some experiments after you have added the basic distance handling.

This didn't work when I tested it. The evaluation isn't differentiated enough; there are too many similar moves. Adding or subtracting the depth has too big an impact.

Progress: I've had the TT addition tested during the night. It seems to gain +/- 105 Elo, with a margin of about 25.

Gauntlet for Rustic Alpha 1:

Code: Select all

0 Rustic Alpha 1                -81      29     500   38.5%   16.2% 
1 Deepov 0.4                    210     109      50   77.0%   14.0% 
2 Wukong JS 1.4                 182     110      50   74.0%    8.0% 
3 Clueless 1.4                  164     107      50   72.0%    8.0% 
4 CDrill Build 4                 92     100      50   63.0%    6.0% 
5 Pigeon 1.5.1                   85      82      50   62.0%   32.0% 
6 TSCP 1.81                      78      98      50   61.0%    6.0% 
7 Shallow Blue 2.0               35      89      50   55.0%   18.0% 
8 Mizar 3                        28      93      50   54.0%   12.0% 
9 Celestial 1.0                   7      87      50   51.0%   22.0% 
10 FracTal 1.0                   -28     78      50   46.0%   36.0%

Same engine pool, but with Alpha 2 (rc5):

Code: Select all

0 Rustic Alpha 2 rc5        	 15      27     500   52.1%   20.6% 
1 Clueless 1.4                  108      98      50   65.0%   10.0% 
2 Pigeon 1.5.1                   42      79      50   56.0%   36.0% 
3 CDrill Build 4                 28      90      50   54.0%   16.0% 
4 Wukong JS 1.4                  21      87      50   53.0%   22.0% 
5 Celestial 1.0                  -7      87      50   49.0%   22.0% 
6 Deepov 0.4                    -14      90      50   48.0%   16.0% 
7 FracTal 1.0                   -21      74      50   47.0%   42.0% 
8 TSCP 1.81                     -63      93      50   41.0%   14.0% 
9 Shallow Blue 2.0              -78      94      50   39.0%   14.0% 
10 Mizar 3                      -173     103     50   27.0%   14.0%

All 1000 games processed by BayesElo, with the list being calibrated against the rating of RA1 on CCRL (= 1677):

Code: Select all

Rank Name                      Elo    +    - games score oppo. draws
   1 Clueless 1.4             1882   62   59   100   69%  1729    9%
   2 Wukong JS 1.4            1837   58   56   100   64%  1729   15%
   3 Deepov 0.4               1825   58   56   100   63%  1729   15%
   4 CDrill Build 4           1799   58   57   100   59%  1729   11%
   5 Pigeon 1.5.1             1790   53   52   100   59%  1729   34%
   6 Rustic Alpha 2 rc5       1781   25   25   500   52%  1767   21%
   7 TSCP 1.81                1738   57   57   100   51%  1729   10%
   8 Celestial 1.0            1728   55   55   100   50%  1729   22%
   9 FracTal 1.0              1709   51   52   100   47%  1729   39%
  10 Shallow Blue 2.0         1708   56   56   100   47%  1729   16%
  11 Rustic Alpha 1           1677   25   26   500   39%  1767   16%
  12 Mizar 3                  1657   56   58   100   41%  1729   13%

The ratings for most engines seem to be in the CCRL ballpark, +/- 25 points. The two real surprises are Fractal and Mizar.

Fractal 1.0 got a trouncing by Alpha 1 on CCRL; it's rating of 1709 in the list above is +59 Elo vs. CCRL, so Fractal performed much better than would be expected, against Rustic with TT. It actually _gained_ rating against it as compared to the first gauntlet. Mizar 3.0 is an engine about as strong as Alpha 1, but Alpha 1 plays badly against it. With the TT included however, Rustic suddenly has no problems with Mizar 3 anymore.

With some luck (and depending on the engines the CCRL-testers choose), with Alpha 2 performing at the top of the Elo margins, the engine might just start scratching 1800. I just need to add the UCI options and make a release in the weekend.

adnoh · Post by **adnoh** » Tue Mar 09, 2021 7:29 pm

Just wanted to thank Marcel and how much I am enjoying this thread.

lithander · Post by **lithander** » Tue Mar 09, 2021 8:28 pm

I really like that you include so many different engines included in the gauntlets! You've got your own little CCRL version here!

But when I run a match between two engines of 1000 games I still get only a result within an error window of +/- 15.5 ELO. How much bigger is your error window with only 50 games per engine? Could it explain the surprising results regarding Fractal?

Example:

Code: Select all

Score of MinimalChess 0.3 vs MinimalChess Dev: 10 - 7 - 39  [0.527] 56
...      MinimalChess 0.3 playing White: 6 - 2 - 20  [0.571] 28
...      MinimalChess 0.3 playing Black: 4 - 5 - 19  [0.482] 28
...      White vs Black: 11 - 6 - 39  [0.545] 56
Elo difference: 18.6 +/- 50.3, LOS: 76.7 %, DrawRatio: 69.6 %%

The above versions should be identical. But one won 3 more games than the other. The calculated ELO could mislead me into thinking one is better than the other. But given that the error window is 100 ELO wide the test really just shows that I need to run more tests before I conclude anything.

mvanthoor · Post by **mvanthoor** » Tue Mar 09, 2021 10:11 pm

lithander wrote: ↑Tue Mar 09, 2021 8:28 pm I really like that you include so many different engines included in the gauntlets! You've got your own little CCRL version here!

But when I run a match between two engines of 1000 games I still get only a result within an error window of +/- 15.5 ELO. How much bigger is your error window with only 50 games per engine? Could it explain the surprising results regarding Fractal?

The reason I include a lot of engines is because every engine behaves differently. For example: I _know_ Alpha 1 plays badly against Mizar 3.0 and Celestial, but I also know it performs well against Shallow Blue (and even better against Pulse; so well, that I actually decided to not include it for fear of skewing the results.) These engines are all in the 1650-1725 range, but Rustic performs in a 1600-1800 range against them.

As you can also see, some engines took a nose-dive against Rustic+TT. Deepov dropped from +210 Elo (which, given Alpha 1's rating of 1677, would put have put Deepov at 1887, which is about 40 points too much) to -14. If Deepov is -14 against Alpha 2, while it stood at 1887 against Alpha 1, that would put Alpha 2 at 1901. Which is probably too much.

On the other hand, Celestial wasn't really bothered by Rustic's TT: it dropped from +7 to -7. Fractal 1.0 rose from -28 to -21. Both are within the margin of error when going head-to-head. If these results are somewhat representative (which would only be possible by playing a thousand games against each engine, which I can't do), then the TT would seem to have no effect. Wukong also dropped a great deal, while Clueless dropped, but still stayed above +100 Elo.

Because of these differences, it's always advisable to test against many engines. I've set the bar at 10 engines, with 50 games per engine, at 1m+0.6, because it's similar to what CCRL does, but at a faster time control.

And yes, that would explain the problem / surprise with Fractal. As you can see, the error bars for the other engines are much bigger than those for Rustic.

What I plan to do is to "migrate up the list." As you can see, Alpha 1 scored -81 Elo (just short of 40%) against this engine field. Alpha 2 scored +15 Elo against it, which is just over 50%. Alpha 3 will have only killer moves and history ordering, which I don't expect will gain 100 points again; lets say, 30. This version would then sit at +45, or about 56% against the field.

At some point after testing version X (don't know which one yet) I'll drop the weakest 2-3 engines from the pool and replace them with 2-3 engines that are stronger than Clueless. Then I'll test both X and X+1 in that new pool. (So I can see the improvement from X to X+1 in the new pool.) All the games will go into a database, and be processed by BayesElo, which CCRL also uses. After processing, the list is calibrated against Alpha 1's rating in CCRL. Then I can exactly see what the expected result of a new version in CCRL would be.

I don't know exactly where I'll put the cutoff, but it probably won't be above a 65% score, because at that point, Rustic will be at over +100 Elo already. I want to keep it in the middle of the pool. It's fun to see your engine at +inf Elo because it wins all the games, but it's not useful for testing obviously.

Example:

Code: Select all
Score of MinimalChess 0.3 vs MinimalChess Dev: 10 - 7 - 39  [0.527] 56
...      MinimalChess 0.3 playing White: 6 - 2 - 20  [0.571] 28
...      MinimalChess 0.3 playing Black: 4 - 5 - 19  [0.482] 28
...      White vs Black: 11 - 6 - 39  [0.545] 56
Elo difference: 18.6 +/- 50.3, LOS: 76.7 %, DrawRatio: 69.6 %%
The above versions should be identical. But one won 3 more games than the other. The calculated ELO could mislead me into thinking one is better than the other. But given that the error window is 100 ELO wide the test really just shows that I need to run more tests before I conclude anything.

Yes. I'm happy with 500 games in a gauntlet, because it puts the error bars at +/- 30. I don't know how many games I would need to play to get that down to let's say +/- 10 Elo. I'm neither a statistician, nor a mathematician.

That's the reason why I said that, if I'm lucky and Rustic performs at the top of the Elo rating margin (and/or CCRL happens to choose engines against which Rustic overperforms), it could -JUST- reach 1800 Elo.

mvanthoor · Post by **mvanthoor** » Tue Mar 09, 2021 10:12 pm

adnoh wrote: ↑Tue Mar 09, 2021 7:29 pm Just wanted to thank Marcel and how much I am enjoying this thread.

I'm glad you enjoy the thread

I'll try to keep it running for some time to come

Progress on Rustic

Re: Progress on Rustic

Re: Progress on Rustic

Re: Progress on Rustic

Re: Progress on Rustic

Re: Progress on Rustic

Re: Progress on Rustic

Re: Progress on Rustic

Re: Progress on Rustic

Re: Progress on Rustic

Re: Progress on Rustic