Do you score with distance from root? I.e. do you reward reaching good positions earlier, and bad ones later?
Progress on Rustic
Moderators: hgm, Rebel, chrisw
-
- Posts: 2488
- Joined: Tue Aug 30, 2016 8:19 pm
- Full name: Rasmus Althoff
Re: Progress on Rustic
Rasmus Althoff
https://www.ct800.net
https://www.ct800.net
-
- Posts: 1784
- Joined: Wed Jul 03, 2019 4:42 pm
- Location: Netherlands
- Full name: Marcel Vanthoor
Re: Progress on Rustic
No; I just do swear he and evaluate. So I should subtract ply from the score, such that the same score will be lower when reached later. Same as adding ply to minus checkmate.
(Would it be helpful somehow to use 2 * ply for more distinction?)
I understand the idea, but I don’t understand why an engine would postpone good things...
-
- Posts: 2488
- Joined: Tue Aug 30, 2016 8:19 pm
- Full name: Rasmus Althoff
Re: Progress on Rustic
Yes, similar approach. I subtract the ply from eval if eval is positive, but don't reduce it to less than +1. If eval is negative, I add the ply, but not to more than -1 as result. I think you're right that some minor eval noise such as slightly better positioning at the end of the chain could be the cause. Not sure whether 2*ply would be better. That could be up to some experiments after you have added the basic distance handling.
Rasmus Althoff
https://www.ct800.net
https://www.ct800.net
-
- Posts: 1784
- Joined: Wed Jul 03, 2019 4:42 pm
- Location: Netherlands
- Full name: Marcel Vanthoor
Re: Progress on Rustic
Thanks. I’ll add this as well. Saw it before, mainly in the end game, but now that there’s aTT this behavior gets really obvious (and annoying).
Do you by any chance have an idea why the TT move ordering does seem to have a positive impact on nodes searched (less) and Time to Depth (shorter), while not gaining any strength over the course of 200 games?
Could it actually be that 200 games isn’t enough...?
Do you by any chance have an idea why the TT move ordering does seem to have a positive impact on nodes searched (less) and Time to Depth (shorter), while not gaining any strength over the course of 200 games?
Could it actually be that 200 games isn’t enough...?
-
- Posts: 2488
- Joined: Tue Aug 30, 2016 8:19 pm
- Full name: Rasmus Althoff
Re: Progress on Rustic
Calculating deeper only leads to more strength if that actually translates into finding better moves. Take the extreme case of eval always returning the same value, then deeper calculations will not be useful.
Rasmus Althoff
https://www.ct800.net
https://www.ct800.net
-
- Posts: 1784
- Joined: Wed Jul 03, 2019 4:42 pm
- Location: Netherlands
- Full name: Marcel Vanthoor
Re: Progress on Rustic
This didn't work when I tested it. The evaluation isn't differentiated enough; there are too many similar moves. Adding or subtracting the depth has too big an impact.Ras wrote: ↑Mon Mar 08, 2021 12:18 am Yes, similar approach. I subtract the ply from eval if eval is positive, but don't reduce it to less than +1. If eval is negative, I add the ply, but not to more than -1 as result. I think you're right that some minor eval noise such as slightly better positioning at the end of the chain could be the cause. Not sure whether 2*ply would be better. That could be up to some experiments after you have added the basic distance handling.
Progress: I've had the TT addition tested during the night. It seems to gain +/- 105 Elo, with a margin of about 25.
Gauntlet for Rustic Alpha 1:
Code: Select all
0 Rustic Alpha 1 -81 29 500 38.5% 16.2%
1 Deepov 0.4 210 109 50 77.0% 14.0%
2 Wukong JS 1.4 182 110 50 74.0% 8.0%
3 Clueless 1.4 164 107 50 72.0% 8.0%
4 CDrill Build 4 92 100 50 63.0% 6.0%
5 Pigeon 1.5.1 85 82 50 62.0% 32.0%
6 TSCP 1.81 78 98 50 61.0% 6.0%
7 Shallow Blue 2.0 35 89 50 55.0% 18.0%
8 Mizar 3 28 93 50 54.0% 12.0%
9 Celestial 1.0 7 87 50 51.0% 22.0%
10 FracTal 1.0 -28 78 50 46.0% 36.0%
Code: Select all
0 Rustic Alpha 2 rc5 15 27 500 52.1% 20.6%
1 Clueless 1.4 108 98 50 65.0% 10.0%
2 Pigeon 1.5.1 42 79 50 56.0% 36.0%
3 CDrill Build 4 28 90 50 54.0% 16.0%
4 Wukong JS 1.4 21 87 50 53.0% 22.0%
5 Celestial 1.0 -7 87 50 49.0% 22.0%
6 Deepov 0.4 -14 90 50 48.0% 16.0%
7 FracTal 1.0 -21 74 50 47.0% 42.0%
8 TSCP 1.81 -63 93 50 41.0% 14.0%
9 Shallow Blue 2.0 -78 94 50 39.0% 14.0%
10 Mizar 3 -173 103 50 27.0% 14.0%
Code: Select all
Rank Name Elo + - games score oppo. draws
1 Clueless 1.4 1882 62 59 100 69% 1729 9%
2 Wukong JS 1.4 1837 58 56 100 64% 1729 15%
3 Deepov 0.4 1825 58 56 100 63% 1729 15%
4 CDrill Build 4 1799 58 57 100 59% 1729 11%
5 Pigeon 1.5.1 1790 53 52 100 59% 1729 34%
6 Rustic Alpha 2 rc5 1781 25 25 500 52% 1767 21%
7 TSCP 1.81 1738 57 57 100 51% 1729 10%
8 Celestial 1.0 1728 55 55 100 50% 1729 22%
9 FracTal 1.0 1709 51 52 100 47% 1729 39%
10 Shallow Blue 2.0 1708 56 56 100 47% 1729 16%
11 Rustic Alpha 1 1677 25 26 500 39% 1767 16%
12 Mizar 3 1657 56 58 100 41% 1729 13%
Fractal 1.0 got a trouncing by Alpha 1 on CCRL; it's rating of 1709 in the list above is +59 Elo vs. CCRL, so Fractal performed much better than would be expected, against Rustic with TT. It actually _gained_ rating against it as compared to the first gauntlet. Mizar 3.0 is an engine about as strong as Alpha 1, but Alpha 1 plays badly against it. With the TT included however, Rustic suddenly has no problems with Mizar 3 anymore.
With some luck (and depending on the engines the CCRL-testers choose), with Alpha 2 performing at the top of the Elo margins, the engine might just start scratching 1800. I just need to add the UCI options and make a release in the weekend.
-
- Posts: 72
- Joined: Tue Jun 26, 2007 6:31 am
- Full name: Charles Wong
Re: Progress on Rustic
Just wanted to thank Marcel and how much I am enjoying this thread.
-
- Posts: 881
- Joined: Sun Dec 27, 2020 2:40 am
- Location: Bremen, Germany
- Full name: Thomas Jahn
Re: Progress on Rustic
I really like that you include so many different engines included in the gauntlets! You've got your own little CCRL version here!
But when I run a match between two engines of 1000 games I still get only a result within an error window of +/- 15.5 ELO. How much bigger is your error window with only 50 games per engine? Could it explain the surprising results regarding Fractal?
Example:
The above versions should be identical. But one won 3 more games than the other. The calculated ELO could mislead me into thinking one is better than the other. But given that the error window is 100 ELO wide the test really just shows that I need to run more tests before I conclude anything.
But when I run a match between two engines of 1000 games I still get only a result within an error window of +/- 15.5 ELO. How much bigger is your error window with only 50 games per engine? Could it explain the surprising results regarding Fractal?
Example:
Code: Select all
Score of MinimalChess 0.3 vs MinimalChess Dev: 10 - 7 - 39 [0.527] 56
... MinimalChess 0.3 playing White: 6 - 2 - 20 [0.571] 28
... MinimalChess 0.3 playing Black: 4 - 5 - 19 [0.482] 28
... White vs Black: 11 - 6 - 39 [0.545] 56
Elo difference: 18.6 +/- 50.3, LOS: 76.7 %, DrawRatio: 69.6 %%
-
- Posts: 1784
- Joined: Wed Jul 03, 2019 4:42 pm
- Location: Netherlands
- Full name: Marcel Vanthoor
Re: Progress on Rustic
The reason I include a lot of engines is because every engine behaves differently. For example: I _know_ Alpha 1 plays badly against Mizar 3.0 and Celestial, but I also know it performs well against Shallow Blue (and even better against Pulse; so well, that I actually decided to not include it for fear of skewing the results.) These engines are all in the 1650-1725 range, but Rustic performs in a 1600-1800 range against them.lithander wrote: ↑Tue Mar 09, 2021 8:28 pm I really like that you include so many different engines included in the gauntlets! You've got your own little CCRL version here!
But when I run a match between two engines of 1000 games I still get only a result within an error window of +/- 15.5 ELO. How much bigger is your error window with only 50 games per engine? Could it explain the surprising results regarding Fractal?
As you can also see, some engines took a nose-dive against Rustic+TT. Deepov dropped from +210 Elo (which, given Alpha 1's rating of 1677, would put have put Deepov at 1887, which is about 40 points too much) to -14. If Deepov is -14 against Alpha 2, while it stood at 1887 against Alpha 1, that would put Alpha 2 at 1901. Which is probably too much.
On the other hand, Celestial wasn't really bothered by Rustic's TT: it dropped from +7 to -7. Fractal 1.0 rose from -28 to -21. Both are within the margin of error when going head-to-head. If these results are somewhat representative (which would only be possible by playing a thousand games against each engine, which I can't do), then the TT would seem to have no effect. Wukong also dropped a great deal, while Clueless dropped, but still stayed above +100 Elo.
Because of these differences, it's always advisable to test against many engines. I've set the bar at 10 engines, with 50 games per engine, at 1m+0.6, because it's similar to what CCRL does, but at a faster time control.
And yes, that would explain the problem / surprise with Fractal. As you can see, the error bars for the other engines are much bigger than those for Rustic.
What I plan to do is to "migrate up the list." As you can see, Alpha 1 scored -81 Elo (just short of 40%) against this engine field. Alpha 2 scored +15 Elo against it, which is just over 50%. Alpha 3 will have only killer moves and history ordering, which I don't expect will gain 100 points again; lets say, 30. This version would then sit at +45, or about 56% against the field.
At some point after testing version X (don't know which one yet) I'll drop the weakest 2-3 engines from the pool and replace them with 2-3 engines that are stronger than Clueless. Then I'll test both X and X+1 in that new pool. (So I can see the improvement from X to X+1 in the new pool.) All the games will go into a database, and be processed by BayesElo, which CCRL also uses. After processing, the list is calibrated against Alpha 1's rating in CCRL. Then I can exactly see what the expected result of a new version in CCRL would be.
I don't know exactly where I'll put the cutoff, but it probably won't be above a 65% score, because at that point, Rustic will be at over +100 Elo already. I want to keep it in the middle of the pool. It's fun to see your engine at +inf Elo because it wins all the games, but it's not useful for testing obviously.
Example:
Yes. I'm happy with 500 games in a gauntlet, because it puts the error bars at +/- 30. I don't know how many games I would need to play to get that down to let's say +/- 10 Elo. I'm neither a statistician, nor a mathematician.The above versions should be identical. But one won 3 more games than the other. The calculated ELO could mislead me into thinking one is better than the other. But given that the error window is 100 ELO wide the test really just shows that I need to run more tests before I conclude anything.Code: Select all
Score of MinimalChess 0.3 vs MinimalChess Dev: 10 - 7 - 39 [0.527] 56 ... MinimalChess 0.3 playing White: 6 - 2 - 20 [0.571] 28 ... MinimalChess 0.3 playing Black: 4 - 5 - 19 [0.482] 28 ... White vs Black: 11 - 6 - 39 [0.545] 56 Elo difference: 18.6 +/- 50.3, LOS: 76.7 %, DrawRatio: 69.6 %%
That's the reason why I said that, if I'm lucky and Rustic performs at the top of the Elo rating margin (and/or CCRL happens to choose engines against which Rustic overperforms), it could -JUST- reach 1800 Elo.
-
- Posts: 1784
- Joined: Wed Jul 03, 2019 4:42 pm
- Location: Netherlands
- Full name: Marcel Vanthoor