Progress on Rustic

mvanthoor · Post by **mvanthoor** » Mon Jun 07, 2021 11:17 pm

Today I finished the SPRT tests. Each test is against the previous version. Teh TT-cuts test at +41 elo, and TT-move ordering at 102 Elo is eerily close to the 140 Elo Rustic Alpha 2 advanced over Alpha 1 in the CCRL list, even though the test below is self-play. Adding Killers on top of Alpha 2 gains 56 Elo, and adding PVS on top of Killers gains another 54, at least in self-play. It would be great if Alpha 3 would gain 110 Elo, which would put it at around 1925 CCRL. I could be happy with a rating around 1900. (Even though I had hoped to hit 2000 with also including AW and History, but in the current engine, they don't seem to do anything. Probably the evaluation is not good enough yet.)

TT-cuts:
Score of Rustic Alpha 1.5 vs Rustic Alpha 1.1: 869 - 646 - 357 [0.560] 1872
... Rustic Alpha 1.5 playing White: 431 - 321 - 184 [0.559] 936
... Rustic Alpha 1.5 playing Black: 438 - 325 - 173 [0.560] 936
... White vs Black: 756 - 759 - 357 [0.499] 1872
Elo difference: 41.6 +/- 14.2, LOS: 100.0 %, DrawRatio: 19.1 %
SPRT: llr 2.95 (100.3%), lbound -2.94, ubound 2.94 - H1 was accepted

TT-move ordering:
Score of Rustic Alpha 2 vs Rustic Alpha 1.5: 409 - 197 - 133 [0.643] 739
... Rustic Alpha 2 playing White: 218 - 84 - 68 [0.681] 370
... Rustic Alpha 2 playing Black: 191 - 113 - 65 [0.606] 369
... White vs Black: 331 - 275 - 133 [0.538] 739
Elo difference: 102.5 +/- 23.5, LOS: 100.0 %, DrawRatio: 18.0 %
SPRT: llr 2.95 (100.1%), lbound -2.94, ubound 2.94 - H1 was accepted

Killers:
Score of Rustic Alpha 2.1.100 vs Rustic Alpha 2: 628 - 416 - 277 [0.580] 1321
... Rustic Alpha 2.1.100 playing White: 357 - 167 - 137 [0.644] 661
... Rustic Alpha 2.1.100 playing Black: 271 - 249 - 140 [0.517] 660
... White vs Black: 606 - 438 - 277 [0.564] 1321
Elo difference: 56.2 +/- 16.8, LOS: 100.0 %, DrawRatio: 21.0 %
SPRT: llr 2.94 (100.0%), lbound -2.94, ubound 2.94 - H1 was accepted

PVS:
Score of Rustic Alpha 2.2.100 vs Rustic Alpha 2.1.100: 591 - 388 - 318 [0.578] 1297
... Rustic Alpha 2.2.100 playing White: 334 - 171 - 143 [0.626] 648
... Rustic Alpha 2.2.100 playing Black: 257 - 217 - 175 [0.531] 649
... White vs Black: 551 - 428 - 318 [0.547] 1297
Elo difference: 54.8 +/- 16.6, LOS: 100.0 %, DrawRatio: 24.5 %
SPRT: llr 2.95 (100.3%), lbound -2.94, ubound 2.94 - H1 was accepted

This was the SPRT hypothesis for CuteChess:
-sprt elo0=1 elo1=5 alpha=0.05 beta=0.05

H1: Engine A is at least 1 Elo stronger than Engine B.
H0: Engine A is NOT more than 10 Elo stronger than Engine B.

So, if A is 50 Elo stronger than B, it's clear that H1 is accepted: the engine is at least 1 Elo stronger, and "it is NOT more than 5 Elo stronger" isn't true.

What if the engine is 3 Elo stronger? In that case, both H1 and H0 are true? It is more than 1 Elo stronger, but also NOT more than 5 Elo stronger. Or is that exactly the point when H0 is accepted?

What would happen if I set this to:
-sprt elo0=20 elo1=100 alpha=0.05 beta=0.05

Assume engine A is 50 Elo stronger. It's at least 20 stronger, but it's also not 100 Elo stronger, so both hypothesis are true at the same time. How is it decided which one is accepted? (As said: I assume that if the Elo range falls between the two ranges, it's H0 that is accepted, but I don't know for sure. Can't really find a definitive answer.)

Reversing Elo0 and Elo1 also feels logical, but seems seems to be wrong, according to several posts I've found:
-sprt elo0=20 elo1=5 alpha=0.05 beta=0.05

H1: Accept if engine A is at least 20 Elo stronger
H0: Accept if engine A is not more than 5 Elo stronger

But: if the engine is 10 Elo stronger, both H1 and H0 would fail, because:
- A is at least 20 Elo stronger is false...
- A is not more than 5 Elo stronger is also false...

So no way to accept one.

(For me, the confusing/weird part is that H1 is the main hypothesis, as in "Engine A is at least X Elo stronger", which is set by Elo0, and H0 is the null-hypothesis, set by Elo1. At least, as far as I can interpret from CuteChess's description.)

===

It seems there's a definite improvement for the version with Killers and PVS. I'll test that in a gauntlet, and if this improvement holds there, I'll turn that into Alpha 3 (with a new build-script based on a Makefile), and then move on to tapering and tuning. That would then become Rustic 4. (I think the engine can be considered to have left "basic camp" after tapering and tuning the eval; I assume it'll at least hit 2150 after tapering and tuning, I the increases of other engines such as MinimalChess are indicative to what tapering and tuning can deliver.)

emadsen · Post by **emadsen** » Sat Jun 12, 2021 1:10 am

mvanthoor wrote: ↑Sun Jun 06, 2021 11:07 pm The test with Aspiration Window (on top of killers and pvs, 50cp window, reset to INFINITY if it fails) is now running. It's at 51% +/- 0.2% against the version with killers+pvs, so it's unlikely this is going to make a huge difference. (Edit: while I was typing this post, the version with AW dropped to 49.8%. We're 1200 games into the test. So, I feel as if AW are not going to make any clear difference, at least not with the 0.5 window, and a simple evaluation.)

That was my experience with AW. They didn't provide any strength increase beyond what PVS already had added. What's challenging though is there are so many possible implementations (open one side of the window, open both sides, increase window in small increments, increase by doubling, etc). I tried a few. But at some point I gave up not expecting to find any gain.

mar · Post by **mar** » Sat Jun 12, 2021 1:31 am

aspiration windows were a clear gain for me.
first they reduce number of nodes significantly (assuming you start with a sufficiently small window)

I use them to improve time management as well (this is actually worth quite a bit of elo IIRC), when a move starts failing low, I resolve the fail low first to know how bad it actually
is instead of trying to look for a better move - this is typically fast and if the score goes below some margin from previous iteration,
I enter the "panic mode" where I try to spend up to a certain fraction of total available time to try to finish the iteration and hopefully find
a move that avoids a blunder. I also enter the panic mode when I get a fail low after a fail high, which is rare.
(it's a bit tricky to get this right though and I could probably do better here)

another thing I do is if a move starts to fail high and time is up, I play that move anyway, betting that it'd ultimately improve

mvanthoor · Post by **mvanthoor** » Sat Jun 12, 2021 1:37 am

emadsen wrote: ↑Sat Jun 12, 2021 1:10 am ...

mar wrote: ↑Sat Jun 12, 2021 1:31 am ...

Thanks, Eric and mar. I haven't written of AW yet. I also had no success with history (but I also have no evaluation to speak of at this point). I'll try them both later again.

I've been running that TT experiment from the other thread. It's the last thing I wanted to research before finally releasing a new version. Alpha 3 is marginally improved. Killers and PVS have a clear gain in Elo, somewhere between a total of 35 up to 85 Elo, depending on time controls and opponents.

The two greatest perks for Alpha 3 will be its new build script (Makefile) which builds all CPU targets for the current OS, and a fairly large update for the documentation at https://rustic-chess.org/. That book thing is finally coming together.

And obviously, I learned a lot again. I can at least quickly re-implement and re-test AW and history later.

After Alpha 3 I'll start on the tapered evaluation (which is fairly simple, just 2 incremental sets of tables and the game phase), and the tuning (which I need to research completely from scratch).

My hopes were to reach 2000 with Killers, PVS, AW and history; but AW and history aren't in because they gain nothing yet; I'm glad if the engine actually reaches 1900 now. With the tapering and tuning, I expect to gain _at least_ 200 Elo, so I hope to release Rustic 4 (without the "Alpha") at a strength of at least 2100.

At that point, it still won't have any evaluation terms or pruning.

Features would then be MVV_LVA, TT-move ordering, the TT itself, Killers, PVS, tapered eval, and tuned PST's, and that'd be it. Next would be null move and some other search improvements; especially looking into how I can cut down on the massive QSearch.

Somewhere down the line I'll probably need to write some eval terms, and tune them too. Seeing the engine lose because it has no concept of passed pawns, king safety or mobility is becoming... irritating

Ras · Post by **Ras** » Wed Jun 16, 2021 11:38 pm

mvanthoor wrote: ↑Sat Jun 12, 2021 1:37 amespecially looking into how I can cut down on the massive QSearch.

Delta pruning and recapture-only mode will already help a lot.

mvanthoor · Post by **mvanthoor** » Fri Jun 18, 2021 11:29 pm

Release of Rustic Alpha 3.0.0

Killer Moves + PVS. Self-play gain is +81 Elo +/- 20 Elo. (CuteChess SPRT-test.)

Now I can finally start on the tapering and tuning. The tapering will be easy

Then I'll finally have to look into the tuning part. Maybe it's a chance to use the Rust Rayon library. That thing is awesome. If you have stuff that needs to run from beginning to end (such as a for-loop that doesn't break), you can have Rayon parallelize it automatically. In Rustic, that's not very useful (nor would it be ethical, because the engine would effectively become partially multi-threaded and thus skew testing results), but if the tuner does things like "run this loop 5 million times", it will be worthwhile

mar · Post by **mar** » Sat Jun 19, 2021 1:10 am

first, congratulations on your progress

mvanthoor wrote: ↑Fri Jun 18, 2021 11:29 pm Maybe it's a chance to use the Rust Rayon library. That thing is awesome. If you have stuff that needs to run from beginning to end (such as a for-loop that doesn't break), you can have Rayon parallelize it automatically. In Rustic, that's not very useful (nor would it be ethical, because the engine would effectively become partially multi-threaded and thus skew testing results), but if the tuner does things like "run this loop 5 million times", it will be worthwhile

parallelizing (embarassingly parallel) for loops is easy, but comes with a catch. the loop needs to do a lot of work to make it worthwhile
also don't forget that waking up a worker takes ~15 microseconds, unless you spin the workers forever, which is not a great use of resources in a typical program
typically a parallel for waits for the loop to complete (i.e. it blocks the issuing thread until the loop is done), so unless the workload is perfectly balanced, you waste time at the end of each such loop
surely this can't be used to parallelize alphabeta, because you'd have to search all root moves with full bounds, which isn't great

so - unlike pertf, search is not easy to paralellize, but there's an easy way to avoid this by using things like Lazy SMP where you let the worker threads chew on their own and communicate only through TT, if you allow data races (which are harmless in this case), no synchronization is required during search, which is a massive improvement when a lot of threads are involved

mvanthoor · Post by **mvanthoor** » Sat Jun 19, 2021 1:38 am

mar wrote: ↑Sat Jun 19, 2021 1:10 am first, congratulations on your progress

Thanks

I look forward to implementing the tapering (which is basically just implementing another incremental PST and the phase transition/interpolation), to see how much strength it gains.

For MinimalChess, I remember that it gained something like 250 Elo (and another 50-60 with staged move generation and some other optimiziations). If I could gain 50 Elo with Killers + PVS and another 250 with tapering/tuning, Rustic would be over 2100. (I consider reaching 2000 to be out of "alpha camp".)

I'd be happy if I can reach 2100 without even having implemented any pruning (except for a/b obviously) and having no evaluation terms yet.

What I find a bit strange is:

Alpha 2 + Killers vs. Alpha 2: +54 Elo (+/- 15)
Alpha 2 + PVS vs. Alpha 2: +50 Elo (+/- 20)

One would expect >= 100 Elo in self play, but if you combine PVS and Killers:

Alpha 3 (= Alpha 2 + killers + PVS) vs Alpha 2: +81 Elo (+/- 20).

The runs were not matches; they were done with SPRT. Probably it's just the error margins.

I've also noticed that the effectiveness killers + pvs depends heavily on the engine Rustic plays against. Against some engines, killers + PVS gain indeed almost 100 Elo, against some other engines they gain only 25 Elo together. Alpha 3.0.0 is able to comfortably defaat some engines in the 1900-1950 Elo range in 2000 games head-to-head match, while also being unable to defeat some other ~1875 rated engines in 2000 game matches.

I wonder why it is that Rustic performs at +30 against some 1950 rated engines, while also performing -50 against some 1875 rated engines. I haven't been able to explain this.

parallelizing (embarassingly parallel) for loops is easy, but comes with a catch. the loop needs to do a lot of work to make it worthwhile
also don't forget that waking up a worker takes ~15 microseconds, unless you spin the workers forever, which is not a great use of resources in a typical program
typically a parallel for waits for the loop to complete (i.e. it blocks the issuing thread until the loop is done), so unless the workload is perfectly balanced, you waste time at the end of each such loop
surely this can't be used to parallelize alphabeta, because you'd have to search all root moves with full bounds, which isn't great

so - unlike pertf, search is not easy to paralellize, but there's an easy way to avoid this by using things like Lazy SMP where you let the worker threads chew on their own and communicate only through TT, if you allow data races (which are harmless in this case), no synchronization is required during search, which is a massive improvement when a lot of threads are involved

I know, but as far as I've seen, many tuning algorithms just take a data set and do things like: "for the entire data set { ... }". That's why I said that Rayon works especially well for loops that run all the way through and don't suddenly do a break during the run. It might just work for the tuner. I'm not going to use it in the engine.

Even if it doesn't scale perfectly, like boosting speed by 2.75x on 4 cores, instead of 3.5x for a perfect hand-made implementation, it's worth it. I don't want to spend a huge amount of time on speeding up the tuner (at least not for the first version), but I'll take what I can get, especially if it doesn't cost my any extra work.

mar · Post by **mar** » Sat Jun 19, 2021 1:48 am

mvanthoor wrote: ↑Sat Jun 19, 2021 1:38 am I know, but as far as I've seen, many tuning algorithms just take a data set and do things like: "for the entire data set { ... }". That's why I said that Rayon works especially well for loops that run all the way through and don't suddenly do a break during the run. It might just work for the tuner. I'm not going to use it in the engine.

Even if it doesn't scale perfectly, like boosting speed by 2.75x on 4 cores, instead of 3.5x for a perfect hand-made implementation, it's worth it. I don't want to spend a huge amount of time on speeding up the tuner (at least not for the first version), but I'll take what I can get, especially if it doesn't cost my any extra work.

ah yes, I thought you were talking about search

of course you want to parallelize your tuner and you should get the expected 3.5x (or even a bit more with hyperthreading - from my experience you typically get around 30% speedup by using 8 threads instead of 4 on a quad, but it depends on what you do)

mar · Post by **mar** » Sat Jun 19, 2021 1:50 am

mvanthoor wrote: ↑Sat Jun 19, 2021 1:38 am I wonder why it is that Rustic performs at +30 against some 1950 rated engines, while also performing -50 against some 1875 rated engines. I haven't been able to explain this.

this is normal, the programs are completely different so you can do better against some and worse against other opponents in the same elo range

Progress on Rustic

Re: Progress on Rustic

Re: Progress on Rustic

Re: Progress on Rustic

Re: Progress on Rustic

Re: Progress on Rustic

Re: Progress on Rustic

Re: Progress on Rustic

Re: Progress on Rustic

Re: Progress on Rustic

Re: Progress on Rustic