FGRL rating lists - Winter 0.8, Pirarucu 3.3.5, Marvin 3.6.0

Discussion of computer chess matches and engine tournaments.

Moderators: hgm, Rebel, chrisw

fastgm
Posts: 818
Joined: Mon Aug 19, 2013 6:57 pm

FGRL rating lists - Winter 0.8, Pirarucu 3.3.5, Marvin 3.6.0

Post by fastgm »

FGRL rating list - 10 min + 6 sec

32. Winter 0.8 (+81 to Winter 0.7)
41. Pirarucu 3.3.5 (+18 to Pirarucu 3.2.4)

FGRL rating list - 60 sec + 0.6 sec
Now with 400 engine versions and 6.182.250 games!

32. Pirarucu 3.3.5 (-3 to Pirarucu 3.2.4)
36. Marvin 3.6.0 (+58 to Marvin 3.5.0)

http://www.fastgm.de
jorose
Posts: 360
Joined: Thu Jan 22, 2015 3:21 pm
Location: Zurich, Switzerland
Full name: Jonathan Rosenthal

Re: FGRL rating lists - Winter 0.8, Pirarucu 3.3.5, Marvin 3.6.0

Post by jorose »

Thank you for testing Winter at LTC!

It kind of fascinates me how Winter is between Texel and Igel in the rapid list which are 129 and 94 respective Elo above it in the bullet list.
-Jonathan
Alayan
Posts: 550
Joined: Tue Nov 19, 2019 8:48 pm
Full name: Alayan Feh

Re: FGRL rating lists - Winter 0.8, Pirarucu 3.3.5, Marvin 3.6.0

Post by Alayan »

Poor nps hurts in the bullet list. You can also look at how Booot moves up the ranks as the TC gets longer. A 100 elo change from bullet to 10m is certainly above what would be expected from a pure nps issue, however.

But it's important to understand that "good scaling" should only look at how good an engine can be at longer TC, not at how horrible it can be at a short TC.

To take this to the extreme, imagine this : a Stockfish fork that searches depth 1, then waits 1s doing nothing, then resumes search as usual. If TM says to move during the wait period, it will do so with its d1 move. This Stockfish fork would have horrendous bullet performance. Yet at rapid it would come close to normal Stockfish, and be almost indistinguishable at long classic.

Does this mean this Stockfish fork has "better scaling" than regular Stockfish ? Not really. When there is legitimate better scaling, the engine that scales better will asymptotically achieve a higher performance. It might need an unpractical amount of time to do so, but it will do so.

Another important element is that elo gets harder and harder to gain as strength increases, because mistakes are less and less obvious. A 60 elo gap compared to Stockfish @ classical TC is much bigger than a 100 elo gap compared to Crafty @ bullet TC. This "elo compression" means that looking at the elo-difference at iso-time as time increases is misleading when it comes to scaling.

What's the correct metric for assessing scaling, then ?

Instead of elo-difference at iso-time, the proper metric is time-difference at iso-elo. An extremely nice property of this metric is that if you take two functionally identical versions of an engine, with the only difference being that one is faster than the other, the time-difference at iso-elo will be a constant. A constant tells us that both scale exactly the same, which is the correct result. If the time-to-elo difference tends to increase as elo goes up, then the stronger engine is scaling better, while if it tends to go down the weaker engine might actually be the stronger one at a long enough TC. My "wait 1s doing nothing" example still somewhat games this metric if you only look at low elo strength, but because interpretation focuses on the highest-elo data points it's doing ok. In more realistic situations (e.g. badly optimized engines vs decently optimized engines), it works out very well.

This isn't perfect because elo isn't transitive and you'd need a multi-value system to better capture some erratic behaviors, but it's the closest you can get with elo.
jorose
Posts: 360
Joined: Thu Jan 22, 2015 3:21 pm
Location: Zurich, Switzerland
Full name: Jonathan Rosenthal

Re: FGRL rating lists - Winter 0.8, Pirarucu 3.3.5, Marvin 3.6.0

Post by jorose »

Nice writeup and I like your thought experiment with SF "do nothing". I agree that a large part of Winter's scaling probably comes from bad fast TC performance and low N/s relative to other classical engines. I do wonder how best to to measure scaling.

Assuming equivalent definition of a node, one way might be to benchmark N/s for various engines and create a speed normalized rating list. Equivalent in this context means counting the nodes the same way and having the nodes be incapable of doing more than quiescent position evaluation. This second point is important as the large CNNs in Leela and company can do some pseudo QSearch and thus cannot really be compared with classical engines, even if you could compare MCTS node counting to AB node counting.

Comparing how nodes are counted at least used to be very un-standardized and differed from engine to engine. I don't know if this has changed much. Winter counts nodes at the top of QSearch and the top of AB search, unless the AB search is depth 0 and drops down into the QSearch. Iirc SF does something similar and I imagine when researching this developers try to do the same thing as SF. I honestly don't recall if my current definition matches the definition from SF, but I do remember at one point looking into SF code for the purpose of having a similar and hopefully comparable definition.

Obviously such a speed normalized list does not reflect actual playing strength of the engines, however it should shift engines to be within the same part of the scaling curve. After that you can increase and reduce time to see how the relative ranking of the engines changes. While it is true that the draw rate should increase as the TC increases, the relative order of the engines should be more stable, assuming equivalent scaling under this format.

Critical for this idea to work is that there cannot be a single normalization parameter. At each TC you would need to benchmark the engines again. SF "do nothing" would end up scaled to 2s per move against regular SFs 1s per move and proper scaling at 1 minute per move would roughly match 1 minute per move of regular SF. Unfortunately speed tends to not be constant over the course of the game, so this is nontrivial, generally.
-Jonathan
voffka
Posts: 288
Joined: Sat Jun 30, 2018 10:58 pm
Location: Ukraine
Full name: Volodymyr Shcherbyna

Re: FGRL rating lists - Winter 0.8, Pirarucu 3.3.5, Marvin 3.6.0

Post by voffka »

Hey Jonathan,
jorose wrote: Thu Jun 11, 2020 9:36 am Thank you for testing Winter at LTC!

It kind of fascinates me how Winter is between Texel and Igel in the rapid list which are 129 and 94 respective Elo above it in the bullet list.
To me this is a "Winter magical trick" each time I am testing Igel vs Winter :) In LTC it is quite strong, in bullet Winter is weaker. I am thinking that this is an overall balance between the search and a heavier eval: in LTCs the eval function of Winter probably has much more weights than Igel and more precise as well.

Currently Igel has 2600+ weights params, I am curios how many Winter has?
Alayan
Posts: 550
Joined: Tue Nov 19, 2019 8:48 pm
Full name: Alayan Feh

Re: FGRL rating lists - Winter 0.8, Pirarucu 3.3.5, Marvin 3.6.0

Post by Alayan »

jorose wrote: Thu Jun 11, 2020 7:23 pm Nice writeup and I like your thought experiment with SF "do nothing". I agree that a large part of Winter's scaling probably comes from bad fast TC performance and low N/s relative to other classical engines. I do wonder how best to to measure scaling.
Nice uplift from 10m+6s to 60m+15s too.

I've read your post with interest. Trying to figure out a measurement method that can rank engines and be safe from tricks like SF "do nothing" is nice. However, I thought of this extreme example to show why looking at how bad very short TC results can be to assess scaling. But the more reasonable case of just making the engine twice as slow is more relevant, as it means that speed optimizations could make an engine appear to "scale worse" with the old naive method.

My thinking is that the best scaling indicator we have is how good an engine can be at a long TC. Being faster will still give some advantage as the rating measurement won't be done at ultra-long TC, but in the real world, any engine user will allot it finite time, so it's a decent compromise I think. But bullet is insufficient for those purposes.

Comparing with pseudo-nodestime settings as you suggest could still be distorted with engines misreporting nodes count ala Rybka or Houdini. So if the main point is to go around deceptive tricks, it doesn't seem worth it compared to my suggested methods. Nodes measurement standardization issues only compound this. Allie vs Leela comes to my mind as an egregious divergence in reported nps.

Once you've derived the curves for the engines you want to rate (you can't know which TC is needed for which elo before the test, so you take several datapoints to then do a reliable enough interpolation), you could still do nps measurements and shift the TC curves to have an equal-nps comparison, or normalize by one of the curve and shifts to have a x1 factor for all at one TC and see if the other engines get relatively slower/faster to reach a target elo as time increase.