Scaling of engines from FGRL rating list

jhellis3 · Post by **jhellis3** » Mon Apr 10, 2017 11:06 pm

A small post explaining the issue with drawing conclusions from the data Mark posted.

The data:

TC = 3' + 2" -40 Elo (5000 games)
TC = 10' + 6" -30 Elo (5000 games)
TC = 30' + 15" -21 Elo (2000 games)
TC = 90' + 30" -10 Elo ( 500 games)

So what we have presented here is an approximate 10 Elo gain for each tripling of time. Now, if we just glance at the data a clear trend is immediately apparent, the aforementioned 10 Elo gain for 3x time, or is it a 10 Elo loss for each 1/3rd of time.... Do we really know?

In order to be absolutely certain we are witnessing a genuine Elo gain as opposed to Elo compression, the absolute best evidence we could have is a data point where Komodo is clearly stronger than Stockfish. Assuming, the naive path continues, 270' + 90" should be equal, and at 810' + 270" Komodo should have a 10 Elo lead.

Obviously, this presents practical problems, but luckily we have a solution. I believe Mark & Larry have alluded to better scaling for Komodo with core count as well. Thus, assuming this claim to be true, we can up the core count and reduce the time, and Komodo should still come out ahead.

810' + 270" / 32 = 25.3' + 8.5"

Obviously scaling won't be perfect, so let's round up to 30' + 15" on 32+ cores. With both superior time and thread scaling, that should be more than enough for Komodo to assert its dominance. Now, all we need is someone with such a machine to run a 2000 game match (we will forget that we have seen fishtest results swing after even 15-20k games).

But we do have one fairly recent data point to look upon, the last TCEC. It was only 20 cores for stage 3 but the time control of 150' + 15" was certainly large enough. And yet....

I would say that in order to make the claim that engine A genuinely out-scales engine B, then you need to be able to show a (reasonable) data point where engine A actually beats engine B.

Why is this necessary? Let's look at the data Mark posted once again. Now, imagine I fiddle with SF's time management a tiny bit to make it slightly sub-optimal. This change will certainly be felt at low time controls, but will essentially disappear as T approaches infinity. At any rate, the upshot is I artificially lower SF's strength by 30 Elo at the shortest time control, but SF's strength at the longest time control is virtually unchanged. Now, SF simply has a 10 Elo advantage forever according to the "scaling" trajectory. Or maybe Komodo is the one with worse time management

.

The truth is I don't know. Nobody can know just from that set of data. It could be Elo compression. It could be poor time management by Komodo. It could be that Komodo does indeed scale better with time than Stockfish. It could also be a whole host of other issues from eval, to pruning techniques, to opening selections, to branching factor, etc... I don't know. But I do know for someone to say they do know means that either they haven't it all through, made a mistake, are rather arrogant, or are being purposefully disingenuous. Personally, I like to give people the benefit of the doubt and just assume they made a mistake and/or forgot to take some factors into consideration. I know I do that all the time (usually several times per day).

Dann Corbit · Post by **Dann Corbit** » Mon Apr 10, 2017 11:29 pm

We may not know why, but we still know what.
Even without a detailed examination of the machinery.

cdani · Post by **cdani** » Mon Apr 10, 2017 11:36 pm

Some times I test at stc and ltc, regardless of the results of stc. Sometimes I obtain surprise results (good at ltc, bad at stc). As I do it quite often, is reasonable that Andscacs has more better scaling patches than other engines that don't do this.

Andscacs scale better at ltc or scale worse at stc?

I accept more patches that are bad at stc, but my stc is like 7-25 seconds, and my ltc sometimes 30 sometimes 80 seconds or inbetween. Thus as most rating list, even the stc ones are mostly of games of various minutes, the patches discovered are supposed to be good for those rating lists. So relative to those rating lists, Andscacs scale better at ltc, as each change is supposed to increase its strenght on them.

Is possible that if I have not accepted such changes Andscacs will be strongest now? I find not very logical to think this, but I cannot discard it 100%.

mjlef · Post by **mjlef** » Mon Apr 10, 2017 11:37 pm

jhellis3 wrote:A small post explaining the issue with drawing conclusions from the data Mark posted.

The data:

TC = 3' + 2" -40 Elo (5000 games)
TC = 10' + 6" -30 Elo (5000 games)
TC = 30' + 15" -21 Elo (2000 games)
TC = 90' + 30" -10 Elo ( 500 games)

So what we have presented here is an approximate 10 Elo gain for each tripling of time. Now, if we just glance at the data a clear trend is immediately apparent, the aforementioned 10 Elo gain for 3x time, or is it a 10 Elo loss for each 1/3rd of time.... Do we really know?

In order to be absolutely certain we are witnessing a genuine Elo gain as opposed to Elo compression, the absolute best evidence we could have is a data point where Komodo is clearly stronger than Stockfish. Assuming, the naive path continues, 270' + 90" should be equal, and at 810' + 270" Komodo should have a 10 Elo lead.

Obviously, this presents practical problems, but luckily we have a solution. I believe Mark & Larry have alluded to better scaling for Komodo with core count as well. Thus, assuming this claim to be true, we can up the core count and reduce the time, and Komodo should still come out ahead.

810' + 270" / 32 = 25.3' + 8.5"

Obviously scaling won't be perfect, so let's round up to 30' + 15" on 32+ cores. With both superior time and thread scaling, that should be more than enough for Komodo to assert its dominance. Now, all we need is someone with such a machine to run a 2000 game match (we will forget that we have seen fishtest results swing after even 15-20k games).

But we do have one fairly recent data point to look upon, the last TCEC. It was only 20 cores for stage 3 but the time control of 150' + 15" was certainly large enough. And yet....

I would say that in order to make the claim that engine A genuinely out-scales engine B, then you need to be able to show a (reasonable) data point where engine A actually beats engine B.

Why is this necessary? Let's look at the data Mark posted once again. Now, imagine I fiddle with SF's time management a tiny bit to make it slightly sub-optimal. This change will certainly be felt at low time controls, but will essentially disappear as T approaches infinity. At any rate, the upshot is I artificially lower SF's strength by 30 Elo at the shortest time control, but SF's strength at the longest time control is virtually unchanged. Now, SF simply has a 10 Elo advantage forever according to the "scaling" trajectory. Or maybe Komodo is the one with worse time management .

The truth is I don't know. Nobody can know just from that set of data. It could be Elo compression. It could be poor time management by Komodo. It could be that Komodo does indeed scale better with time than Stockfish. It could also be a whole host of other issues from eval, to pruning techniques, to opening selections, to branching factor, etc... I don't know. But I do know for someone to say they do know means that either they haven't it all through, made a mistake, are rather arrogant, or are being purposefully disingenuous. Personally, I like to give people the benefit of the doubt and just assume they made a mistake and/or forgot to take some factors into consideration. I know I do that all the time (usually several times per day).

You are writing hogwash. The scaling has been clearly show at reasonable time controls, both against Stockfish and against a lot of other programs. Komodo scales well, with some other programs scaling even better then Komodo.

Your rhetoric is nonsense. In science there is no "absolute certainty", but we can reach a level with reasonable certainty. Not with your 150-300 game runs, but with a lot more. What is this nonsense of making Stockfish weaker? This whole thread was not about what program is strongest at a specific playing time. This is a thread about how elo changes with time (scaling). You present no data and suggest worthless modifications to programs to ruin what data might be collected. Kai just showed the scaling effect in the time control ranges he presented. It is possible that scaling could change remarkably at a much longer time control. But we have not said it would, and neither has Kai.

You are not taking this seriously, so I will stop taking you seriously too.

jhellis3 · Post by **jhellis3** » Mon Apr 10, 2017 11:42 pm

You are writing hogwash.

Classy.

In science there is no "absolute certainty"

Actually, there is. It is called reality.

You present no data and suggest worthless modifications to programs to ruin what data might be collected.

Aye, kind of like contempt... my bad.

You are not taking this seriously, so I will stop taking you seriously too.

I don't have anything for sale here, and I am not the one making dubious claims. I am perfectly fine letting the public form their own opinions.

As for your behavior towards me, pretty sad. But then, if there is one thing I have learned in life, it is that people will disappoint you. SSDD.

jhellis3 · Post by **jhellis3** » Tue Apr 11, 2017 12:07 am

We may not know why, but we still know what.

But you don't... Is that evidence of Komodo scaling poorly with less time and benefiting from Elo compression or is it evidence of Komodo scaling well with more time and eventually it will surpass SF. AFAICS, there is no rational reason to believe one lemma over another. But maybe I am missing something...

jhellis3 · Post by **jhellis3** » Tue Apr 11, 2017 12:29 am

Is possible that if I have not accepted such changes Andscacs will be strongest now? I find not very logical to think this, but I cannot discard it 100%.

That is an interesting point, and I don't think anyone would find fault with such methodology. One simply has to make the best decision one can with the resources available. Certainly, the success and Elo gain of the SF framework demonstrate the effectiveness of such an approach.

Andscacs scale better at ltc or scale worse at stc?

I have no idea. Maybe a bit a of both. But I do think Andscacs is the most interesting engine to study in this regard. And I would guess than if any engine is going to demonstrate significant results compared to the field, it would be Andscacs (at least from what I have seen thus far).

If we look at Kai's post regarding branching factors, we can see that Andscacs has a somewhat larger branching factor than other top engines at lower depths most similar to Komodo, which then tapers off but doesn't quite catch up to Komodo and SF. I would guess this is what hurts both Komodo and Andscacs at STC, but eventually begins to pay off (or at least become insignificant) at LTC. But that is just a guess...

If true though, that would indicate your search is already a bit better that SF's (at medium to LTC), no small accomplishment! And that you could see the biggest/easiest gains by focusing on eval.

mjlef · Post by **mjlef** » Tue Apr 11, 2017 12:57 am

jhellis3 wrote:
You are writing hogwash.
Classy.

In science there is no "absolute certainty"
Actually, there is. It is called reality.

You present no data and suggest worthless modifications to programs to ruin what data might be collected.
Aye, kind of like contempt... my bad.

You are not taking this seriously, so I will stop taking you seriously too.
I don't have anything for sale here, and I am not the one making dubious claims. I am perfectly fine letting the public form their own opinions.

As for your behavior towards me, pretty sad. But then, if there is one thing I have learned in life, it is that people will disappoint you. SSDD.

Science and absolute certainty: https://www.wsj.com/articles/SB10001424 ... 1041127168

Contempt:
Contempt does just what we say it does, and is like what a human player would do against a much weaker (or much stronger) opponent. It improves the program's chances (when set properly) and causes the program to try to avoid draws (or seek them if negative). It is not trying to ruin data. Users are welcome to run tests with Contempt set to any value they want. And in the large run of a development Komodo against Stockfish 8 was using a Contempt of 0, meaning it is has no effect on that data.

Behavior towards you: I have just defended the work of Kai and Larry. This open forum is exactly the way it is to send out data and let each person for their own opinions. I like backing up my claims with facts.

I think your flippant remarks are not helping you convince people. At least not ones with a strong belief in science. You cannot just make claims and not back them with data if you want to be taken seriously.

mjlef · Post by **mjlef** » Tue Apr 11, 2017 1:03 am

jhellis3 wrote:
We may not know why, but we still know what.
But you don't... Is that evidence of Komodo scaling poorly with less time and benefiting from Elo compression or is it evidence of Komodo scaling well with more time and eventually it will surpass SF. AFAICS, there is no rational reason to believe one lemma over another. But maybe I am missing something...

It is the latter. Kai calculated a measurement of elo scaling with time two ways. In the second way, he removed the draws (which are the cause of elo contraction at longer time controls). the "Wilos" method. He describes it here:

http://www.talkchess.com/forum/viewtopi ... 47&t=63687

Both methods showed the same programs scaling better or worse.

jhellis3 · Post by **jhellis3** » Tue Apr 11, 2017 1:08 am

Science and absolute certainty:

Thanks for the lecture prof Mark, you are such a smart guy

. Nothing I like more than being talked down to....

.

I think your flippant remarks are not helping you convince people.

Like I said earlier (perhaps you are a bit slow on the uptake?), I am not here to convince anybody. I am not here to promote an agenda *cough*. I present my viewpoints, and let other people do with them what they may.

In my view, false belief is its own punishment

.

Scaling of engines from FGRL rating list

Re: Scaling of engines from FGRL rating list.

Re: Scaling of engines from FGRL rating list.

Re: Scaling of engines from FGRL rating list

Re: Scaling of engines from FGRL rating list.

Re: Scaling of engines from FGRL rating list.

Re: Scaling of engines from FGRL rating list.

Re: Scaling of engines from FGRL rating list

Re: Scaling of engines from FGRL rating list.

Re: Scaling of engines from FGRL rating list.

Re: Scaling of engines from FGRL rating list.