A small post explaining the issue with drawing conclusions from the data Mark posted.
The data:
TC = 3' + 2" -40 Elo (5000 games)
TC = 10' + 6" -30 Elo (5000 games)
TC = 30' + 15" -21 Elo (2000 games)
TC = 90' + 30" -10 Elo ( 500 games)
So what we have presented here is an approximate 10 Elo gain for each tripling of time. Now, if we just glance at the data a clear trend is immediately apparent, the aforementioned 10 Elo gain for 3x time, or is it a 10 Elo loss for each 1/3rd of time.... Do we really know?
In order to be absolutely certain we are witnessing a genuine Elo gain as opposed to Elo compression, the absolute best evidence we could have is a data point where Komodo is clearly stronger than Stockfish. Assuming, the naive path continues, 270' + 90" should be equal, and at 810' + 270" Komodo should have a 10 Elo lead.
Obviously, this presents practical problems, but luckily we have a solution. I believe Mark & Larry have alluded to better scaling for Komodo with core count as well. Thus, assuming this claim to be true, we can up the core count and reduce the time, and Komodo should still come out ahead.
810' + 270" / 32 = 25.3' + 8.5"
Obviously scaling won't be perfect, so let's round up to 30' + 15" on 32+ cores. With both superior time and thread scaling, that should be more than enough for Komodo to assert its dominance. Now, all we need is someone with such a machine to run a 2000 game match (we will forget that we have seen fishtest results swing after even 15-20k games).
But we do have one fairly recent data point to look upon, the last TCEC. It was only 20 cores for stage 3 but the time control of 150' + 15" was certainly large enough. And yet....
I would say that in order to make the claim that engine A genuinely out-scales engine B, then you need to be able to show a (reasonable) data point where engine A actually beats engine B.
Why is this necessary? Let's look at the data Mark posted once again. Now, imagine I fiddle with SF's time management a tiny bit to make it slightly sub-optimal. This change will certainly be felt at low time controls, but will essentially disappear as T approaches infinity. At any rate, the upshot is I artificially lower SF's strength by 30 Elo at the shortest time control, but SF's strength at the longest time control is virtually unchanged. Now, SF simply has a 10 Elo advantage forever according to the "scaling" trajectory. Or maybe Komodo is the one with worse time management .
The truth is I don't know. Nobody can know just from that set of data. It could be Elo compression. It could be poor time management by Komodo. It could be that Komodo does indeed scale better with time than Stockfish. It could also be a whole host of other issues from eval, to pruning techniques, to opening selections, to branching factor, etc... I don't know. But I do know for someone to say they do know means that either they haven't it all through, made a mistake, are rather arrogant, or are being purposefully disingenuous. Personally, I like to give people the benefit of the doubt and just assume they made a mistake and/or forgot to take some factors into consideration. I know I do that all the time (usually several times per day).
Scaling of engines from FGRL rating list
Moderators: hgm, Rebel, chrisw
-
- Posts: 546
- Joined: Sat Aug 17, 2013 12:36 am
-
- Posts: 12542
- Joined: Wed Mar 08, 2006 8:57 pm
- Location: Redmond, WA USA
Re: Scaling of engines from FGRL rating list.
We may not know why, but we still know what.
Even without a detailed examination of the machinery.
Even without a detailed examination of the machinery.
Taking ideas is not a vice, it is a virtue. We have another word for this. It is called learning.
But sharing ideas is an even greater virtue. We have another word for this. It is called teaching.
But sharing ideas is an even greater virtue. We have another word for this. It is called teaching.
-
- Posts: 2204
- Joined: Sat Jan 18, 2014 10:24 am
- Location: Andorra
Re: Scaling of engines from FGRL rating list
Some times I test at stc and ltc, regardless of the results of stc. Sometimes I obtain surprise results (good at ltc, bad at stc). As I do it quite often, is reasonable that Andscacs has more better scaling patches than other engines that don't do this.
Andscacs scale better at ltc or scale worse at stc?
I accept more patches that are bad at stc, but my stc is like 7-25 seconds, and my ltc sometimes 30 sometimes 80 seconds or inbetween. Thus as most rating list, even the stc ones are mostly of games of various minutes, the patches discovered are supposed to be good for those rating lists. So relative to those rating lists, Andscacs scale better at ltc, as each change is supposed to increase its strenght on them.
Is possible that if I have not accepted such changes Andscacs will be strongest now? I find not very logical to think this, but I cannot discard it 100%.
Andscacs scale better at ltc or scale worse at stc?
I accept more patches that are bad at stc, but my stc is like 7-25 seconds, and my ltc sometimes 30 sometimes 80 seconds or inbetween. Thus as most rating list, even the stc ones are mostly of games of various minutes, the patches discovered are supposed to be good for those rating lists. So relative to those rating lists, Andscacs scale better at ltc, as each change is supposed to increase its strenght on them.
Is possible that if I have not accepted such changes Andscacs will be strongest now? I find not very logical to think this, but I cannot discard it 100%.
Last edited by cdani on Mon Apr 10, 2017 11:37 pm, edited 1 time in total.
Daniel José - http://www.andscacs.com
-
- Posts: 1494
- Joined: Thu Mar 30, 2006 2:08 pm
Re: Scaling of engines from FGRL rating list.
You are writing hogwash. The scaling has been clearly show at reasonable time controls, both against Stockfish and against a lot of other programs. Komodo scales well, with some other programs scaling even better then Komodo.jhellis3 wrote:A small post explaining the issue with drawing conclusions from the data Mark posted.
The data:
TC = 3' + 2" -40 Elo (5000 games)
TC = 10' + 6" -30 Elo (5000 games)
TC = 30' + 15" -21 Elo (2000 games)
TC = 90' + 30" -10 Elo ( 500 games)
So what we have presented here is an approximate 10 Elo gain for each tripling of time. Now, if we just glance at the data a clear trend is immediately apparent, the aforementioned 10 Elo gain for 3x time, or is it a 10 Elo loss for each 1/3rd of time.... Do we really know?
In order to be absolutely certain we are witnessing a genuine Elo gain as opposed to Elo compression, the absolute best evidence we could have is a data point where Komodo is clearly stronger than Stockfish. Assuming, the naive path continues, 270' + 90" should be equal, and at 810' + 270" Komodo should have a 10 Elo lead.
Obviously, this presents practical problems, but luckily we have a solution. I believe Mark & Larry have alluded to better scaling for Komodo with core count as well. Thus, assuming this claim to be true, we can up the core count and reduce the time, and Komodo should still come out ahead.
810' + 270" / 32 = 25.3' + 8.5"
Obviously scaling won't be perfect, so let's round up to 30' + 15" on 32+ cores. With both superior time and thread scaling, that should be more than enough for Komodo to assert its dominance. Now, all we need is someone with such a machine to run a 2000 game match (we will forget that we have seen fishtest results swing after even 15-20k games).
But we do have one fairly recent data point to look upon, the last TCEC. It was only 20 cores for stage 3 but the time control of 150' + 15" was certainly large enough. And yet....
I would say that in order to make the claim that engine A genuinely out-scales engine B, then you need to be able to show a (reasonable) data point where engine A actually beats engine B.
Why is this necessary? Let's look at the data Mark posted once again. Now, imagine I fiddle with SF's time management a tiny bit to make it slightly sub-optimal. This change will certainly be felt at low time controls, but will essentially disappear as T approaches infinity. At any rate, the upshot is I artificially lower SF's strength by 30 Elo at the shortest time control, but SF's strength at the longest time control is virtually unchanged. Now, SF simply has a 10 Elo advantage forever according to the "scaling" trajectory. Or maybe Komodo is the one with worse time management .
The truth is I don't know. Nobody can know just from that set of data. It could be Elo compression. It could be poor time management by Komodo. It could be that Komodo does indeed scale better with time than Stockfish. It could also be a whole host of other issues from eval, to pruning techniques, to opening selections, to branching factor, etc... I don't know. But I do know for someone to say they do know means that either they haven't it all through, made a mistake, are rather arrogant, or are being purposefully disingenuous. Personally, I like to give people the benefit of the doubt and just assume they made a mistake and/or forgot to take some factors into consideration. I know I do that all the time (usually several times per day).
Your rhetoric is nonsense. In science there is no "absolute certainty", but we can reach a level with reasonable certainty. Not with your 150-300 game runs, but with a lot more. What is this nonsense of making Stockfish weaker? This whole thread was not about what program is strongest at a specific playing time. This is a thread about how elo changes with time (scaling). You present no data and suggest worthless modifications to programs to ruin what data might be collected. Kai just showed the scaling effect in the time control ranges he presented. It is possible that scaling could change remarkably at a much longer time control. But we have not said it would, and neither has Kai.
You are not taking this seriously, so I will stop taking you seriously too.
-
- Posts: 546
- Joined: Sat Aug 17, 2013 12:36 am
Re: Scaling of engines from FGRL rating list.
Classy.You are writing hogwash.
Actually, there is. It is called reality.In science there is no "absolute certainty"
Aye, kind of like contempt... my bad.You present no data and suggest worthless modifications to programs to ruin what data might be collected.
I don't have anything for sale here, and I am not the one making dubious claims. I am perfectly fine letting the public form their own opinions.You are not taking this seriously, so I will stop taking you seriously too.
As for your behavior towards me, pretty sad. But then, if there is one thing I have learned in life, it is that people will disappoint you. SSDD.
-
- Posts: 546
- Joined: Sat Aug 17, 2013 12:36 am
Re: Scaling of engines from FGRL rating list.
But you don't... Is that evidence of Komodo scaling poorly with less time and benefiting from Elo compression or is it evidence of Komodo scaling well with more time and eventually it will surpass SF. AFAICS, there is no rational reason to believe one lemma over another. But maybe I am missing something...We may not know why, but we still know what.
-
- Posts: 546
- Joined: Sat Aug 17, 2013 12:36 am
Re: Scaling of engines from FGRL rating list
That is an interesting point, and I don't think anyone would find fault with such methodology. One simply has to make the best decision one can with the resources available. Certainly, the success and Elo gain of the SF framework demonstrate the effectiveness of such an approach.Is possible that if I have not accepted such changes Andscacs will be strongest now? I find not very logical to think this, but I cannot discard it 100%.
I have no idea. Maybe a bit a of both. But I do think Andscacs is the most interesting engine to study in this regard. And I would guess than if any engine is going to demonstrate significant results compared to the field, it would be Andscacs (at least from what I have seen thus far).Andscacs scale better at ltc or scale worse at stc?
If we look at Kai's post regarding branching factors, we can see that Andscacs has a somewhat larger branching factor than other top engines at lower depths most similar to Komodo, which then tapers off but doesn't quite catch up to Komodo and SF. I would guess this is what hurts both Komodo and Andscacs at STC, but eventually begins to pay off (or at least become insignificant) at LTC. But that is just a guess...
If true though, that would indicate your search is already a bit better that SF's (at medium to LTC), no small accomplishment! And that you could see the biggest/easiest gains by focusing on eval.
-
- Posts: 1494
- Joined: Thu Mar 30, 2006 2:08 pm
Re: Scaling of engines from FGRL rating list.
Science and absolute certainty: https://www.wsj.com/articles/SB10001424 ... 1041127168jhellis3 wrote:Classy.You are writing hogwash.
Actually, there is. It is called reality.In science there is no "absolute certainty"
Aye, kind of like contempt... my bad.You present no data and suggest worthless modifications to programs to ruin what data might be collected.
I don't have anything for sale here, and I am not the one making dubious claims. I am perfectly fine letting the public form their own opinions.You are not taking this seriously, so I will stop taking you seriously too.
As for your behavior towards me, pretty sad. But then, if there is one thing I have learned in life, it is that people will disappoint you. SSDD.
Contempt:
Contempt does just what we say it does, and is like what a human player would do against a much weaker (or much stronger) opponent. It improves the program's chances (when set properly) and causes the program to try to avoid draws (or seek them if negative). It is not trying to ruin data. Users are welcome to run tests with Contempt set to any value they want. And in the large run of a development Komodo against Stockfish 8 was using a Contempt of 0, meaning it is has no effect on that data.
Behavior towards you: I have just defended the work of Kai and Larry. This open forum is exactly the way it is to send out data and let each person for their own opinions. I like backing up my claims with facts.
I think your flippant remarks are not helping you convince people. At least not ones with a strong belief in science. You cannot just make claims and not back them with data if you want to be taken seriously.
-
- Posts: 1494
- Joined: Thu Mar 30, 2006 2:08 pm
Re: Scaling of engines from FGRL rating list.
It is the latter. Kai calculated a measurement of elo scaling with time two ways. In the second way, he removed the draws (which are the cause of elo contraction at longer time controls). the "Wilos" method. He describes it here:jhellis3 wrote:But you don't... Is that evidence of Komodo scaling poorly with less time and benefiting from Elo compression or is it evidence of Komodo scaling well with more time and eventually it will surpass SF. AFAICS, there is no rational reason to believe one lemma over another. But maybe I am missing something...We may not know why, but we still know what.
http://www.talkchess.com/forum/viewtopi ... 47&t=63687
Both methods showed the same programs scaling better or worse.
-
- Posts: 546
- Joined: Sat Aug 17, 2013 12:36 am
Re: Scaling of engines from FGRL rating list.
Thanks for the lecture prof Mark, you are such a smart guy . Nothing I like more than being talked down to.... .Science and absolute certainty:
Like I said earlier (perhaps you are a bit slow on the uptake?), I am not here to convince anybody. I am not here to promote an agenda *cough*. I present my viewpoints, and let other people do with them what they may.I think your flippant remarks are not helping you convince people.
In my view, false belief is its own punishment .