I tested Stockfish 160913 at FRC at 40/4 repeating, and got no time losses at all in 1,800 games with the default value of 50.
On the other hand, I'm running Stockfish 091013 + Syzygy 5men bases 4CPU at 5+3, and was getting a few time losses there. I increased to 200 and got none at all after that.
1000 is 1000 ms = 1 second. I guess at fast bullet chess that could disadvantage Stockfish, but at blitz or longer time control I don't see it having any impact.
Next time you run Stockfish, try 200. It will probably be enough.
Scaling Study
Moderator: Ras
-
lkaufman
- Posts: 6284
- Joined: Sun Jan 10, 2010 6:15 am
- Location: Maryland USA
- Full name: Larry Kaufman
Re: Scaling Study
I note that there is also an Emergency Move Time, which I suppose is a per-move buffer in milliseconds? Since I notice that the time forfeit problem seems to be related to using non-trivial increments, might it be better to increase that number rather than the Base Time number? Maybe it doesn't matter appreciably.Modern Times wrote:I tested Stockfish 160913 at FRC at 40/4 repeating, and got no time losses at all in 1,800 games with the default value of 50.
On the other hand, I'm running Stockfish 091013 + Syzygy 5men bases 4CPU at 5+3, and was getting a few time losses there. I increased to 200 and got none at all after that.
1000 is 1000 ms = 1 second. I guess at fast bullet chess that could disadvantage Stockfish, but at blitz or longer time control I don't see it having any impact.
Next time you run Stockfish, try 200. It will probably be enough.
-
Modern Times
- Posts: 3806
- Joined: Thu Jun 07, 2012 11:02 pm
Re: Scaling Study
I haven't tried altering. Emergency Move Time. But it is Emergency Base Time that the Stockfish team suggested be increased to avoid the issue.
-
Don
- Posts: 5106
- Joined: Tue Apr 29, 2008 4:27 pm
Re: Scaling Study
Wouldn't adding time cause more forfeits?Modern Times wrote:I haven't tried altering. Emergency Move Time. But it is Emergency Base Time that the Stockfish team suggested be increased to avoid the issue.
Capital punishment would be more effective as a preventive measure if it were administered prior to the crime.
-
lkaufman
- Posts: 6284
- Joined: Sun Jan 10, 2010 6:15 am
- Location: Maryland USA
- Full name: Larry Kaufman
Re: Scaling Study
I assume it's a buffer to be subtracted from the real time for safety, just like our "move overhead milliseconds" but it's for the whole game, not per move. So far there have been no Stockfish time forfeits using the suggested value of 200.Don wrote:Wouldn't adding time cause more forfeits?Modern Times wrote:I haven't tried altering. Emergency Move Time. But it is Emergency Base Time that the Stockfish team suggested be increased to avoid the issue.
-
Don
- Posts: 5106
- Joined: Tue Apr 29, 2008 4:27 pm
Re: Scaling Study
Emergency Base Time is about as obfuscated as it gets.lkaufman wrote:I assume it's a buffer to be subtracted from the real time for safety, just like our "move overhead milliseconds" but it's for the whole game, not per move. So far there have been no Stockfish time forfeits using the suggested value of 200.Don wrote:Wouldn't adding time cause more forfeits?Modern Times wrote:I haven't tried altering. Emergency Move Time. But it is Emergency Base Time that the Stockfish team suggested be increased to avoid the issue.
Capital punishment would be more effective as a preventive measure if it were administered prior to the crime.
-
Modern Times
- Posts: 3806
- Joined: Thu Jun 07, 2012 11:02 pm
Re: Scaling Study
That seems to be is what it is, yes.lkaufman wrote: I assume it's a buffer to be subtracted from the real time for safety, just like our "move overhead milliseconds" but it's for the whole game, not per move.
-
beram
- Posts: 1187
- Joined: Wed Jan 06, 2010 3:11 pm
Re: Scaling Study
Modern Times wrote:Most rating lists don't have the large numbers of games that Larry has above.
I'm also confident in Larry's results now that he has HT off on his machines.
Those results are certainly very interesting, but it is just one opponent (Komodo vs Houdini) so it does't tell us much. If he repeated those same tests Komodo vs Stockfish, that would be fascinating I think.
Ray,
It is true that these lists have not that very large numbers
But one list that has these numbers is the LS list and this has 547-453 after 1000 games at bullet speed is 54,7% for Houdini against Komodo 6
The latest CCRL 40/40 4cpu list 26 oct. has 124 games between them with score 54 % for Houdini 3 4cpu (on 1 cpu 54,7 % just 32 games)
If the doubling theory of 8 ELO increase from Larry would be true, than this should be a very very strange abnormality in result between those opponents
As far as we can observe from these lists at the moment it just makes no sense
Besides that you are right that you need more tests against other opponents
-
Modern Times
- Posts: 3806
- Joined: Thu Jun 07, 2012 11:02 pm
Re: Scaling Study
Indeed, the LS list is the only one where the statistical margin of error is consistently low.beram wrote: But one list that has these numbers is the LS list and this has 547-453 after 1000 games at bullet speed is 54,7% for Houdini against Komodo 6
I think you know I ran 500 games between this pair, 5'+3" 4CPU, and got 273.5 - 226.5 (+148 =251 -101) in Houdini's favour, which is also 54.7% coincidentally. So that is four times the number of CPUs and six times longer thinking time, for exactly the same result. Hardware was different of course, all my chess machines are AMD, so that certainly might affect things.
-
Don
- Posts: 5106
- Joined: Tue Apr 29, 2008 4:27 pm
Re: Scaling Study
A huge problem with the LS list is that multiple dev versions of Stockfish are included. That's not very scientific because it leads to statistical anomalies.beram wrote:Modern Times wrote:Most rating lists don't have the large numbers of games that Larry has above.
I'm also confident in Larry's results now that he has HT off on his machines.
Those results are certainly very interesting, but it is just one opponent (Komodo vs Houdini) so it does't tell us much. If he repeated those same tests Komodo vs Stockfish, that would be fascinating I think.
Ray,
It is true that these lists have not that very large numbers
But one list that has these numbers is the LS list and this has 547-453 after 1000 games at bullet speed is 54,7% for Houdini against Komodo 6
The latest CCRL 40/40 4cpu list 26 oct. has 124 games between them with score 54 % for Houdini 3 4cpu (on 1 cpu 54,7 % just 32 games)
If the doubling theory of 8 ELO increase from Larry would be true, than this should be a very very strange abnormality in result between those opponents
As far as we can observe from these lists at the moment it just makes no sense
Besides that you are right that you need more tests against other opponents
The reason is that if several versions of any given program are in the same list and then it's not likely to be the best one that is on top, instead it will be the luckiest that is on top. If you submit 3 or 4 more one of them will likely end up ahead of Houdini even if there is no improvement at all simply because of statistical anomaly. It's sort of like flipping a coin and if you lose then saying, "let's try again." Sooner or later you will win the coin flip.
There should be a rule that you are not allowed to have more than 1 version every 4 months or something like that and that if you do all the others should not be reported. It's just not right to do this. I see this in other lists where they rate the same program twice sometimes using different "modes" such as "Houdini tactical" along with Houdini normal. Because Houdini tactical is significantly weaker it's probably no big deal - but it does represent 2 opportunities to be on top which only certain programs get. If I could put 10 versions of Komodo on the lists (and they all had very minor changes) people would pick the one on top and ignore the rest and I would get the benefit of the sampling noise which other program don't get.
Playing a huge number of games helps mitigate this problem, but the error margins are still given as 5 ELO and the "true" error margin is higher since this is not a controlled study with a fixed and stated number of games to be played.
The Ipon test was the only test I believed to be run scientifically and correctly but it suffered from very low sample size. Even though that was eventually improved upon it was still a problem but at least it did not give some program multiple chances to defeat the error margins.
Capital punishment would be more effective as a preventive measure if it were administered prior to the crime.