Scaling Study

Discussion of computer chess matches and engine tournaments.

Moderator: Ras

Modern Times
Posts: 3806
Joined: Thu Jun 07, 2012 11:02 pm

Re: Scaling Study

Post by Modern Times »

I tested Stockfish 160913 at FRC at 40/4 repeating, and got no time losses at all in 1,800 games with the default value of 50.

On the other hand, I'm running Stockfish 091013 + Syzygy 5men bases 4CPU at 5+3, and was getting a few time losses there. I increased to 200 and got none at all after that.

1000 is 1000 ms = 1 second. I guess at fast bullet chess that could disadvantage Stockfish, but at blitz or longer time control I don't see it having any impact.

Next time you run Stockfish, try 200. It will probably be enough.
lkaufman
Posts: 6284
Joined: Sun Jan 10, 2010 6:15 am
Location: Maryland USA
Full name: Larry Kaufman

Re: Scaling Study

Post by lkaufman »

Modern Times wrote:I tested Stockfish 160913 at FRC at 40/4 repeating, and got no time losses at all in 1,800 games with the default value of 50.

On the other hand, I'm running Stockfish 091013 + Syzygy 5men bases 4CPU at 5+3, and was getting a few time losses there. I increased to 200 and got none at all after that.

1000 is 1000 ms = 1 second. I guess at fast bullet chess that could disadvantage Stockfish, but at blitz or longer time control I don't see it having any impact.

Next time you run Stockfish, try 200. It will probably be enough.
I note that there is also an Emergency Move Time, which I suppose is a per-move buffer in milliseconds? Since I notice that the time forfeit problem seems to be related to using non-trivial increments, might it be better to increase that number rather than the Base Time number? Maybe it doesn't matter appreciably.
Modern Times
Posts: 3806
Joined: Thu Jun 07, 2012 11:02 pm

Re: Scaling Study

Post by Modern Times »

I haven't tried altering. Emergency Move Time. But it is Emergency Base Time that the Stockfish team suggested be increased to avoid the issue.
User avatar
Don
Posts: 5106
Joined: Tue Apr 29, 2008 4:27 pm

Re: Scaling Study

Post by Don »

Modern Times wrote:I haven't tried altering. Emergency Move Time. But it is Emergency Base Time that the Stockfish team suggested be increased to avoid the issue.
Wouldn't adding time cause more forfeits?
Capital punishment would be more effective as a preventive measure if it were administered prior to the crime.
lkaufman
Posts: 6284
Joined: Sun Jan 10, 2010 6:15 am
Location: Maryland USA
Full name: Larry Kaufman

Re: Scaling Study

Post by lkaufman »

Don wrote:
Modern Times wrote:I haven't tried altering. Emergency Move Time. But it is Emergency Base Time that the Stockfish team suggested be increased to avoid the issue.
Wouldn't adding time cause more forfeits?
I assume it's a buffer to be subtracted from the real time for safety, just like our "move overhead milliseconds" but it's for the whole game, not per move. So far there have been no Stockfish time forfeits using the suggested value of 200.
User avatar
Don
Posts: 5106
Joined: Tue Apr 29, 2008 4:27 pm

Re: Scaling Study

Post by Don »

lkaufman wrote:
Don wrote:
Modern Times wrote:I haven't tried altering. Emergency Move Time. But it is Emergency Base Time that the Stockfish team suggested be increased to avoid the issue.
Wouldn't adding time cause more forfeits?
I assume it's a buffer to be subtracted from the real time for safety, just like our "move overhead milliseconds" but it's for the whole game, not per move. So far there have been no Stockfish time forfeits using the suggested value of 200.
Emergency Base Time is about as obfuscated as it gets.
Capital punishment would be more effective as a preventive measure if it were administered prior to the crime.
Modern Times
Posts: 3806
Joined: Thu Jun 07, 2012 11:02 pm

Re: Scaling Study

Post by Modern Times »

lkaufman wrote: I assume it's a buffer to be subtracted from the real time for safety, just like our "move overhead milliseconds" but it's for the whole game, not per move.
That seems to be is what it is, yes.
beram
Posts: 1187
Joined: Wed Jan 06, 2010 3:11 pm

Re: Scaling Study

Post by beram »

Modern Times wrote:Most rating lists don't have the large numbers of games that Larry has above.
I'm also confident in Larry's results now that he has HT off on his machines.

Those results are certainly very interesting, but it is just one opponent (Komodo vs Houdini) so it does't tell us much. If he repeated those same tests Komodo vs Stockfish, that would be fascinating I think.

Ray,
It is true that these lists have not that very large numbers
But one list that has these numbers is the LS list and this has 547-453 after 1000 games at bullet speed is 54,7% for Houdini against Komodo 6
The latest CCRL 40/40 4cpu list 26 oct. has 124 games between them with score 54 % for Houdini 3 4cpu (on 1 cpu 54,7 % just 32 games)
If the doubling theory of 8 ELO increase from Larry would be true, than this should be a very very strange abnormality in result between those opponents
As far as we can observe from these lists at the moment it just makes no sense

Besides that you are right that you need more tests against other opponents
Modern Times
Posts: 3806
Joined: Thu Jun 07, 2012 11:02 pm

Re: Scaling Study

Post by Modern Times »

beram wrote: But one list that has these numbers is the LS list and this has 547-453 after 1000 games at bullet speed is 54,7% for Houdini against Komodo 6
Indeed, the LS list is the only one where the statistical margin of error is consistently low.

I think you know I ran 500 games between this pair, 5'+3" 4CPU, and got 273.5 - 226.5 (+148 =251 -101) in Houdini's favour, which is also 54.7% coincidentally. So that is four times the number of CPUs and six times longer thinking time, for exactly the same result. Hardware was different of course, all my chess machines are AMD, so that certainly might affect things.
User avatar
Don
Posts: 5106
Joined: Tue Apr 29, 2008 4:27 pm

Re: Scaling Study

Post by Don »

beram wrote:
Modern Times wrote:Most rating lists don't have the large numbers of games that Larry has above.
I'm also confident in Larry's results now that he has HT off on his machines.

Those results are certainly very interesting, but it is just one opponent (Komodo vs Houdini) so it does't tell us much. If he repeated those same tests Komodo vs Stockfish, that would be fascinating I think.

Ray,
It is true that these lists have not that very large numbers
But one list that has these numbers is the LS list and this has 547-453 after 1000 games at bullet speed is 54,7% for Houdini against Komodo 6
The latest CCRL 40/40 4cpu list 26 oct. has 124 games between them with score 54 % for Houdini 3 4cpu (on 1 cpu 54,7 % just 32 games)
If the doubling theory of 8 ELO increase from Larry would be true, than this should be a very very strange abnormality in result between those opponents
As far as we can observe from these lists at the moment it just makes no sense

Besides that you are right that you need more tests against other opponents
A huge problem with the LS list is that multiple dev versions of Stockfish are included. That's not very scientific because it leads to statistical anomalies.

The reason is that if several versions of any given program are in the same list and then it's not likely to be the best one that is on top, instead it will be the luckiest that is on top. If you submit 3 or 4 more one of them will likely end up ahead of Houdini even if there is no improvement at all simply because of statistical anomaly. It's sort of like flipping a coin and if you lose then saying, "let's try again." Sooner or later you will win the coin flip.

There should be a rule that you are not allowed to have more than 1 version every 4 months or something like that and that if you do all the others should not be reported. It's just not right to do this. I see this in other lists where they rate the same program twice sometimes using different "modes" such as "Houdini tactical" along with Houdini normal. Because Houdini tactical is significantly weaker it's probably no big deal - but it does represent 2 opportunities to be on top which only certain programs get. If I could put 10 versions of Komodo on the lists (and they all had very minor changes) people would pick the one on top and ignore the rest and I would get the benefit of the sampling noise which other program don't get.

Playing a huge number of games helps mitigate this problem, but the error margins are still given as 5 ELO and the "true" error margin is higher since this is not a controlled study with a fixed and stated number of games to be played.

The Ipon test was the only test I believed to be run scientifically and correctly but it suffered from very low sample size. Even though that was eventually improved upon it was still a problem but at least it did not give some program multiple chances to defeat the error margins.
Capital punishment would be more effective as a preventive measure if it were administered prior to the crime.