Stockfish testing at STC and LTC: one question

Jouni · Post by **Jouni** » Tue Sep 19, 2017 8:53 am

It's usual, that excellent patch gives +4 ELO at STC, but +2 at LTC. So is this indicating, that at 360+3,6 level we get probably NOTHING? And may be regression at tournament level!

cdani · Post by **cdani** » Tue Sep 19, 2017 5:39 pm

Who knows, without more games is impossible to tell. When I want to be more sure, I just extend the test. Flexibility in testing methods is..., well, necessary.

jhellis3 · Post by **jhellis3** » Tue Sep 19, 2017 5:40 pm

Yes, this is why Stockfish has lost hundreds of Elo over the last couple of years and has had miserable results at TCEC.

cdani · Post by **cdani** » Tue Sep 19, 2017 5:41 pm

jhellis3 wrote:Yes, this is why Stockfish has lost hundreds of Elo over the last couple of years and has had miserable results at TCEC.

Not at all, of course, but always something more can be done.

Houdini · Post by **Houdini** » Tue Sep 19, 2017 6:17 pm

Jouni wrote:It's usual, that excellent patch gives +4 ELO at STC, but +2 at LTC. So is this indicating, that at 360+3,6 level we get probably NOTHING? And may be regression at tournament level!

The amount of noise in engine testing is such that it's nearly impossible to extrapolate the results to longer TC.
The error margins are very big compared to the difference in results.

Evert · Post by **Evert** » Tue Sep 19, 2017 6:20 pm

How do you get those Elo estimates? Elo estimates based on the SPRT test runs are not reliable. All I'm seeing from the numbers you quote is an increased draw rate with longer time control, which I think is expected.

mjlef · Post by **mjlef** » Thu Sep 21, 2017 3:19 am

Jouni wrote:It's usual, that excellent patch gives +4 ELO at STC, but +2 at LTC. So is this indicating, that at 360+3,6 level we get probably NOTHING? And may be regression at tournament level!

Take a look at draw percentages. At longer time controls they increase a lot. For example, in CCRL a typical draw percentage for the stronger programs is 40% at 40/4, but at 40/40 is it around 60%. You just get more draws as programs search deeper and play better. So a contraction of 4 elo to 2 elo at a longer time control is quite normal.

It is very hard to get enough data at very long time controls to prove a change is good or not. Anything over more than just a few seconds per move just takes too long. I really appreciate the in between lists like IPON's 5'+3". It is a reasonable attempt to get enough games to say something with some reasonable error margins. Perhaps as computers get cheaper on the cloud we can test at a much longer time control.

Uri Blass · Post by **Uri Blass** » Fri Sep 22, 2017 4:33 am

Jouni wrote:It's usual, that excellent patch gives +4 ELO at STC, but +2 at LTC. So is this indicating, that at 360+3,6 level we get probably NOTHING? And may be regression at tournament level!

It is usual that the stockfish team have no interest how much elo they get otherwise they could do better by using fixed number of games.

We know almost nothing about the elo improvement of a patch from
results of SPRT.

Performance of +4 elo when they pass SPRT at STC means nothing becuase if you test the patch many times the patch may fail SPRT in part of the cases and give also 0 elo or 1 elo so the average result is clearly less than 4 elo.

jdart · Post by **jdart** » Sun Sep 24, 2017 8:42 pm

I have observed myself that STC results appear to be a lot noisier than LTC results. So a positive STC result is a bad predictor of what LTC or real tournament results will be. This is a bit surprising because for years, starting with Rybka, engines were using hyper-bullet games for testing. There is some validity to that method because many got a good ELO gain from it. But it is not the best or most reliable method. It is a way to short cut testing at real time controls, which would require a huge number of processor cores to perform in a reasonable time period.

--Jon

Dann Corbit · Post by **Dann Corbit** » Mon Sep 25, 2017 8:41 pm

jdart wrote:I have observed myself that STC results appear to be a lot noisier than LTC results. So a positive STC result is a bad predictor of what LTC or real tournament results will be. This is a bit surprising because for years, starting with Rybka, engines were using hyper-bullet games for testing. There is some validity to that method because many got a good ELO gain from it. But it is not the best or most reliable method. It is a way to short cut testing at real time controls, which would require a huge number of processor cores to perform in a reasonable time period.

--Jon

When you test with a certain set of conditions, the results are totally valid for exactly those conditions. The results may or may not translate to another set of conditions.

Generally speaking, things that work well at ultra high speed will work well at other speeds to. That is why the model tends to work and Stockfish is an extremely strong engine.

On the other hand, they are tuning SF for high speed blitz games so they will achieve that.
But I think every other engine is doing the same thing, so it really won't make any difference any way.
Besides which, nobody has the resources to test at tournament time control.

Stockfish testing at STC and LTC: one question

Stockfish testing at STC and LTC: one question

Re: Stockfish testing at STC and LTC: one question

Re: Stockfish testing at STC and LTC: one question

Re: Stockfish testing at STC and LTC: one question

Re: Stockfish testing at STC and LTC: one question

Re: Stockfish testing at STC and LTC: one question

Re: Stockfish testing at STC and LTC: one question

Re: Stockfish testing at STC and LTC: one question

Re: Stockfish testing at STC and LTC: one question

Re: Stockfish testing at STC and LTC: one question