Stockfish testing at STC and LTC: one question

Uri Blass · Post by **Uri Blass** » Fri Sep 22, 2017 4:33 am

Jouni wrote:It's usual, that excellent patch gives +4 ELO at STC, but +2 at LTC. So is this indicating, that at 360+3,6 level we get probably NOTHING? And may be regression at tournament level!

It is usual that the stockfish team have no interest how much elo they get otherwise they could do better by using fixed number of games.

We know almost nothing about the elo improvement of a patch from
results of SPRT.

Performance of +4 elo when they pass SPRT at STC means nothing becuase if you test the patch many times the patch may fail SPRT in part of the cases and give also 0 elo or 1 elo so the average result is clearly less than 4 elo.

jdart · Post by **jdart** » Sun Sep 24, 2017 8:42 pm

I have observed myself that STC results appear to be a lot noisier than LTC results. So a positive STC result is a bad predictor of what LTC or real tournament results will be. This is a bit surprising because for years, starting with Rybka, engines were using hyper-bullet games for testing. There is some validity to that method because many got a good ELO gain from it. But it is not the best or most reliable method. It is a way to short cut testing at real time controls, which would require a huge number of processor cores to perform in a reasonable time period.

--Jon

Dann Corbit · Post by **Dann Corbit** » Mon Sep 25, 2017 8:41 pm

jdart wrote:I have observed myself that STC results appear to be a lot noisier than LTC results. So a positive STC result is a bad predictor of what LTC or real tournament results will be. This is a bit surprising because for years, starting with Rybka, engines were using hyper-bullet games for testing. There is some validity to that method because many got a good ELO gain from it. But it is not the best or most reliable method. It is a way to short cut testing at real time controls, which would require a huge number of processor cores to perform in a reasonable time period.

--Jon

When you test with a certain set of conditions, the results are totally valid for exactly those conditions. The results may or may not translate to another set of conditions.

Generally speaking, things that work well at ultra high speed will work well at other speeds to. That is why the model tends to work and Stockfish is an extremely strong engine.

On the other hand, they are tuning SF for high speed blitz games so they will achieve that.
But I think every other engine is doing the same thing, so it really won't make any difference any way.
Besides which, nobody has the resources to test at tournament time control.

jdart · Post by **jdart** » Mon Sep 25, 2017 9:12 pm

Besides which, nobody has the resources to test at tournament time control.

That is true now, but I am not sure it is going to be always true.

If you had 2000 cores available, you could run a match of 50 games on each core and get 100,000 games. 50 games blitz (say 5+3) wouldn't take more than a few hours. 50 games rapid might take a day or so. And a lot of tests wouldn't take 100k games to show a significant result.

Stockfish has a few hundred cores in its testing network now. And the core counts on processors just keep going up.

--Jon

Uri Blass · Post by **Uri Blass** » Mon Sep 25, 2017 11:05 pm

jdart wrote:
Besides which, nobody has the resources to test at tournament time control.
That is true now, but I am not sure it is going to be always true.

If you had 2000 cores available, you could run a match of 50 games on each core and get 100,000 games. 50 games blitz (say 5+3) wouldn't take more than a few hours. 50 games rapid might take a day or so. And a lot of tests wouldn't take 100k games to show a significant result.

Stockfish has a few hundred cores in its testing network now. And the core counts on processors just keep going up.

--Jon

It is something that I do not understand.

Why so many people choose to give computer time to stockfish when they are not being able even to choose the time control that they test and what they test.

I think that a better model should be to allow people who give computer time to choose the patch that they test if they want to do it when maybe the default option will be not to choose a specific patch.

patches that the team consider as not interesting may be in a special set and only if a person ask to test them they will get computer time.

Dann Corbit · Post by **Dann Corbit** » Mon Sep 25, 2017 11:44 pm

Uri Blass wrote:
jdart wrote:
Besides which, nobody has the resources to test at tournament time control.
That is true now, but I am not sure it is going to be always true.

If you had 2000 cores available, you could run a match of 50 games on each core and get 100,000 games. 50 games blitz (say 5+3) wouldn't take more than a few hours. 50 games rapid might take a day or so. And a lot of tests wouldn't take 100k games to show a significant result.

Stockfish has a few hundred cores in its testing network now. And the core counts on processors just keep going up.

--Jon
It is something that I do not understand.

Why so many people choose to give computer time to stockfish when they are not being able even to choose the time control that they test and what they test.

I think that a better model should be to allow people who give computer time to choose the patch that they test if they want to do it when maybe the default option will be not to choose a specific patch.

patches that the team consider as not interesting may be in a special set and only if a person ask to test them they will get computer time.

I guess that if you did a test using their own time control multiplied by 10 and showed that it would pass they would accept it.

Stockfish testing at STC and LTC: one question

Re: Stockfish testing at STC and LTC: one question

Re: Stockfish testing at STC and LTC: one question

Re: Stockfish testing at STC and LTC: one question

Re: Stockfish testing at STC and LTC: one question

Re: Stockfish testing at STC and LTC: one question

Re: Stockfish testing at STC and LTC: one question