Stockfish testing at STC and LTC: one question

Discussion of chess software programming and technical issues.

Moderators: hgm, Rebel, chrisw

Uri Blass
Posts: 10322
Joined: Thu Mar 09, 2006 12:37 am
Location: Tel-Aviv Israel

Re: Stockfish testing at STC and LTC: one question

Post by Uri Blass »

Jouni wrote:It's usual, that excellent patch gives +4 ELO at STC, but +2 at LTC. So is this indicating, that at 360+3,6 level we get probably NOTHING? And may be regression at tournament level!
It is usual that the stockfish team have no interest how much elo they get otherwise they could do better by using fixed number of games.

We know almost nothing about the elo improvement of a patch from
results of SPRT.

Performance of +4 elo when they pass SPRT at STC means nothing becuase if you test the patch many times the patch may fail SPRT in part of the cases and give also 0 elo or 1 elo so the average result is clearly less than 4 elo.
jdart
Posts: 4367
Joined: Fri Mar 10, 2006 5:23 am
Location: http://www.arasanchess.org

Re: Stockfish testing at STC and LTC: one question

Post by jdart »

I have observed myself that STC results appear to be a lot noisier than LTC results. So a positive STC result is a bad predictor of what LTC or real tournament results will be. This is a bit surprising because for years, starting with Rybka, engines were using hyper-bullet games for testing. There is some validity to that method because many got a good ELO gain from it. But it is not the best or most reliable method. It is a way to short cut testing at real time controls, which would require a huge number of processor cores to perform in a reasonable time period.

--Jon
Dann Corbit
Posts: 12542
Joined: Wed Mar 08, 2006 8:57 pm
Location: Redmond, WA USA

Re: Stockfish testing at STC and LTC: one question

Post by Dann Corbit »

jdart wrote:I have observed myself that STC results appear to be a lot noisier than LTC results. So a positive STC result is a bad predictor of what LTC or real tournament results will be. This is a bit surprising because for years, starting with Rybka, engines were using hyper-bullet games for testing. There is some validity to that method because many got a good ELO gain from it. But it is not the best or most reliable method. It is a way to short cut testing at real time controls, which would require a huge number of processor cores to perform in a reasonable time period.

--Jon
When you test with a certain set of conditions, the results are totally valid for exactly those conditions. The results may or may not translate to another set of conditions.

Generally speaking, things that work well at ultra high speed will work well at other speeds to. That is why the model tends to work and Stockfish is an extremely strong engine.

On the other hand, they are tuning SF for high speed blitz games so they will achieve that.
But I think every other engine is doing the same thing, so it really won't make any difference any way.
Besides which, nobody has the resources to test at tournament time control.
Taking ideas is not a vice, it is a virtue. We have another word for this. It is called learning.
But sharing ideas is an even greater virtue. We have another word for this. It is called teaching.
jdart
Posts: 4367
Joined: Fri Mar 10, 2006 5:23 am
Location: http://www.arasanchess.org

Re: Stockfish testing at STC and LTC: one question

Post by jdart »

Besides which, nobody has the resources to test at tournament time control.
That is true now, but I am not sure it is going to be always true.

If you had 2000 cores available, you could run a match of 50 games on each core and get 100,000 games. 50 games blitz (say 5+3) wouldn't take more than a few hours. 50 games rapid might take a day or so. And a lot of tests wouldn't take 100k games to show a significant result.

Stockfish has a few hundred cores in its testing network now. And the core counts on processors just keep going up.

--Jon
Uri Blass
Posts: 10322
Joined: Thu Mar 09, 2006 12:37 am
Location: Tel-Aviv Israel

Re: Stockfish testing at STC and LTC: one question

Post by Uri Blass »

jdart wrote:
Besides which, nobody has the resources to test at tournament time control.
That is true now, but I am not sure it is going to be always true.

If you had 2000 cores available, you could run a match of 50 games on each core and get 100,000 games. 50 games blitz (say 5+3) wouldn't take more than a few hours. 50 games rapid might take a day or so. And a lot of tests wouldn't take 100k games to show a significant result.

Stockfish has a few hundred cores in its testing network now. And the core counts on processors just keep going up.

--Jon
It is something that I do not understand.

Why so many people choose to give computer time to stockfish when they are not being able even to choose the time control that they test and what they test.

I think that a better model should be to allow people who give computer time to choose the patch that they test if they want to do it when maybe the default option will be not to choose a specific patch.

patches that the team consider as not interesting may be in a special set and only if a person ask to test them they will get computer time.
Dann Corbit
Posts: 12542
Joined: Wed Mar 08, 2006 8:57 pm
Location: Redmond, WA USA

Re: Stockfish testing at STC and LTC: one question

Post by Dann Corbit »

Uri Blass wrote:
jdart wrote:
Besides which, nobody has the resources to test at tournament time control.
That is true now, but I am not sure it is going to be always true.

If you had 2000 cores available, you could run a match of 50 games on each core and get 100,000 games. 50 games blitz (say 5+3) wouldn't take more than a few hours. 50 games rapid might take a day or so. And a lot of tests wouldn't take 100k games to show a significant result.

Stockfish has a few hundred cores in its testing network now. And the core counts on processors just keep going up.

--Jon
It is something that I do not understand.

Why so many people choose to give computer time to stockfish when they are not being able even to choose the time control that they test and what they test.

I think that a better model should be to allow people who give computer time to choose the patch that they test if they want to do it when maybe the default option will be not to choose a specific patch.

patches that the team consider as not interesting may be in a special set and only if a person ask to test them they will get computer time.
I guess that if you did a test using their own time control multiplied by 10 and showed that it would pass they would accept it.
Taking ideas is not a vice, it is a virtue. We have another word for this. It is called learning.
But sharing ideas is an even greater virtue. We have another word for this. It is called teaching.