questions about stockfish testing

Uri Blass · Post by **Uri Blass** » Thu May 02, 2013 11:37 am

I see basically 2 type of tests

15+0.05 and 60+0.05
If I understood correctly only patches that are good at both tests are accepted.

http://tests.stockfishchess.org/tests?page=4

I wonder if testers are free to test also at different time controls.

For example it is possible that some change in time management is productive at 40/40 and not at incremental time control.

I also wonder if testers who believe that some change is counter productive at 15+0.05 and productive at longer time control are free to test at longer time control.

I thought to ask the question in the stockfish developer forum but I understand that marco does not like there people who are not patch authors and not testers so I decided not to ask there.

mcostalba · Post by **mcostalba** » Thu May 02, 2013 12:37 pm

Uri Blass wrote: I thought to ask the question in the stockfish developer forum but I understand that marco does not like there people who are not patch authors and not testers so I decided not to ask there.

I have already answer to you in dev forum at this exact question: "Everybody with write access to fishtest can test whatever they want"

I am wondering if your aim is to read the answers or just to create troubles.

Uri Blass · Post by **Uri Blass** » Thu May 02, 2013 1:31 pm

mcostalba wrote:
Uri Blass wrote: I thought to ask the question in the stockfish developer forum but I understand that marco does not like there people who are not patch authors and not testers so I decided not to ask there.
I have already answer to you in dev forum at this exact question: "Everybody with write access to fishtest can test whatever they want"

I am wondering if your aim is to read the answers or just to create troubles.

I understood from your answer that everybody can test every change in the code that he wants but it was not clear if it is only about changes in the code or also about the time control.

I simply see that all the recent tests are at time control 15+0.05 and later 60+0.05 in case of productive results in 15+0.05

I thought simply that maybe some change may cause 2 elo reduction at 15+0.5 but 3 elo improvement at 60+0.5 and 5 elo improvement at 240+2 time control and it is possible that the stockfish team can miss it because nobody is going to test at 60+0.5 after failure in 15+0.5

gladius · Post by **gladius** » Thu May 02, 2013 1:53 pm

Uri Blass wrote:
mcostalba wrote:
Uri Blass wrote: I thought to ask the question in the stockfish developer forum but I understand that marco does not like there people who are not patch authors and not testers so I decided not to ask there.
I have already answer to you in dev forum at this exact question: "Everybody with write access to fishtest can test whatever they want"

I am wondering if your aim is to read the answers or just to create troubles.
I understood from your answer that everybody can test every change in the code that he wants but it was not clear if it is only about changes in the code or also about the time control.

I simply see that all the recent tests are at time control 15+0.05 and later 60+0.05 in case of productive results in 15+0.05

I thought simply that maybe some change may cause 2 elo reduction at 15+0.5 but 3 elo improvement at 60+0.5 and 5 elo improvement at 240+2 time control and it is possible that the stockfish team can miss it because nobody is going to test at 60+0.5 after failure in 15+0.5

The TC is editable by anyone setting up the test. It's always a question of using the testing resources efficiently though. If a test doesn't pass at 15s/game, it has a high likelihood of failing at 60s/game as well.

Uri Blass · Post by **Uri Blass** » Thu May 02, 2013 2:25 pm

gladius wrote:
Uri Blass wrote:
mcostalba wrote:
Uri Blass wrote: I thought to ask the question in the stockfish developer forum but I understand that marco does not like there people who are not patch authors and not testers so I decided not to ask there.
I have already answer to you in dev forum at this exact question: "Everybody with write access to fishtest can test whatever they want"

I am wondering if your aim is to read the answers or just to create troubles.
I understood from your answer that everybody can test every change in the code that he wants but it was not clear if it is only about changes in the code or also about the time control.

I simply see that all the recent tests are at time control 15+0.05 and later 60+0.05 in case of productive results in 15+0.05

I thought simply that maybe some change may cause 2 elo reduction at 15+0.5 but 3 elo improvement at 60+0.5 and 5 elo improvement at 240+2 time control and it is possible that the stockfish team can miss it because nobody is going to test at 60+0.5 after failure in 15+0.5
The TC is editable by anyone setting up the test. It's always a question of using the testing resources efficiently though. If a test doesn't pass at 15s/game, it has a high likelihood of failing at 60s/game as well.

If a change passed at 15 s/game and did not pass at 60 s/game then it is logical to try a change to the opposite direction that maybe is going to pass at 60 s/game and not pass at 15 s/game.

lucasart · Post by **lucasart** » Thu May 02, 2013 2:37 pm

Uri Blass wrote: If a change passed at 15 s/game and did not pass at 60 s/game then it is logical to try a change to the opposite direction that maybe is going to pass at 60 s/game and not pass at 15 s/game.

In practice this (almost) never happens. And there are a couple of good reasons for that:
1/ almost all patches perform less at 60+0.05 than at 15+0.05. And this is not surprising.
2/ If you have a look at the SPRT bounds of the 15+0.05 and the 60+0.05 test, you will see that the 15+0.05 is much more tolerant, and we allow a "not so good" patch at 15+0.05 to be given a chance in 60+0.05.

Don · Post by **Don** » Thu May 02, 2013 4:51 pm

Uri Blass wrote:
gladius wrote:
Uri Blass wrote:
mcostalba wrote:
Uri Blass wrote: I thought to ask the question in the stockfish developer forum but I understand that marco does not like there people who are not patch authors and not testers so I decided not to ask there.
I have already answer to you in dev forum at this exact question: "Everybody with write access to fishtest can test whatever they want"

I am wondering if your aim is to read the answers or just to create troubles.
I understood from your answer that everybody can test every change in the code that he wants but it was not clear if it is only about changes in the code or also about the time control.

I simply see that all the recent tests are at time control 15+0.05 and later 60+0.05 in case of productive results in 15+0.05

I thought simply that maybe some change may cause 2 elo reduction at 15+0.5 but 3 elo improvement at 60+0.5 and 5 elo improvement at 240+2 time control and it is possible that the stockfish team can miss it because nobody is going to test at 60+0.5 after failure in 15+0.5
The TC is editable by anyone setting up the test. It's always a question of using the testing resources efficiently though. If a test doesn't pass at 15s/game, it has a high likelihood of failing at 60s/game as well.
If a change passed at 15 s/game and did not pass at 60 s/game then it is logical to try a change to the opposite direction that maybe is going to pass at 60 s/game and not pass at 15 s/game.

Uri,

I know from observing you on talkchess that you have at least a mild obsessive disorder with respect to computer chess. You feel that there are always stones left that need to be turned over just to be sure but it never ends.

But due to limited testing resources, you cannot obsess forever on a single change or you would never make progress. Yes, you can imagine almost anything happening and you could chase down every possibility if you wanted to, but you would never be able to move on to the next test.

I know because I am more like you and Larry is more pragmatic, almost to the point of being sloppy from my point of view and he probably thinks I'm too anal retentive. I'll even do exploratory tests to get an upper or lower bound and I like to "study" it thoroughly and Larry will wonder what the point is of running something that I know won't test well. My view is that I need to know!

But I am pragmatic too and know that you are forced to take some shortcuts, make some concessions, find the right balance. For example, rejecting changes that do not test well at 15 seconds - we know from experience that it's very possible to reject good changes that WOULD test well at 60 seconds. But if you do not draw the line somewhere you will never make progress. It always comes down to how much time do you want to spend on any given change? It'a a calculated risk. If you don't spend enough time you might reject something that could have been a good thing. It's like being on a road-trip and thinking you passed your exit but not being sure. How far do you go before turning around? Turning around might be the wrong choice but continuing on might also waste time. It's annoying when you are in a hurry or late and that happens.

Generally, if you have a good change but don't know how to set the key parameter, a pragmatic approach is to do an initial fast test with a conservative value because if the idea is good it will show up at all levels with very few exceptions. You can tune it later.
But in your example I would say that if it doesn't test well at the longer time control you should not expect a parameter change to reverse that. I'm not saying it cannot happen but how must time do you have to invest on 1 change when 30 are in the queue?

By the way, with respect to scalability issues we also run fast tests as a kind of sanity test, but we don't take the fast results as seriously. If it's a close call we might make the judgement to run the long test anyway, if we believe it's likely to be something that would have more than a trivial impact at different search depths.

I would love to have a black-box testing procedure that takes all human judgement out of the picture, but it turns out that human judgement is too important to ignore. But human judgement also introduces bias. We stop some tests early based on many factors, such as a priori belief (or lack of belief) in the change, or our own judgement that a bad start will not recover even though we know it's possible. If we believe the change cannot be more that something very minor we are not as interested. If we were publishing results it would not be proper to do any of that.

If you really want to drive yourself crazy, when you make a change you need to test all other changes you have ever made again because the new change WILL with 100% certainty interact with at least some of them. I hope that thought does not cause you to stay up at night

questions about stockfish testing

questions about stockfish testing

Re: questions about stockfish testing

Re: questions about stockfish testing

Re: questions about stockfish testing

Re: questions about stockfish testing

Re: questions about stockfish testing

Re: questions about stockfish testing