testing question

bob · Post by **bob** » Fri Jun 03, 2011 12:31 am

michiguel wrote:
bob wrote:
lkaufman wrote:
Dann Corbit wrote:[quoteTo me, it seems logical that self-testing is the most logical way to make your program improve against earlier versions of itself and foreign testing is the best way to make your program improve against the programs you tested it against.

I remember Quark self tests that showed a big improvement, and then Anmon would (once again) give Quark a bloody nose, since it was a nemesis beyond the numerical difference for some reason.

I suspect that both types of testing have value and will produce different kinds of improvement.

Consider:
Program 'A' has bad king safety. We improve the pawn structure understanding of program 'A' and now 'A-prime' can beat the pants off of program 'A' with a incredible 100 Elo improvement. However when we play program 'A-prime' against program 'B', he attacks our king safety and so the improvment we see against this program will be much less.
Your answer implies that if the goal is to improve against foreign opponents, we should just do foreign testing. However your example merely implies that rating improvements from self-testing exaggerate the "real" gains, which I know to be true. It does not suggest that gains from self-testing could be worthless or harmful against foreign opponents. My further question would therefore be: has anyone ever experienced a program improvement based on self-ply which turned out to be harmful against other opponents, based on a statistically significant sample in each case?
Yes. I reported this curious effect a year or two ago. I don't remember the specific eval term, but the idea was that I added a new term, so that A' had the term, while A did not. And A' won with a very high confidence that it was better. I then dropped it in to our cluster testing approach, and it was worse.

We have seen several such cases where A vs A' suggests that a change is better, but then testing against a suite of opponents shows that the new term is worse more often than it is better.

I believe this is an exception rather than a rule.

Miguel

It's the exceptions that kill you.

The only time I test A vs A' any longer is when I do a "stress test" to make sure that everything works correctly at impossibly fast time controls, to detect errors that cause unexpected results. And I rarely do that kind of testing unless I make a major change in something (say parallel search) where I want to make sure it doesn't break something seriously. Game in 1 second (or less) plays a ton of games and usually exposes serious bugs quickly.

Don · Post by **Don** » Fri Jun 03, 2011 6:35 pm

F. Bluemers wrote:
lkaufman wrote:There are basically two methods to test whether a new version of your program is strong than the previous one, which I call "self-testing". You can play a direct match between them, or you can play each against a set of unrelated programs (let's call that "foreign testing"). Self-testing is more efficient in that you need less games to reach a conclusion with a given amount of confidence, but there is the question of how well self-testing predicts the results of foreign testing. Self-testing tends to exaggerate rating differences, but that is a good thing as it further reduces the need for more games. We use both methods in Komodo, but are still unsure about the relative merits of the two methods.
So what do all of you think? If you only had time to play a thousand games, would you self-test or foreign test? Does it depend on the nature of the difference between the versions? Is there any solid empirical data that shows that self-testing is not a reliable predictor of foreign-testing?
One of the problems with selftesting is that in most,if not all,tournaments you are not playing against a previous version.

Of course, Larry addressed that - didn't you read what he posted?

I think the question Larry asked was, "Is there any solid empirical data that shows that self-testing is not a reliable predictor of foreign-testing?"

A post like this is sure to bring out more opinions than facts and I think that is why he asked that very specific question. I don't mind hearing opinions and listen to your intuitions too, but I don't consider that very reliable and I don't even trust my own intuitions on things like this.

We do both kinds of testing and we put more weight on testing against foreign programs. However I would point out that testing against 3 other foreign programs is just a less severe kind of self-testing. The programs are of similar strength, they are computer chess programs that for the most part do things the way all the top programs do and thus only partially address any concerns over "incestuous" testing methods and intransitivies. The burning question is whether we should actually be concerned or not.

Don

Don · Post by **Don** » Fri Jun 03, 2011 6:44 pm

lkaufman wrote:Yes, gains against foreign programs are almost always less than predicted by self-testing. Furthermore we already do what you say, a mix of the two types of testing. So I guess what I really want to know is which of the two types of testing should we do mostly? It is unlikely that exactly a 50-50 distribution between the two types of tests is optimum. When I worked on Rybka we relied 99% on self-testing, and obviously it worked well, but this was primarily because there were then no other programs close to Rybka's level. Now this is no longer a problem so the answer is not at all obvious.

It's gradually becoming a problem however. We can no longer handicap the programs we test against and even without the handicaps we are no longer on the bottom of our own lists, so this will become a real issue as we approach and then pass Houdini.

michiguel · Post by **michiguel** » Fri Jun 03, 2011 8:28 pm

Don wrote:
F. Bluemers wrote:
lkaufman wrote:There are basically two methods to test whether a new version of your program is strong than the previous one, which I call "self-testing". You can play a direct match between them, or you can play each against a set of unrelated programs (let's call that "foreign testing"). Self-testing is more efficient in that you need less games to reach a conclusion with a given amount of confidence, but there is the question of how well self-testing predicts the results of foreign testing. Self-testing tends to exaggerate rating differences, but that is a good thing as it further reduces the need for more games. We use both methods in Komodo, but are still unsure about the relative merits of the two methods.
So what do all of you think? If you only had time to play a thousand games, would you self-test or foreign test? Does it depend on the nature of the difference between the versions? Is there any solid empirical data that shows that self-testing is not a reliable predictor of foreign-testing?
One of the problems with selftesting is that in most,if not all,tournaments you are not playing against a previous version.

Unless your name is Fabien

Of course, Larry addressed that - didn't you read what he posted?

I think the question Larry asked was, "Is there any solid empirical data that shows that self-testing is not a reliable predictor of foreign-testing?"

A post like this is sure to bring out more opinions than facts and I think that is why he asked that very specific question. I don't mind hearing opinions and listen to your intuitions too, but I don't consider that very reliable and I don't even trust my own intuitions on things like this.

We do both kinds of testing and we put more weight on testing against foreign programs. However I would point out that testing against 3 other foreign programs is just a less severe kind of self-testing.

I absolutely agree.

The programs are of similar strength, they are computer chess programs that for the most part do things the way all the top programs do and thus only partially address any concerns over "incestuous" testing methods and intransitivies. The burning question is whether we should actually be concerned or not.
Don

The reason why the problem may not become obvious is because you end up playing against those same programs in tournaments and testing.

Miguel

michiguel · Post by **michiguel** » Fri Jun 03, 2011 8:30 pm

bob wrote:
michiguel wrote:
bob wrote:
lkaufman wrote:
Dann Corbit wrote:[quoteTo me, it seems logical that self-testing is the most logical way to make your program improve against earlier versions of itself and foreign testing is the best way to make your program improve against the programs you tested it against.

I remember Quark self tests that showed a big improvement, and then Anmon would (once again) give Quark a bloody nose, since it was a nemesis beyond the numerical difference for some reason.

I suspect that both types of testing have value and will produce different kinds of improvement.

Consider:
Program 'A' has bad king safety. We improve the pawn structure understanding of program 'A' and now 'A-prime' can beat the pants off of program 'A' with a incredible 100 Elo improvement. However when we play program 'A-prime' against program 'B', he attacks our king safety and so the improvment we see against this program will be much less.
Your answer implies that if the goal is to improve against foreign opponents, we should just do foreign testing. However your example merely implies that rating improvements from self-testing exaggerate the "real" gains, which I know to be true. It does not suggest that gains from self-testing could be worthless or harmful against foreign opponents. My further question would therefore be: has anyone ever experienced a program improvement based on self-ply which turned out to be harmful against other opponents, based on a statistically significant sample in each case?
Yes. I reported this curious effect a year or two ago. I don't remember the specific eval term, but the idea was that I added a new term, so that A' had the term, while A did not. And A' won with a very high confidence that it was better. I then dropped it in to our cluster testing approach, and it was worse.

We have seen several such cases where A vs A' suggests that a change is better, but then testing against a suite of opponents shows that the new term is worse more often than it is better.

I believe this is an exception rather than a rule.

Miguel
It's the exceptions that kill you.

Not if you confirm the progress with foreign testing.

Miguel

The only time I test A vs A' any longer is when I do a "stress test" to make sure that everything works correctly at impossibly fast time controls, to detect errors that cause unexpected results. And I rarely do that kind of testing unless I make a major change in something (say parallel search) where I want to make sure it doesn't break something seriously. Game in 1 second (or less) plays a ton of games and usually exposes serious bugs quickly.

michiguel · Post by **michiguel** » Fri Jun 03, 2011 8:35 pm

lkaufman wrote:[quote="bob"I am not sure how testing A against A' requires fewer games than when testing A vs B, C, D, E and F. I've not seen where the opponents matter, just the total number of games.

In fact, the opposite might be the case, because when testing A vs A', the difference is typically very small, which requires many more games to reach a reasonable error margin.

I don't understand this comment. If A' is one Elo better than A, it will take a huge number of games to prove this regardless of whether they play each other or foreign programs. My testing has always shown that it takes more games to prove an improvement with foreign testing than with self-testing. The argument for foreign testing must be that self-testing just doesn't correlate highly enough with it, as you suggest in another response here.[/quote]

Even if self testing were not more sensitive, it would take less games. When you do "foreing testing" you compare A vs B, then A' vs B. So, you double the number of games, and you also have to combine two error bars. A vs A' give you only one error bar and it needs less games to reach the same confidence level. The only problem is reliability, but statistically speaking, it is an obvious winner.

Miguel

Don · Post by **Don** » Fri Jun 03, 2011 8:49 pm

michiguel wrote:
Don wrote:
F. Bluemers wrote:
lkaufman wrote:There are basically two methods to test whether a new version of your program is strong than the previous one, which I call "self-testing". You can play a direct match between them, or you can play each against a set of unrelated programs (let's call that "foreign testing"). Self-testing is more efficient in that you need less games to reach a conclusion with a given amount of confidence, but there is the question of how well self-testing predicts the results of foreign testing. Self-testing tends to exaggerate rating differences, but that is a good thing as it further reduces the need for more games. We use both methods in Komodo, but are still unsure about the relative merits of the two methods.
So what do all of you think? If you only had time to play a thousand games, would you self-test or foreign test? Does it depend on the nature of the difference between the versions? Is there any solid empirical data that shows that self-testing is not a reliable predictor of foreign-testing?
One of the problems with selftesting is that in most,if not all,tournaments you are not playing against a previous version.

Unless your name is Fabien

Of course, Larry addressed that - didn't you read what he posted?

I think the question Larry asked was, "Is there any solid empirical data that shows that self-testing is not a reliable predictor of foreign-testing?"

A post like this is sure to bring out more opinions than facts and I think that is why he asked that very specific question. I don't mind hearing opinions and listen to your intuitions too, but I don't consider that very reliable and I don't even trust my own intuitions on things like this.

We do both kinds of testing and we put more weight on testing against foreign programs. However I would point out that testing against 3 other foreign programs is just a less severe kind of self-testing.

I absolutely agree.
The programs are of similar strength, they are computer chess programs that for the most part do things the way all the top programs do and thus only partially address any concerns over "incestuous" testing methods and intransitivies. The burning question is whether we should actually be concerned or not.
Don
The reason why the problem may not become obvious is because you end up playing against those same programs in tournaments and testing.

That is true. We are testing against the strongest programs we have that run well on Linux: Critter, Stockfish and Robbo. Have not looked much at the many other clones and Robbo is more or less representative of them. Houdini is the only program now that is substantially stronger at the levels we test at (even the 32 bit version) but we can only use the 32 bit version under wine and it's not rock solid, some time-forfeits, etc. We have to be able to play thousands of games unattended.

If we can do well against those programs, then we are not so much worried about the lesser players although ANY top 10 program can be quite dangerous.

Miguel

testing question

Re: testing question

Re: testing question

Re: testing question

Re: testing question

Re: testing question

Re: testing question

Re: testing question