New testing thread

bob · Post by **bob** » Mon Aug 11, 2008 2:13 am

I have always agreed that the _only_ type of correlation I see here is caused by the fact that the same programs are playing the same set of positions. Some want to turn this into a type of correlation where game a physically affects the outcome of game b, and I do not believe that is possible without some very intentional intervention by someone. Say in a 60 game match between A and B, A is favored by biasing the time in its favor, but the second time around B is favored by biasing the time in the opposite direction. That would nicely do the trick and make even worse results than I had by far. But that is not a possibility here, in any form...

But the same two programs _are_ pllaying, and although the clock jitter scrambles the results significantly, my observation has been that the overall result of win/lose has not changed significantly with respect to who finished at the top, etc. Which is, I believe, what you are trying to explain. The results of a given match are not completely random. The better program generally wins more than it loses (although there are always bad exceptions... I count cards when playing blackjack and a couple of years ago I went on one of those terrible 3 month -2 sigma+ trips even though I was playing with a roughly 1.5% advantage over the house. But those are rare, thankfully, although they are certainly something that must be expected every now and then. So the one with the edge is the one I would bet on, as you are explaining. And on occasion, I am going to lose. But lifelong, it will be a money-maker I agree. Just as I believe in Hi-Lo.

Uri Blass · Post by **Uri Blass** » Mon Aug 11, 2008 6:11 am

MartinBryant wrote:Forgot to mention...
I repeated the Fruit experiment with Spike and Colossus too.
Again no duplicates in 100 games.

When you say no duplicate you mean that no 2 games are the same or maybe you mean that the first game did not repeat.

Note that both options are different.

Uri

Uri Blass · Post by **Uri Blass** » Mon Aug 11, 2008 6:26 am

bob wrote:
Uri Blass wrote:You say:"To use your phraseology, all _good_ engines seem to exhibit this variability. The new ones with simplistic time controls, searching only complete iterations, and such won't. But that is a performance penalty that will hurt in OTB play. So good programs are not being designed like that. There's reasons for it... "

my response:
Performance penalty from searching only complete iterations is clearly
less than 100 elo(and I guess something near 20 elo) and I believe that there are programmers who do not care about these small numbers and prefer deterministic results in testing so they may be able to reproduce everything(My opinion is that it is better to sacrifice 20 elo for being able to reproduce everything easily).
I play so many games against other computers on ICC, and I see the analysis they whisper/kibitz when I am watching and we both turn it on. Even in CCT events as well. And _everybody_ that I played was producing output claiming that not all root moves were searched each time a move was played. Sometimes they started a new depth and got nothing back, sometimes they were in the middle, and perhaps very rarely they timed out when the iteration ended. But it is easy to watch 'em. I don't believe _anybody_ designs a program in any way other than to make it as strong as possible. And that includes me. I know how to make a perfectly deterministic parallel search. But with absolutely nowhere near the speedup I now produce. So non-determinism is accepted along with better performance. Ditto for time allocation and each other decision we make. But find me any _serious_ chess programmer that would say "sure, I will give up significant Elo just to make testing more deterministic (although no more accurate)"

You can be sure that not all programs at level that is close to Crafty's level exhibit the variability that you talk about.
When I first ran into this, I tried several. Both Glaurungs, fruit, gnuchess, crafty, arasan 9/10, and a couple of others that I won't name but which I could get to run under linux. And they _all_ had this issue. Yes, there could certainly be a program out there that terminates searches on iteration boundaries, that clears the hash between moves, that does anything else needed to minimize variability. But there certainly are not very many of 'em because we all take any rating improvement we can get, and we aren't going to throw away 30 here, 20 there, as much of that produces a pure patzer.

Not all programmers of programs at that level care much about playing strength and you can be sure that there are people who do not care if their program is 500 elo weaker than rybka or 480 elo weaker than rybka.

I did not work lately about Movei but if I come back I will certainly not care at levels that are more than 100 elo weaker than rybka about this small improvement that make result not reproducable and make it harder to find bugs because I see that the program played a move that I cannot reproduce.

Uri
I'm the opposite, trying to take everything I can get. Most are...

stopping at the end of the iteration may not be important for finding bugs (and in ponder on games unlike ponder off games I have no rule to stop at the end of the iteration and of course I may play in 0 seconds not at the end of the iteration in case of ponder hit).

Clearing the hash clearly can help to find bugs and does not help only for deterministic testing.

Suppose that your program did a mistake in some game and you have only the game.

You know that the program did some mistake that you simply cannot reproduce in first tries in analysis because you cannot reproduce the hash situation.

You will have harder time to catch the bug that cause the program to do the mistake.

If you need more time to discover bugs then it means that you sacrifice future improvements.

Uri

Uri Blass · Post by **Uri Blass** » Mon Aug 11, 2008 6:53 am

bob wrote:I have always agreed that the _only_ type of correlation I see here is caused by the fact that the same programs are playing the same set of positions. Some want to turn this into a type of correlation where game a physically affects the outcome of game b, and I do not believe that is possible without some very intentional intervention by someone.

I think that it is possible that the correlation is result of third factor that effect both outcome of game a and outcome of game b when games a and b are from different positions.

I do not say that other do not have that correlation.
I have no reason to assume that a computer runs at exactly the same speed all the time.

I am no hardware expert but I believe that the speed of the computer is changed slightly during time when the change is not random and you can have one day when the computer in average is slightly faster.
For example it may be possible that the hardware is slightly slower when the temperature of the room is slightly higher(or the oposite difference).

It is enough to explain possible correlation that may cause a big difference.

It means that the results do not mean that you have a bug in your cluster
but the results mean that you did not repeat the same experiment twice.

The results clearly caused me to suspect that there was some bug but without games we cannot check it and I agree that a bug is not the only possible explanation for the initial results.

Uri

bob · Post by **bob** » Mon Aug 11, 2008 7:09 am

Uri Blass wrote:
MartinBryant wrote:Forgot to mention...
I repeated the Fruit experiment with Spike and Colossus too.
Again no duplicates in 100 games.
When you say no duplicate you mean that no 2 games are the same or maybe you mean that the first game did not repeat.

Note that both options are different.

Uri

"no duplicates" has only possible interpretation. for the set of 100 games, no two were the same.

bob · Post by **bob** » Mon Aug 11, 2008 7:27 am

Uri Blass wrote:
bob wrote:I have always agreed that the _only_ type of correlation I see here is caused by the fact that the same programs are playing the same set of positions. Some want to turn this into a type of correlation where game a physically affects the outcome of game b, and I do not believe that is possible without some very intentional intervention by someone.
I think that it is possible that the correlation is result of third factor that effect both outcome of game a and outcome of game b when games a and b are from different positions.

I do not say that other do not have that correlation.
I have no reason to assume that a computer runs at exactly the same speed all the time.

I am no hardware expert but I believe that the speed of the computer is changed slightly during time when the change is not random and you can have one day when the computer in average is slightly faster.

Absolutely does _not_ happen. Ask any good EE you know about how a PLL is used to generate a CPU clock frequency using a crystal oscillator that vibrates at _one_ frequency and doesn't vary enough that it can be measured in its effect on CPU instruction rate. That is as constant as it gets. Don't know where you come up with some of this stuff, but you need to stop drinking whatever it is that brings these bizarre ideas to the surface. Computers just do _not_ work like that.

For example it may be possible that the hardware is slightly slower when the temperature of the room is slightly higher(or the oposite difference).

does not happen, or at least does not happen to the point where the speed of the processor's instruction execution cycle varies by a single tenth of a nanosecond, which is .0000000001

It is enough to explain possible correlation that may cause a big difference.

It means that the results do not mean that you have a bug in your cluster
but the results mean that you did not repeat the same experiment twice.

If this were true, and that is a big IF at best, although I know that it does not happen, then _everyone_ would have that problem, and there would be absolutely nothing anyone could do about it.

The results clearly caused me to suspect that there was some bug but without games we cannot check it and I agree that a bug is not the only possible explanation for the initial results.

Uri

bob · Post by **bob** » Mon Aug 11, 2008 7:31 am

Uri Blass wrote:
bob wrote:
Uri Blass wrote:You say:"To use your phraseology, all _good_ engines seem to exhibit this variability. The new ones with simplistic time controls, searching only complete iterations, and such won't. But that is a performance penalty that will hurt in OTB play. So good programs are not being designed like that. There's reasons for it... "

my response:
Performance penalty from searching only complete iterations is clearly
less than 100 elo(and I guess something near 20 elo) and I believe that there are programmers who do not care about these small numbers and prefer deterministic results in testing so they may be able to reproduce everything(My opinion is that it is better to sacrifice 20 elo for being able to reproduce everything easily).
I play so many games against other computers on ICC, and I see the analysis they whisper/kibitz when I am watching and we both turn it on. Even in CCT events as well. And _everybody_ that I played was producing output claiming that not all root moves were searched each time a move was played. Sometimes they started a new depth and got nothing back, sometimes they were in the middle, and perhaps very rarely they timed out when the iteration ended. But it is easy to watch 'em. I don't believe _anybody_ designs a program in any way other than to make it as strong as possible. And that includes me. I know how to make a perfectly deterministic parallel search. But with absolutely nowhere near the speedup I now produce. So non-determinism is accepted along with better performance. Ditto for time allocation and each other decision we make. But find me any _serious_ chess programmer that would say "sure, I will give up significant Elo just to make testing more deterministic (although no more accurate)"

You can be sure that not all programs at level that is close to Crafty's level exhibit the variability that you talk about.
When I first ran into this, I tried several. Both Glaurungs, fruit, gnuchess, crafty, arasan 9/10, and a couple of others that I won't name but which I could get to run under linux. And they _all_ had this issue. Yes, there could certainly be a program out there that terminates searches on iteration boundaries, that clears the hash between moves, that does anything else needed to minimize variability. But there certainly are not very many of 'em because we all take any rating improvement we can get, and we aren't going to throw away 30 here, 20 there, as much of that produces a pure patzer.

Not all programmers of programs at that level care much about playing strength and you can be sure that there are people who do not care if their program is 500 elo weaker than rybka or 480 elo weaker than rybka.

I did not work lately about Movei but if I come back I will certainly not care at levels that are more than 100 elo weaker than rybka about this small improvement that make result not reproducable and make it harder to find bugs because I see that the program played a move that I cannot reproduce.

Uri
I'm the opposite, trying to take everything I can get. Most are...
stopping at the end of the iteration may not be important for finding bugs (and in ponder on games unlike ponder off games I have no rule to stop at the end of the iteration and of course I may play in 0 seconds not at the end of the iteration in case of ponder hit).

Clearing the hash clearly can help to find bugs and does not help only for deterministic testing.

Suppose that your program did a mistake in some game and you have only the game.

You know that the program did some mistake that you simply cannot reproduce in first tries in analysis because you cannot reproduce the hash situation.

You will have harder time to catch the bug that cause the program to do the mistake.

If you need more time to discover bugs then it means that you sacrifice future improvements.

Uri

That is an insane approach to designing the strongest possible chess program. Would you then choose to not add major eval blocks of code because those might be harder to debug than if you left it alone? Even though the additions are not going to add 100 Elo, or even 20 Elo, do you add them or not? There is only one sane answer there, and it is the same sane answer to any question about sacrificing Elo for (possibly unnecessary) future debugging...

Uri Blass · Post by **Uri Blass** » Mon Aug 11, 2008 8:04 am

bob wrote:
does not happen, or at least does not happen to the point where the speed of the processor's instruction execution cycle varies by a single tenth of a nanosecond, which is .0000000001

I do not understand this part.
.0000000001 seconds relative to what?
.0000000001 seconds can be a small change if it is relative to 1 second but can be a big change if it is relative to clearly smaller time.

The fact that the change is less than a tenth of a nanosecond tells me nothing because if you change from nanosecond to 1.09 nanoseconds it is a relatively big change in speed and if you change from 1 second to 1 secod and 0.09 nanoseconds it is a small speed change.

Uri .

Fritzlein · Post by **Fritzlein** » Mon Aug 11, 2008 8:05 am

Fritzlein wrote:Let's take the heat off of the two particular positions by rotating the experiment ninety degrees. Instead of having two positions played 201 times each, let's have 201 positions played twice each. Then we measure whether the winner of the first play of each position is likely to be the winner of the second play of each position. If the playouts are uncorrelated, then there should be as many pairs of playouts with different winners as there are pairs of playouts with the same winner. If the two repeat playouts are at all correlated, then the winners will be the same more often than they are different. (If it is truly random, then the winners might even be different more often than they are the same, in which case the correlation will look negative, but one must suppose that is a fluke.)

Now if your 201 positions are a representative sample from your opening book, you might suppose that white has a 55% chance to win, ignoring draws for the moment. The chance that the results are the same is 0.55*0.55 + 0.45*0.45 = 50.5%, and the chance they are different is 0.55*0.45 + 0.45*0.55 = 49.5%. Aha! There is positive correlation since more are the same than different.

Whoops, I biffed that explanation. Having more the same than different doesn't necessarily mean they are correlated. If every position is 55% for white then we can have two test runs that are perfectly uncorrelated even though there will be more repeat winners than different winners. My bad. Where correlation actually creeps in is like in my first example where one position is 90% for white and the next is 10% for white in a sample of positions that are overall 55% for white. Or, more realistically, where one position is 65% for white and the next is 45% for white.

Getting back to Crafty vs. Fruit, let me assume that Fruit's average score across all positions is 60%. Now if every position is 60% for Fruit in repeated playouts, then there will be no correlation in repeated trials on the same position set. If, however, some positions are 75% for Fruit when replayed, others are 45% for Fruit when replayed, etc., then there will be correlation in repeated runs over the same position set (as hgm was saying). To eliminate that correlation we need to use each position only once (as hgm was not saying but I am).

Correlation is correlation, regardless of whether the exact reason for a particular position being merely 45% for Fruit is that white simply stands worse there, or that Fruit doesn't understand opposite-side castling so well, or that the clock doesn't jitter enough and the expected score would revert to 60% if we had enough randomness. I guess I need to back off my confident assertion that clock jitter isn't enough to "randomize" the outcome. What I should claim instead is that clock jitter isn't sufficient to make a series playouts from one position behave like a series of playouts from a series of different positions chosen at random and used only once each. The effect of correlation on the statistical significance of Bob's original tests is potentially huge regardless of the source of the correlation, but I am more confident that correlation is present between repeat playouts of the same positions than I am confident that I know what causes it.

Uri Blass · Post by **Uri Blass** » Mon Aug 11, 2008 8:15 am

bob wrote:
Uri Blass wrote:
bob wrote:
Uri Blass wrote:You say:"To use your phraseology, all _good_ engines seem to exhibit this variability. The new ones with simplistic time controls, searching only complete iterations, and such won't. But that is a performance penalty that will hurt in OTB play. So good programs are not being designed like that. There's reasons for it... "

my response:
Performance penalty from searching only complete iterations is clearly
less than 100 elo(and I guess something near 20 elo) and I believe that there are programmers who do not care about these small numbers and prefer deterministic results in testing so they may be able to reproduce everything(My opinion is that it is better to sacrifice 20 elo for being able to reproduce everything easily).
I play so many games against other computers on ICC, and I see the analysis they whisper/kibitz when I am watching and we both turn it on. Even in CCT events as well. And _everybody_ that I played was producing output claiming that not all root moves were searched each time a move was played. Sometimes they started a new depth and got nothing back, sometimes they were in the middle, and perhaps very rarely they timed out when the iteration ended. But it is easy to watch 'em. I don't believe _anybody_ designs a program in any way other than to make it as strong as possible. And that includes me. I know how to make a perfectly deterministic parallel search. But with absolutely nowhere near the speedup I now produce. So non-determinism is accepted along with better performance. Ditto for time allocation and each other decision we make. But find me any _serious_ chess programmer that would say "sure, I will give up significant Elo just to make testing more deterministic (although no more accurate)"

You can be sure that not all programs at level that is close to Crafty's level exhibit the variability that you talk about.
When I first ran into this, I tried several. Both Glaurungs, fruit, gnuchess, crafty, arasan 9/10, and a couple of others that I won't name but which I could get to run under linux. And they _all_ had this issue. Yes, there could certainly be a program out there that terminates searches on iteration boundaries, that clears the hash between moves, that does anything else needed to minimize variability. But there certainly are not very many of 'em because we all take any rating improvement we can get, and we aren't going to throw away 30 here, 20 there, as much of that produces a pure patzer.

Not all programmers of programs at that level care much about playing strength and you can be sure that there are people who do not care if their program is 500 elo weaker than rybka or 480 elo weaker than rybka.

I did not work lately about Movei but if I come back I will certainly not care at levels that are more than 100 elo weaker than rybka about this small improvement that make result not reproducable and make it harder to find bugs because I see that the program played a move that I cannot reproduce.

Uri
I'm the opposite, trying to take everything I can get. Most are...
stopping at the end of the iteration may not be important for finding bugs (and in ponder on games unlike ponder off games I have no rule to stop at the end of the iteration and of course I may play in 0 seconds not at the end of the iteration in case of ponder hit).

Clearing the hash clearly can help to find bugs and does not help only for deterministic testing.

Suppose that your program did a mistake in some game and you have only the game.

You know that the program did some mistake that you simply cannot reproduce in first tries in analysis because you cannot reproduce the hash situation.

You will have harder time to catch the bug that cause the program to do the mistake.

If you need more time to discover bugs then it means that you sacrifice future improvements.

Uri
That is an insane approach to designing the strongest possible chess program. Would you then choose to not add major eval blocks of code because those might be harder to debug than if you left it alone? Even though the additions are not going to add 100 Elo, or even 20 Elo, do you add them or not? There is only one sane answer there, and it is the same sane answer to any question about sacrificing Elo for (possibly unnecessary) future debugging...

evaluation is different because the results after adding evaluation terms can be reproduced.

Uri

New testing thread

Re: Correlated data discussion

Re: Correlated data discussion

Re: 4 sets of data

Re: Correlated data discussion

Re: Correlated data discussion

Re: Correlated data discussion

Re: 4 sets of data

Re: Correlated data discussion

Re: Correlated data discussion

Re: 4 sets of data