Observator bias or...

Alessandro Scotti · Post by **Alessandro Scotti** » Thu May 31, 2007 2:56 pm

There's a very good point from H. G. that the games could not be independent... Now I'm going to stop for a while and make sure everything is OK before running another test.

bob · Post by **bob** » Sat Jun 02, 2007 8:37 am

hgm wrote:
Alessandro Scotti wrote:I remember since testing with Kiwi that results with 100 games are very unreliable. It sometimes happen that a version gets a bad start but gets better at the end of the long test. On the other hand, I had a version reach 64% after 100 games and finish with a disappointing 50% after 720 games... I will now increase the number to 800 and see if that brings some benefits (not much is expected though).
64% after 100 games between approximately equal engines is extreme: the standard error over 100 games should be 0.4/sqrt(100) = 4%, so a 14% deviation represents 3.5 sigma. This should happen on the average only 1 in ~4000 tries.

I noted a very strange effect when I was testing uMax in self play. The standard error over 100 games should be 4%, but when I played 1000 games between the same versions, and looked at the scores of the ten individual 100-game runs, these results deviated on the average much more from each other (and the final average result) than you would expect from the calculated standard error. This can only happen if the games are not independent! I can indeed not exclude this, as all the games were played in a single run, and were using the random seed the previous game ended with. So with a bad randomizer, if a single game repeats due to an equal or very close seed at the start of the game, it might imply that the following game repeats as well, destroying the independence of the game.

Whatever the cause, the effect was that the error in the win percentage was always a lot larger than you would expect based on the number of games.

My current testing methodology is to play 40 positions, once black, once white, and do this 32 times (64 games per position) and to repeat this against multiple opponents. This is giving pretty stable results and allows me to compare two versions with reasonable reliability. Anything less is not enough, based on a few hundred thousand games testing this.

bob · Post by **bob** » Sat Jun 02, 2007 8:41 am

I am working on an ICGA paper hitting on this very topic. I'll make it electronically available once it has made it thru the review process. There are lots of places where randomness can creep in, but the main one is timing. Timing is just not that accurate, and at today's NPS rates, a few milliseconds variance on a move can make a new best move pop up and change the game outcome randomly.

hgm · Post by **hgm** » Sat Jun 02, 2007 12:18 pm

Yes, for this reason testing at a fixed number of nodes and recording the ime, rather than fixing the time, seems preferable. But of course you cannot get rid of the randomness induced by SMP that way.

For this reason I still want to implement the tree comparison idea I proposed here lately. This would eliminate the randomness not by sampling enough games and relying on the (tediously slow) 1/sqrt(N) convergence, but by exhaustively generatng all possible realizations of the game from a given initial position. If the versions under comparison are quite close (the case that is most difficult to test with conventional methods), the entire game tree might consist of less than 100 games, but might give you the accuracy of a 10,000 games that are subject to chance effects.

Eelco de Groot · Post by **Eelco de Groot** » Sun Jun 03, 2007 1:15 am

hgm wrote:
Alessandro Scotti wrote:I remember since testing with Kiwi that results with 100 games are very unreliable. It sometimes happen that a version gets a bad start but gets better at the end of the long test. On the other hand, I had a version reach 64% after 100 games and finish with a disappointing 50% after 720 games... I will now increase the number to 800 and see if that brings some benefits (not much is expected though).
64% after 100 games between approximately equal engines is extreme: the standard error over 100 games should be 0.4/sqrt(100) = 4%, so a 14% deviation represents 3.5 sigma. This should happen on the average only 1 in ~4000 tries.

I noted a very strange effect when I was testing uMax in self play. The standard error over 100 games should be 4%, but when I played 1000 games between the same versions, and looked at the scores of the ten individual 100-game runs, these results deviated on the average much more from each other (and the final average result) than you would expect from the calculated standard error. This can only happen if the games are not independent! I can indeed not exclude this, as all the games were played in a single run, and were using the random seed the previous game ended with. So with a bad randomizer, if a single game repeats due to an equal or very close seed at the start of the game, it might imply that the following game repeats as well, destroying the independence of the game.

Whatever the cause, the effect was that the error in the win percentage was always a lot larger than you would expect based on the number of games.

Hello Harm,

Sorry if I interpret all this wrong, but is it not more likely that this interdependance you saw of games played shortly after or before another game, has to do with assigning resources by the operating system to the different processes? (I mean, rather than with your random seed generator?) At any one time Windows for instance here has to shift memory and CPU time to on average thirty processes on my PC and this can never be handled perfectly in such a way that two chess programs would get completely equal resources.

I'm no Windows expert at all, but, even if you unload and reload the two chessprograms after every game I would think it is possible that this could make matters even worse, if, in the worst case scenario, clearing up memory and terminating process is not done without leaving any traces.

In Ed's case I suppose he would prefer to test everything under DOS but as this was about Chessbase matches I suspect it was all done under Windows? Rebel never ran perfect under Windows, so I would expect some imperfect assigning memory and time could very well have occured because of Windows imperfections, and not because of Chessbase bugs.

Maybe certain chessprograms are more susceptible than others, and not clearing up used memory etc. could degrade match results for certain programs more than for others in the long run?

Well, that would just be my first guess as one possible cause of what could have happened?

Regards, Eelco

bob · Post by **bob** » Sun Jun 03, 2007 2:54 am

hgm wrote:Yes, for this reason testing at a fixed number of nodes and recording the ime, rather than fixing the time, seems preferable. But of course you cannot get rid of the randomness induced by SMP that way.

For this reason I still want to implement the tree comparison idea I proposed here lately. This would eliminate the randomness not by sampling enough games and relying on the (tediously slow) 1/sqrt(N) convergence, but by exhaustively generatng all possible realizations of the game from a given initial position. If the versions under comparison are quite close (the case that is most difficult to test with conventional methods), the entire game tree might consist of less than 100 games, but might give you the accuracy of a 10,000 games that are subject to chance effects.

fixed number of nodes is absolutely worthless. To prove that to yourself, do the following. Play a match using the same starting position, where _both_ programs search a fixed number of nodes (say 20,000,000). Record the results. Then re-play but have both search 20,010,000 nodes (10K more nodes than before). Now look at the results. They won't be anywhere near the same. Which one is more correct? Answer: that's hopeless as you take a small random (the games with 20M nodes per side) from a much larger set of random results, and you base your decisions on that? May as well flip a coin...

my upcoming ICGA paper will show just how horrible this is...

Uri Blass · Post by **Uri Blass** » Sun Jun 03, 2007 7:44 am

bob wrote:
hgm wrote:Yes, for this reason testing at a fixed number of nodes and recording the ime, rather than fixing the time, seems preferable. But of course you cannot get rid of the randomness induced by SMP that way.

For this reason I still want to implement the tree comparison idea I proposed here lately. This would eliminate the randomness not by sampling enough games and relying on the (tediously slow) 1/sqrt(N) convergence, but by exhaustively generatng all possible realizations of the game from a given initial position. If the versions under comparison are quite close (the case that is most difficult to test with conventional methods), the entire game tree might consist of less than 100 games, but might give you the accuracy of a 10,000 games that are subject to chance effects.
fixed number of nodes is absolutely worthless. To prove that to yourself, do the following. Play a match using the same starting position, where _both_ programs search a fixed number of nodes (say 20,000,000). Record the results. Then re-play but have both search 20,010,000 nodes (10K more nodes than before). Now look at the results. They won't be anywhere near the same. Which one is more correct? Answer: that's hopeless as you take a small random (the games with 20M nodes per side) from a much larger set of random results, and you base your decisions on that? May as well flip a coin...

my upcoming ICGA paper will show just how horrible this is...

I disagree
Of course if you do not have enough games the weaker program may win but the advantage of fixed number of nodes is that you do not need to care about the problem when one program was slowed down by a significant factor and you can do other things in the same computer at the same time without changing the result.

I use fixed number of nodes in tests.
I am not sure that every change is an improvement because of not having enough games but hopefully the total result after many changes is an improvement.

Uri

Michael Sherwin · Post by **Michael Sherwin** » Sun Jun 03, 2007 8:29 am

Alessandro Scotti wrote:I seem to have reached a plateau with Hamsters where new features and even bug fixes hardly contribute any elo to it.
This would bother me little if not for the fact that each and every test starts by feeding me the illusion of improvement, and only later falls to the same old stats.
A 700 games test tournament might go like this:
- games 0-100: 53% (ehi, not bad)
- games 101-200: 54% (great!)
- games 201-300: 54% (yo-hoo!... goes out to buy champagne!)
- games 301-400: 53% (just a little glitch!)
- games 401-500: 52% (some bad luck here...)
- games 501-600: 51% (you son of a...)
- games 601-700: 50% (nooooooooooooooo!!!)
So it's just my imagination or this happen to you too?!? One would think that 400 games already provide a good approximation, yet...

Hi Alessandro,

I just saw your post. I have been busy giving and recieving grief in the ctf and wow was that a lot of fun.

Luckily I was able to get my humour circuits powered up again for some light entertainment!

You should have sent me an email or or a PM on this. You should know that I have played more games against Hamsters than anyone else except for possibly you. I have played many thousand RomiChess vs Hamsters games.

Yes, I have noticed this behavior with RomiChess also, but almost exclusively when playing against hamsters. Wild variation in results. One, one hundred game match of fixed positions had 56 games with different results than the match before. All I did in my code was change an 8 to a 16 for the two test.

There seems to be some random factor (?) in Hamsters that causes it to play differently a good part of the time. There are times when Romi just blows Hamsters of the board and then from the same position in the next test there seems to be nothing Romi can do against Hamsters and is herself often blown away.

Ron has also noticed this about hamsters. I am surprised that he did not mention this to you. We both have labled Hamsters as too volitile for reliable testing!

If Rybka was as random as Hamsters seems to be, then I would make the bet that Rybka would suffer a several hundred point drop in rating. Randomness limits an engines upper ceiling, no matter how many improvements are made. Randomness will lead to a certain % of losses regardless of how strong that the engine is.

This is just an educated guess on my part from playing so many games against Hamsters. I really do not know for sure. I came to the conclusion that you must have really put 5 different hamsters in your program, all with different personalities and that they are selected randomly before each game!

Unless you put some randomness into Hamsters on purpose then I do not see how it can be there. I hope that there might be something in all that I have said that might be of some help.

Mike

Uri Blass · Post by **Uri Blass** » Sun Jun 03, 2007 10:11 am

Michael Sherwin wrote:
Alessandro Scotti wrote:I seem to have reached a plateau with Hamsters where new features and even bug fixes hardly contribute any elo to it.
This would bother me little if not for the fact that each and every test starts by feeding me the illusion of improvement, and only later falls to the same old stats.
A 700 games test tournament might go like this:
- games 0-100: 53% (ehi, not bad)
- games 101-200: 54% (great!)
- games 201-300: 54% (yo-hoo!... goes out to buy champagne!)
- games 301-400: 53% (just a little glitch!)
- games 401-500: 52% (some bad luck here...)
- games 501-600: 51% (you son of a...)
- games 601-700: 50% (nooooooooooooooo!!!)
So it's just my imagination or this happen to you too?!? One would think that 400 games already provide a good approximation, yet...
Hi Alessandro,

I just saw your post. I have been busy giving and recieving grief in the ctf and wow was that a lot of fun. Luckily I was able to get my humour circuits powered up again for some light entertainment!

You should have sent me an email or or a PM on this. You should know that I have played more games against Hamsters than anyone else except for possibly you. I have played many thousand RomiChess vs Hamsters games.

Yes, I have noticed this behavior with RomiChess also, but almost exclusively when playing against hamsters. Wild variation in results. One, one hundred game match of fixed positions had 56 games with different results than the match before. All I did in my code was change an 8 to a 16 for the two test.

There seems to be some random factor (?) in Hamsters that causes it to play differently a good part of the time. There are times when Romi just blows Hamsters of the board and then from the same position in the next test there seems to be nothing Romi can do against Hamsters and is herself often blown away.

Ron has also noticed this about hamsters. I am surprised that he did not mention this to you. We both have labled Hamsters as too volitile for reliable testing!

If Rybka was as random as Hamsters seems to be, then I would make the bet that Rybka would suffer a several hundred point drop in rating. Randomness limits an engines upper ceiling, no matter how many improvements are made. Randomness will lead to a certain % of losses regardless of how strong that the engine is.

This is just an educated guess on my part from playing so many games against Hamsters. I really do not know for sure. I came to the conclusion that you must have really put 5 different hamsters in your program, all with different personalities and that they are selected randomly before each game!

Unless you put some randomness into Hamsters on purpose then I do not see how it can be there. I hope that there might be something in all that I have said that might be of some help.

Mike

I disagree with you.

It is easy to change program's style by only changing the evaluation without changing the playing strength significantly.

I know it from movei and I believe that the same is for rybka and Vas can
also release some versions of rybka who play different with no big difference in playing strength(less than 40 elo difference).

Uri

Alessandro Scotti · Post by **Alessandro Scotti** » Sun Jun 03, 2007 10:36 am

Michael Sherwin wrote:There seems to be some random factor (?) in Hamsters that causes it to play differently a good part of the time. There are times when Romi just blows Hamsters of the board and then from the same position in the next test there seems to be nothing Romi can do against Hamsters and is herself often blown away.

Ron has also noticed this about hamsters. I am surprised that he did not mention this to you. We both have labled Hamsters as too volitile for reliable testing!

Ouch Michael... this screams "bug" all the way!

I must say that I haven't noticed _extremely_ wild fluctuations in my tests, but yes there is almost definitely something strange in my engine...

P.S. Version 0.2 is particularly buggy though!

Observator bias or...

Re: Observator bias or...

Re: Observator bias or...

Re: Observator bias or...

Re: Observator bias or...

Re: Observator bias or...

Re: Observator bias or...

Re: Observator bias or...

Re: Observator bias or...

Re: Observator bias or...

Re: Observator bias or...