more on engine testing

krazyken · Post by **krazyken** » Fri Aug 01, 2008 8:44 pm

bob wrote:
Michael Sherwin wrote:
bob wrote:These results are _typical_. And most "testers" are using a similar number of opponents, playing _far_ fewer games, and then making go/no-go decisions about changes based on the results. And much of that is pure random noise.
Okay, I have one computer to test on, what do I do? Just give up and quit, I guess.
I actually am not sure, to be honest. It is a real problem, however...

I think the answer is to become less dependent on statistics, and less obsessed with ELO points. I would think the right way to go would be to identify problems in play and understand what causes that problem. Playing many many games is more helpful if you can analyze those games and find out why you lost the ones you did.

hgm · Post by **hgm** » Fri Aug 01, 2008 8:47 pm

Oh yes, I forgot. They don't know sarcasm on Betelgeuse...

How your method is flawed? Well, I can think of hundreds of ways, all very much more likely than that the presented result is due to chance fluctuations in an identical experiment. After all, that probaility for the latter is only 2 in a billion.

In fact I already mentioned one (defective memory bit). Others are: one or more of the participating engines randomize their moves by explicitly reading the clock. Or unintentionally, by generating the Zobrist keys after initializing the PRNG from the clock (perhaps because they are also FRC engines, and need to be able to generate truly random starting positions). An uninitialized local variable used in the evaluation might coincide with an address used for store the most-significant byte of the time to determine if the sarch should go on. (Far fetched? Well, it happened to me.) The order in which the engines are laoded into memory might matter, and be persistent over many games because of the way the OS memory manager works. The lsit is endless. And even if I could not think of anything that could not be ruled out, the fact remains that you are standing with a broken bottle, and no amount of talking on your part will make the bottle whole.

So you want us to do your science for you? Give us more data, then! Give us the complete list of game results (all 50,000), with the time they were played, and number of the core they have been running on. So we can slice it up, per engine, per time slice, per core, per position, and check the result histograms of the slices. So that we can do a Fourier transform on results, to make the correlations show up in the power spectrum. Plenty of angles to attack the problem. Didn't they teach you anything in elementary school, other than throwing up your hands in the air?

hgm · Post by **hgm** » Fri Aug 01, 2008 9:01 pm

bob wrote:Again, what are you talking about? I presented _all_ the data that I had. I made 4 800 game runs. The Elo was somewhat surprising although I knew 800 games would not cut it anyway. But I wanted to quickly see how the Elo numbers would look. I then ran two "normal" runs, which finished right as I was making the original post. I had planned on reporting those later, but when I noticed they had finished, I just added 'em to the bottom rather than doing as I had said earlier in the post and reporting them in a later follow-on.

Why are you rambling on about the 800-game runs? We already concluded that there was absolutely nothing remarkable about the results of those, had we not?

As you call the other runs "normal", I suppose you have done more of those. You never did two of those with identical engines before? Because those are the result I would like to see, how much the results of two of your identical runs typically differs, by seeing a few hundred examples, rather than just one. You never did two identical 250,000-game runs, they were all different? Then just artificially split them in a first half and a second half (timewise), and give us a few hundred of those 12,500-game resuts...

mathmoi · Post by **mathmoi** » Fri Aug 01, 2008 9:38 pm

I think there is a simple way to verify Dr Hyatt's claim (note that I do believe his result are genuine).

All you need to do is take the results of the 50000 games (the two 25000 games match) and mix them, so they are now in a completly random order. You then split the games into 2 25000 games samples and compute the engine's elo for thoses games. Rince and repeate a couple of times.

If Dr Hyatt is right the crafty elo of each sample should still vary "a lot".

hgm · Post by **hgm** » Fri Aug 01, 2008 10:16 pm

mathmoi wrote:If Dr Hyatt is right the crafty elo of each sample should still vary "a lot".

I don't understand that conclusion. I would say that if the difference between two randomly picked 25,00 samples from the 50,000- games would differ to much, it shows you did not pick randomly.

That and only that.

bob · Post by **bob** » Fri Aug 01, 2008 10:28 pm

krazyken wrote:
bob wrote:
Michael Sherwin wrote:
bob wrote:These results are _typical_. And most "testers" are using a similar number of opponents, playing _far_ fewer games, and then making go/no-go decisions about changes based on the results. And much of that is pure random noise.
Okay, I have one computer to test on, what do I do? Just give up and quit, I guess.
I actually am not sure, to be honest. It is a real problem, however...
I think the answer is to become less dependent on statistics, and less obsessed with ELO points. I would think the right way to go would be to identify problems in play and understand what causes that problem. Playing many many games is more helpful if you can analyze those games and find out why you lost the ones you did.

How do you decide if something you add is good or bad? I can't count the number of times I have added something I was sure made the program better, only to find out later than it was worse... that's what I (and many others) are trying to measure with these test games...

bob · Post by **bob** » Fri Aug 01, 2008 10:31 pm

hgm wrote:
bob wrote:Again, what are you talking about? I presented _all_ the data that I had. I made 4 800 game runs. The Elo was somewhat surprising although I knew 800 games would not cut it anyway. But I wanted to quickly see how the Elo numbers would look. I then ran two "normal" runs, which finished right as I was making the original post. I had planned on reporting those later, but when I noticed they had finished, I just added 'em to the bottom rather than doing as I had said earlier in the post and reporting them in a later follow-on.
Why are you rambling on about the 800-game runs? We already concluded that there was absolutely nothing remarkable about the results of those, had we not?

Depends on your definition of "nothing remarkable". If you mean "worthless for evaluating any sort of program change" then I will agree...

As you call the other runs "normal", I suppose you have done more of those. You never did two of those with identical engines before? Because those are the result I would like to see, how much the results of two of your identical runs typically differs, by seeing a few hundred examples, rather than just one. You never did two identical 250,000-game runs, they were all different? Then just artificially split them in a first half and a second half (timewise), and give us a few hundred of those 12,500-game resuts...

I run two or more "normal" runs using exactly the same versions every now and then, yes. This was the first two that had been passed thru BayesElo, which was the reason I posted them. I'll post more over time, because the issue is important.

hgm · Post by **hgm** » Fri Aug 01, 2008 10:37 pm

Well, BayseElo won't tell you anything that you would not immediately see from the scores. The Elo differences are all very small, and you are entirely in the linear range, so that 1% corresponds to 7 Elo.

bob wrote:Depends on your definition of "nothing remarkable". If you mean "worthless for evaluating any sort of program change" then I will agree...

Wotrthless for evaluating changes smaller than 18*sqrt(2) Elo. That is a bit different from "any sort of".

I don't see what you are so hung up about. A sampling process will have a statistical error associated with it, that can be accurately calculated using no other information than that the result of a single game is limited to the interval [0,1], and the games are independent. This requires nothing more fancy than taking a square root. And if the difference you want to measure is smaller than that error, yes, than the result is of course worthless.

One doesn't have t play 2,400 games to come to that conclusion. The 2400 games more or less confirm that the error bars given by BayesElo are correct, though.

Dirt · Post by **Dirt** » Fri Aug 01, 2008 10:38 pm

hgm wrote:
mathmoi wrote:If Dr Hyatt is right the crafty elo of each sample should still vary "a lot".
I don't understand that conclusion. I would say that if the difference between two randomly picked 25,00 samples from the 50,000- games would differ to much, it shows you did not pick randomly.

That and only that.

That and nothing more. Quoth the raven, "nevermore"...

Or that there is something wrong with your, and BayesElo's, statistics.

Would such a test which failed to show the large swings he's been seeing convince Bob that something is wrong with his cluster testing, even if he can't find the cause?

bob · Post by **bob** » Fri Aug 01, 2008 10:44 pm

hgm wrote:Oh yes, I forgot. They don't know sarcasm on Betelgeuse...

How your method is flawed? Well, I can think of hundreds of ways, all very much more likely than that the presented result is due to chance fluctuations in an identical experiment. After all, that probaility for the latter is only 2 in a billion.

In fact I already mentioned one (defective memory bit).

This is ECC memory. Not going to happen for a single bit. We run a monthly sanity check on the entire cluster to exercise everything from memory to CPU to disks to infiniband.

Others are: one or more of the participating engines randomize their moves by explicitly reading the clock.

And exactly how would that make the result of one game dependent on the result of another game? Are you trying to say that a program with a random evaluation can't be evaluated statistically? there goes monte carlo analysis I guess. In any case, none of them do that because they all support opening books and it would not work with random hash keys. And random hash keys would not produce inter-game dependencies in any case.

Or unintentionally, by generating the Zobrist keys after initializing the PRNG from the clock (perhaps because they are also FRC engines, and need to be able to generate truly random starting positions). An uninitialized local variable used in the evaluation might coincide with an address used for store the most-significant byte of the time to determine if the sarch should go on. (Far fetched? Well, it happened to me.) The order in which the engines are laoded into memory might matter, and be persistent over many games because of the way the OS memory manager works. The lsit is endless. And even if I could not think of anything that could not be ruled out, the fact remains that you are standing with a broken bottle, and no amount of talking on your part will make the bottle whole.

You are assuming the bottle is broken, and wildly thrashing about trying to justify that. I'll point out that if my testing could somehow be biased by memory management, then _everybody's_ testing will be biased. So what would be the solution? As far as program bugs go, if the opponents I use have bugs, then everyone that is using those opponents, including the rating lists will have the same problem. So what would be the solution.

Nothing you have suggested above would cause any sort of dependency between any two games. They would just increase the randomness of the behavior, where more games should make it settle down. Assuming computers conform to the Elo model which I am certainly not convinced of.

So you want us to do your science for you? Give us more data, then! Give us the complete list of game results (all 50,000), with the time they were played, and number of the core they have been running on. So we can slice it up, per engine, per time slice, per core, per position, and check the result histograms of the slices. So that we can do a Fourier transform on results, to make the correlations show up in the power spectrum. Plenty of angles to attack the problem. Didn't they teach you anything in elementary school, other than throwing up your hands in the air?

They teach me to not worry about some odd remote possibility if it is an odd remote possibility that _everyone_ has to deal with in their testing. Do you believe my hardware operates under different physical laws than what you (or others) use for testing? Do you believe that running a set of games on a CPU core running at 2.133ghz and another set running on a CPU core running at 2.155ghz would make any difference? Can you not mix games running on different speeds of processors (that is not happening here, but shows that the idea is senseless in the extreme).

To make a meaningful argument, you need to come up with something that is unique in _my_ testing methodology that is flawed and which does not cause a problem for others. I'll bet that except for those of us using a controlled cluster like either of the ones I use, _nobody_ can produce the consistency of performance, because there are _no_ odd processes running. No email. No network pings and handshakes and such. No crontab activity of any kind. Etc. So if anything, these clusters are _more_ consistent than any practical user testing facility, rather than less.

more on engine testing

Re: more on engine testing

Re: more on engine testing

Re: more on engine testing

Re: more on engine testing

Re: more on engine testing

Re: more on engine testing

Re: more on engine testing

Re: more on engine testing

Re: more on engine testing

Re: more on engine testing