more on engine testing

bob · Post by **bob** » Fri Aug 01, 2008 4:03 pm

Dirt wrote:Unless you changed the default, BayesElo shows a 95% confidence range, or two sigma. So your 21 point change in Crafty (which was +/-4 in both runs) was a ten sigma change. Either BayesElo is calculating the ranges wrong or there was something different about the two runs. Or you have some freakish luck, which keeps happening over and over and ... play the lottery much?

This is stock BayesElo. The only thing different about the two runs is that one started at time T, the next started at time T+N where N is how long it took the first run to occur. I've been describing this problem for quite a while. Note that the four runs I posted were four consecutive runs. I had queued up some 5000+ game matches, but our A/C in the cluster room had a problem during the night and the cluster was powered down to avoid overheating.

bob · Post by **bob** » Fri Aug 01, 2008 4:18 pm

hgm wrote:I don't see any problem here. The results of the 800-game matches are all the same within the quoted statistical error. So of course one cannot conclude that one version is better than the other. I am not sure if the intervals quoted by BayesElo are 68% confidence or 95% confidence (i.e. 1 sigma or 2 sigma), but, when repeated often enough, of course you are bound to also find results deviating more than the quoted statistical uncerrtainty. In 200 tries, 10 results should ly outside 2 sigma. This is why it is called a 95% confidence interval, and not a 100%.

So 800 games are not enough to show one of the versions to better. In fact no number of games would be enough, as they are indeed equal. But if they were not equal, and one would score 90% on average against the the opponents where the other would score only 10%, a dozen games each would already be enough to point out the better one with >99.9% confidence. (40%*sqrt(2/12) ~ 16%, so that would be a 5-sigma deviation.) Surprise! small differences are more difficult to measure than large differences...

In the 25,000-games test you know at least that it is extremely unlikely that the version with Elo=+2 would not be better than that with -19. The uncertainty in the difference is sqrt(4*4+4*4) = 6, the difference is 21. I guess the 6 is even a 2-sigma interval, after all (25,000 games give a standard deviation in the score of 40%/sqrt(25,000) = 0.25%, which translate to roughtly 1.8 Elo point, while 4 was indicated). So you deviate by 6 sigma, which makes it 99.99...% certain that one of the versions is better.

But that was statistics. The question of course should be: "better at what?". And the answer is of course: "better at beating these particular 5 opponents starting from the Sliver positions".

As not all starting positions give the same score expectation value (i.e. score after asymptotically many games) between the same opponents, there is a statistical error associated with sampling the positions as well. Analyze your data by starting position, and you will have a pretty good idea how large it is (40 positions are enough to give a good impression of the standard deviation of the entire population). Say it is 5% (optimistic, IMO...). Then, averaged over 40 positions, a 1-sigma uncertainty of 0.8% in score expectancy will remain. This would show up as 11 Elo points in the Elo bounds, when converted to units comparable to the BayesElo-given error bounds.

So the likelihood that the 'better' Crafty version is also better at scoring pints against opponents picked from the entire population of Chess-playing entities, starting from arbitrary positions, is not nearly as large as you might think from the bounds given by BayesElo. As these bounds only apply to the statistical error in the game results, and not to the errors due to position selection and opponent selection. (Note that the latter also contributes a sampling error, as the score of one particular program against two different opponents will in general not be the same. And a sample of 5 is very, very small...)

So in short: no matter if you have 10,000, 25,000, 1,000,000 or 100,000,000 games, as long as you use only 5 opponents and 40 starting positions, the results will be pretty useless for determining which version would be better in general.

Here's the point, that continually gets overlooked. If 800 games gives two rating ranges that do _not_ overlap, for two identical programs, then it is impossible to use 800 game matches to determine whether A' is better than A, when that change will, by definition, be small. The Rybka testing using 80K games to "measure Elo changes as small as 1" (Larry K's words, not mine) is, as I said when it was posted, incorrect. I don't believe you can actually measure Elo changes of +/- 1 with any degree of confidence at all, without playing hundreds of thousands of games at a minimum.

The BayesElo confidence interval is 95%. Which, as I have said previously, doesn't mean a lot when programs exhibit so much natural non-determinism.

Once more, the runs were consecutive runs. I did not cherry-pick the wildest 4 our of hundreds of matches. I just ran 4 matches and cut/pasted the results. These results are _typical_. And most "testers" are using a similar number of opponents, playing _far_ fewer games, and then making go/no-go decisions about changes based on the results. And much of that is pure random noise.

BTW the stuff about the 40 positions, or the 5 opponents is way beside the point. No matter what 40 positions I choose, and it would seem to me that the smaller number the more stable the results, I ought to be able to produce a stable answer about those 40 positions, whether or not that carries over to other positions depends on how "general" those positions are (and the silver positions are pretty generic/representative of opening positions). My point is this: If I can't play a long match against _one_ opponent, using 5 starting positions, and get similar results for each run (in terms of Elo) then adding more programs and more positions only increases the noise, it certainly doesn't reduce it.

Uri Blass · Post by **Uri Blass** » Fri Aug 01, 2008 4:28 pm

bob wrote:
hgm wrote:I don't see any problem here. The results of the 800-game matches are all the same within the quoted statistical error. So of course one cannot conclude that one version is better than the other. I am not sure if the intervals quoted by BayesElo are 68% confidence or 95% confidence (i.e. 1 sigma or 2 sigma), but, when repeated often enough, of course you are bound to also find results deviating more than the quoted statistical uncerrtainty. In 200 tries, 10 results should ly outside 2 sigma. This is why it is called a 95% confidence interval, and not a 100%.

So 800 games are not enough to show one of the versions to better. In fact no number of games would be enough, as they are indeed equal. But if they were not equal, and one would score 90% on average against the the opponents where the other would score only 10%, a dozen games each would already be enough to point out the better one with >99.9% confidence. (40%*sqrt(2/12) ~ 16%, so that would be a 5-sigma deviation.) Surprise! small differences are more difficult to measure than large differences...

In the 25,000-games test you know at least that it is extremely unlikely that the version with Elo=+2 would not be better than that with -19. The uncertainty in the difference is sqrt(4*4+4*4) = 6, the difference is 21. I guess the 6 is even a 2-sigma interval, after all (25,000 games give a standard deviation in the score of 40%/sqrt(25,000) = 0.25%, which translate to roughtly 1.8 Elo point, while 4 was indicated). So you deviate by 6 sigma, which makes it 99.99...% certain that one of the versions is better.

But that was statistics. The question of course should be: "better at what?". And the answer is of course: "better at beating these particular 5 opponents starting from the Sliver positions".

As not all starting positions give the same score expectation value (i.e. score after asymptotically many games) between the same opponents, there is a statistical error associated with sampling the positions as well. Analyze your data by starting position, and you will have a pretty good idea how large it is (40 positions are enough to give a good impression of the standard deviation of the entire population). Say it is 5% (optimistic, IMO...). Then, averaged over 40 positions, a 1-sigma uncertainty of 0.8% in score expectancy will remain. This would show up as 11 Elo points in the Elo bounds, when converted to units comparable to the BayesElo-given error bounds.

So the likelihood that the 'better' Crafty version is also better at scoring pints against opponents picked from the entire population of Chess-playing entities, starting from arbitrary positions, is not nearly as large as you might think from the bounds given by BayesElo. As these bounds only apply to the statistical error in the game results, and not to the errors due to position selection and opponent selection. (Note that the latter also contributes a sampling error, as the score of one particular program against two different opponents will in general not be the same. And a sample of 5 is very, very small...)

So in short: no matter if you have 10,000, 25,000, 1,000,000 or 100,000,000 games, as long as you use only 5 opponents and 40 starting positions, the results will be pretty useless for determining which version would be better in general.
Here's the point, that continually gets overlooked. If 800 games gives two rating ranges that do _not_ overlap, for two identical programs, then it is impossible to use 800 game matches to determine whether A' is better than A, when that change will, by definition, be small. The Rybka testing using 80K games to "measure Elo changes as small as 1" (Larry K's words, not mine) is, as I said when it was posted, incorrect. I don't believe you can actually measure Elo changes of +/- 1 with any degree of confidence at all, without playing hundreds of thousands of games at a minimum.

The BayesElo confidence interval is 95%. Which, as I have said previously, doesn't mean a lot when programs exhibit so much natural non-determinism.

Once more, the runs were consecutive runs. I did not cherry-pick the wildest 4 our of hundreds of matches. I just ran 4 matches and cut/pasted the results. These results are _typical_. And most "testers" are using a similar number of opponents, playing _far_ fewer games, and then making go/no-go decisions about changes based on the results. And much of that is pure random noise.

If the results are typical the only conclusion is that the results are not independent.

If you write the number of nodes per second in every move then I suggest to look for the smallest numbers and the biggest number.

You may discover that out of 1600 games the 100 smallest numbers all happened at games 200-500.

Another possibility is that Crafty does not leave the memory of the computer after every game and in that case some bug like an uninitialized variable can cause random change in the behaviour of Crafty during the match so games are dependent.

Uri

bob · Post by **bob** » Fri Aug 01, 2008 4:30 pm

krazyken wrote:Nice statistics! It's the kind of thing my stats prof would give the class. Take this data and build a case for X. Now take the same data and build a case against X.

Your 2 25k matches are nice. Of course there is a possibility that the results could happen that way with no changes.

It is more likely that something did change. If software was controlled, learning off, same openings and opponents played in the same order, etc. SMP algorithms are frequently non-deterministic aren't they? It was 2 SMP engines that varied the most.

Except that all programs use one cpu here. This makes the testing go faster and also lets me test against programs that don't have SMP, without exaggerating the difference due to a speed advantage. All my normal testing is just one cpu for Crafty vs one cpu for the opponent.

I am going to try, before long, to write a paper describing the SMP decisions I made in Crafty, and then run some matches on our bigger cluster where Crafty uses 1, then 2, then 4 and then 8 processors, to the opponent's one, to get an idea of how Elo is affected by SMP speedup. But not in these tests.

Then, of course, one should question whether the environmental variables were the same. Possibilities include CPU temp, background processes, available memory and things of that sort.

Our cluster is consistent in that regard. No processes on a node except for mine, as once I am given a node it is mine and the queue scheduler won't run anything else on that node at all. There is always the issue of network traffic, although the nodes themselves are behind a "front end node" and do not see "outside world" network traffic, but they do see "internal traffic" although those runs were with me using the entire cluster so the only network traffic was writing result files to a humongous file storage sub-system that sits on the infiniband network. But this is as constant as it can be, from that perspective. Memory also is constant at 12 gigs per node although I use nowhere near 1/3 of that for these fairly fast games. There is just a tremendous amount of variability in measuring time on these systems, which lets the searches from the same position search different numbers of nodes on successive runs. And I have discovered that just varying by 100 nodes in each search is way more than enough to produce completely different games, even though the idea seems impossible.

Uri Blass · Post by **Uri Blass** » Fri Aug 01, 2008 4:37 pm

Note that it is important that all the players leave the memory after every game.

Otherwise even bug that is not in Crafty may cause a difference in the results.

As an extreme example a program may simply search to fixed depth of 1 ply after something rare happen and if it starts to search to fixed depth 1 at game number 500 you get significantly different result relative to the case that it starts to search to fixed depth 1 at game number 100.

Uri

bob · Post by **bob** » Fri Aug 01, 2008 4:39 pm

hgm wrote:Well, he says they were changed between the runs, didn't he?

Nope. Go back and re-read. _zero_ changes. Crafty-22.2 was _exactly_ the same for _all_ the runs... Which was my point.

If these are the same, I would not trust it: then there would be very strong evidence that you are not doing what you think you are doing, (whatever the reason may be), as the probability of a 6-sigma fluke is virtually nil. Unless, of course, this was not a typical result, but the most extreme fluctuation selected from a million tries. Then it could become quite likely, again. It is very easy to fool yourself with statistics, by selecting samples in stead of sampling randomly. It will always be possible to quote examples of incredibly unlikely events, after enough tries.

OK, one more time. The 800 game matches were run back-to-back, on the same cluster, with no body else using the cluster at all, much less using any single node randomly. The versions of _all_ programs were unchanged for the testing. Crafty was unchanged. The only thing that was changed was whatever random effects influence time measurement as all the searches were just simple 5+5 games, which probably turns into something around 15 seconds per move on average.

For the two 25,000 game matches, same deal. Nothing was changed. Same crafty-22.2 version for both runs.

The first four runs were run consecutively. I didn't cherry-pick. If you want to test my honesty there, I'd propose the following: You choose the number of games in a match, same opponents, same positions (since I already have this set up and ready to run). But you pick the number of games and the number of trials to run. We agree on an approximate start time, so that at the proper time, you post the number of games and matches you want here. I'll be on so that I will see the conditions within a minute of your posting. I'll run the tests and post the results as soon as they are done. That way I would not be able to run 4N tests and pick the N results with the most variance.

The last two 25000 game runs were run consecutively as well. I only ran them because the cluster was idle and I thought the data would be interesting, which it was.

There is absolutely no way to run a _more_ controlled experiment. This is on a cluster with 128 two CPU nodes. I run 256 games at a time. No pondering. No SMP search. No other users even looking at the cluster nodes while the test is running. All nodes are identical with respect to O/S, infiniband network, memory, clock speed, etc. So yes, there is significant variance. No it wasn't cherry-picked. And we can at least address this last with the above test conditions I gave if you suspect (for whatever reason) some sort of "foul play" going on.

hgm · Post by **hgm** » Fri Aug 01, 2008 4:39 pm

bob wrote:Here's the point, that continually gets overlooked. If 800 games gives two rating ranges that do _not_ overlap, for two identical programs, then it is impossible to use 800 game matches to determine whether A' is better than A, when that change will, by definition, be small.

This is not so much overlooked, as well rejected because it is wrong. If two identical programs habitually produce rating ranges that do not overlap from identical tests, (i.e. if they in more than 5% of the cass are outside the 95% confidence interval), it follows that the 800-game test match was not done properly (as the games were apparently not independent). It does not follow that 800 game matches cannot be used to measure a small program difference when conducted properly (i.e. with independent games).

You seem to be attempting burning down your own straw man, though, as for the results of the 800-game matches you show, the rating ranges did overlap. There was nothing suspicious with these results. They were distibuted as one would expect from standard statistics.

The Rybka testing using 80K games to "measure Elo changes as small as 1" (Larry K's words, not mine) is, as I said when it was posted, incorrect. I don't believe you can actually measure Elo changes of +/- 1 with any degree of confidence at all, without playing hundreds of thousands of games at a minimum.

I think they can, as they undoubtedly conduct the tests properly, keeping an eye on the statistics, and confirming that the games were indeed independent.

The BayesElo confidence interval is 95%. Which, as I have said previously, doesn't mean a lot when programs exhibit so much natural non-determinism.

I am sorry I have to say this for the umptieth time, but this is pure nonsense. Non-determinism cannot explain these results. Only dependence of the game results can.

Once more, the runs were consecutive runs. I did not cherry-pick the wildest 4 our of hundreds of matches. I just ran 4 matches and cut/pasted the results. These results are _typical_. And most "testers" are using a similar number of opponents, playing _far_ fewer games, and then making go/no-go decisions about changes based on the results. And much of that is pure random noise.

As I said before, there was nothing suspicious about these 4 results. You have no case.

BTW the stuff about the 40 positions, or the 5 opponents is way beside the point. No matter what 40 positions I choose, and it would seem to me that the smaller number the more stable the results, I ought to be able to produce a stable answer about those 40 positions, whether or not that carries over to other positions depends on how "general" those positions are (and the silver positions are pretty generic/representative of opening positions). My point is this: If I can't play a long match against _one_ opponent, using 5 starting positions, and get similar results for each run (in terms of Elo) then adding more programs and more positions only increases the noise, it certainly doesn't reduce it.

Indeed. Like I said, it proves your testing method is flawed, because the games are apparently dependent. Not because there aren't enough games. The number of posititions and opponents only shows that the effort to obtain more accurate results by playing more games is doomed from the outset, even if it would not be executed in a flawed way. Even if would have converged, it will not converge to the correct number, and conclusions based on it can be wrong.

Michael Sherwin · Post by **Michael Sherwin** » Fri Aug 01, 2008 4:40 pm

bob wrote:These results are _typical_. And most "testers" are using a similar number of opponents, playing _far_ fewer games, and then making go/no-go decisions about changes based on the results. And much of that is pure random noise.

Okay, I have one computer to test on, what do I do? Just give up and quit, I guess.

Dirt · Post by **Dirt** » Fri Aug 01, 2008 4:45 pm

bob wrote:This is stock BayesElo. The only thing different about the two runs is that one started at time T, the next started at time T+N where N is how long it took the first run to occur. I've been describing this problem for quite a while. Note that the four runs I posted were four consecutive runs. I had queued up some 5000+ game matches, but our A/C in the cluster room had a problem during the night and the cluster was powered down to avoid overheating.

Time? Do any of these programs use time of day to seed a random number generator? But that's a silly question, you would have already checked to see if that was a problem.

Something changes, or each of the runs would be exactly the same. It would be reasonable to expect that those differences are random from game to game, but your results suggest that they aren't for some reason.

bob · Post by **bob** » Fri Aug 01, 2008 4:47 pm

hgm wrote:Well, if the Crafty version was the same, and the other engines are as well, it just shows one of two things:

1) the test-method is completely broken.
2) the test result was selected as the most a-typical case from a huge data-base of results.

Or it might show that you are simply not _reading_ what I wrote. The four runs, and then the two runs were made consecutively and were the only runs I made (I have started playing with BayesElo as a method of measuring change rather than just raw game totals to see if it provides any better answers). So there was no cherry-picking, and in another post I gave you the option of testing this for yourself. I ran the four 160 game matches to see what BayesElo would say about the results. And once I saw the variability, I decided to run two full (25,000 game) runs to see if it was any better. It wasn't...

You'd have to explain how the test-method is "completely broken". I run on a cluster of 128 dual-cpu nodes. Nobody else is running while my test is running, verified by a monitor program that runs on the "head" (you can't get into the cluster without first logging into the "head"). All nodes are identical. All run a "minimal linux" system (without all the silly daemons that are worthless in this environment). So I don't see what can be "broken" except for the conception that this kind of result can't happen. It can, and does happen, all the time...

Case 1 could occur, for example, because one of the machines you had been using forrunning the test had a defective memory bit, which reduced playing strength of the engine using that memory, and that Crafty was suffering from this for a large number of games in a row (because it stayed loaded between games) while in another test one of the opponents was similarly suffering. If the distributionof 25,000-game results does have a standard deviation large enough that results like the reported one are likely, it proves that the results of individual game are not independent. No matter how unlikely any mechanism you could imagine to cause this is (supposing you made an effort to exclude the likely causes), it is stiil astronomically more likely that this is the case than that the reported result would be due to chance.

Case 2 I already discussed. If you try something a million times, you will almost certainly (well, with 1-1/e probability, actually) get at least one 'one-in-a-million' fluke. This is actually why they call it a one-in-a-million fluke. So it would be silly to start bragging about how lucky / unlucky you were.

So the moral lesson is this: believe the result distributions you see. If they do not match the normal distribution that is predicted for independent games, you know your test setup and/or method is flawed, and the width of the distribution might give you a pointer as to what cused the problem. (The effectve number of independent games will tell you ho big the correlated clusters of results were, and that might for example correspond to the number of games played on each machine, or against each opponent.)

more on engine testing

Re: more on engine testing

Re: more on engine testing

Re: more on engine testing

Re: more on engine testing

Re: more on engine testing

Re: more on engine testing

Re: more on engine testing

Re: more on engine testing

Re: more on engine testing

Re: more on engine testing