more on engine testing

hgm · Post by **hgm** » Fri Aug 01, 2008 4:55 pm

Well, sorry for being slow then.

Can you explain me what exactly is your issue with these first 4 runs?

What I see is ratings of -33, -18, -17 and -12, with a 2-sigma error of 18. So which rating ranges, according to you, did not overlap? The average is -20, and the deviation from that average are -13, +2, +3 and +8. Which of those numbers is larger than 18? To me it seems only one of them is larger than 9 (= 1 sigma). Which is about what you expect: 32% of the result should ly outside the interval (-sigma, sigma).

So explain me what I overlooked.

hgm · Post by **hgm** » Fri Aug 01, 2008 5:03 pm

bob wrote:You'd have to explain how the test-method is "completely broken".

No, I don't. If I see a pile of glass fragments lying on the floor, I can conclude that the bottle is broken without having to explain how it got there. Seeing the fragments justifies the conclusion that it is a broken bottle no matter what. And you show us the fragments. (For the 25,000 games run, that is. The 800-games runs look like a perfectly OK bottle to me.

)

It is for you to figure out why you did not succeed in playing independent games. It is your cluster, after all,so you fix it! If you were to offer me equal time on it, I would of course be willing to help you out...

bob · Post by **bob** » Fri Aug 01, 2008 5:06 pm

BTW, let me add one more line for each test that I removed for brevity before, which shows the date/time the results were finished for a run:

Code: Select all

Wed Jul 30 19:38:08 CDT 2008
time control = 5+5
crafty-22.2R4
Rank Name                   Elo    +    - games score oppo. draws
   1 Glaurung 2-epsilon/5   121   42   41   160   68%   -18   17%
   2 Glaurung 1.1 SMP        61   42   41   160   60%   -18   13%
   3 Fruit 2.1               49   41   40   160   59%   -18   15%
   4 opponent-21.7           13   38   38   159   55%   -18   33%
   5 Crafty-22.2            -18   18   18   799   47%     4   19%
   6 Arasan 10.0           -226   42   45   160   23%   -18   18%
Wed Jul 30 22:00:07 CDT 2008
time control = 5+5
crafty-22.2R4
Rank Name                   Elo    +    - games score oppo. draws
   1 Glaurung 2-epsilon/5    81   42   41   160   63%   -17   16%
   2 opponent-21.7           61   38   38   159   62%   -17   33%
   3 Glaurung 1.1 SMP        46   42   41   160   58%   -17   13%
   4 Fruit 2.1               35   40   40   160   57%   -17   19%
   5 Crafty-22.2            -17   18   18   799   47%     3   19%
   6 Arasan 10.0           -205   42   45   160   26%   -17   16%
Thu Jul 31 00:27:13 CDT 2008
time control = 5+5
crafty-22.2R4
Rank Name                   Elo    +    - games score oppo. draws
   1 Glaurung 2-epsilon/5   113   43   41   160   66%   -12   12%
   2 opponent-21.7           73   39   38   159   63%   -12   32%
   3 Fruit 2.1               21   41   40   160   54%   -12   15%
   4 Crafty-22.2            -12   18   18   799   48%     2   18%
   5 Glaurung 1.1 SMP       -35   41   41   160   47%   -12   11%
   6 Arasan 10.0           -161   41   43   160   30%   -12   18%
Thu Jul 31 03:10:15 CDT 2008
time control = 5+5
crafty-22.2R4
Rank Name                   Elo    +    - games score oppo. draws
   1 Glaurung 2-epsilon/5   131   45   42   160   70%   -33   10%
   2 Fruit 2.1               64   41   40   160   63%   -33   19%
   3 Glaurung 1.1 SMP        25   41   40   160   58%   -33   15%
   4 opponent-21.7           13   37   37   160   57%   -33   36%
   5 Crafty-22.2            -33   18   18   800   45%     7   19%
   6 Arasan 10.0           -199   42   44   160   29%   -33   15%

These runs take about 2.5 hours each. and don't take the entire cluster since there are only 40 positions and 5 opponents, giving 200 individual matches of 4 games each. The cluster is busy for about an hour and a half or so, then it quickly drops off since the games are run in groups of four for efficiency when running large numbers of games. Notice the times which gives an idea of how long the run takes (the time is at the end of the run where the statistics are gathered and saved into the results file)...

bob · Post by **bob** » Fri Aug 01, 2008 5:10 pm

Uri Blass wrote:Note that it is important that all the players leave the memory after every game.

Otherwise even bug that is not in Crafty may cause a difference in the results.

As an extreme example a program may simply search to fixed depth of 1 ply after something rare happen and if it starts to search to fixed depth 1 at game number 500 you get significantly different result relative to the case that it starts to search to fixed depth 1 at game number 100.

Uri

This is how the tests are run. I have a C "referee" program I wrote years ago. It forks two processes, each process executes an engine for one game, the processes are terminated, and then this is repeated. A single instance of a referee plays just one position, between just two opponents, for some specified number of games (four in this case). So no persistent hash issues or anything else, each program plays exactly one game and then is terminated and re-started from scratch.

However, even if this wasn't the case, the results should be identical if the match is played the same way each time...

bob · Post by **bob** » Fri Aug 01, 2008 5:17 pm

Uri Blass wrote:
bob wrote:
hgm wrote:I don't see any problem here. The results of the 800-game matches are all the same within the quoted statistical error. So of course one cannot conclude that one version is better than the other. I am not sure if the intervals quoted by BayesElo are 68% confidence or 95% confidence (i.e. 1 sigma or 2 sigma), but, when repeated often enough, of course you are bound to also find results deviating more than the quoted statistical uncerrtainty. In 200 tries, 10 results should ly outside 2 sigma. This is why it is called a 95% confidence interval, and not a 100%.

So 800 games are not enough to show one of the versions to better. In fact no number of games would be enough, as they are indeed equal. But if they were not equal, and one would score 90% on average against the the opponents where the other would score only 10%, a dozen games each would already be enough to point out the better one with >99.9% confidence. (40%*sqrt(2/12) ~ 16%, so that would be a 5-sigma deviation.) Surprise! small differences are more difficult to measure than large differences...

In the 25,000-games test you know at least that it is extremely unlikely that the version with Elo=+2 would not be better than that with -19. The uncertainty in the difference is sqrt(4*4+4*4) = 6, the difference is 21. I guess the 6 is even a 2-sigma interval, after all (25,000 games give a standard deviation in the score of 40%/sqrt(25,000) = 0.25%, which translate to roughtly 1.8 Elo point, while 4 was indicated). So you deviate by 6 sigma, which makes it 99.99...% certain that one of the versions is better.

But that was statistics. The question of course should be: "better at what?". And the answer is of course: "better at beating these particular 5 opponents starting from the Sliver positions".

As not all starting positions give the same score expectation value (i.e. score after asymptotically many games) between the same opponents, there is a statistical error associated with sampling the positions as well. Analyze your data by starting position, and you will have a pretty good idea how large it is (40 positions are enough to give a good impression of the standard deviation of the entire population). Say it is 5% (optimistic, IMO...). Then, averaged over 40 positions, a 1-sigma uncertainty of 0.8% in score expectancy will remain. This would show up as 11 Elo points in the Elo bounds, when converted to units comparable to the BayesElo-given error bounds.

So the likelihood that the 'better' Crafty version is also better at scoring pints against opponents picked from the entire population of Chess-playing entities, starting from arbitrary positions, is not nearly as large as you might think from the bounds given by BayesElo. As these bounds only apply to the statistical error in the game results, and not to the errors due to position selection and opponent selection. (Note that the latter also contributes a sampling error, as the score of one particular program against two different opponents will in general not be the same. And a sample of 5 is very, very small...)

So in short: no matter if you have 10,000, 25,000, 1,000,000 or 100,000,000 games, as long as you use only 5 opponents and 40 starting positions, the results will be pretty useless for determining which version would be better in general.
Here's the point, that continually gets overlooked. If 800 games gives two rating ranges that do _not_ overlap, for two identical programs, then it is impossible to use 800 game matches to determine whether A' is better than A, when that change will, by definition, be small. The Rybka testing using 80K games to "measure Elo changes as small as 1" (Larry K's words, not mine) is, as I said when it was posted, incorrect. I don't believe you can actually measure Elo changes of +/- 1 with any degree of confidence at all, without playing hundreds of thousands of games at a minimum.

The BayesElo confidence interval is 95%. Which, as I have said previously, doesn't mean a lot when programs exhibit so much natural non-determinism.

Once more, the runs were consecutive runs. I did not cherry-pick the wildest 4 our of hundreds of matches. I just ran 4 matches and cut/pasted the results. These results are _typical_. And most "testers" are using a similar number of opponents, playing _far_ fewer games, and then making go/no-go decisions about changes based on the results. And much of that is pure random noise.
If the results are typical the only conclusion is that the results are not independent.

If you write the number of nodes per second in every move then I suggest to look for the smallest numbers and the biggest number.

Already done. No _significant_ variance. We did this early on as this was our first guess. But it was not the case. The "nodes" varies significantly here and there, because one time a program just has enough time to fail low, the next time the game is played is doesn't quite search long enough to find it. But the NPS is not varying at all from Crafty's perspective. Note that I do some serious rounding to report numbers like 2.1M nps. But if you'd like, I can certainly grab all the NPS values from two different runs and post them side-by-side.

You may discover that out of 1600 games the 100 smallest numbers all happened at games 200-500.

Another possibility is that Crafty does not leave the memory of the computer after every game and in that case some bug like an uninitialized variable can cause random change in the behaviour of Crafty during the match so games are dependent.

Uri

Already said this is not happening. the referee program starts a new copy of each engine for each game. To avoid that. Again, we thought this might be an issue when this testing variability first came up. It wasn't.

bob · Post by **bob** » Fri Aug 01, 2008 5:21 pm

Dirt wrote:
bob wrote:This is stock BayesElo. The only thing different about the two runs is that one started at time T, the next started at time T+N where N is how long it took the first run to occur. I've been describing this problem for quite a while. Note that the four runs I posted were four consecutive runs. I had queued up some 5000+ game matches, but our A/C in the cluster room had a problem during the night and the cluster was powered down to avoid overheating.
Time? Do any of these programs use time of day to seed a random number generator? But that's a silly question, you would have already checked to see if that was a problem.

Something changes, or each of the runs would be exactly the same. It would be reasonable to expect that those differences are random from game to game, but your results suggest that they aren't for some reason.

What is changing is the time per move. Because there is jitter in the normal clock function. Significant jitter. Which then means successive games from the same starting position will have slightly different node counts for each search which can significantly alter the search results. I do not know about the random seed issue, as so far as I know, none of the programs have any sort of randomness in the evaluation or search. These are just stock versions of the programs given in the output, unchanged by me.

Uri Blass · Post by **Uri Blass** » Fri Aug 01, 2008 6:23 pm

bob wrote:A while back I mentioned how difficult it is to draw conclusions about relatively modest changes in a chess program, requiring a ton of games to get usable comparisons. Here is a sample to show that in a way that is pretty easy to understand.

First, I played crafty vs 5 other opponents, including an older 21.7 version. The version I am testing here is not particularly good yet, representing some significant "removals" from the evaluation. So the results are not particularly interesting from that perspective. The 5 opponents were played on 40 starting positions, playing 4 rounds for each position, alternating colors. So a total of 800 games per match, and I am giving 4 consecutive match results, all the same opponents, all played at a time control of 5 + 5 (5 minutes on clock, 5 seconds increment added per move). I lost a game here and there due to data corruption on our big storage system, so some of the matches show 799 rather than 800 games because once in a while the PGN for the last game would be somehow corrupted (a different issue).

I ran these 800 game matches thru Remi's BayesElo. You can look at the four sets of results, but imagine that in each of those tests, crafty-22.2 was a slightly different version with a tweak or two added. Which of the four looks the best? And then realize that all programs are identical for the 4 matches. How would one reliably draw any conclusion from a match containing only 800 games since the error bar is significant, and the variability is even more significant. First the data:
Code: Select all
Rank Name                   Elo    +    - games score oppo. draws
   1 Glaurung 2-epsilon/5   121   42   41   160   68%   -18   17%
   2 Glaurung 1.1 SMP        61   42   41   160   60%   -18   13%
   3 Fruit 2.1               49   41   40   160   59%   -18   15%
   4 opponent-21.7           13   38   38   159   55%   -18   33%
   5 Crafty-22.2            -18   18   18   799   47%     4   19%
   6 Arasan 10.0           -226   42   45   160   23%   -18   18%
Rank Name                   Elo    +    - games score oppo. draws
   1 Glaurung 2-epsilon/5    81   42   41   160   63%   -17   16%
   2 opponent-21.7           61   38   38   159   62%   -17   33%
   3 Glaurung 1.1 SMP        46   42   41   160   58%   -17   13%
   4 Fruit 2.1               35   40   40   160   57%   -17   19%
   5 Crafty-22.2            -17   18   18   799   47%     3   19%
   6 Arasan 10.0           -205   42   45   160   26%   -17   16%
Rank Name                   Elo    +    - games score oppo. draws
   1 Glaurung 2-epsilon/5   113   43   41   160   66%   -12   12%
   2 opponent-21.7           73   39   38   159   63%   -12   32%
   3 Fruit 2.1               21   41   40   160   54%   -12   15%
   4 Crafty-22.2            -12   18   18   799   48%     2   18%
   5 Glaurung 1.1 SMP       -35   41   41   160   47%   -12   11%
   6 Arasan 10.0           -161   41   43   160   30%   -12   18%
Rank Name                   Elo    +    - games score oppo. draws
   1 Glaurung 2-epsilon/5   131   45   42   160   70%   -33   10%
   2 Fruit 2.1               64   41   40   160   63%   -33   19%
   3 Glaurung 1.1 SMP        25   41   40   160   58%   -33   15%
   4 opponent-21.7           13   37   37   160   57%   -33   36%
   5 Crafty-22.2            -33   18   18   800   45%     7   19%
   6 Arasan 10.0           -199   42   44   160   29%   -33   15%
Notice first that _everybody_ in the test is getting significantly different results each match. The overall order (with the exception of Glaurung 2 which stays at the top) flips around significantly.

Now does anyone _really_ believe that 800 games are enough? Later I will show some _much_ bigger matches as well, showing the same kind of variability. Here are two quickies that represent 25,000 games per match for two matches, just for starters (same time control):
Code: Select all
Rank Name                   Elo    +    - games score oppo. draws
   1 Glaurung 2-epsilon/5   123    8    8  5120   66%     2   15%
   2 Fruit 2.1               38    8    7  5119   55%     2   19%
   3 opponent-21.7           28    7    7  5119   54%     2   34%
   4 Crafty-22.2              2    4    4 25597   50%     0   19%
   5 Glaurung 1.1 SMP         2    8    8  5120   50%     2   14%
   6 Arasan 10.0           -193    8    9  5119   26%     2   15%
Rank Name                   Elo    +    - games score oppo. draws
   1 Glaurung 2-epsilon/5   118    8    8  5120   67%   -19   13%
   2 Fruit 2.1               42    8    8  5120   58%   -19   17%
   3 opponent-21.7           32    7    7  5115   58%   -19   36%
   4 Glaurung 1.1 SMP        20    8    8  5120   55%   -19   12%
   5 Crafty-22.2            -19    4    4 25595   47%     4   19%
   6 Arasan 10.0           -193    8    8  5120   28%   -19   16%
The question you want to answer from the above is this: crafty-22.2 in the first run was slightly modified for the second run. Was the change good or bad? How sure are you? Then I will add that crafty-22.2 for _both_ runs was identical. Now which one is better? There is a 21 Elo difference between the two. The first result says 2 +/- 8, while the second says -19 +/- 4. The ranges don't even overlap. Which points out that this kind of statistic is good for the sample under observation, but not necessarily representative of the total population of potential games, without playing a _lot_ more games. Some would say that the second match says crafty is somewhere between -15 and -23. Which is OK. But then what does the first bigger match say?

"things that make you go hmmm......."

Looking at your data(and I do not talk about the 800 games but about the
last table of more than 20000 games) it seems that Crafty is not the same.

The difference in rating for Crafty is bigger than the difference in rating for everyone of the opponents inspite of the fact that Crafty played more games.

All opponents scored better against latest Crafty.

Glaurung 2-epsilon/5 66%->67%
Fruit 2.1 55%->58%
opponent-21.7 54%->58%
Glaurung 1.1 SMP 50%->55%
Arasan 10.0 26%->28%

Uri

hgm · Post by **hgm** » Fri Aug 01, 2008 6:55 pm

And I don't believe for a minute that these are just the first two 25,000-game runs he ever tried, and that Bob would also have made this post if they head been dead-on the same score. A 5+5 games lasts about 20 min, so 25,000 of them is 347 days on a single core. Seems to me that

1) Bob is in this business for more than one year
2) he uses significantly more than a single core

So he must have made many of such runs before?

And that of course would raise the question: why did he post dis time, and why didn't he post any of these earlier runs? Not enough difference in those cases to be noteworthy? This has the word 'selected data' stamped all over it!

Dirt · Post by **Dirt** » Fri Aug 01, 2008 7:15 pm

hgm wrote:And I don't believe for a minute that these are just the first two 25,000-game runs he ever tried, and that Bob would also have made this post if they head been dead-on the same score. A 5+5 games lasts about 20 min, so 25,000 of them is 347 days on a single core. Seems to me that

1) Bob is in this business for more than one year
2) he uses significantly more than a single core

So he must have made many of such runs before?

And that of course would raise the question: why did he post dis time, and why didn't he post any of these earlier runs? Not enough difference in those cases to be noteworthy? This has the word 'selected data' stamped all over it!

Even as the worst consecutive runs he's ever had those last two runs would be surprising.

Uri Blass · Post by **Uri Blass** » Fri Aug 01, 2008 7:29 pm

hgm wrote:And I don't believe for a minute that these are just the first two 25,000-game runs he ever tried, and that Bob would also have made this post if they head been dead-on the same score. A 5+5 games lasts about 20 min, so 25,000 of them is 347 days on a single core. Seems to me that

1) Bob is in this business for more than one year
2) he uses significantly more than a single core

So he must have made many of such runs before?

And that of course would raise the question: why did he post dis time, and why didn't he post any of these earlier runs? Not enough difference in those cases to be noteworthy? This has the word 'selected data' stamped all over it!

The difference here is so high so it does not change much if it is selected data of one data out of 10.

The probability that a difference that is so high is going to happen is something like 1/1000,000(I did not calculate it and it is a fast estimate) and changing it to 1/100,000 is not going to cause me to believe that it is because of luck.

Uri

more on engine testing

Re: more on engine testing

Re: more on engine testing

Re: more on engine testing

Re: more on engine testing

Re: more on engine testing

Re: more on engine testing

Re: more on engine testing

Re: more on engine testing

Re: more on engine testing

Re: more on engine testing