New testing thread

bob · Post by **bob** » Sun Aug 10, 2008 8:37 am

xsadar wrote:
bob wrote:
xsadar wrote:
bob wrote:
Zach Wegner wrote:
Karl wrote:If we want to measure how well engine A plays in the absolute, however, then "randomness" (or more precisely independence between
measurements) is a good thing. We want to do everything possible to kill off correlations between games that A plays and other games
that A plays. This can include having the opponent's node count set to a random number, so that there is less correlation between games
that reuse the same opponent. That said, if we randomize the opposing node count we should save the node count we used, so we can use
exactly the same node count for the same game when A' plays in our comparison game.

I think, therefore, that test suite results will be most significant if the time control is taken out of the picture completely. If the
bots are limited by node count rather than time control, we can control the randomness so that we get the "good randomness" (achieving
less correlation among the games of one engine) and can simultaneously eliminate the "bad randomness" (removing noise from the comparison
between engines).
Isn't this pretty much exactly what I've been saying?
Yes. but there is a fatal flaw. You already saw where 10,001,000 nodes produces different results from 10,000,000 nodes, correct? Now make one tiny change to your evaluation or search and guess what happens to the tree? It will change by _more_ than 1,000 nodes, and you get a different random result.

There is no way to search the same exact tree with a different evaluation or search, so you are right back to where we started. There will be a random offset between the two sample sets just as there is with random times at present...
The changes in the search tree are part of what we're trying to test, whereas fluctuations in the clock are not.

However, I will agree with you that a fixed (or random) number of nodes won't work. Any change to the evaluation or search will also change the nodes per second of the search, which is also something we would definitely want included in our test. That is, afterall, the major tradeoff when adding new evaluation terms. Also, in my experience, nps varies drastically from one stage of the game to another, so I don't think it would even work to use an average nps in combination with a node count.
You are completely missing the point.
Could be. I'm having a hard time understanding this post, so I'll go point by point, to see if I can understand any better.
We want to test whatever changes we made, would you agree? And do you follow in how a tiny change in time measurement changes the node count of the tree being search? And how that changes the result? So a change you make could just as easily change the shape of the tree just enough so that you now get a win where before you got a loss, or vice versa.
Yes/correct to all of the above.
Having absolutely _nothing_ to do with a better or worse evaluation, just one the produces a different value,
I'm not sure I agree with this. To me it means that it is better or worse for this particular sample (consisting of the position and number of nodes searched).
which then alters how cutoffs occur, which is the same thing that happens when the node count is changed slightly.

OK, let's step back just a minute. Do you believe that searching 100 nodes more or less will have any effect on the results at all? I know it will, and posted the data to show that the effect is not minor at all. Now, if I search 100 fewer nodes per position, and I get a significantly better result, do you believe that searching a tiny bit less was really a significant improvement? Of course not. And that is the effect I am talking about. A tiny evaluation change, changes the shape of the tree, and that is all that is needed to change the result, and the really interesting part is, the change in shape is _not_ correlated with the result. One time fewer nodes makes a better result, next time it makes a worse one...

Except when the node count is changed that has nothing to do with changes in the program we're testing, but the size and shape of the tree does. So I don't understand why this is relevant, but perhaps that's because I'm not understanding your point.
The easiest way to understand this is to
No offense, but I think I understood the above a lot better than what follows.
view the entire population of games we could possibly play using a time limit. There are only so many possible node count variations.
Okay so we have all possible games that can be played by your engine from one starting point and with a specific time limit, right? Or is this just all possible games of chess that can be played with that engine? Or all possible games that can be played by anybody?

In this context, all possible games that could be played by my engine in that specific time control. the time control is fixed, but the timing jitter will produce different node counts most of the time. If you play every possible node count, you would enumerate them all.

So you slice up this total population by individual node counts, and now you have a super-set of my change-the-node-count games.
I don't understand what you're doing here. Are you saying that the games resulting from a time limit is a superset of those resulting from varying node counts?

The opposite. You just enumerated games for every _possible_ node count I could produce given a fixed time per search. If we play enough timed games, I _might_ produce them all, but I might not be able to hit _every_ possible node count for various reasons. So what I could actually reach would be a sub-set of what is actually possible in the set of every possible node count.

If so, isn't the time limit itself just varying the node counts?
Except your super-set has sets of games from all possible different node counts. But how do the sub-sets compare to the entire set?
So the subsets are the set of games we actually play in our tests, I assume.
Ought to be representative, statistically, but there is no guarantee.
Probably depends on how we choose our subsets. So your wondering how we know if our means of selecting subsets is any good?
Just to make sure we're talking about the same thing: of course we can only expect our subset to be representative of the superset under consideration. If the superset only consists of games from one, or a few starting position we can't expect the subset to be representative of games from all possible starting positions.
It would be ugly to just alter the node count slightly and get a better result, which is what happened in the tests I ran sometimes.
But that is where you were testing with only forty start positions, right?

Yes. And remember, so far, nothing has said that more positions is going to change the results. I have four big runs queued up, the first is about 1/2 way done, when all are done we will see how the 4 runs stack up stability wise thru BayesElo.

It would be just as ugly to add a new eval term that was not very important, but since it changes the tree shape just a bit, I get over into a different one of those node-count samples and get a better (or worse) result and draw a conclusion about the change, when in reality any such change will affect the final result.
So we don't know if our different result is due to the quality of the change or something similar to the slight-change-in-node-count effect that you've demonstrated in the past? Maybe I understand you now. Personally, I think this effect is mainly due to the correlation caused by playing multiple games from the same positions, which is what you're in the process of testing. But I don't see why that in and of itself would prevent us from testing A and A' with the same node counts in addition to using more positions. The only effect I can see changing to node counts having is to stabilize the results slightly. So, if we were to disregard the issue of any changes also changing nps, I wouldn't see what it would hurt, and there would even be the possibility that it would help. But maybe I'm still not understanding your point.

Short form: from a given starting position, there are a _huge_ number of possible different games that can be played, some wins, some draws and some losses. The number of nodes searched directly chooses which game we are going to follow. Using time, we follow one of a very large number of different games but the game is different each time we replay it because time changes slightly. If we used fixed nodes, we will just choose one of those games to play, but if you alter how that fixed number is allocated, by changing any aspect of the program, you again change the game path.

The real problem is that if each game were always won or loss, it would be great, but with balanced positions, and millions of possible wins and millions of possible losses, if you use a small set (and maybe even 50,000 games is a small set) you can get unexpected results by picking a win in one test, and a loss in the next, where the choice is essentially random and caused by the shape of the tree you search and not the skill used to play through the game...

That is the problem that is difficult to deal with. Any change alters the entire system, and it seems to take a large number of games/positions/opponents to reduce this noise to an acceptable level.

bob · Post by **bob** » Sun Aug 10, 2008 8:44 am

Here was the original results for 26,000 games I posted. Here are the current results, just for information, test is not finished yet...

Code: Select all

Rank Name                   Elo    +    - games score oppo. draws
   1 Glaurung 2-epsilon/5   123    8    8  5120   66%     2   15%
   2 Fruit 2.1               38    8    7  5119   55%     2   19%
   3 opponent-21.7           28    7    7  5119   54%     2   34%
   4 Crafty-22.2              2    4    4 25597   50%     0   19%
   5 Glaurung 1.1 SMP         2    8    8  5120   50%     2   14%
   6 Arasan 10.0           -193    8    9  5119   26%     2   15%
Rank Name                   Elo    +    - games score oppo. draws
   1 Glaurung 2-epsilon/5   118    8    8  5120   67%   -19   13%
   2 Fruit 2.1               42    8    8  5120   58%   -19   17%
   3 opponent-21.7           32    7    7  5115   58%   -19   36%
   4 Glaurung 1.1 SMP        20    8    8  5120   55%   -19   12%
   5 Crafty-22.2            -19    4    4 25595   47%     4   19%
   6 Arasan 10.0           -193    8    8  5120   28%   -19   16%

here are the new results so far:

Code: Select all

Rank Name                   Elo    +    - games score oppo. draws
   1 Glaurung 2-epsilon/5   111    9    9  3611   68%   -19   20% 
   2 Fruit 2.1               64    9    9  3608   62%   -19   24% 
   3 opponent-21.7           28    8    8  3611   57%   -19   33% 
   4 Glaurung 1.1 SMP         9    9    9  3611   54%   -19   21% 
   5 Crafty-22.2            -19    5    5 18059   46%     4   23% 
   6 Arasan 10.0           -193    9    9  3618   27%   -19   18%

So far, just over 18,000 games have finished. I started this somewhere around 8pm or so, so another 8-10 hours will see the thing finished (one run of four). the interesting part will be the second, third and fourth runs to see if the results suddenly become stable enough to use.

More as the things finish...

hgm · Post by **hgm** » Sun Aug 10, 2008 9:07 am

Dirt wrote: Using the node count instead of time would mean you're not testing the time management code, right? I guess you'd then have to test that separately, which is really starting to make the testing complex.

This would not be true if you use the node-based time-control mode of WinBoard 4.3.14.

Then the time-control code would be fully active. It is just that the engine would be fed, (and uses internally) a virtual time, derived on its node count. But apart from the clock routine, the rest of the engine would never know that it is not dealing with real time.

hgm · Post by **hgm** » Sun Aug 10, 2008 9:37 am

bob wrote:No, but if you'd just follow the discussion once in a while,
you would see that reducing the number of games is good enough to test the hypothesis that a round-robin will stabilize the ratings more.

Well, as I pointed out, even two games would be enough to 'test' the hypothesis that 1+1=2. Test whatever trivialities you want, but don't expect me to follow your muddling with interest.

One does have to have an attention span measured in something longer than milliseconds to be able to carry on discussions here of course. I've never proposed using 800 games for the good/bad testing.

A blatant ly. Even in the previous post you were complaining that the 800-game runs were not good enough for the tiny differences you wanted to measure. "tons of variation", remember? (Yes, hard, I admit. You did say that more than a msec ago...)

But I did suggest using it to test the hypothesis that playing a round robin would help, otherwise the test would take over a week for one run, not a day.

Get that now? Nobody is saying _reduce_ the number of games for real testing. At least not me...

Are you sure you don't mean "FORget it now?"

easy enough. Because the engines run on "virgin" machines each time. The machines do not cyclically ramp up their clock, then ramp it back down.

Indeed not, but I was not asking about the machines, but about the engines. The engines do have access to the date and time of day, do they not? It would be trivial for me to write an engine that would wreck your testing: Just add the line:

if( timeOfDay.month & 1 ) thinkingTime /= 2;

and that engine would play 70 Elo weaker in odd months than in even months. So if you would make a 25,000-game run in it in June, and repeat that run in July, you would get totally different results. The score would be off by as much as 10%.

So are your machines virgin enough that you reset the system time to zero before each run? Safe bet you don't...

Which would not introduce a dependency anyway. So somehow the act of winning or losing a game on machine X would have to alter the clock on machine X so that the next time a game is played there, the altered clock has to somehow affect (in a consistent way) the outcome of that game.

It is all absolutely poppycock and you (as well as everyone else) knows that is simply not possible. I also don't want to have to prove that sunspots, cosmic rays, close-proximity black holes, gravity waves, etc are introducing dependencies also. So back to the real world. If my cluster produces dependencies, so does everyone else's testing, particularly those just using a paltry single machine or 2 or 4.

Even if that were true (and it does not necssessaryly have to be true, as the problem might be in one of the engines, and other people might not use that engine), the main point is that others do not care. They do not make 25,000-game runs, so their errors are always dominated by sampling statistics. Systematic errors that are 6-sigma in a 25,000-game run are only 0.6 sigma in a 250-game run, and of no consequence. You could not see them, and they would not affect your testing accuracy.

If you want to be able to do precision measurements, you have to work harder to eliminate noise sources. Every physiscist knows this.

Your signal/noise ratio is so bad, I am not aware of your offering any useful signal anyway. So I believe I will just "muddle on" and before long you will actually know how to test engines as a result, since you obviously do not know how at present... And neither do I, at present. But at least _I_ am working on changing that situation.

hgm · Post by **hgm** » Sun Aug 10, 2008 10:33 am

Fritzlein wrote:What we apparently disagree is about the potential that we will be able to re-use positions in a way that is independent. I highly doubt it can be done.

Is this just a gut feeling, or is this doubt somehow based on mathematical calculation?

The variances that would result from the different sampling procedures I defined are actually very easy to calculate. So I urge you to do it. Then you will see that the variance for independent sampling over a number of classes (like different initial positions) is simply the sum of the variances of the sampling within each class, plus the variance of the class averages.

Therefore, I say we need 10,000 positions, not to remove the source of noise you are talking about, but to remove the source of noise that comes from correlations when a position is re-used.

Then we must miscommunicate, because that was exactly the source of noise I was talking about.

I contend that if Crafty playing white from a given position beats Fruit limited to exactly 3,000,000 nodes, then I can make a lot of money betting on Crafty to win playing white from the same position against Fruit limited to 3,000,000 nodes +/- 100,000 nodes. Yes, at some of the other nodes counts Fruit will win, but I still expect to consistently make money, because I just don't believe the results will be independent.

Well, this is something Bob vehemently denies, and in this respect I have no reason at all to doubt his statement. I think you would lose tons of money, as in general Fruit is stronger than Crafty, and the (not so extremely rare) exception that Crafty beats Fruit for this precise combination of conditions will not carry over to cases with even a tiny difference in number of nodes. This is the 'butterfly effect'. (Unless the position was not equal, of course, and Crafty was playing a position with a +8 score or something like that, which is not really relevant for this discussion.)

MartinBryant · Post by **MartinBryant** » Sun Aug 10, 2008 10:56 am

Fritzlein wrote: I contend that if Crafty playing white from a given position beats Fruit limited to exactly 3,000,000 nodes, then I can make a lot of money betting on Crafty to win playing white from the same position against Fruit limited to 3,000,000 nodes +/- 100,000 nodes. Yes, at some of the other nodes counts Fruit will win, but I still expect to consistently make money, because I just don't believe the results will be independent.

I'd be careful how you bet your money!

Bob has already demo'd that tiny changes (1,000 nodes) can make move choices and hence the game result fluctuate wildly and my own experience/experiments agree with this. It may be that the result is so sensitive to the node count that the addition of just a _single_ node would do the same. (Now that would be an interesting experiment.) The results through the 200,001 posiible games may be totally chaotic.

In fact, because Fruit is stronger than Crafty, your Crafty win at 3,000,000 nodes could well be sat in the middle of a sea of losses.

If you took your +/-100,000 node interval and ran 201 games at the 1,000 node steps I'd happily bet MY money that Fruit would win the majority.

hgm · Post by **hgm** » Sun Aug 10, 2008 11:44 am

bob wrote:So far, just over 18,000 games have finished. I started this somewhere around 8pm or so, so another 8-10 hours will see the thing finished (one run of four). the interesting part will be the second, third and fourth runs to see if the results suddenly become stable enough to use.

More as the things finish...

Well, it seems things are pretty-much heading for a within-bounds replication of the results of your second run. So it is apparently the first run that produced an a-typical result, most likely (i.e. 99.9999% likely) to be due to some unknown flaw in the way you conducted the experiment.

Lessons to be drawn from this for the would-be scientific data collector:

- Do not accept data points with a deviation of more than 4-sigma; they are more likely to spoil your data than improve it
- Always repeat an experiment a number of times, to get an idea of how much the standard deviation is, and if it matches theoretical expectation. This does not have to be time consuming, as long runs can be often subdivided into shorter identical runs, and the variability in the results for these sub runs is a good indicator of the standard deviation of the total run.
- Sub-division into shorter runs allows you to pinpoint corrupted sub-runs, in cases where the unknown flaw was localized to a single sub-run, and you could rescue the data by simply deleting the corrupted sub-run, and don't have to re-do everything. Some sciences even go so far as to delete the data points (=sub-runs) with the largest and smallest result ALWAYS, even if it is not suspect at all, because this is not supposed to affect the average of the remaining results, and erroneous results have a larger probability to be deleted in this procedure than correct results, so the standard error in your average in general decreases by this. (Despite the fact that the average is now taken over fewer samples.)

bob · Post by **bob** » Sun Aug 10, 2008 4:32 pm

hgm wrote:
bob wrote:So far, just over 18,000 games have finished. I started this somewhere around 8pm or so, so another 8-10 hours will see the thing finished (one run of four). the interesting part will be the second, third and fourth runs to see if the results suddenly become stable enough to use.

More as the things finish...
Well, it seems things are pretty-much heading for a within-bounds replication of the results of your second run. So it is apparently the first run that produced an a-typical result, most likely (i.e. 99.9999% likely) to be due to some unknown flaw in the way you conducted the experiment.

Lessons to be drawn from this for the would-be scientific data collector:

- Do not accept data points with a deviation of more than 4-sigma; they are more likely to spoil your data than improve it

Another one: do not draw conclusions from _one_ partial result. The test has not even finished yet and here you go, drawing a conclusion. The way the test is run, the partial results can be highly misleading due to white/black bias. One does best by waiting until _all_ the data is in.

- Always repeat an experiment a number of times, to get an idea of how much the standard deviation is, and if it matches theoretical expectation. This does not have to be time consuming, as long runs can be often subdivided into shorter identical runs, and the variability in the results for these sub runs is a good indicator of the standard deviation of the total run.
- Sub-division into shorter runs allows you to pinpoint corrupted sub-runs, in cases where the unknown flaw was localized to a single sub-run, and you could rescue the data by simply deleting the corrupted sub-run, and don't have to re-do everything. Some sciences even go so far as to delete the data points (=sub-runs) with the largest and smallest result ALWAYS, even if it is not suspect at all, because this is not supposed to affect the average of the remaining results, and erroneous results have a larger probability to be deleted in this procedure than correct results, so the standard error in your average in general decreases by this. (Despite the fact that the average is now taken over fewer samples.)

bob · Post by **bob** » Sun Aug 10, 2008 4:37 pm

hgm wrote:
Fritzlein wrote:What we apparently disagree is about the potential that we will be able to re-use positions in a way that is independent. I highly doubt it can be done.
Is this just a gut feeling, or is this doubt somehow based on mathematical calculation?

The variances that would result from the different sampling procedures I defined are actually very easy to calculate. So I urge you to do it. Then you will see that the variance for independent sampling over a number of classes (like different initial positions) is simply the sum of the variances of the sampling within each class, plus the variance of the class averages.

Therefore, I say we need 10,000 positions, not to remove the source of noise you are talking about, but to remove the source of noise that comes from correlations when a position is re-used.
Then we must miscommunicate, because that was exactly the source of noise I was talking about.

I contend that if Crafty playing white from a given position beats Fruit limited to exactly 3,000,000 nodes, then I can make a lot of money betting on Crafty to win playing white from the same position against Fruit limited to 3,000,000 nodes +/- 100,000 nodes. Yes, at some of the other nodes counts Fruit will win, but I still expect to consistently make money, because I just don't believe the results will be independent.
Well, this is something Bob vehemently denies, and in this respect I have no reason at all to doubt his statement. I think you would lose tons of money, as in general Fruit is stronger than Crafty, and the (not so extremely rare) exception that Crafty beats Fruit for this precise combination of conditions will not carry over to cases with even a tiny difference in number of nodes. This is the 'butterfly effect'. (Unless the position was not equal, of course, and Crafty was playing a position with a +8 score or something like that, which is not really relevant for this discussion.)

Where do I deny that? I am onto _accurate_ measurement. The overall "trend" of these matches seems to accurately "order" the programs and predict who will win. But that is not what I am trying to measure, and that is not what any of this discussion is about.

So again, I am not trying to measure who is better. I am trying to measure precisely (as much as possible with reasonable resource limits) how much better (or worse) a new version is. A subtle, but _important_ difference in what I am trying to do.

hgm · Post by **hgm** » Sun Aug 10, 2008 4:52 pm

bob wrote:Another one: do not draw conclusions from _one_ partial result. The test has not even finished yet and here you go, drawing a conclusion. The way the test is run, the partial results can be highly misleading due to white/black bias. One does best by waiting until _all_ the data is in.

You think that was a conclusion from your partial result? Actually this is fresh-man's teaching material, the basics of every experimental science...

Where do I deny that? I am onto _accurate_ measurement.

Oh, man, why don't you wait until you are awake, before you start posting. It says you deny that the game results will practically be the same when you change the the node count by 100,000. Or are you denying now that you deny that?

New testing thread

Re: Correlated data discussion

Intermediate results...

Re: Correlated data discussion

Re: 4 sets of data

Re: Correlated data discussion

Re: Correlated data discussion

Re: Intermediate results...

Re: Intermediate results...

Re: Correlated data discussion

Re: Correlated data discussion