Observator bias or...

bob · Post by **bob** » Sun Jun 10, 2007 4:24 am

hgm wrote:Well, excuse me for saying so, but if all you said was that 80 games is not enough, then I don't understand why you said that at all. As the topic under discussion, or at least the one that I addressed and that you reacted on, was the one raised by Uri, if it is preferable to test based on time or based on node count, not how many games are enough:
bob wrote:
hgm wrote:Yes, for this reason testing at a fixed number of nodes and recording the ime, rather than fixing the time, seems preferable. But of course you cannot get rid of the randomness induced by SMP that way.

I was directly addressing _both_ issues. Using node counts is bad. I can give several reasons but for shortness, consider two:

(1) nodes = 50M will produce a sample set of games. nodes=50M+10K will produce a _different_ sample set of games. Which set is most representative? Who knows.

(2) some programs have a significant variance in NPS from opening to middlegame, others do not. How do you pick a value for N (nodes searched before terminating the search) that is fair and representative of what you would expect in real play? A program that speeds up in the endgame will either be handicapped in the endgame because as its nps goes up, its time per move goes down since nodes are fixed. Or, if you use the endgame NPS then it will be favored in the opening due to it taking much longer than expected to move.

I mentioned the number of games because most use _way_ too few games (just look at game totals mentioned by yourself, Uri and others when making "go/no-go" decisions about recent changes. That point needs to be continually brought up to keep results in perspective. 80 or 160 or 320 games is _not_ enough to draw accurate conclusions. Yet that is exactly what the testing is all about...

For this reason I still want to implement the tree comparison idea I proposed here lately. This would eliminate the randomness not by sampling enough games and relying on the (tediously slow) 1/sqrt(N) convergence, but by exhaustively generatng all possible realizations of the game from a given initial position. If the versions under comparison are quite close (the case that is most difficult to test with conventional methods), the entire game tree might consist of less than 100 games, but might give you the accuracy of a 10,000 games that are subject to chance effects.

As I previously mentioned, it just won't work that way. The sample set of games you produce might or might not reflect the total population of games that contains the "truth". Any small sub-set is not going to be useful in these comparisons, because any eval change in version N+1 can completely alter the shape of the tree making the "fixed nodes" approach exactly as random as the time-based games...

fixed number of nodes is absolutely worthless. To prove that to yourself, do the following. Play a match using the same starting position, where _both_ programs search a fixed number of nodes (say 20,000,000). Record the results. Then re-play but have both search 20,010,000 nodes (10K more nodes than before). Now look at the results. They won't be anywhere near the same. Which one is more correct? Answer: that's hopeless as you take a small random (the games with 20M nodes per side) from a much larger set of random results, and you base your decisions on that? May as well flip a coin...

my upcoming ICGA paper will show just how horrible this is...
Here you make the general statement that testing with a fixed number of nodes is useless. Without referring to any number of games. And I don't think that 'useless' is the same as 'not enough' (for a particular purpose). Even if you want to stick to the 80 games that suddenly popped out of nowhere, if I have two versions and one of them scored 45 out of 80, while the other scored 0 out of 80, then 80 games are clearly enough to draw the far-reaching conclusion that you have broken something, and it would be plain silly to continue playing 100,000 games with this version. But all of that is standard statistics, which was never an issue in this discussion thread.

Searching a fixed number of nodes is worthless if the goal is to compare program version N to program version N+1, by playing one or more opponents. The results are exactly as reliable as just using clock time... You can easily confirm this yourself with a lot of test runs. I've done it... we had long discussions and hundreds of thousands of games played when we were studying this in the "crafty team" mailing list.

To talk about things that are absolutely useles: testing uMax against Crafty, Fruit, Arasan, comes pretty close pretty to that. The more games I would play, the more useles it would be, for in 100,000 games both the old and the improved version of uMax would score 0 points. So How would I know if my improvement worked, or if I had completely broken it? It would just be a giant waste of time. An easy analysis shows that you obtain maximum information per game (so that you get the desired reliability with the smallest number of games) if you test against engines of about equal strength.

You have to do _both_. You have test against programs close to you, worse than you, and significantly better than you. Otherwise imbalances can creep in that make you play better against weaker opponents but much worse against better opponents. We've also seen this happen in our cluster testing with various crafty versions.

On top of that, for those that are testing engines in a higher ELO range, I would even reverse the statement: if Crafty, Fruit, Arasan,... are not able to reproduce their games despite 'random' being switched off, and despite being set for a fixed ply depth, (so that random factors outside of the engines cannot affect their logic), they are clearly not suitable test opponents and are best excluded from any gauntlets you make to evaluate tiny changes in your engine. As using such unpredictable engines needlesly add an enormous statistical variance to the quantity under measurement. Better stick to engines that behave according to specifications.

After all, the idea is to make testing to a certain accuracy as easy as possible. That you could also make it much harder on yourself by picking certain engines with nasty peculiarities, is quite irrelevant if you are smart enough to stay away from them!

I challenge you to find _any_ program that uses time for search limit, and yet play the same games over and over given the same moves by the opponent. 10K nodes makes _major_ differences in 80 game matches as I showed. At todays search speeds that might be less than a millisecond, and time jitters more than that on every O/S I have tested.

bob · Post by **bob** » Sun Jun 10, 2007 5:30 pm

ed wrote:
hgm wrote:Did you ever solve this puzzle, Ed?

Not satisfactory. I reran the whole thing at a higher time control (40/20) and the problem disappeared. But doing so 800 games lasted 800 hours which was unacceptable for me, even if you have 4 PC's at your disposal at the time.

My educated guess as that due to tiny Windows interferences stealing 100-200ms so now and then (the infamous windows swap-file comes to mind) programs go one iteration deeper or 1 ply less deeper than a former game resulting in different moves, thus randomness plays a role after all.

The higher time control you play, the fewer such things occurs.

It seems that +/- 1% is really a bit better than you can expect over 800 games: the standard error in a single run should be 0.4/sqrt(800) = 1.4%, but the difference of the result of two independent runs should be sqrt(2) larger, i.e. 2%. And in 32% of the cases the difference would be larger than that.

If I calculate the spread of the 3 percentages you quote, it is 1.9%. This is a bit larger than expected, but not so much that it is wildly improbable.

But one might argue that the different runs are not sufficiently independent. If this is the case one would really have to know the experimental spread of the results of the run under exactly the same condition.

If it turns out to be a real effect, the only thing I can think of is that the OS is to blame. I don't know under what conditions you played these matches (ponder on/off, single/dual core), but if there is competition between the engines for CPU and/or memory, the OS might systematically favor one because of its position in the process table (determined by the order of starting them). Or the GUI might of course cheat, giving a higher priority to the engine it starts with...
Everything was done to surpress randomness as much as possible, so single cpu, no PB, same openings in reverse, no learning, no TB etc.

Ed

Amazingly, I have now played over 1,000,000 games studying this "randomness" Unfortunately, the largest source of randomness remaining is the effect of time "jitter". Adding even 10K nodes here and there changes the game and the outcome, and there appears to be no way to eliminate this without rendering the entire test pointless... So the final solution (for me) is to play a lot of games... and by doing so, I can get a final result that is pretty consistent and which reflects changes pretty reasonably. But it takes a bunch of games, not a few dozen (or even a few hundred)...

Observator bias or...

Re: Observator bias or...

Re: Observator bias or...