Re: Observator bias or...
Posted: Sun Jun 10, 2007 2:24 am
Searching a fixed number of nodes is worthless if the goal is to compare program version N to program version N+1, by playing one or more opponents. The results are exactly as reliable as just using clock time... You can easily confirm this yourself with a lot of test runs. I've done it... we had long discussions and hundreds of thousands of games played when we were studying this in the "crafty team" mailing list.hgm wrote:Well, excuse me for saying so, but if all you said was that 80 games is not enough, then I don't understand why you said that at all. As the topic under discussion, or at least the one that I addressed and that you reacted on, was the one raised by Uri, if it is preferable to test based on time or based on node count, not how many games are enough:As I previously mentioned, it just won't work that way. The sample set of games you produce might or might not reflect the total population of games that contains the "truth". Any small sub-set is not going to be useful in these comparisons, because any eval change in version N+1 can completely alter the shape of the tree making the "fixed nodes" approach exactly as random as the time-based games...bob wrote:hgm wrote:Yes, for this reason testing at a fixed number of nodes and recording the ime, rather than fixing the time, seems preferable. But of course you cannot get rid of the randomness induced by SMP that way.
I was directly addressing _both_ issues. Using node counts is bad. I can give several reasons but for shortness, consider two:
(1) nodes = 50M will produce a sample set of games. nodes=50M+10K will produce a _different_ sample set of games. Which set is most representative? Who knows.
(2) some programs have a significant variance in NPS from opening to middlegame, others do not. How do you pick a value for N (nodes searched before terminating the search) that is fair and representative of what you would expect in real play? A program that speeds up in the endgame will either be handicapped in the endgame because as its nps goes up, its time per move goes down since nodes are fixed. Or, if you use the endgame NPS then it will be favored in the opening due to it taking much longer than expected to move.
I mentioned the number of games because most use _way_ too few games (just look at game totals mentioned by yourself, Uri and others when making "go/no-go" decisions about recent changes. That point needs to be continually brought up to keep results in perspective. 80 or 160 or 320 games is _not_ enough to draw accurate conclusions. Yet that is exactly what the testing is all about...
For this reason I still want to implement the tree comparison idea I proposed here lately. This would eliminate the randomness not by sampling enough games and relying on the (tediously slow) 1/sqrt(N) convergence, but by exhaustively generatng all possible realizations of the game from a given initial position. If the versions under comparison are quite close (the case that is most difficult to test with conventional methods), the entire game tree might consist of less than 100 games, but might give you the accuracy of a 10,000 games that are subject to chance effects.
Here you make the general statement that testing with a fixed number of nodes is useless. Without referring to any number of games. And I don't think that 'useless' is the same as 'not enough' (for a particular purpose). Even if you want to stick to the 80 games that suddenly popped out of nowhere, if I have two versions and one of them scored 45 out of 80, while the other scored 0 out of 80, then 80 games are clearly enough to draw the far-reaching conclusion that you have broken something, and it would be plain silly to continue playing 100,000 games with this version. But all of that is standard statistics, which was never an issue in this discussion thread.
fixed number of nodes is absolutely worthless. To prove that to yourself, do the following. Play a match using the same starting position, where _both_ programs search a fixed number of nodes (say 20,000,000). Record the results. Then re-play but have both search 20,010,000 nodes (10K more nodes than before). Now look at the results. They won't be anywhere near the same. Which one is more correct? Answer: that's hopeless as you take a small random (the games with 20M nodes per side) from a much larger set of random results, and you base your decisions on that? May as well flip a coin...
my upcoming ICGA paper will show just how horrible this is...
To talk about things that are absolutely useles: testing uMax against Crafty, Fruit, Arasan, comes pretty close pretty to that. The more games I would play, the more useles it would be, for in 100,000 games both the old and the improved version of uMax would score 0 points. So How would I know if my improvement worked, or if I had completely broken it? It would just be a giant waste of time. An easy analysis shows that you obtain maximum information per game (so that you get the desired reliability with the smallest number of games) if you test against engines of about equal strength.
You have to do _both_. You have test against programs close to you, worse than you, and significantly better than you. Otherwise imbalances can creep in that make you play better against weaker opponents but much worse against better opponents. We've also seen this happen in our cluster testing with various crafty versions.
I challenge you to find _any_ program that uses time for search limit, and yet play the same games over and over given the same moves by the opponent. 10K nodes makes _major_ differences in 80 game matches as I showed. At todays search speeds that might be less than a millisecond, and time jitters more than that on every O/S I have tested.
On top of that, for those that are testing engines in a higher ELO range, I would even reverse the statement: if Crafty, Fruit, Arasan,... are not able to reproduce their games despite 'random' being switched off, and despite being set for a fixed ply depth, (so that random factors outside of the engines cannot affect their logic), they are clearly not suitable test opponents and are best excluded from any gauntlets you make to evaluate tiny changes in your engine. As using such unpredictable engines needlesly add an enormous statistical variance to the quantity under measurement. Better stick to engines that behave according to specifications.
After all, the idea is to make testing to a certain accuracy as easy as possible. That you could also make it much harder on yourself by picking certain engines with nasty peculiarities, is quite irrelevant if you are smart enough to stay away from them!