bob wrote:Is that not a _huge_ change? 10 less wins, 10 less losses, 20 more draws. Just by adding 1000 nodes.
Yes it is, but it is exactly what you'd expect because of the non-determinacy in mature engines.
I ran a different experiment...
I played Fruit against itself for 100 games with no book at 0.5 seconds / move to see how many games it took before it repeated the first game.
Now if the engine was determinant we'd get a repeat on the second game.
But anybody care to guess how many games before a repeat occurred?
The answer is not a single one!
In fact, not only did the first game never repeat, there wasn't a single duplicate in the whole match!
(You might even conclude that using a set of opening positions is irrelevant! Just leave the engines to it! That's probably too extreme for most peoples tastes though.)
The fluctuation in timings are accentuated in this kind of test as the search is terminated mid flow as opposed to at a couple of well defined decision points (do I start another iteration? do I terminate this iteration after this root move?) but it demonstrates the problem nicely.
A few different transposition table entries early on rapidly multiply until the root move scores start to change and eventually even the root move choice.
I believe it's synonymous with the butterfly effect in chaos theory. An innocent table entry flaps its wings and later on there's a hurricane in the game result.
Your test is effectively just giving a consistent +ve clock wobble every move, as opposed to a random +/- on some moves.
Could it even be that a SINGLE node extra could cause such variation? Fancy trying a 1,000,000 node run and a 1,000,001 node run?
The vast majority of this non-determinancy can be removed of course by simply clearing the transposition table before each move.
Of course this is not something you want to do in a real game but might be worth a few experiments when testing? Of course your opponents would have to have this option available to try too! Gut feeling though, I don't think it would stop you getting the fluctuations in the overall results that you see. I believe there's something else going on that we're all not seeing, but I too have no idea what it is.
FWIW, I sympathise with your original problem and I can entirely believe your data. (Although my Cambridge mathematician son was incredulous... at least he didn't accuse you of lying!

)
I used to play 1000 game runs to try to make a good/bad decision but I too was getting too many inconsistent results if I re-ran. (Apologies for never posting anything, I didn't realise I was meant too!)
I've now decided that using any arbitrary number of games is incorrect and changed the process as follows.
My GUI graphs the ELO rating as a match progresses. I start an open ended match and let it go until the graph has flatlined to a +/-1 window for a thousand games. With this method I can sometimes get runs <2000 games but often runs of >4000 games are not uncommon.
I'm not sure that this method will stand the test of time either, but it's what 'gives me confidence' (or keeps me sane!) currently.
For those that are interested, I achieve these number of games in a reasonable time as follows...
I don't have a cluster, but I do have up to 6 test PCs available at different times. Most are old kit that has been retired from real use but is still good for sitting in a corner blindly churning out chess moves. Sometimes I steal the kids PCs too!
I also play at very fast time controls which I believe are as good as any other. I think any choice of time control is entirely arbitrary as nobody ever specifies CPU speeds when they quote such things. Modern CPUs can go much deeper in 100ms than we were acheiving at tournament time controls 25 years ago! So why should a 5+5 game game be any more 'respectable' than a 0.5+0.5
Anyways, good luck with your testing and please keep us posted on any findings as at least some of us are interested in your results!