MartinBryant wrote:bob wrote:Is that not a _huge_ change? 10 less wins, 10 less losses, 20 more draws. Just by adding 1000 nodes.
Yes it is, but it is exactly what you'd expect because of the non-determinacy in mature engines.
It is what _I_ expect, yes. But not "everyone" believes engines are that non-deterministic. That's why I ran the test, to actually measure this effect, which is pronounced, regardless of the opinions of others to the contrary.
I ran a different experiment...
I played Fruit against itself for 100 games with no book at 0.5 seconds / move to see how many games it took before it repeated the first game.
Now if the engine was determinant we'd get a repeat on the second game.
But anybody care to guess how many games before a repeat occurred?
The answer is not a single one!
In fact, not only did the first game never repeat, there wasn't a single duplicate in the whole match!
(You might even conclude that using a set of opening positions is irrelevant! Just leave the engines to it! That's probably too extreme for most peoples tastes though.)
Not surprising to me. But again, others have suggested that this non-deterministic effect I have been seeing is something that is unique to crafty, rather than uniformly applicable to nearly all engines today, except for the brain-dead simple ones that are not important to the discussion.
The fluctuation in timings are accentuated in this kind of test as the search is terminated mid flow as opposed to at a couple of well defined decision points (do I start another iteration? do I terminate this iteration after this root move?) but it demonstrates the problem nicely.
A few different transposition table entries early on rapidly multiply until the root move scores start to change and eventually even the root move choice.
I believe it's synonymous with the butterfly effect in chaos theory. An innocent table entry flaps its wings and later on there's a hurricane in the game result.
That could easily have been cut/pasted from several of my past posts on the topic.
Your test is effectively just giving a consistent +ve clock wobble every move, as opposed to a random +/- on some moves.
It isn't always +ve. The problem is that the clock sort of "jumps" along in discrete steps, while I can sample at any point between the steps. So I could start something right before one of these jumps, and sample right after and draw the conclusion that I had used enough time when I had actually used almost none. Also using more time on one move means less time is left to use on others later, so it is a sort of oscillation that is set into motion somewhere and it ripples throughout the game after that point.
Could it even be that a SINGLE node extra could cause such variation? Fancy trying a 1,000,000 node run and a 1,000,001 node run?
I believe I ran such a thing, let me try to dig up the data. If I can't I can run it. I ran lots of +/- 100 node tests...
The vast majority of this non-determinancy can be removed of course by simply clearing the transposition table before each move.
Of course this is not something you want to do in a real game but might be worth a few experiments when testing? Of course your opponents would have to have this option available to try too! Gut feeling though, I don't think it would stop you getting the fluctuations in the overall results that you see. I believe there's something else going on that we're all not seeing, but I too have no idea what it is.
FWIW, I sympathise with your original problem and I can entirely believe your data. (Although my Cambridge mathematician son was incredulous... at least he didn't accuse you of lying!

)
Good to know that some have some common sense and want to test for themselves as opposed to stamping their feet and shouting "can't be... can't be"
I used to play 1000 game runs to try to make a good/bad decision but I too was getting too many inconsistent results if I re-ran. (Apologies for never posting anything, I didn't realise I was meant too!)
I've now decided that using any arbitrary number of games is incorrect and changed the process as follows.
My GUI graphs the ELO rating as a match progresses. I start an open ended match and let it go until the graph has flatlined to a +/-1 window for a thousand games. With this method I can sometimes get runs <2000 games but often runs of >4000 games are not uncommon.
I'm not sure that this method will stand the test of time either, but it's what 'gives me confidence' (or keeps me sane!) currently.
However, if you re-run, won't you get _different_ results once again? I have tried that approach, in that at any instant I can grab the current BayesElo output for all games completed so far. My original intent was to play "just enough" games. But when I re-run, I still get different results.
For those that are interested, I achieve these number of games in a reasonable time as follows...
I don't have a cluster, but I do have up to 6 test PCs available at different times. Most are old kit that has been retired from real use but is still good for sitting in a corner blindly churning out chess moves. Sometimes I steal the kids PCs too!
I also play at very fast time controls which I believe are as good as any other. I think any choice of time control is entirely arbitrary as nobody ever specifies CPU speeds when they quote such things. Modern CPUs can go much deeper in 100ms than we were acheiving at tournament time controls 25 years ago! So why should a 5+5 game game be any more 'respectable' than a 0.5+0.5
Anyways, good luck with your testing and please keep us posted on any findings as at least some of us are interested in your results!