New testing thread

Discussion of chess software programming and technical issues.

Moderator: Ras

Fritzlein

Re: Correlated data discussion - Another Experiment...

Post by Fritzlein »

MartinBryant wrote:OK Karl,
here's another experiment for you, not quite the one we were discussing but perhaps shedding some more light...

I thought it would be interesting to play a game between two engines at a fixed node count and then gradually increase that count to see how long it took until the game changed. I thought I would increase the node count by 1, 2, 4, 8, 16, etc nodes, doubling each time in case it took quite a while for a change to occur.

[snip]

So in 12 games we have 7 unique games where the maximum percentage change was 0.1024%
Also most changes (all much smaller than our previously discussed 1000) produce a game change.
Now I guess this data doesn't help much with your desired correlation test but to a layman it does seem to add even more fuel to the 'boy isn't it erratic!' side of the argument.
Yes, it is impressive that a change of a mere two nodes made the game different. It is impressive that the games were changed by node counts well below what were discussing.

To me it is even more telling that not all twelve games were drawn. There was one win for black in there along with eleven draws. But as long as we are just eyeballing results, eleven draws out of twelve doesn't seem like a highly erratic result.

To relate this to the Crafty vs. Fruit discussion, suppose that our bet about correlated game results had arisen in a slightly different way than it actually did. Suppose that we had been discussing the probability of draws as opposed to decisive results. Suppose that a great mass of data from all sorts of positions shows that Spike playing itself draws 40% of the time. I might have claimed that, due to correlated game results, if I can peek at the result of the first playout from a position, and if that first playout is a draw, I'll bet that more that half the remaining playouts from this position are draws. You say that my bet is a loser because there are only 40% draws on average, and there is a ton of variation in playouts due to changing node counts. I would have won this bet about correlation in convincing fashion.
MartinBryant wrote:So is there another experiment you could propose using Spike v Spike or Spike v A.N.Other (I presume I can find another UCI engine that supports fixed node counts) which would help you?
Yes indeed. Thanks for offering! Let's try Spike against another bot where node counts can be fixed. Let's pick 201 random starting positions that are roughly balanced. For each position, pick which color Spike plays at random.

Now play out each position twice with Spike playing the same color both times. In the first playout, use 1,000,000 nodes for both engines, and in the second use 1% more, i.e. 1,010,000 nodes for both engines. If the NPS of each bot are vastly different, choose a constant ratio of node counts to make the games roughly even. For example if the opponent is slower/smarter by a factor of two, in the first game use 1,000,000 nodes for Spike and 500,000 for opponent, and in the second game use 1,010,000 nodes for Spike and 505,000 for the opponent. Whatever makes the engines roughly the same strength, since we get the most information from nearly-equal games.

For each pair of games record the ordered pair of Spike's two results, e.g. (0.5, 1) for a draw and win, or (1, 0) for a win and a loss. That will give us 201 measurements of random variable X (=first playout) and random variable Y (=second playout). If there is no correlation, then Spike's wins, draws, and losses in the first playout won't especially line up with its wins, draws, and losses in the second playout. I'm betting they will line up, though, perhaps even enough for you to eyeball without calculating the coefficient of correlation.

As Kenny Dail was saying, I expect there to be more correlation in game results if the 201 positions are chosen from a bit later in the opening. Although I'm asking you to tilt the result in a way that I expect to be in my favor by picking such positions instead of running from the initial position of chess, using later positions is a good idea anyway since it is more like what Bob was doing, right?

It is way cool of you to offer to run an experiment, since I'm not able to do anything but spout off theories myself.
bob
Posts: 20943
Joined: Mon Feb 27, 2006 7:30 pm
Location: Birmingham, AL

Re: Correlated data discussion - Silver Position Result

Post by bob »

Let me add that this position really was chosen randomly. I did not (and have not) looked at it with a diagram to see what the position looks like. Any time you choose a single position randomly, you are looking at extreme variations in the position. Some of these are pretty forced, some are not. I could have tried to choose one that was well-balanced, and I might have chosen such by accident for all I know...
bob
Posts: 20943
Joined: Mon Feb 27, 2006 7:30 pm
Location: Birmingham, AL

Re: Correlated data discussion

Post by bob »

Uri Blass wrote:
MartinBryant wrote:
Uri Blass wrote:
MartinBryant wrote:Forgot to mention...
I repeated the Fruit experiment with Spike and Colossus too.
Again no duplicates in 100 games.
When you say no duplicate you mean that no 2 games are the same or maybe you mean that the first game did not repeat.

Note that both options are different.

Uri
Yes I mean no duplicates at all, no 2 games are the same (well according to SCID anyway!).
Note that if you can get 1000 different games with probability of 1/1000
then you can expect more than one pair of identical games so the result that you report seems to be very surprising and I suspect that there may be some learning or some other bug(like bug in SCID.

I think that looking at the games may be productive to see if there is a problem and if SCID is reliable in detecting identical games.

I would like to know what is the probability of ply k to be different when we know that ply 1,2,...k-1 are identical.

Uri
I am not sure that I buy a "birthday paradox" similarity here. From a single starting position, the probability of playing any one game is 1/N where N is the possible number of different games that can be derived from that position. What you are doing is assuming that if you play 1,000 games, that there are only 1,000 possible games that could be played during that test, which is wrong. If the average position has 38 moves, and the average game goes 50 moves, then you have 38^50 possible games. Not just 1,000 because that is all you chose to enumerate. In actuality the odds are not that low, because using pure time variance probably won't ever play every possible game. But even if this only produces 3 possible variations at each move, that is huge. And I did not even factor in the fact that 38^50 is actually wrong and it should be 38^100 since each side can vary...


So the probability of 1/1000 is simply wrong. Producing any one of those games has a probability of 1/N where N is huge, which makes all the assumptions invalid, including the common suggestion of "a bug". The "bug" is in the analysis above, not in how the test was run.
bob
Posts: 20943
Joined: Mon Feb 27, 2006 7:30 pm
Location: Birmingham, AL

Re: Correlated data discussion

Post by bob »

Uri Blass wrote:<sniipped>
bob wrote:
does not happen, or at least does not happen to the point where the speed of the processor's instruction execution cycle varies by a single tenth of a nanosecond, which is .0000000001

I do not understand this part.
.0000000001 seconds relative to what?
Relative to a second. What do you mean? If the clock should tick at 1,000,000,000 times per second, it might go at 1,000,000,001, or 999,999,999 ticks per second.
.0000000001 seconds can be a small change if it is relative to 1 second but can be a big change if it is relative to clearly smaller time.


The fact that the change is less than a tenth of a nanosecond tells me nothing because if you change from nanosecond to 1.09 nanoseconds it is a relatively big change in speed and if you change from 1 second to 1 secod and 0.09 nanoseconds it is a small speed change.

Uri .
Again, if you transmit at 2,000,000,000 hertz, and your receiver is looking for a frequency of 1,000, 900,000 hertz, no communication occurs. Even worse, the broadcast spectrum modifies the frequency (FM) to transmit the data. If the frequency changes by itself, that will "add" information (noise) that would be intolerable.
bob
Posts: 20943
Joined: Mon Feb 27, 2006 7:30 pm
Location: Birmingham, AL

Re: Correlated data discussion - Another Experiment...

Post by bob »

Remember I had already said that in some cases, one extra node was all it took. I ran the same sort of experiment you did. And that's where I got into the "time jitter" study because the results of adding even a single node can cause different games from some starting positions.

Personally, when I ran this, it was _totally_ unexpected, and had me looking for other influences for quite a while last Summer (2007) when this was coming to light. Certainly not an expected result in my mind. But now we know it is not only an unusual type observation, but that it is a very frequent event as well.
bob
Posts: 20943
Joined: Mon Feb 27, 2006 7:30 pm
Location: Birmingham, AL

Re: Correlated data discussion - Another Experiment...

Post by bob »

The interesting thing the two of you are doing is that you are not going thru the "stampee feet, test is flawed, stampee feet, cluster is broken, stampee feet, data is 'cherry-picked', stampee feet, stampee feet."

Instead, you are investigating _exactly_ what I investigated starting last Summer, to see what is going on, and why.

Meanwhile I am still waiting on word about the A/C so that I can try this large number of starting positions and see what shakes loose from those results...
User avatar
xsadar
Posts: 147
Joined: Wed Jun 06, 2007 10:01 am
Location: United States
Full name: Mike Leany

Re: Correlated data discussion - Silver Position Result

Post by xsadar »

MartinBryant wrote:OK I've re-run the test on Bob's randomly selected Silver position.

Firstly, I did as Uri (I think it was Uri? things just get buried here so fast!) suggested and ran a deterministic test just to ensure the setup is valid.
I played Fruit v Fruit from the opening position, no books, 7-ply fixed depth search for 10 games and got 10 identical games as expected. (I could've done 100 but that just seemed redundant...)
I used fixed depth as Fruit doesn't seem to work correctly with fixed nodes I'm afraid.

I then ran Fruit (2.1 by the way) v Fruit, 100 games, from Bob's Silver position, 0.5 secs/move.
(I actually made the book repeatedly select the line (12 ply) which leads to Bob's position rather than giving it the FEN as a starting position. It is just easier to do it that way with my GUI.)
In 100 games there was ONE duplicate game pair (out of 4,950 possible game pairings).
Games 9 and 27 played exactly the same moves (PGN below) but if you look carefully the scores do vary slightly throughout the game.

[Event "test"]
[Site "DELL8"]
[Date "2008.08.11"]
[Round "9"]
[White "Fruit 2.1"]
[Black "Fruit 2.1"]
[Result "1/2-1/2"]
[ECO "C02"]
[Termination "Repetition"]
[TimeControl "0.5 secs/move"]
[Opening "French"]
[Variation "Advance, 5.Nf3 Qb6 6.a3 c4"]

1.e4 e6 2.d4 d5 3.e5 c5 4.c3 Nc6 5.Nf3 Qb6 6.a3 c4 7.Be2 {-0.02/10} 7...Nge7 {-0.02/10} 8.O-O {-0.07/10} 8...Nf5 {+0.01/10} 9.Qc2 {-0.26/10} 9...Be7 {-0.18/10} 10.Bf4 {-0.24/10} 10...O-O {-0.03/10} 11.Nbd2 {-0.03/10} 11...h5 {+0.01/9} 12.Ng5 {-0.05/10} 12...Qd8 {+0.00/10} 13.h4 {+0.03/10} 13...g6 {+0.03/10} 14.Ndf3 {-0.09/9} 14...Bd7 {-0.06/9} 15.Rfe1 {-0.27/9} 15...Qb6 {-0.27/9} 16.g3 {-0.20/8} 16...a6 {-0.10/9} 17.Nd2 {-0.10/9} 17...Na5 {-0.16/8} 18.Bf3 {-0.21/9} 18...Rac8 {-0.20/9} 19.Rab1 {-0.14/8} 19...Nb3 {-0.28/9} 20.Nxb3 {-0.22/9} 20...Ba4 {-0.27/10} 21.Qe2 {-0.32/9} 21...Qxb3 {-0.27/10} 22.Bg2 {-0.27/10} 22...Qc2 {-0.33/10} 23.Qxc2 {-0.35/10} 23...Bxc2 {-0.35/11} 24.Ra1 {-0.35/11} 24...Rc6 {-0.28/11} 25.a4 {-0.33/11} 25...Rb6 {-0.29/11} 26.Ra2 {-0.32/10} 26...Rc8 {-0.32/10} 27.Nf3 {-0.24/10} 27...Kg7 {-0.17/10} 28.Bg5 {-0.15/11} 28...Bb3 {-0.17/11} 29.Raa1 {-0.12/12} 29...Bxg5 {-0.12/12} 30.Nxg5 {-0.11/12} 30...Bc2 {-0.08/12} 31.a5 {-0.05/12} 31...Rb3 {-0.07/12} 32.Re2 {-0.05/12} 32...Bd3 {+0.00/12} 33.Rd2 {-0.05/13} 33...Ne7 {-0.04/12} 34.Bf3 {+0.00/13} 34...Bf5 {+0.00/13} 35.Bd1 {+0.00/13} 35...Rb5 {+0.00/12} 36.Be2 {-0.17/12} 36...b6 {-0.17/12} 37.axb6 {-0.08/12} 37...Rxb6 {+0.06/11} 38.f3 {+0.03/11} 38...Nc6 {+0.09/11} 39.Bd1 {+0.00/11} 39...Rh8 {-0.03/11} 40.Kf2 {-0.01/11} 40...a5 {-0.03/11} 41.g4 {+0.22/11} 41...Bd3 {+0.20/11} 42.Kg3 {+0.21/11} 42...Rhb8 {+0.00/11} 43.Ba4 {+0.00/12} 43...Na7 {+0.00/12} 44.Bd7 {+0.00/11} 44...Rd8 {+0.00/12} 45.Ba4 {+0.00/14} 45...Rdb8 {+0.00/16} 46.Bd7 {+0.00/12} 46...Rd8 {+0.00/14} 47.Ba4 {+0.00/15} 47...Rdb8 {+0.00/16} 1/2-1/2

[Event "test"]
[Site "DELL8"]
[Date "2008.08.11"]
[Round "27"]
[White "Fruit 2.1"]
[Black "Fruit 2.1 - Duplicate"]
[Result "1/2-1/2"]
[ECO "C02"]
[Termination "Repetition"]
[TimeControl "0.5 secs/move"]
[Opening "French"]
[Variation "Advance, 5.Nf3 Qb6 6.a3 c4"]

1.e4 e6 2.d4 d5 3.e5 c5 4.c3 Nc6 5.Nf3 Qb6 6.a3 c4 7.Be2 {-0.02/10} 7...Nge7 {-0.02/10} 8.O-O {-0.07/10} 8...Nf5 {-0.07/10} 9.Qc2 {-0.26/10} 9...Be7 {-0.18/10} 10.Bf4 {-0.24/10} 10...O-O {-0.03/10} 11.Nbd2 {-0.03/10} 11...h5 {+0.01/9} 12.Ng5 {-0.05/10} 12...Qd8 {+0.00/10} 13.h4 {+0.03/10} 13...g6 {+0.03/10} 14.Ndf3 {-0.09/9} 14...Bd7 {-0.09/9} 15.Rfe1 {-0.06/9} 15...Qb6 {-0.27/9} 16.g3 {-0.20/8} 16...a6 {-0.22/9} 17.Nd2 {-0.10/9} 17...Na5 {-0.16/8} 18.Bf3 {-0.20/9} 18...Rac8 {-0.21/9} 19.Rab1 {-0.19/9} 19...Nb3 {-0.28/9} 20.Nxb3 {-0.22/9} 20...Ba4 {-0.27/10} 21.Qe2 {-0.27/9} 21...Qxb3 {-0.27/10} 22.Bg2 {-0.28/9} 22...Qc2 {-0.33/10} 23.Qxc2 {-0.35/10} 23...Bxc2 {-0.35/11} 24.Ra1 {-0.35/11} 24...Rc6 {-0.28/11} 25.a4 {-0.33/10} 25...Rb6 {-0.29/11} 26.Ra2 {-0.29/10} 26...Rc8 {-0.31/10} 27.Nf3 {-0.24/10} 27...Kg7 {-0.17/10} 28.Bg5 {-0.15/11} 28...Bb3 {-0.17/11} 29.Raa1 {-0.12/12} 29...Bxg5 {-0.12/12} 30.Nxg5 {-0.08/12} 30...Bc2 {-0.08/12} 31.a5 {-0.22/12} 31...Rb3 {-0.06/12} 32.Re2 {-0.02/12} 32...Bd3 {+0.00/12} 33.Rd2 {-0.05/13} 33...Ne7 {-0.04/12} 34.Bf3 {+0.00/13} 34...Bf5 {+0.00/12} 35.Bd1 {+0.00/14} 35...Rb5 {+0.00/12} 36.Be2 {-0.17/12} 36...b6 {-0.17/12} 37.axb6 {-0.08/12} 37...Rxb6 {+0.06/11} 38.f3 {+0.03/11} 38...Nc6 {+0.41/11} 39.Bd1 {+0.00/11} 39...Rh8 {-0.03/11} 40.Kf2 {-0.01/11} 40...a5 {-0.03/11} 41.g4 {+0.22/11} 41...Bd3 {+0.20/11} 42.Kg3 {+0.03/11} 42...Rhb8 {+0.00/11} 43.Ba4 {+0.00/12} 43...Na7 {+0.00/12} 44.Bd7 {+0.00/11} 44...Rd8 {+0.00/12} 45.Ba4 {+0.00/14} 45...Rdb8 {+0.00/16} 46.Bd7 {+0.00/12} 46...Rd8 {+0.00/14} 47.Ba4 {+0.00/15} 47...Rdb8 {+0.00/16} 1/2-1/2


So there you have it.
Draw from that whatever conclusions you like.

My 2 cents...
1) Ken may have a point that moving the position further into the game does increase the chance of more duplicates (although these two experiments by no means prove that at all). Looking at the games there were many that finished with similar locked pawn chains (which are an artefact of this particular opening position) perhaps thus reducing the reasonable choices.
2) There's still a mass (excuse the un-scientific term) of variation.
A lot of variation, but evidence that there is in fact some correlation present, however small it may be. Would have been interesting to have the node counts for each move, to see if it showed any correlation in the clock jitter.
bob
Posts: 20943
Joined: Mon Feb 27, 2006 7:30 pm
Location: Birmingham, AL

Re: Correlated data discussion

Post by bob »

hgm wrote:
bob wrote:
hgm wrote:
bob wrote:
hgm wrote:
Dirt wrote: Using the node count instead of time would mean you're not testing the time management code, right? I guess you'd then have to test that separately, which is really starting to make the testing complex.
This would not be true if you use the node-based time-control mode of WinBoard 4.3.14.

Then the time-control code would be fully active. It is just that the engine would be fed, (and uses internally) a virtual time, derived on its node count. But apart from the clock routine, the rest of the engine would never know that it is not dealing with real time.
How can that be? You tell me to search a specific number of nodes, then how can I go over that as I would if I fail low on a position? How can I go over that by variable amounts as I do now, depending on how much I appear to be "losing"???

If you tell me exactly how long, there is going to be node count jitter. If you tell me exactly how many nodes to search, there will be no timing jitter and I can't test my time allocation code at all.

Can't be both ways, at least in my program...
Of course it can. You get a given number of nodes for a given number of moves, and the engine will itself determine how it distributes those nodes over the individual moves. Just like it divides its time quota over the moves in a normal N moves / M minutes time control. What you had in mind is the direct equivalent to the 'st' command, that gives you a fixed, never-exceed time for a single move. But there is no reason to restrict node-based time controls to that mode.

So if you want to play 100,000 nodes (max) per move, you start WinBoard with arguments

-st 1 -nps 100000

If you want to play 4,000,000 nodes for 40 moves, you specfy

-mps 40 -tc 0:40 -nps 100000

That's all. Of course the engine has to implement the WinBoard nps command. So indeed you could not do it on Crafty.
OK, how does that make a fair test? Program A is fairly constant in NPS throughout the game. Program B varies by a factor of 3x from opening to endgame (Ferret was an example of this). So how many nodes do you tell Ferret to search in comparison to the other program?

So how is this going to give reasonable estimates of skill when it introduces a potential time bias toward one opponent or the other???
The purpose of node-based time controls is not to have a fair test of skill between programs, but to create a reproducible testing environment that is insensitive to CPU loading. I just point out that node-based time controls are not limited to 'st' type time controls, but come in all flavors: classical, incremental, sudden-death, or multi-session combintions of those.

Engines do not count nodes the same way, an have not the same nps rate anyway. So if a tester wants to use node-based time control to not have the chracter of a time-odds game, it is unavoidable that he adapts the nps parameter for each engine separately for equal time use.

To prevent the problem with highly variable nps that you point out, I would encourage any program that shows such behavior to ly about their node count accordingly in nps mode. E.g. counting each tablebase probe for 100 nodes, if that is necessary to make the nps rate in the end-game equal to that in the middle-game. Use any internal engine statistic to make the reportd node count reflect CPU-time use as accurately as possible.
OK, so to get reproducible results, I have to cripple my engine with regard to how it handles time allocation, and play with a time advantage or a time handicap relative to my opponent. And then I use _those_ results to draw conclusions about program changes?

In general a 2x time advantage could make any sort of change look good.

While the idea of reproducible matches is interesting, if you continue to follow this thread, you might conclude that it is pointless, since someone else has now reported that a single extra node per move in a game can change the result.
bob
Posts: 20943
Joined: Mon Feb 27, 2006 7:30 pm
Location: Birmingham, AL

Re: 4 sets of data

Post by bob »

Uri Blass wrote:
bob wrote:
Uri Blass wrote:
bob wrote:
Uri Blass wrote:You say:"To use your phraseology, all _good_ engines seem to exhibit this variability. The new ones with simplistic time controls, searching only complete iterations, and such won't. But that is a performance penalty that will hurt in OTB play. So good programs are not being designed like that. There's reasons for it... "

my response:
Performance penalty from searching only complete iterations is clearly
less than 100 elo(and I guess something near 20 elo) and I believe that there are programmers who do not care about these small numbers and prefer deterministic results in testing so they may be able to reproduce everything(My opinion is that it is better to sacrifice 20 elo for being able to reproduce everything easily).
I play so many games against other computers on ICC, and I see the analysis they whisper/kibitz when I am watching and we both turn it on. Even in CCT events as well. And _everybody_ that I played was producing output claiming that not all root moves were searched each time a move was played. Sometimes they started a new depth and got nothing back, sometimes they were in the middle, and perhaps very rarely they timed out when the iteration ended. But it is easy to watch 'em. I don't believe _anybody_ designs a program in any way other than to make it as strong as possible. And that includes me. I know how to make a perfectly deterministic parallel search. But with absolutely nowhere near the speedup I now produce. So non-determinism is accepted along with better performance. Ditto for time allocation and each other decision we make. But find me any _serious_ chess programmer that would say "sure, I will give up significant Elo just to make testing more deterministic (although no more accurate)"

You can be sure that not all programs at level that is close to Crafty's level exhibit the variability that you talk about.
When I first ran into this, I tried several. Both Glaurungs, fruit, gnuchess, crafty, arasan 9/10, and a couple of others that I won't name but which I could get to run under linux. And they _all_ had this issue. Yes, there could certainly be a program out there that terminates searches on iteration boundaries, that clears the hash between moves, that does anything else needed to minimize variability. But there certainly are not very many of 'em because we all take any rating improvement we can get, and we aren't going to throw away 30 here, 20 there, as much of that produces a pure patzer.



Not all programmers of programs at that level care much about playing strength and you can be sure that there are people who do not care if their program is 500 elo weaker than rybka or 480 elo weaker than rybka.

I did not work lately about Movei but if I come back I will certainly not care at levels that are more than 100 elo weaker than rybka about this small improvement that make result not reproducable and make it harder to find bugs because I see that the program played a move that I cannot reproduce.

Uri
I'm the opposite, trying to take everything I can get. Most are...
stopping at the end of the iteration may not be important for finding bugs (and in ponder on games unlike ponder off games I have no rule to stop at the end of the iteration and of course I may play in 0 seconds not at the end of the iteration in case of ponder hit).

Clearing the hash clearly can help to find bugs and does not help only for deterministic testing.

Suppose that your program did a mistake in some game and you have only the game.

You know that the program did some mistake that you simply cannot reproduce in first tries in analysis because you cannot reproduce the hash situation.

You will have harder time to catch the bug that cause the program to do the mistake.

If you need more time to discover bugs then it means that you sacrifice future improvements.

Uri
That is an insane approach to designing the strongest possible chess program. Would you then choose to not add major eval blocks of code because those might be harder to debug than if you left it alone? Even though the additions are not going to add 100 Elo, or even 20 Elo, do you add them or not? There is only one sane answer there, and it is the same sane answer to any question about sacrificing Elo for (possibly unnecessary) future debugging...
evaluation is different because the results after adding evaluation terms can be reproduced.

Uri
Are you following this thread? A tiny eval change alters the shape of the game tree. A single node change can change a move or the game result. So no way is that last statement true. Just run a couple of positions, then make a tiny eval change and re-run 'em. And look at the node counts for the same fixed depth. Then this can be put to rest for all time.
Fritzlein

Re: Correlated data discussion - Silver Position Result

Post by Fritzlein »

MartinBryant wrote:OK I've re-run the test on Bob's randomly selected Silver position.
[...]
In 100 games there was ONE duplicate game pair
For comparison, if I had a bag of 7,213 game results, and I picked out of that bag with equal likelihood of getting each result, then in 100 picks I would have a 50% of getting no duplicates. (birthday paradox) Therefore the actual results of getting 1 duplicate game in 100 playouts is consistent with there only being being 7,213 playouts you will ever get, each of them equally likely. Of course, in practice there will probably be more playouts that you eventually get if you try long enough, because they won't all be equally likely; some will be more likely than others, and there's where you will get your duplicates. I just wanted to throw out a ballpark number to help guide the intuition of how much variation this is.

Come to think of it, it's pretty much the same as the ballpark number you threw out there yourself, namely 4,950 (pairs of comparisons)...
MartinBryant wrote:There's still a mass (excuse the un-scientific term) of variation.
Yes indeed, lots of different playouts possible. But if we look at the results of the playouts (win, draw, loss) does the distribution correspond to the distribution for a randomly chosen position? If the average over all positions (or even just over all balanced positions) is 30% white wins, 45% draw, 25% black wins, and looking at the actual playouts of this position (drawn from a pool of 4,950 or 7,213 or whatever) there are 10% white wins, 40% draws, and 50% black wins, then reusing this position is causing strong correlation in scores.

Blame insufficient clock jitter, blame the engine for not understanding black's strategic weaknesses, blame sunspots if you like, but if the results of playouts of this position are not the same shape as results of playouts from a random position, then there is correlation of the results to the position.