hgm wrote:bob wrote:And if one would recognize that I have already done as much of the above as is possible, we wouldn't be wasting time. Games are run at random times. They are run on random cores. Since the games last for random amounts of times, things are further scrambled. If you'd like, I can certainly run a couple of tests and list match script to node it is run on to show that it is different every time due to the way the SGE scheduler works. But don't let details interfere with blindness.
Interesting. Because if the games of two long runs between the same opponenets were randomly interleaved, the actual action of playing them becomes the same, no matter how you interleaved them. They would all be simply games between the same two opponents.
So even if the computer on which the games were played would totally doctor the results, e.g. let A win in all games of the first half, and B in the second, or A in all odd games, and B in all even games, it could not affect the difference in result of the long runs. These results would remain distributed with a standard deviation as if the games were totally uncorrelated, as it would just be the equivalent of randomly drawing half the marbles out of a vase with colored marbles. And that is totally insensitive to the algorithm that was used to color the marbles, on if they were colored in groups and with intent, or randomly. The act of drawing the later would totally randomize the result again.
So if you observe a correlation in that case, it cannot be due to the computer that play the games at all. It must be the random selection process that decides which game counts for which run.
But of course what you say above totally denies what you were saying earlier: you said you started one run after the other finished. So the games of the two runs were not randomly interleaved at all, which they should have been if one of the runs was intended as a background correction for the other, to correct for low-frequency noise in the engine strength. So either you were bulshitting us then, or you are bullshitting us now...
What I said above doesn't "totally deny" anything. Might be a bit of a vocabulary issue, but here is what is done. First, let's define "run" as a complete set of games, 40 positions, 4 games per position, 5 opponents to play against current crafty, N repeats. N varies but is typically set to 32 so that one opponent plays crafty 40 * 4 * 32 = 5120 games. And with 5 opponents, that is 5 * 5120 total games in one run, or 25,600 total games. That will match the second set (of 2) results I posted. The first set used a number smaller than 32 to produce just 800 games. This is called a run.
To actually perform this test, assuming I am using the entire cluster of 260 processors, I have a shell script that first creates a "command" to play each single position 4 times alternating colors (I can also produce 4 commands but this is done for efficiency which I will explain in a minute). If you do the math, that turns into 25,600 / 4 commands. These are saved in a file. I then run, on the "head node" (not one of the 130 nodes we have) a program that submits N of these commands to the SGE queueing system. Typically N is 300 here so that there are 260 running, and 40 waiting to run when anything finishes. As one of these 4-game mini-matches completes, the program fires off another command from the file, so that there are always "jobs waiting in the queue".
The SGE engine schedules the jobs on random nodes. Whenever one finishes, the grid engine waits until the node shows that nothing is left running (the load average drops to near zero) and then picks the next command in the queue and schedules it on that processor. This continues until all of the commands have been executed and the test is completed. The initial seeding of the commands is completely random, and then as a command finishes, the next one falls into that node and begins. I'd be happy to show you the scheduling log to see what ran where. It is always different, but since all nodes are absolutely identical, that would make no difference. I have the option of rebooting a node before using it. Since these are bare linux nodes, that takes under 30 extra seconds, but I have found no difference between booting and not booting with respect to the variability of the results.
Hopefully, that explains the process clearly and precisely. We have the gridengine configured so that it will _never_ start a command on a node with a load factor that is not < .05, which means nothing running. Only down side is that this wastes quite a bit of time as it takes time for the load average to settle to near zero after being at 1.0 for a while.
Running like this, I can produce 100% repeatable results, so long as I avoid using time as the limiting constraint for the game tree searches. Node count limits work perfectly. Depth limits work perfectly. But all of the programs I am testing against do partial last iterations and time out when they feel like it, and timing jitter makes these games completely non-reproducible.
I assume you saw the post where someone played fruit vs fruit, same starting position, no book, and did not get one duplicate game? That is what I see as well. Some games actually do repeat. They are apparently forced enough that there are no last minute score / best-move changes that would be sensitive to time jitter. But most games, even if they have the same result, have a different sequence of moves at some point...
Clear enough? I have explained that exact set-up many times. It hasn't changed. There's no bullshit coming from me. You might have some in your ears of course, but I can't help that. The nodes are initially assigned randomly intentionally, since this cluster can run MPI jobs, PVM jobs, etc, which depend on network bandwidth/latency as well. This randomness makes the results show "average performance" since it is mixed up every time. My testing is not doing any network activity until after the games are played and I save the PGN in a common location.
Now before you make any more comments, read the above carefully and see if there are any points you don't understand or want clarification on. I have several ways I can modify the testing. For lots of runs, I submit "commands" that play 4 games per processor" rather than just one. This is more efficient as there are fewer of those "pause until load average drops to near zero" conditions which waste time. I can crank this up to 8, 16, or whatever I want. Going too far begins to create commands that vary way too much in how long they will run causing load-balancing issues as the run winds down. running too few creates a lot of idle time waiting on the load average to drop. My "automatic submit" program also looks at what else is running, and if it notices other users with jobs in the queue, it slows down the submission process to use fewer nodes to let those jobs run. As the queue empties, it picks the pace back up. This way I can run as much as I want without starving other users who are also trying to run things (nobody uses all the nodes as I do so there is always some major fraction of the cluster available).
any questions or comments???
Edit:
As far as the "which game goes with which run" There are two issues.
(1) one run (run as defined above) is completed before another is started. So the 25,000 games are not intermingled with another 25,000 game run. As far as the individual games go, each game is assigned a unique "ID". I use this so that if I choose to create them, crafty can use this "ID" as the logfile number so that each different game gets a different logfile when I am looking for problems. It is then easy enough to collect the individual game results and "group" them in the same order as the commands were initially produced. So that if I want the first 4 games from each position, I can see those precise games. In fact, the PGN is actually stored like that. The filesnames are:
matchX.Y.Z
X is a number between 1 and 5 to indicate which opponent Crafty played. Z is the position number, 1-40, and Y is the number of the 4-game mini-match, 1-32 in the usual case, but sometimes smaller. So it is easy to figure out the "logical order" if not the physical order the games were played. I use "logical" in everything I do since the games are collected in that way. In general, if you ignore the parallel activity of 260 simultaneous games, the games are actually played in that order, or at least they are started in that order, what happens after a game is started is up to the timing issues already discussed which can make one of the four supposedly identical games take way longer (or shorter) time than the rest.