Harald wrote:bob wrote:The only issue with my referee is that one of the opponents has to be crafty, as the referee depends on Crafty to accurately tell it when a game is technically over, which is simpler than having the referee keep up with board position and such here...
Sorry, I did not follow all the arguments and desriptions in this thread,
so I may be wrong, but I have a question.
You have two big 25000 games tests comparing crafty_a to crafty_b.
The collected test data show a result of ELO difference that it shouldn't.
One of the engines is used as a referee. Did you use the same engine as
referee in both tests, that is
run_test referee=crafty_a other=crafty_b
or was one of the tests started like this
run_test referee=crafty_b other=crafty_a ?
If your test results show a strange result and ELO shift then the fact that
an engine is a referee may cost (or gain) a few ELO points.
This is only one of many possible problems with a test experiment.
With a sample from tests so big (2*25000) I trust the statistics, probability
and mathematics so much that I think there _must_ be a hidden problem.
We want to help you to find it or at least learn to avoid such problems
when we all get a big test cluster.
Harald
None of the above. The referee is similar to winboard/xboard. I start the referee, it then uses fork() to create two processes, pipe() to create a connection to each of the two new processes, and then each process uses exec() to execute a copy of the two engines being tested. The referee sends both the starting FEN position, the initial time control, and instructs the correct opponent to move. It then simply alternates taking the move from the first opponent and sending it plus the time/otim commands to the second opponent. This goes back and forth until the game ends. The referee is not a part of either engine, it is a separate program much like xboard except no graphical interface since I don't want to see up to 540 graphical boards at one time (I can play up to 540 games at a time on the faster cluster, I do not use both for one run since the processors are not identical on the two clusters.)
The referee has been sanity-checked (although it is a simple program) by playing a few dozen 160 game matches, and taking the PGN for each game and comparing it to the log file produced by Crafty, to verify that the PGN result appears to be consistent with Crafty's expectation. Way back we actually found a bug in Crafty's draw detection when doing this, where it would think it was winning and the PGN said "draw by repetition" and when we checked the PGN it was correct and Crafty had overlooked it. But at least we know that the PGN collection code is saving sane information.
I believe that what we are seeing is just an artifact of computers playing the game. In human play, you won't find two opponents, where one is 200 Elo stronger, but loses _many_ consecutive games, then wins many _more_ consecutive games in the kind of streaks we see regularly in comp vs comp matches. Such behavior would give the impression that a group of games were somehow dependent when they are not.
So, to recap.
for the tests run, I ran on 128 nodes with nothing else running, and no possibility anything else could run since I was on a "closed" system. Each node is absolutely identical to every other node. This is a 128-blade cluster in 5 racks. All use the same 3.2ghz intel Xeon processors (older 64 bit processors), with 4.0 gigs of RAM and a big local hard drive where the directories being used are wiped clean after each set of 4 games (2 with black and 2 with white) are played.
The referee starts two programs, sets the position and then interfaces between the two programs. No pondering. No SMP search. No learning. No opening books. After a game ends, both engines are "killed", two new instances are started up and this is repeated for a total of 4 games with the given starting position, alternating colors after each game. That is a single "mini-match" of 4 games. There are 40 starting positions. or a total of 160 games per opponent for the set of mini-matches. 5 opponents gives 800 games. In the big runs, this 800 game set is repeated 32 times, for a total of 32 times, producing 25,600 games.
Each 4-game mini-match is a single "shell script" that is submitted. Each script is scheduled when a node is unused, and runs to completion. When that run completes, then another one is scheduled, although there is a 1-2 minute gap while the node "settles down" to appear unused again. The runs use local disk space, and after completing the final PGN results are copied to a multi-terabyte file storage system based on hardware raid.
the PGN is organized into a bunch of files, named matchN.X.Y where N is the number of the opponent (1-5), X is the run number (1-32) and Y is the position number (1-40).
There is nothing carried between games (engines are re-started, no learning, no book/book learning, etc.) The games are run essentially in order by player number first, then by position, and finally by run number, so that partial results give a better cross-section of what is happening while the big match is still in progress...
I have a test in Crafty that monitors NPS to make sure that it doesn't fluctuate more than a normal amount. This was done because in the past, before some scheduling changes on the cluster, it was possible for a node to get a third (runaway) process which would consume CPU time and interfere with the running programs. I put the check into Crafty to catch this. It no longer happens, although the test is always there for sanity. There are no endgame tables (although I could copy 3-4-5 piece files to all the nodes locally, or even use all the 6's but over the infiniband network which would not particularly be very good.
From that, you can see why I believe there is zero potential for any single game to be dependent on any other game result. it is as controlled and precise as I can make it, and it is _far_ more controlled than what most are doing in their basement testing...