more on engine testing

Sven · Post by **Sven** » Sun Aug 03, 2008 7:24 pm

Hi Bob,

how much effort would it make for you to provide the following results?

- A round-robin tournament of your five test opponents, with 160 games for each single pairing (i.e. a total of (5*4/2)*160 = 1600 games), with identical conditions as described by you for your Crafty test, including scores and relative Elo rating given by BayesElo;

- the same repeated three times, so a overall total of 4*1600 = 6400 games, while saving each subset of 1600 games in a separate PGN (not important but convenient, see below).

With the saved PGN, and assuming four very close outcomes regarding the ratings, the next step could be to select any of the four PGN parts, concatenate it with the PGN of one of your Crafty test runs, and calculate Elo ratings.

Compare with ratings that you get by combining the same PGN part with the PGN of your second (or any other different) 800-games Crafty test run.

I hope you can prove me wrong

Sven

bob · Post by **bob** » Sun Aug 03, 2008 7:40 pm

Sven Schüle wrote:Hi Bob,

how much effort would it make for you to provide the following results?

- A round-robin tournament of your five test opponents, with 160 games for each single pairing (i.e. a total of (5*4/2)*160 = 1600 games), with identical conditions as described by you for your Crafty test, including scores and relative Elo rating given by BayesElo;

- the same repeated three times, so a overall total of 4*1600 = 6400 games, while saving each subset of 1600 games in a separate PGN (not important but convenient, see below).

With the saved PGN, and assuming four very close outcomes regarding the ratings, the next step could be to select any of the four PGN parts, concatenate it with the PGN of one of your Crafty test runs, and calculate Elo ratings.

Compare with ratings that you get by combining the same PGN part with the PGN of your second (or any other different) 800-games Crafty test run.

I hope you can prove me wrong

Sven

Some of that is not so easy to do with my current setup. For example, the version of Arasan I use sends the wrong "result" string depending on which color it is playing. My referee doesn't know about the game in progress, it just relays moves and uses the actual Crafty result to end the games. That's why I have been so careful to make sure that the result from crafty is always correct by visually looking at the log to see scores and then looking at the result it sends (also recorded in the log) to make sure it is consistent. This was I don't get bogus draw claims, or wins when a program was losing instead. I can't use xboard to play these games as the remote X display introduces huge network traffic when 256 games are going at once, which adds a +lot+ of timing jitter to an already sensitive environment.

But if I do not use the cluster, I could do it on my office box using xboard. But the time control needs to be quick. What about the following: all games are 1 0 (game in one minute). that means 30 games an hour or more (not all games will need the full minute for each side) which would come out to be almost 800 games a day. Ack. Probably too slow.

So, what time control is acceptable? How many games between opponents? If I do an 80 game match between each two opponents, that is 15 matches, which is 1200 games, which is doable on my office box using xboard (although xboard will accept bogus result strings unfortunately)... that work for you?

Sven · Post by **Sven** » Sun Aug 03, 2008 7:55 pm

bob wrote:So, what time control is acceptable? How many games between opponents? If I do an 80 game match between each two opponents, that is 15 matches, which is 1200 games, which is doable on my office box using xboard (although xboard will accept bogus result strings unfortunately)... that work for you?

Yes, of course.

Sven

bob · Post by **bob** » Sun Aug 03, 2008 8:00 pm

Sven Schüle wrote:
bob wrote:So, what time control is acceptable? How many games between opponents? If I do an 80 game match between each two opponents, that is 15 matches, which is 1200 games, which is doable on my office box using xboard (although xboard will accept bogus result strings unfortunately)... that work for you?
Yes, of course.

Sven

OK. To clarify:

1. same 6 opponents (which includes an old and current crafty).

2. game/1minute time control

3. 80 games using silver positions, one black one white for each position.

4. round robin where each opponent plays everyone the above 80 games

5. save BayesElo results and the PGN.

Should I repeat this 2-3-4 times to see how they vary which is at the heart of this discussion. These games will be run on my office dual PIV xeon box, so I can play 2 games at a time with no problem to cut the total time to something barely reasonable.

And I could run BayesElo on each test both including all the games, and just including the crafty-22.2 vs the rest games to see how the elo numbers compare as well?

Sven · Post by **Sven** » Sun Aug 03, 2008 8:47 pm

Yes, that sounds like what I intended. I'm curious to see the results

Sven

bob · Post by **bob** » Sun Aug 03, 2008 9:16 pm

Sven Schüle wrote:Yes, that sounds like what I intended. I'm curious to see the results

Sven

It will take a bit to write a shell script to automate that, and get all the stuff set up (polyglot, .ini files, etc)... working on it.

MartinBryant · Post by **MartinBryant** » Mon Aug 04, 2008 11:24 am

bob wrote:Is that not a _huge_ change? 10 less wins, 10 less losses, 20 more draws. Just by adding 1000 nodes.

Yes it is, but it is exactly what you'd expect because of the non-determinacy in mature engines.

I ran a different experiment...
I played Fruit against itself for 100 games with no book at 0.5 seconds / move to see how many games it took before it repeated the first game.
Now if the engine was determinant we'd get a repeat on the second game.
But anybody care to guess how many games before a repeat occurred?
The answer is not a single one!
In fact, not only did the first game never repeat, there wasn't a single duplicate in the whole match!
(You might even conclude that using a set of opening positions is irrelevant! Just leave the engines to it! That's probably too extreme for most peoples tastes though.)

The fluctuation in timings are accentuated in this kind of test as the search is terminated mid flow as opposed to at a couple of well defined decision points (do I start another iteration? do I terminate this iteration after this root move?) but it demonstrates the problem nicely.

A few different transposition table entries early on rapidly multiply until the root move scores start to change and eventually even the root move choice.
I believe it's synonymous with the butterfly effect in chaos theory. An innocent table entry flaps its wings and later on there's a hurricane in the game result.

Your test is effectively just giving a consistent +ve clock wobble every move, as opposed to a random +/- on some moves.

Could it even be that a SINGLE node extra could cause such variation? Fancy trying a 1,000,000 node run and a 1,000,001 node run?

The vast majority of this non-determinancy can be removed of course by simply clearing the transposition table before each move.
Of course this is not something you want to do in a real game but might be worth a few experiments when testing? Of course your opponents would have to have this option available to try too! Gut feeling though, I don't think it would stop you getting the fluctuations in the overall results that you see. I believe there's something else going on that we're all not seeing, but I too have no idea what it is.

FWIW, I sympathise with your original problem and I can entirely believe your data. (Although my Cambridge mathematician son was incredulous... at least he didn't accuse you of lying!

)

I used to play 1000 game runs to try to make a good/bad decision but I too was getting too many inconsistent results if I re-ran. (Apologies for never posting anything, I didn't realise I was meant too!)
I've now decided that using any arbitrary number of games is incorrect and changed the process as follows.
My GUI graphs the ELO rating as a match progresses. I start an open ended match and let it go until the graph has flatlined to a +/-1 window for a thousand games. With this method I can sometimes get runs <2000 games but often runs of >4000 games are not uncommon.
I'm not sure that this method will stand the test of time either, but it's what 'gives me confidence' (or keeps me sane!) currently.

For those that are interested, I achieve these number of games in a reasonable time as follows...
I don't have a cluster, but I do have up to 6 test PCs available at different times. Most are old kit that has been retired from real use but is still good for sitting in a corner blindly churning out chess moves. Sometimes I steal the kids PCs too!
I also play at very fast time controls which I believe are as good as any other. I think any choice of time control is entirely arbitrary as nobody ever specifies CPU speeds when they quote such things. Modern CPUs can go much deeper in 100ms than we were acheiving at tournament time controls 25 years ago! So why should a 5+5 game game be any more 'respectable' than a 0.5+0.5

Anyways, good luck with your testing and please keep us posted on any findings as at least some of us are interested in your results!

Kirill Kryukov · Post by **Kirill Kryukov** » Mon Aug 04, 2008 1:40 pm

As I am one of the "raters" this is an interesting discussion to me. As far as I can see here, Bob's games are clearly not independent. It would be funny to even try arguing against it (as long as we don't take 1/million luck as a working hypothesis). The big question is: considering Bob's expertise and his inability to perform valid testing, how much the rest of us are affected?

The main theories seem to be:
1. (Bob) There are hidden mechanisms (timing, etc), that make any independent sampling of game results impossible (or very hard).
2. (H.G.M.) Bob's cluster or software is broken.

As much as I hope that "2" is correct, I am worrying about the possibility of "1". Clearly we need more observations from more sources.

hgm · Post by **hgm** » Mon Aug 04, 2008 2:14 pm

Kirill Kryukov wrote: As much as I hope that "2" is correct, I am worrying about the possibility of "1". Clearly we need more observations from more sources.

There are many well established techniques in data-sampling to make it much less sensitive to artifacts. If Humans are involve double-blind tests help a lot. For slowly drifting signals one does a base-line correction, low-frequency noise can be eliminated by rapid multi-scan. Games of a given match that logically belong together can be randomly interleaved with games from other matches, and randomly assigned to cores or times. And of course careful study of the raw data usually tells you which part of your data-collection machinery is broken.

It is all trivial to solve, actually, and even if the noise is intrinsic, the experiment can be set up so that it is insensitive to it. If only one would recognize that there is a problem...

Note that I don't really see the distinction that you make so clearly. If the game results are correlated, meaning that games within one match are dependent (1) on each other, more than on games of other matches, the setup is by definition broken (2), as this is not supposed to happen. The difference in point of view is more "broken, so we have to repair" versus "broken, so we have to argue that the task is impossible".

Uri Blass · Post by **Uri Blass** » Mon Aug 04, 2008 2:42 pm

Kirill Kryukov wrote:As I am one of the "raters" this is an interesting discussion to me. As far as I can see here, Bob's games are clearly not independent. It would be funny to even try arguing against it (as long as we don't take 1/million luck as a working hypothesis). The big question is: considering Bob's expertise and his inability to perform valid testing, how much the rest of us are affected?

The main theories seem to be:
1. (Bob) There are hidden mechanisms (timing, etc), that make any independent sampling of game results impossible (or very hard).
2. (H.G.M.) Bob's cluster or software is broken.

As much as I hope that "2" is correct, I am worrying about the possibility of "1". Clearly we need more observations from more sources.

Third theory:

Bob games are independent but he had some bug in the program that calculated the total result of the games.

We do not have the pgn of more than 50,000 games so we cannot check if the result that he reported is a correct result.

Uri

more on engine testing

Re: An important proposal (Re: more on engine testing)

Re: An important proposal (Re: more on engine testing)

Re: An important proposal (Re: more on engine testing)

Re: An important proposal (Re: more on engine testing)

Re: An important proposal (Re: more on engine testing)

Re: An important proposal (Re: more on engine testing)

Re: more...

Re: more on engine testing

Re: more on engine testing

Re: more on engine testing