more on engine testing

Uri Blass · Post by **Uri Blass** » Mon Aug 04, 2008 2:53 pm

hgm wrote:
Kirill Kryukov wrote: As much as I hope that "2" is correct, I am worrying about the possibility of "1". Clearly we need more observations from more sources.
There are many well established techniques in data-sampling to make it much less sensitive to artifacts. If Humans are involve double-blind tests help a lot. For slowly drifting signals one does a base-line correction, low-frequency noise can be eliminated by rapid multi-scan. Games of a given match that logically belong together can be randomly interleaved with games from other matches, and randomly assigned to cores or times. And of course careful study of the raw data usually tells you which part of your data-collection machinery is broken.

It is all trivial to solve, actually, and even if the noise is intrinsic, the experiment can be set up so that it is insensitive to it. If only one would recognize that there is a problem...

Note that I don't really see the distinction that you make so clearly. If the game results are correlated, meaning that games within one match are dependent (1) on each other, more than on games of other matches, the setup is by definition broken (2), as this is not supposed to happen. The difference in point of view is more "broken, so we have to repair" versus "broken, so we have to argue that the task is impossible".

I believe that usually games that are done with time control at different time are correlated but the correlation may have a small effect of less than 0.1 elo so it is unimportant.

I do not believe that the computer run at constant speed at different times
and if the computer become 1% slower after using it for a month and if we assume that there is a difference in rating at different time controls then it is clear that there is some correlation.

I assume that for factor of 10 you may get a difference of 20 elo.
if the factor is 1.01 you can get difference of less than 20/200(because 1.01^200<10)

Uri

bob · Post by **bob** » Mon Aug 04, 2008 5:35 pm

hgm wrote:
Kirill Kryukov wrote: As much as I hope that "2" is correct, I am worrying about the possibility of "1". Clearly we need more observations from more sources.
There are many well established techniques in data-sampling to make it much less sensitive to artifacts. If Humans are involve double-blind tests help a lot. For slowly drifting signals one does a base-line correction, low-frequency noise can be eliminated by rapid multi-scan. Games of a given match that logically belong together can be randomly interleaved with games from other matches, and randomly assigned to cores or times. And of course careful study of the raw data usually tells you which part of your data-collection machinery is broken.

It is all trivial to solve, actually, and even if the noise is intrinsic, the experiment can be set up so that it is insensitive to it. If only one would recognize that there is a problem...

And if one would recognize that I have already done as much of the above as is possible, we wouldn't be wasting time. Games are run at random times. They are run on random cores. Since the games last for random amounts of times, things are further scrambled. If you'd like, I can certainly run a couple of tests and list match script to node it is run on to show that it is different every time due to the way the SGE scheduler works. But don't let details interfere with blindness..

Note that I don't really see the distinction that you make so clearly. If the game results are correlated, meaning that games within one match are dependent (1) on each other, more than on games of other matches, the setup is by definition broken (2), as this is not supposed to happen. The difference in point of view is more "broken, so we have to repair" versus "broken, so we have to argue that the task is impossible".

And there is absolutely no way for that to happen, unless it happens for _everybody_. Each individual game is run separately. Games within a test are randomly scattered over all the available nodes. There is no "continuity" between games, such as might happen if the same engine was loaded into memory and then played N games leaving hash and such available between games. None of that happens. You continually suggest that I am the only one that is seeing this "broken" (your term) sort of behavior. Yet you, nor anyone else, has actually tried to prove that by running the test yourself. You just use your "superior intellect" to reach a conclusion. I am running real tests to reach a different conclusion. So far, the data supports _my_ results. All hand-waving and "impossibles" aside...

Sven · Post by **Sven** » Mon Aug 04, 2008 5:37 pm

Fourth theory, brought up by Richard Allbert and supported also by me (see also this subthread: http://64.68.157.89/forum/viewtopic.php ... 51&t=22731):

Elo ratings are calculated only based on games of Crafty vs. opponents while the opponents did not play each other for that calculation (so far the facts), so that the Crafty Elo results are relative to instable ratings of its opponents and therefore too inaccurate (that's the theory).

Bob is preparing data that could be used to verify this theory. We will see.

Sven

bob · Post by **bob** » Mon Aug 04, 2008 5:37 pm

Uri Blass wrote:
hgm wrote:
Kirill Kryukov wrote: As much as I hope that "2" is correct, I am worrying about the possibility of "1". Clearly we need more observations from more sources.
There are many well established techniques in data-sampling to make it much less sensitive to artifacts. If Humans are involve double-blind tests help a lot. For slowly drifting signals one does a base-line correction, low-frequency noise can be eliminated by rapid multi-scan. Games of a given match that logically belong together can be randomly interleaved with games from other matches, and randomly assigned to cores or times. And of course careful study of the raw data usually tells you which part of your data-collection machinery is broken.

It is all trivial to solve, actually, and even if the noise is intrinsic, the experiment can be set up so that it is insensitive to it. If only one would recognize that there is a problem...

Note that I don't really see the distinction that you make so clearly. If the game results are correlated, meaning that games within one match are dependent (1) on each other, more than on games of other matches, the setup is by definition broken (2), as this is not supposed to happen. The difference in point of view is more "broken, so we have to repair" versus "broken, so we have to argue that the task is impossible".
I believe that usually games that are done with time control at different time are correlated but the correlation may have a small effect of less than 0.1 elo so it is unimportant.

I do not believe that the computer run at constant speed at different times
and if the computer become 1% slower after using it for a month and if we assume that there is a difference in rating at different time controls then it is clear that there is some correlation.

I assume that for factor of 10 you may get a difference of 20 elo.
if the factor is 1.01 you can get difference of less than 20/200(because 1.01^200<10)

Uri

Do you understand how the computer clock is generated? If that is not constant, then you can't receive any TV broadcasts or anything else because the frequencies must be very precise. Yes, you might see a clock vary by .001% as a worst case. That is no more or less significant than the fact that we can't measure time accurately to the sub-microsecond level either. So it doesn't matter in the grand scheme of things...

bob · Post by **bob** » Mon Aug 04, 2008 5:44 pm

MartinBryant wrote:
bob wrote:Is that not a _huge_ change? 10 less wins, 10 less losses, 20 more draws. Just by adding 1000 nodes.
Yes it is, but it is exactly what you'd expect because of the non-determinacy in mature engines.

It is what _I_ expect, yes. But not "everyone" believes engines are that non-deterministic. That's why I ran the test, to actually measure this effect, which is pronounced, regardless of the opinions of others to the contrary.

I ran a different experiment...
I played Fruit against itself for 100 games with no book at 0.5 seconds / move to see how many games it took before it repeated the first game.
Now if the engine was determinant we'd get a repeat on the second game.
But anybody care to guess how many games before a repeat occurred?
The answer is not a single one!
In fact, not only did the first game never repeat, there wasn't a single duplicate in the whole match!
(You might even conclude that using a set of opening positions is irrelevant! Just leave the engines to it! That's probably too extreme for most peoples tastes though.)

Not surprising to me. But again, others have suggested that this non-deterministic effect I have been seeing is something that is unique to crafty, rather than uniformly applicable to nearly all engines today, except for the brain-dead simple ones that are not important to the discussion.

The fluctuation in timings are accentuated in this kind of test as the search is terminated mid flow as opposed to at a couple of well defined decision points (do I start another iteration? do I terminate this iteration after this root move?) but it demonstrates the problem nicely.

A few different transposition table entries early on rapidly multiply until the root move scores start to change and eventually even the root move choice.
I believe it's synonymous with the butterfly effect in chaos theory. An innocent table entry flaps its wings and later on there's a hurricane in the game result.

That could easily have been cut/pasted from several of my past posts on the topic.

Your test is effectively just giving a consistent +ve clock wobble every move, as opposed to a random +/- on some moves.

Could it even be that a SINGLE node extra could cause such variation? Fancy trying a 1,000,000 node run and a 1,000,001 node run?

I believe I ran such a thing, let me try to dig up the data. If I can't I can run it. I ran lots of +/- 100 node tests...

The vast majority of this non-determinancy can be removed of course by simply clearing the transposition table before each move.
Of course this is not something you want to do in a real game but might be worth a few experiments when testing? Of course your opponents would have to have this option available to try too! Gut feeling though, I don't think it would stop you getting the fluctuations in the overall results that you see. I believe there's something else going on that we're all not seeing, but I too have no idea what it is.

FWIW, I sympathise with your original problem and I can entirely believe your data. (Although my Cambridge mathematician son was incredulous... at least he didn't accuse you of lying! )

Good to know that some have some common sense and want to test for themselves as opposed to stamping their feet and shouting "can't be... can't be"

I used to play 1000 game runs to try to make a good/bad decision but I too was getting too many inconsistent results if I re-ran. (Apologies for never posting anything, I didn't realise I was meant too!)
I've now decided that using any arbitrary number of games is incorrect and changed the process as follows.
My GUI graphs the ELO rating as a match progresses. I start an open ended match and let it go until the graph has flatlined to a +/-1 window for a thousand games. With this method I can sometimes get runs <2000 games but often runs of >4000 games are not uncommon.
I'm not sure that this method will stand the test of time either, but it's what 'gives me confidence' (or keeps me sane!) currently.

For those that are interested, I achieve these number of games in a reasonable time as follows...
I don't have a cluster, but I do have up to 6 test PCs available at different times. Most are old kit that has been retired from real use but is still good for sitting in a corner blindly churning out chess moves. Sometimes I steal the kids PCs too!
I also play at very fast time controls which I believe are as good as any other. I think any choice of time control is entirely arbitrary as nobody ever specifies CPU speeds when they quote such things. Modern CPUs can go much deeper in 100ms than we were acheiving at tournament time controls 25 years ago! So why should a 5+5 game game be any more 'respectable' than a 0.5+0.5

Anyways, good luck with your testing and please keep us posted on any findings as at least some of us are interested in your results!

Richard Allbert · Post by **Richard Allbert** » Mon Aug 04, 2008 6:36 pm

Sven Schüle wrote:Fourth theory, brought up by Richard Allbert and supported also by me (see also this subthread: http://64.68.157.89/forum/viewtopic.php ... 51&t=22731):

Elo ratings are calculated only based on games of Crafty vs. opponents while the opponents did not play each other for that calculation (so far the facts), so that the Crafty Elo results are relative to instable ratings of its opponents and therefore too inaccurate (that's the theory).

Bob is preparing data that could be used to verify this theory. We will see.

Sven

Just to add uneducated fuel to this fire.... I did a test of four different versions, altering the search in three (checks in qsearch, null move reduction depth made less aggressive, null move bug fixed ). They all scored within 2 points of each other from 300 games vs 5 opponents, black and white vs each engine in the 30 Noomen starting positions. I also ran a RR between the five opponents from the 30 starting positions.

Bayesian elo rated one version 40 points higher than the other versions, even though to total scores were almost the same. This seemed to be pretty clear...

Interestingly, I've found a Null move bug as a result

Richard

Uri Blass · Post by **Uri Blass** » Mon Aug 04, 2008 7:17 pm

Sven Schüle wrote:Fourth theory, brought up by Richard Allbert and supported also by me (see also this subthread: http://64.68.157.89/forum/viewtopic.php ... 51&t=22731):

Elo ratings are calculated only based on games of Crafty vs. opponents while the opponents did not play each other for that calculation (so far the facts), so that the Crafty Elo results are relative to instable ratings of its opponents and therefore too inaccurate (that's the theory).

Bob is preparing data that could be used to verify this theory. We will see.

Sven

The other games are irrelevant for the rating of Crafty and they can only change the rating of the opponents.

If the program that calculates rating is not broken then the difference between Crafty1 and Crafty2 cannot be changed by games of Glaurung against Fruit(assuming Crafty1 and Crafty2 played equal number of games against every opponent)

Uri

Uri Blass · Post by **Uri Blass** » Mon Aug 04, 2008 7:26 pm

Richard Allbert wrote:
Sven Schüle wrote:Fourth theory, brought up by Richard Allbert and supported also by me (see also this subthread: http://64.68.157.89/forum/viewtopic.php ... 51&t=22731):

Elo ratings are calculated only based on games of Crafty vs. opponents while the opponents did not play each other for that calculation (so far the facts), so that the Crafty Elo results are relative to instable ratings of its opponents and therefore too inaccurate (that's the theory).

Bob is preparing data that could be used to verify this theory. We will see.

Sven
Just to add uneducated fuel to this fire.... I did a test of four different versions, altering the search in three (checks in qsearch, null move reduction depth made less aggressive, null move bug fixed ). They all scored within 2 points of each other from 300 games vs 5 opponents, black and white vs each engine in the 30 Noomen starting positions. I also ran a RR between the five opponents from the 30 starting positions.

Bayesian elo rated one version 40 points higher than the other versions, even though to total scores were almost the same. This seemed to be pretty clear...

Interestingly, I've found a Null move bug as a result

Richard

I think that in this case your program to calculate elo has a bug.

difference of 2 points in 300 games cannot be translated to difference of 40 elo except the case when the result is close to 100% or 0% and I assume you did not choose opponents that the results are more than 90% or less than 10% against them.

Uri

bob · Post by **bob** » Mon Aug 04, 2008 8:22 pm

MartinBryant wrote:
bob wrote:Is that not a _huge_ change? 10 less wins, 10 less losses, 20 more draws. Just by adding 1000 nodes.
Yes it is, but it is exactly what you'd expect because of the non-determinacy in mature engines.

It is what _I_ expect, yes. But not "everyone" believes engines are that non-deterministic. That's why I ran the test, to actually measure this effect, which is pronounced, regardless of the opinions of others to the contrary.

I ran a different experiment...
I played Fruit against itself for 100 games with no book at 0.5 seconds / move to see how many games it took before it repeated the first game.
Now if the engine was determinant we'd get a repeat on the second game.
But anybody care to guess how many games before a repeat occurred?
The answer is not a single one!
In fact, not only did the first game never repeat, there wasn't a single duplicate in the whole match!
(You might even conclude that using a set of opening positions is irrelevant! Just leave the engines to it! That's probably too extreme for most peoples tastes though.)

Not surprising to me. But again, others have suggested that this non-deterministic effect I have been seeing is something that is unique to crafty, rather than uniformly applicable to nearly all engines today, except for the brain-dead simple ones that are not important to the discussion.

The fluctuation in timings are accentuated in this kind of test as the search is terminated mid flow as opposed to at a couple of well defined decision points (do I start another iteration? do I terminate this iteration after this root move?) but it demonstrates the problem nicely.

A few different transposition table entries early on rapidly multiply until the root move scores start to change and eventually even the root move choice.
I believe it's synonymous with the butterfly effect in chaos theory. An innocent table entry flaps its wings and later on there's a hurricane in the game result.

That could easily have been cut/pasted from several of my past posts on the topic.

Your test is effectively just giving a consistent +ve clock wobble every move, as opposed to a random +/- on some moves.

It isn't always +ve. The problem is that the clock sort of "jumps" along in discrete steps, while I can sample at any point between the steps. So I could start something right before one of these jumps, and sample right after and draw the conclusion that I had used enough time when I had actually used almost none. Also using more time on one move means less time is left to use on others later, so it is a sort of oscillation that is set into motion somewhere and it ripples throughout the game after that point.

Could it even be that a SINGLE node extra could cause such variation? Fancy trying a 1,000,000 node run and a 1,000,001 node run?

I believe I ran such a thing, let me try to dig up the data. If I can't I can run it. I ran lots of +/- 100 node tests...

The vast majority of this non-determinancy can be removed of course by simply clearing the transposition table before each move.
Of course this is not something you want to do in a real game but might be worth a few experiments when testing? Of course your opponents would have to have this option available to try too! Gut feeling though, I don't think it would stop you getting the fluctuations in the overall results that you see. I believe there's something else going on that we're all not seeing, but I too have no idea what it is.

FWIW, I sympathise with your original problem and I can entirely believe your data. (Although my Cambridge mathematician son was incredulous... at least he didn't accuse you of lying! )

Good to know that some have some common sense and want to test for themselves as opposed to stamping their feet and shouting "can't be... can't be"

I used to play 1000 game runs to try to make a good/bad decision but I too was getting too many inconsistent results if I re-ran. (Apologies for never posting anything, I didn't realise I was meant too!)
I've now decided that using any arbitrary number of games is incorrect and changed the process as follows.
My GUI graphs the ELO rating as a match progresses. I start an open ended match and let it go until the graph has flatlined to a +/-1 window for a thousand games. With this method I can sometimes get runs <2000 games but often runs of >4000 games are not uncommon.
I'm not sure that this method will stand the test of time either, but it's what 'gives me confidence' (or keeps me sane!) currently.

However, if you re-run, won't you get _different_ results once again? I have tried that approach, in that at any instant I can grab the current BayesElo output for all games completed so far. My original intent was to play "just enough" games. But when I re-run, I still get different results.

For those that are interested, I achieve these number of games in a reasonable time as follows...
I don't have a cluster, but I do have up to 6 test PCs available at different times. Most are old kit that has been retired from real use but is still good for sitting in a corner blindly churning out chess moves. Sometimes I steal the kids PCs too!
I also play at very fast time controls which I believe are as good as any other. I think any choice of time control is entirely arbitrary as nobody ever specifies CPU speeds when they quote such things. Modern CPUs can go much deeper in 100ms than we were acheiving at tournament time controls 25 years ago! So why should a 5+5 game game be any more 'respectable' than a 0.5+0.5

Anyways, good luck with your testing and please keep us posted on any findings as at least some of us are interested in your results!

Sven · Post by **Sven** » Mon Aug 04, 2008 8:28 pm

Uri Blass wrote:
Sven Schüle wrote:Fourth theory, brought up by Richard Allbert and supported also by me (see also this subthread: http://64.68.157.89/forum/viewtopic.php ... 51&t=22731):

Elo ratings are calculated only based on games of Crafty vs. opponents while the opponents did not play each other for that calculation (so far the facts), so that the Crafty Elo results are relative to instable ratings of its opponents and therefore too inaccurate (that's the theory).

Bob is preparing data that could be used to verify this theory. We will see.

Sven
The other games are irrelevant for the rating of Crafty and they can only change the rating of the opponents.

If the program that calculates rating is not broken then the difference between Crafty1 and Crafty2 cannot be changed by games of Glaurung against Fruit(assuming Crafty1 and Crafty2 played equal number of games against every opponent)

Uri

I am not sure about this. In the beginning, the rating of each opponent is only calculated from its games against Crafty. If OppA scores 50% vs Crafty and OppB scores 70% vs Crafty, and no other games are considered, then approximately equal ratings are assigned to OppA and Crafty, and OppB gets a rating that shall express that OppB is stronger than OppA and Crafty (according to his 70% score only against Crafty).

Now if you add "enough" games between OppA and OppB, where OppA scores 60% against OppB (instead of roughly 30% what might have been expected), and then calculate ratings from the whole round robin set now, I would like to know what you think the result will be. Does Crafty still get the same relative rating as before? Personally I don't think so, although I would of course accept it if you proved me wrong.

If my opponent's ratings are determined only by their play against myself and not against other opponents, then they are quite unreliable IMO. There is no "playing strength" of my opponents that can serve as something fixed, their strength is only expressed based on results against one single opponent, and that's myself. So I have only played against opponents with unreliable ratings, how reliable shall my rating be now? And why should my own rating remain unchanged when adding games between my opponents which makes their ratings more reliable, and also changes them?

If I had my best score against some OppX then, without opponent-vs.-opponent games, OppX will get the worst rating of all. But when adding opponent-vs.-opponent games, it may turn out that OppX has just big difficulties against my playing style but performs perferctly well against all the others, such that OppX is considered the best of all participants. Shouldn't this affect my relative rating since I scored quite good against OppX?

If Crafty's relative rating can be affected then I think that the ratings of Crafty1 and Crafty2 may also be affected differently.

I may be wrong, of course

But that's my current understanding of the Elo system that requires to play against several opponents to get a reliable rating.

I would be glad if someone could do some sort of simulation with BayesElo, or even a real test, to examine my theory - sorry that I can't do it by myself.

Sven

more on engine testing

Re: more on engine testing

Re: more on engine testing

Re: more on engine testing

Re: more on engine testing

Re: more...

Re: more on engine testing

Re: more on engine testing

Re: more on engine testing

Re: more...

Re: more on engine testing