more on engine testing

bob · Post by **bob** » Mon Aug 04, 2008 8:29 pm

Uri Blass wrote:
Richard Allbert wrote:
Sven Schüle wrote:Fourth theory, brought up by Richard Allbert and supported also by me (see also this subthread: http://64.68.157.89/forum/viewtopic.php ... 51&t=22731):

Elo ratings are calculated only based on games of Crafty vs. opponents while the opponents did not play each other for that calculation (so far the facts), so that the Crafty Elo results are relative to instable ratings of its opponents and therefore too inaccurate (that's the theory).

Bob is preparing data that could be used to verify this theory. We will see.

Sven
Just to add uneducated fuel to this fire.... I did a test of four different versions, altering the search in three (checks in qsearch, null move reduction depth made less aggressive, null move bug fixed ). They all scored within 2 points of each other from 300 games vs 5 opponents, black and white vs each engine in the 30 Noomen starting positions. I also ran a RR between the five opponents from the 30 starting positions.

Bayesian elo rated one version 40 points higher than the other versions, even though to total scores were almost the same. This seemed to be pretty clear...

Interestingly, I've found a Null move bug as a result

Richard
I think that in this case your program to calculate elo has a bug.

difference of 2 points in 300 games cannot be translated to difference of 40 elo except the case when the result is close to 100% or 0% and I assume you did not choose opponents that the results are more than 90% or less than 10% against them.

Uri

That has become the "designer excuse" of choice it seems, sort of like "designer drugs". "you have a bug". Never any other possible explanation. Such as some inherent randomness that has different characteristics than we believe should be presented. He didn't mention the error bar, just the number BayesElo gives, which is a _single_ number. they could easily be exactly the same, with exactly the same error margins. Which most (but not me at least) would conclude means "no difference in the programs".

You need to quit assuming so much, and take what people write at face value. If it says 2500+/- 40, and the next run says 2500 +/- 40, anyone in their right mind would say "BayesElo gave the same results for both." Amazingly, that is _exactly_ what he said. And that is why these discussions go on and on...

bob · Post by **bob** » Mon Aug 04, 2008 8:30 pm

Graham Banks wrote:
Michael Sherwin wrote:All the chess engine raters out there do a fair to good job rating the engines and their work taken as a whole and averaged is rather accurate.
Correct.

Of course this depends on what "rather accurate" means. Certainly not +/- 5 Elo, regardless of how many games are used.

bob · Post by **bob** » Mon Aug 04, 2008 8:33 pm

Sven Schüle wrote:
Uri Blass wrote:
Sven Schüle wrote:Fourth theory, brought up by Richard Allbert and supported also by me (see also this subthread: http://64.68.157.89/forum/viewtopic.php ... 51&t=22731):

Elo ratings are calculated only based on games of Crafty vs. opponents while the opponents did not play each other for that calculation (so far the facts), so that the Crafty Elo results are relative to instable ratings of its opponents and therefore too inaccurate (that's the theory).

Bob is preparing data that could be used to verify this theory. We will see.

Sven
The other games are irrelevant for the rating of Crafty and they can only change the rating of the opponents.

If the program that calculates rating is not broken then the difference between Crafty1 and Crafty2 cannot be changed by games of Glaurung against Fruit(assuming Crafty1 and Crafty2 played equal number of games against every opponent)

Uri
I am not sure about this. In the beginning, the rating of each opponent is only calculated from its games against Crafty. If OppA scores 50% vs Crafty and OppB scores 70% vs Crafty, and no other games are considered, then approximately equal ratings are assigned to OppA and Crafty, and OppB gets a rating that shall express that OppB is stronger than OppA and Crafty (according to his 70% score only against Crafty).

Now if you add "enough" games between OppA and OppB, where OppA scores 60% against OppB (instead of roughly 30% what might have been expected), and then calculate ratings from the whole round robin set now, I would like to know what you think the result will be. Does Crafty still get the same relative rating as before? Personally I don't think so, although I would of course accept it if you proved me wrong.

If my opponent's ratings are determined only by their play against myself and not against other opponents, then they are quite unreliable IMO. There is no "playing strength" of my opponents that can serve as something fixed, their strength is only expressed based on results against one single opponent, and that's myself. So I have only played against opponents with unreliable ratings, how reliable shall my rating be now? And why should my own rating remain unchanged when adding games between my opponents which makes their ratings more reliable, and also changes them?

If I had my best score against some OppX then, without opponent-vs.-opponent games, OppX will get the worst rating of all. But when adding opponent-vs.-opponent games, it may turn out that OppX has just big difficulties against my playing style but performs perferctly well against all the others, such that OppX is considered the best of all participants. Shouldn't this affect my relative rating since I scored quite good against OppX?

If Crafty's relative rating can be affected then I think that the ratings of Crafty1 and Crafty2 may also be affected differently.

I may be wrong, of course But that's my current understanding of the Elo system that requires to play against several opponents to get a reliable rating.

I would be glad if someone could do some sort of simulation with BayesElo, or even a real test, to examine my theory - sorry that I can't do it by myself.

Sven

It would not be hard, as you don't need to play real games. BayesElo only considers three PGN tags, White, Black and Result. You could easily write a program to produce dummy PGN files with results that fit any sort of distribution you want, against any opponents you want, and then feed that into BayesElo to see what it thinks...

Uri Blass · Post by **Uri Blass** » Mon Aug 04, 2008 8:49 pm

bob wrote:
Uri Blass wrote:
Richard Allbert wrote:
Sven Schüle wrote:Fourth theory, brought up by Richard Allbert and supported also by me (see also this subthread: http://64.68.157.89/forum/viewtopic.php ... 51&t=22731):

Elo ratings are calculated only based on games of Crafty vs. opponents while the opponents did not play each other for that calculation (so far the facts), so that the Crafty Elo results are relative to instable ratings of its opponents and therefore too inaccurate (that's the theory).

Bob is preparing data that could be used to verify this theory. We will see.

Sven
Just to add uneducated fuel to this fire.... I did a test of four different versions, altering the search in three (checks in qsearch, null move reduction depth made less aggressive, null move bug fixed ). They all scored within 2 points of each other from 300 games vs 5 opponents, black and white vs each engine in the 30 Noomen starting positions. I also ran a RR between the five opponents from the 30 starting positions.

Bayesian elo rated one version 40 points higher than the other versions, even though to total scores were almost the same. This seemed to be pretty clear...

Interestingly, I've found a Null move bug as a result

Richard
I think that in this case your program to calculate elo has a bug.

difference of 2 points in 300 games cannot be translated to difference of 40 elo except the case when the result is close to 100% or 0% and I assume you did not choose opponents that the results are more than 90% or less than 10% against them.

Uri
That has become the "designer excuse" of choice it seems, sort of like "designer drugs". "you have a bug". Never any other possible explanation. Such as some inherent randomness that has different characteristics than we believe should be presented. He didn't mention the error bar, just the number BayesElo gives, which is a _single_ number. they could easily be exactly the same, with exactly the same error margins. Which most (but not me at least) would conclude means "no difference in the programs".

You need to quit assuming so much, and take what people write at face value. If it says 2500+/- 40, and the next run says 2500 +/- 40, anyone in their right mind would say "BayesElo gave the same results for both." Amazingly, that is _exactly_ what he said. And that is why these discussions go on and on...

the possible error is not important for this point.

Richard wrote
"Bayesian elo rated one version 40 points higher than the other versions"
I understood that he means that one version got rating of 2540 only based on 300 games when the other versions got at most 2500 based on 300 games against the same opponents.

Error bar may be 40 elo but error bar is not important because the difference in performance is clearly less than 40 elo if you get difference of 2 points in 300 games except extreme that I believe that they did not happen like cases of difference between 298/300 and 300/300.

Uri

Richard Allbert · Post by **Richard Allbert** » Mon Aug 04, 2008 9:27 pm

bob wrote:
Uri Blass wrote:
Richard Allbert wrote:
Sven Schüle wrote:Fourth theory, brought up by Richard Allbert and supported also by me (see also this subthread: http://64.68.157.89/forum/viewtopic.php ... 51&t=22731):

Elo ratings are calculated only based on games of Crafty vs. opponents while the opponents did not play each other for that calculation (so far the facts), so that the Crafty Elo results are relative to instable ratings of its opponents and therefore too inaccurate (that's the theory).

Bob is preparing data that could be used to verify this theory. We will see.

Sven
Just to add uneducated fuel to this fire.... I did a test of four different versions, altering the search in three (checks in qsearch, null move reduction depth made less aggressive, null move bug fixed ). They all scored within 2 points of each other from 300 games vs 5 opponents, black and white vs each engine in the 30 Noomen starting positions. I also ran a RR between the five opponents from the 30 starting positions.

Bayesian elo rated one version 40 points higher than the other versions, even though to total scores were almost the same. This seemed to be pretty clear...

Interestingly, I've found a Null move bug as a result

Richard
I think that in this case your program to calculate elo has a bug.

difference of 2 points in 300 games cannot be translated to difference of 40 elo except the case when the result is close to 100% or 0% and I assume you did not choose opponents that the results are more than 90% or less than 10% against them.

Uri
That has become the "designer excuse" of choice it seems, sort of like "designer drugs". "you have a bug". Never any other possible explanation. Such as some inherent randomness that has different characteristics than we believe should be presented. He didn't mention the error bar, just the number BayesElo gives, which is a _single_ number. they could easily be exactly the same, with exactly the same error margins. Which most (but not me at least) would conclude means "no difference in the programs".

You need to quit assuming so much, and take what people write at face value. If it says 2500+/- 40, and the next run says 2500 +/- 40, anyone in their right mind would say "BayesElo gave the same results for both." Amazingly, that is _exactly_ what he said. And that is why these discussions go on and on...

Exactly - the error bounds indicate that they could all have the same rating.

MartinBryant · Post by **MartinBryant** » Mon Aug 04, 2008 9:29 pm

bob wrote: However, if you re-run, won't you get _different_ results once again? I have tried that approach, in that at any instant I can grab the current BayesElo output for all games completed so far. My original intent was to play "just enough" games. But when I re-run, I still get different results.

Well after you posted your 25000 game run results I have started to worry!

Certainly the 1000 game runs were insufficient and I should have spotted it sooner as sometimes the graph was clearly sloping and still oscillating at the cutoff point.
Hence the change to running until the graph flatlined.
But your results call that into doubt as well, so I have indeed started a re-run on a particularly troublesome change I've been toying with. (Troublesome in that with repeated 1000 game runs I got ELOs of +27, +9, +2 and -7. Don't know if those numbers violate any maths theory but they are all rather unhelpful when you're just trying to make a decision about a change!)
The first long run flatlined from about 2000-3000 games and was stopped at 3034 with a score of 0.
The second long run is in progress and I will let you know the result.

Of course, even if it does flatline at 0, it unfortunately doesn't prove that badly anomolous results can't occur!

Some thoughts I did have on your 25000 game runs...
1) I wondered if having multiple opponents complicates or compounds the problem somehow and would it be worthwhile repeating the experiment with just a single opponent to see if you get any stability then. (Of course you want to test against a range of opps eventually, but simplifying the experiment for now might reveal something.)
2) This is probably a dumb question but was the A/C on the way out during the second run causing some overheating?
3) In future, would it be worthwhile at the end, automatically running some stats analysis on the PGNs looking for things like anomolous average search depths, average nps, anything else you can think of, to sanity check the test stability/validity for your own peace of mind.
4) My maths is not strong and I might be completly loony tunes here, but if the persisted transposition table turns the engine into a chaotic system [not even sure it qualifies for all 3 chaos conditions] then I read that you can't use Gaussian stats for chaotic systems but have to use something called Paretian stats. So is the ELO system just the wrong tool for computer v computer matches?

bob · Post by **bob** » Mon Aug 04, 2008 10:20 pm

Uri Blass wrote:
bob wrote:
Uri Blass wrote:
Richard Allbert wrote:
Sven Schüle wrote:Fourth theory, brought up by Richard Allbert and supported also by me (see also this subthread: http://64.68.157.89/forum/viewtopic.php ... 51&t=22731):

Elo ratings are calculated only based on games of Crafty vs. opponents while the opponents did not play each other for that calculation (so far the facts), so that the Crafty Elo results are relative to instable ratings of its opponents and therefore too inaccurate (that's the theory).

Bob is preparing data that could be used to verify this theory. We will see.

Sven
Just to add uneducated fuel to this fire.... I did a test of four different versions, altering the search in three (checks in qsearch, null move reduction depth made less aggressive, null move bug fixed ). They all scored within 2 points of each other from 300 games vs 5 opponents, black and white vs each engine in the 30 Noomen starting positions. I also ran a RR between the five opponents from the 30 starting positions.

Bayesian elo rated one version 40 points higher than the other versions, even though to total scores were almost the same. This seemed to be pretty clear...

Interestingly, I've found a Null move bug as a result

Richard
I think that in this case your program to calculate elo has a bug.

difference of 2 points in 300 games cannot be translated to difference of 40 elo except the case when the result is close to 100% or 0% and I assume you did not choose opponents that the results are more than 90% or less than 10% against them.

Uri
That has become the "designer excuse" of choice it seems, sort of like "designer drugs". "you have a bug". Never any other possible explanation. Such as some inherent randomness that has different characteristics than we believe should be presented. He didn't mention the error bar, just the number BayesElo gives, which is a _single_ number. they could easily be exactly the same, with exactly the same error margins. Which most (but not me at least) would conclude means "no difference in the programs".

You need to quit assuming so much, and take what people write at face value. If it says 2500+/- 40, and the next run says 2500 +/- 40, anyone in their right mind would say "BayesElo gave the same results for both." Amazingly, that is _exactly_ what he said. And that is why these discussions go on and on...
the possible error is not important for this point.

Richard wrote
"Bayesian elo rated one version 40 points higher than the other versions"
I understood that he means that one version got rating of 2540 only based on 300 games when the other versions got at most 2500 based on 300 games against the same opponents.

Error bar may be 40 elo but error bar is not important because the difference in performance is clearly less than 40 elo if you get difference of 2 points in 300 games except extreme that I believe that they did not happen like cases of difference between 298/300 and 300/300.

Uri

Here is the complete post he made:

==========================================================
Just to add uneducated fuel to this fire.... I did a test of four different versions, altering the search in three (checks in qsearch, null move reduction depth made less aggressive, null move bug fixed ). They all scored within 2 points of each other from 300 games vs 5 opponents, black and white vs each engine in the 30 Noomen starting positions. I also ran a RR between the five opponents from the 30 starting positions.

Bayesian elo rated one version 40 points higher than the other versions, even though to total scores were almost the same. This seemed to be pretty clear...

Interestingly, I've found a Null move bug as a result

===========================================================

Now where are you drawing your conclusions from? it seems to say 75 games per each of his four versions, against each of 5 opponents. So the logical assumption is 15 games per opponent. 2 points could make a _big_ difference in one of those, depending on how they end up rated themselves...

Again, drawing too many conclusions, from assumptions, which appear to have nothing to do with the actual conditions being described. "total of 300 games" seems quite clear to me...

here is a 15 game match, with program A losing the first 5 rounds, drawing the next 5, and winning the final 5. And the rating difference between the two programs is computed at 26 points +/- 68.. want to know why?

All well-documented in Remi's writeup. White vs black. All the games had A as white. Fix that?

Code: Select all

Rank Name        Elo    +    - games score oppo. draws 
   1 program B    13   68   68    15   50%   -13   33% 
   2 program A   -13   68   68    15   50%    13   33%

Correct it so that black/white is equalized as best possible with an odd number of games and I get this:

Code: Select all

Rank Name        Elo    +    - games score oppo. draws 
   1 program B     1   68   68    15   50%    -1   33% 
   2 program A    -1   68   68    15   50%     1   33%

I think _nobody_ is putting enough thought into considering this entire process. Just the usual

stampee foot... can't be.... stampee foot.. bug... stampee foot. cluster broken... stampee foot.. software broken... stampee foot... researcher can't produce good data... stampee foot... statistically invalid... stampee foot... impossible result... stampee foot... cherry-picked data... stampee foot... I could fix this trivially if I wanted to... stampee foot.... stampee foot...

time after time.

It would seem to me that a _logical-thinking_ person would begin to say "Hmmm, we are seeing more of this than we would expect, by multiple different sources, so maybe there is something wrong with our thought processes. For example, of _course_ the games are somewhat dependent. Why? Same damned players. How well would 100 games with me vs Kasparov correlate? 100% since I would lose every game? Data no good. Data dependent... stampee foot...

<sigh>

A beats C by 2 points
A plays even with B.

Uri Blass · Post by **Uri Blass** » Mon Aug 04, 2008 10:22 pm

Richard Allbert wrote:
bob wrote:
Uri Blass wrote:
Richard Allbert wrote:
Sven Schüle wrote:Fourth theory, brought up by Richard Allbert and supported also by me (see also this subthread: http://64.68.157.89/forum/viewtopic.php ... 51&t=22731):

Elo ratings are calculated only based on games of Crafty vs. opponents while the opponents did not play each other for that calculation (so far the facts), so that the Crafty Elo results are relative to instable ratings of its opponents and therefore too inaccurate (that's the theory).

Bob is preparing data that could be used to verify this theory. We will see.

Sven
Just to add uneducated fuel to this fire.... I did a test of four different versions, altering the search in three (checks in qsearch, null move reduction depth made less aggressive, null move bug fixed ). They all scored within 2 points of each other from 300 games vs 5 opponents, black and white vs each engine in the 30 Noomen starting positions. I also ran a RR between the five opponents from the 30 starting positions.

Bayesian elo rated one version 40 points higher than the other versions, even though to total scores were almost the same. This seemed to be pretty clear...

Interestingly, I've found a Null move bug as a result

Richard
I think that in this case your program to calculate elo has a bug.

difference of 2 points in 300 games cannot be translated to difference of 40 elo except the case when the result is close to 100% or 0% and I assume you did not choose opponents that the results are more than 90% or less than 10% against them.

Uri
That has become the "designer excuse" of choice it seems, sort of like "designer drugs". "you have a bug". Never any other possible explanation. Such as some inherent randomness that has different characteristics than we believe should be presented. He didn't mention the error bar, just the number BayesElo gives, which is a _single_ number. they could easily be exactly the same, with exactly the same error margins. Which most (but not me at least) would conclude means "no difference in the programs".

You need to quit assuming so much, and take what people write at face value. If it says 2500+/- 40, and the next run says 2500 +/- 40, anyone in their right mind would say "BayesElo gave the same results for both." Amazingly, that is _exactly_ what he said. And that is why these discussions go on and on...
Exactly - the error bounds indicate that they could all have the same rating.

The error bounds are unimportant and I did not try to find which program is better or to find the exact rating of the program based on many games.

If you tell me that the difference in performance between the programs is 10 elo and the difference in rating only based on this performance and not based on different data is 40 elo then it is clear that the program that calculates the rating has a bug and this is exactly what you did except the fact that you did not write that the difference in performance is 10 elo but only wrote that the difference in points is not more than 2 out of 300 but difference in points of not more than 2 out of 300 is usually translated to difference in rating of less than 10 elo if we do not talk about extreme results).

Uri

Uri Blass · Post by **Uri Blass** » Mon Aug 04, 2008 10:27 pm

bob wrote:
Uri Blass wrote:
bob wrote:
Uri Blass wrote:
Richard Allbert wrote:
Sven Schüle wrote:Fourth theory, brought up by Richard Allbert and supported also by me (see also this subthread: http://64.68.157.89/forum/viewtopic.php ... 51&t=22731):

Elo ratings are calculated only based on games of Crafty vs. opponents while the opponents did not play each other for that calculation (so far the facts), so that the Crafty Elo results are relative to instable ratings of its opponents and therefore too inaccurate (that's the theory).

Bob is preparing data that could be used to verify this theory. We will see.

Sven
Just to add uneducated fuel to this fire.... I did a test of four different versions, altering the search in three (checks in qsearch, null move reduction depth made less aggressive, null move bug fixed ). They all scored within 2 points of each other from 300 games vs 5 opponents, black and white vs each engine in the 30 Noomen starting positions. I also ran a RR between the five opponents from the 30 starting positions.

Bayesian elo rated one version 40 points higher than the other versions, even though to total scores were almost the same. This seemed to be pretty clear...

Interestingly, I've found a Null move bug as a result

Richard
I think that in this case your program to calculate elo has a bug.

difference of 2 points in 300 games cannot be translated to difference of 40 elo except the case when the result is close to 100% or 0% and I assume you did not choose opponents that the results are more than 90% or less than 10% against them.

Uri
That has become the "designer excuse" of choice it seems, sort of like "designer drugs". "you have a bug". Never any other possible explanation. Such as some inherent randomness that has different characteristics than we believe should be presented. He didn't mention the error bar, just the number BayesElo gives, which is a _single_ number. they could easily be exactly the same, with exactly the same error margins. Which most (but not me at least) would conclude means "no difference in the programs".

You need to quit assuming so much, and take what people write at face value. If it says 2500+/- 40, and the next run says 2500 +/- 40, anyone in their right mind would say "BayesElo gave the same results for both." Amazingly, that is _exactly_ what he said. And that is why these discussions go on and on...
the possible error is not important for this point.

Richard wrote
"Bayesian elo rated one version 40 points higher than the other versions"
I understood that he means that one version got rating of 2540 only based on 300 games when the other versions got at most 2500 based on 300 games against the same opponents.

Error bar may be 40 elo but error bar is not important because the difference in performance is clearly less than 40 elo if you get difference of 2 points in 300 games except extreme that I believe that they did not happen like cases of difference between 298/300 and 300/300.

Uri
Here is the complete post he made:

==========================================================
Just to add uneducated fuel to this fire.... I did a test of four different versions, altering the search in three (checks in qsearch, null move reduction depth made less aggressive, null move bug fixed ). They all scored within 2 points of each other from 300 games vs 5 opponents, black and white vs each engine in the 30 Noomen starting positions. I also ran a RR between the five opponents from the 30 starting positions.

Bayesian elo rated one version 40 points higher than the other versions, even though to total scores were almost the same. This seemed to be pretty clear...

Interestingly, I've found a Null move bug as a result
===========================================================

Now where are you drawing your conclusions from? it seems to say 75 games per each of his four versions, against each of 5 opponents. So the logical assumption is 15 games per opponent. 2 points could make a _big_ difference in one of those, depending on how they end up rated themselves...

Again, drawing too many conclusions, from assumptions, which appear to have nothing to do with the actual conditions being described. "total of 300 games" seems quite clear to me...

here is a 15 game match, with program A losing the first 5 rounds, drawing the next 5, and winning the final 5. And the rating difference between the two programs is computed at 26 points +/- 68.. want to know why? All well-documented in Remi's writeup. White vs black. All the games had A as white. Fix that?
Code: Select all
Rank Name        Elo    +    - games score oppo. draws 
   1 program B    13   68   68    15   50%   -13   33% 
   2 program A   -13   68   68    15   50%    13   33% 
Correct it so that black/white is equalized as best possible with an odd number of games and I get this:
Code: Select all
Rank Name        Elo    +    - games score oppo. draws 
   1 program B     1   68   68    15   50%    -1   33% 
   2 program A    -1   68   68    15   50%     1   33% 
I think _nobody_ is putting enough thought into considering this entire process. Just the usual

stampee foot... can't be.... stampee foot.. bug... stampee foot. cluster broken... stampee foot.. software broken... stampee foot... researcher can't produce good data... stampee foot... statistically invalid... stampee foot... impossible result... stampee foot... cherry-picked data... stampee foot... I could fix this trivially if I wanted to... stampee foot.... stampee foot...

time after time.

It would seem to me that a _logical-thinking_ person would begin to say "Hmmm, we are seeing more of this than we would expect, by multiple different sources, so maybe there is something wrong with our thought processes. For example, of _course_ the games are somewhat dependent. Why? Same damned players. How well would 100 games with me vs Kasparov correlate? 100% since I would lose every game? Data no good. Data dependent... stampee foot...

<sigh>

A beats C by 2 points
A plays even with B.

If you give one program white all the time this is possible but I assume equal number of games with white and black for all programs because it is not logical to give one program white all the time and the discussion is about playing the silver suite or some other test from fixed positions with white and black.

Edit:It seems that you do not understand what is dependent.

Games of you against kasparov can be dependent not because you lose against him but because the expected result change with more games.

It can be possible if you learn to play better after the losses.

Uri

bob · Post by **bob** » Mon Aug 04, 2008 10:31 pm

MartinBryant wrote:
bob wrote: However, if you re-run, won't you get _different_ results once again? I have tried that approach, in that at any instant I can grab the current BayesElo output for all games completed so far. My original intent was to play "just enough" games. But when I re-run, I still get different results.
Well after you posted your 25000 game run results I have started to worry!

Certainly the 1000 game runs were insufficient and I should have spotted it sooner as sometimes the graph was clearly sloping and still oscillating at the cutoff point.
Hence the change to running until the graph flatlined.
But your results call that into doubt as well, so I have indeed started a re-run on a particularly troublesome change I've been toying with. (Troublesome in that with repeated 1000 game runs I got ELOs of +27, +9, +2 and -7. Don't know if those numbers violate any maths theory but they are all rather unhelpful when you're just trying to make a decision about a change!)
The first long run flatlined from about 2000-3000 games and was stopped at 3034 with a score of 0.
The second long run is in progress and I will let you know the result.

Of course, even if it does flatline at 0, it unfortunately doesn't prove that badly anomolous results can't occur!

Some thoughts I did have on your 25000 game runs...
1) I wondered if having multiple opponents complicates or compounds the problem somehow and would it be worthwhile repeating the experiment with just a single opponent to see if you get any stability then. (Of course you want to test against a range of opps eventually, but simplifying the experiment for now might reveal something.)
2) This is probably a dumb question but was the A/C on the way out during the second run causing some overheating?
3) In future, would it be worthwhile at the end, automatically running some stats analysis on the PGNs looking for things like anomolous average search depths, average nps, anything else you can think of, to sanity check the test stability/validity for your own peace of mind.
4) My maths is not strong and I might be completly loony tunes here, but if the persisted transposition table turns the engine into a chaotic system [not even sure it qualifies for all 3 chaos conditions] then I read that you can't use Gaussian stats for chaotic systems but have to use something called Paretian stats. So is the ELO system just the wrong tool for computer v computer
matches?

For your A/C question, no. When it goes out, the cluster goes down within less than 3 minutes if I am running a test. The heat this thing produces is incredible with 128 nodes smoking along on 256 processors. I _always_ have a check on NPS running. At one point it was the only way I could discover that a rogue process was running where it should not be and sucking up CPU cycles. We fixed this on the cluster back in January by developing a new way of assigning jobs and cleaning up after old ones finished before starting new ones, so that A can't run a job and leave a process behind after it finishes, still burning CPU cycles. But I never removed the check so it is always there. And I can log on to a node and crank up an extra copy of Crafty and within 4-5 seconds the test has been completely aborted (if anybody reports a funny NPS I just terminate the whole thing so that I will know something went wrong and can find out where.. This has not happened since January with one exception. I ran a big test on the 540 core cluster, but somehow also had the last CCT ICC setup laying around and when I started Crafty on ICC, it used the Ferrum node I used (8 cores, 12 gigs, etc) which blew the test out instantly. I knew it would take 3 days to complete that rather big run, so was a bit pissed when I checked on it and found it had terminated 3 days earlier due to a low NPS...

As far as your last (3rd) question goes, it is a good question. Something is certainly "up"...

more on engine testing

Re: more on engine testing

Re: more on engine testing

Re: more on engine testing

Re: more on engine testing

Re: more on engine testing

Re: more...

Re: more on engine testing

Re: more on engine testing

Re: more on engine testing

Re: more...