more on engine testing

Discussion of chess software programming and technical issues.

Moderator: Ras

bob
Posts: 20943
Joined: Mon Feb 27, 2006 7:30 pm
Location: Birmingham, AL

Re: more on engine testing

Post by bob »

Carey wrote:
Bill Rogers wrote:Carey
A chess program without an evaluation routine can not really play chess in the normal sense.
I never said without an evaluator.

I was meaning a relatively simple one compared to something big & complex like Crafty uses.

Since he took out an important chunk of Crafty and still couldn't determine which was stronger, that it was the search itself that was the dominant factor. Hence the classic wisdom that a deeper searching program is likely the one to win in computer vs. computer.

It could only make random moves no matter how fast it could search. A search routine all by itself only generates moves and does not know which move is better or not. In fact it does not even know when it makes a capture or what the piece is worth.
Actually, the search is the tactical nature of chess.

Chess is pure tactics and that is precisely what the search does.

If you could, for example, search 10 plies deepter in the middle game than your opponent, and you did only material and maybe a few simple rules or square tables, I'd put the odds onto you rather than the smarter but more shallow searching program.

Extend that even deeper to 20 plies and I don't think too many people would bet on the more shallow searching program.

This is just in response to you saying 'no matter how fast it could search'. I'm not suggesting it to be even remotely practical.
Even the simplest evaluation subroutine which understands good captures from bad ones won't be able to play a decent game of chess. It might beat some begining chess players but not anyone with any kind of real chess playing skills, computer or not.
Bill
That's provably false.... Search 500 plies... all the way to the end of the game.... Will it play well? Obviously yes.

But again, I'm not suggesting it's practical etc. I'm just taking it to extreme to prove the point.


I wasn't meaning a super simplistic evaluator, just a basic one.



I do, however, still think that a 'TECH' style program is a good idea. Very simplistic evaluator that depends on the search and has advanced search techniques.

Many people use Muller's micro-max for that purpose, actually. But it's a little too simplistic and too small and not quite user friendly.
Let me qualify some things I did, just to keep things dead on factually. I removed things like all passed pawn evaluation including majorities, outside passers, passed pawn races, etc. This weakened crafty measurably, but the results were random enough that two different runs might show "that was really bad" and "that was barely a bad change". I removed the draw scoring that recognizes drawn positions or positions where one side appears to be ahead but can't win, or recognizes drawish (but not dead drawn) positions such as bishops of opposite color. This makes a significant difference in the accuracy of evaluation in such positions, yet it makes very little difference in the final results on the Silver test. Why? Unknown. Perhaps most don't go to deep endgame positions. Or these rather uncommon sorts of draws are just not that likely in the starting positions given. Or any of several other potential possible explanations.

I was simply looking to see if we had any unknown issues that were hurting us, and the only thing I have actually found was that the draw scoring was problematic... because against good opponents, I would rather go into a KRB vs KR dead drawn ending with an eval of +3.00, than I would try to play a KRNPPP vs KRBPP with an eval of +1.00. That latter one can be lost, while the first won't be. More food for thought dealing with draws. I want a 0.00 in KB+RP of wrong color. I want 0.00 in KRB or KRN vs KR. Etc. But that becomes problematic when trying to avoid a KRB vs KR at +3, when you go into an ending that eventually ends up not as good as you had thought.
bob
Posts: 20943
Joined: Mon Feb 27, 2006 7:30 pm
Location: Birmingham, AL

Re: more on engine testing

Post by bob »

It does have an accepted bias for white. But it is moot here since each set of games is equalized with white and black so that the bias washes out as I interpreted his description. It does a slight fudge in a single game knowing that white has a slight winning edge, but if you play a second game, that washes I believe.
bob
Posts: 20943
Joined: Mon Feb 27, 2006 7:30 pm
Location: Birmingham, AL

Re: more on engine testing

Post by bob »

Dirt wrote:
bob wrote:BTW wouldn't "duplicates" be a good thing here? More consistency? If every position always produced 2 wins and 2 losses, or 2 wins and 2 draws, then the variability would be gone and the issue would not be problematic.
Let's say one of the starting positions always results in one of two games, one a win for white and the other a loss. If there is some small change, maybe the barometric pressure or something, that can change the outcome from one to the other then we would see inconsistent results. By examining the PGN this sort of problem should be easy to spot because there would be an extremely large change in one or at most a few starting positions.

Someone who had a less stable environment, with many random factors affecting the outcome, wouldn't see the same problem.
There are several positions where I have seen this. Including the previously mentioned +4, then -4 for the same position. Some positions appear to be balanced on a very fine edge, and a move here or there topples the game to one side or the other. But making it happen in any dependent way seems impossible. Sometimes an extra 100,000 nodes or whatever finds a better move, sometimes it finds a better move that leads to a lost position later.
Uri Blass
Posts: 10807
Joined: Thu Mar 09, 2006 12:37 am
Location: Tel-Aviv Israel

Re: more on engine testing

Post by Uri Blass »

You are right that _any_ result is "possible" but possible with very low probability is practically something that you can consider as impossible.

If you lose a fair game every time (and I say something like 30-0 against you and not 5-0 or 10-0 and I mean to game that is based only on luck and not on knowledge) then you can be practically sure that people cheat you even if you do not know how they cheat.

It is not impossible that you are unlucky but practically I do not believe in this possibility espacially when I know that you lost some data about the games(in the relevant case you do not have the pgn of more than 50000 games that is more than 25000 games for every one of the versions).

It may be possible to convicne me that you were unlucky only if
I see the full data and investigate it.

Uri
User avatar
hgm
Posts: 28354
Joined: Fri Mar 10, 2006 10:06 am
Location: Amsterdam
Full name: H G Muller

Re: more on engine testing

Post by hgm »

bob wrote:
hgm wrote:
bob wrote:..., otherwise I am getting _way_ too many impossible results.
Well, what can I say. If you get impossible results, they must be wrong results. And getting many wrong results is a good operational definition for incompetence...
Aha. opinions rather than facts now.
???

Seems a fact to me that you complained about getting imposible results. These were not my words.
What is "impossible"? _any_ result is "possible". And people do win the lottery every week. Whether this is that kind of unusual event or not we will see as more tests roll out.
Impossible in the practical sense is a relative notion. In principle, it is possible that, when you step out of your office window on the fifth floor, the motion of air molecules below you experiences a random fluctuation so you remain floating in the air. Do not try it! You can be certain to drop like a stone. If you see people floating in the air, without any support, you can be certain that they have a way of supporting themselves, and that you merely not see it. Likewise here.
And I have not made that kind of mistake. Again, _any_ result is possible, you are claiming otherwise. which shows a complete misunderstanding of this particular type of statistical result. There is no "correct" or "incorrect" results possible in this kind of testing, since any possible result, from all wins, to all losses, to any possible combination of those two results is possible.
Not in any useful and pragmatic sense. Some results are too unlikely to be acceptable. It all relates to Bayesian likelyhood. There are alternative explanations to the observation. You can never exclude all of them with absolute certainty, some of them you might not even be aware of. One of your students might want to pull a prank on you, and substituted one of the engine executables or the result file. There might be ways to doctor the results we cannot even imgine. One never can exclude everything to a level of one in a billion. And if an explanation remains with a prior probability of more than one in a billion, it will outcompete the explanation that everything was as it should have been, and the result is only due to a one-in-a-billion fluke. Because the Bayesian likelyhood of this null-hypothesis being corrrect has just be reduced to the one-in-a-billion level by the occurrence of the result. While the smallness of some of the alternative hypotheses relies solely on the smallness of their prior probability.
That's the flaw in this kind of thinking. "the results must match the theory because the results must match the theory".
Learn about probability theory, and you might be able to grasp it. It is not that difficult. Every serious scientist in the World uses it.
I gave you the data, as it happened. I will provide more over time. I still have not seen you give one possible way where I can run each game independently and configure things so that the games are somehow dependent.
And none will be forthcoming as long as you don't enable me to do a post-mortem on your data set. Do you really expect that you can tell people just a single large number, and then ask them to explain how you got that number? That does not seem to indicate much realism on your part.
I could show you a single position where the four games (two same opponents) produce things from +4 to -4, but there seems to be no point, after all how could the two programs vary that much in a single position, right?
Are remarks like these meant to illustrate how mathematicaly inept you are, or what? Between approximately equal opponents the probability for a 0-4 or a 4-0 result is ~2.5%, (assuming 20% draw rate), and your ability to show them relies on selecting them from a set of tries much larger than 40. This remark displays the same lack of understanding as the one about winning the lottery. of course people whin the lottery. The product of the probability a given individuaal wins and the number of participants is of order one.
One day, if you ever get a chance to run some big tests, you might end up re-thinking your rather rigid ideas about how things ought to be, as opposed to how they really are...
Fat chance, captain... For I have been an experimental physicist al my life, and I know how to collect data, and how to recognize collected data for crap when there was a fault in the data-collection equipment. When things are really as they should not be, it shows that you goofed. Mathematics is indeed a a rigid endeavor, and results that violate mathematical bounds are more suspect than any other.

But the main difference between us is that if I would have faulty equipment, I would analyze the erroneous data in order to trace the error and repair the equipment. Your abilities seem to stop at trying it again in the same way, in the hope it will work now...
bob wrote: That is exactly my point. when there is absolutely no difference between run a, b, c, d, ... etc except for the timing randomness I have already identified, then that is the cause of the randomness.
I was not asking for the cause of randomness, but for the cause of the ccorrelation between results of games in the same run your data contains.
I have already verified that and explained how. Searching to fixed node counts produces the same results each and every time, to the exact number of wins, losses and draws. Now the question would be, how could this test somehow bias the time allocation to favor one program more than another? I have played matches of 1+1, 2+2, 3+3, 4+4 and 5+5 to verify that the results are not that different. So how do I bias time to make the results favor one over another in a correlated way, when they share the _same_ processor, same memory, etc.

Time is an issue. Time is the only issue. And _everybody_ has exactly the _same_ timing issues I have, only moreso as most machines are not nearly as "bare-bones" as our cluster nodes, with respect to extra system processes running and stealing cycles.

So, now that I have once again identified the _only_ that that introduces any variability into the search, how could we _possibly_ manipulate the timing to produce dependent results? You suggest an idea, I will develop a test to see if it is possible or not on our cluster. But first I need an idea _how_...
It does not seem possible for timing jitter to systematically favor one engine over the other for an extended stretch of games. So it is not an acceptable explanation of the observed correlation. There thus must be another cause fo this, implying also that other testers might not be plagued by this at all.

How exactly did you guard against one of the engines in your test playing stronger on even months, and weaker in odd months?
So in a set of 400 games from the same position, where the final result is dead even, I can't see 25 consecutive wins or losses? Or I can, but rather infrequently? You seem to say "impossible". Which is wrong, because it happened.
The probability for such an observation in a singe 400-game run is 1 in 12 million (assuming 20% draw rate). So if you have tried a few million 400-game runs, it would not be very remarkable if you observed that.

If it occurred in just a few hundred 400-game runs, you can be pretty sure somethin is going on that you don't understand, as the unlikelihood that his would be a genuine chance event starts to rival the prior unlikelihood of alternative hypotheses that you could at best guarantee, no matter how much you tries. The likelyhood of the alternatives would probably not yet exceed the probablity for a statistical fluke by so much that you could safely bet your life on it, though.
Rambling leads nowhere. How can a processor favor one opponent consistently over several games?
That is for you to figure out. They are your processors, after all. When doing a post-mortem to find cause of death, one usually cuts open the body to look inside. You still expect us to see it from only looking at the outside (= overall score). Not possible...
...

I doubt you could put together a more consistent test platform if you had unlimited funds.
Well, so maybe it is the engines that have a strength that varies on the time-scale of your runs. Who knows? This line of arguing is not productive. You should look at the data, to extract the nature of the correlation (time-wise, core-wise, engine-wise). String at the dead body and speculating about what could not have caused its death is no use.
Totally different animal there.
Wrong! If your data is noisy or not, is never decided by the factory specs and manufacturer guarantee of your equipment. The data itself is the ultimate proof. If two measurements of the same quantity produce a different reading, the difference is by definition noise. What you show us here is low-quality, noisy data. If the test setup producing it promised you low-noise, good quality data, ask your money back.
Sven
Posts: 4052
Joined: Thu May 15, 2008 9:57 pm
Location: Berlin, Germany
Full name: Sven Schüle

An important proposal (Re: more on engine testing)

Post by Sven »

Richard Allbert wrote:the engines used vs crafty need to play vs themselves in a round robin 800 games each, and these results need to be included in the rating calculation.
Gentlemen,

the whole discussion thread might turn out to be close to obsolete if someone carefully examines this proposal of Richard, and possibly proves its correctness (or at least, shows by example).

Personally I think Richard is right. But before commenting his proposal in detail, I want to cross-check the situation.

You have an engine A (here Crafty) in two versions A1 and A2, and k opponents B1 .. Bk. Bob (or any other engine tester, of course) lets A1 play against B1 .. Bk with N games for each match (say N=800), and then he calculates relative Elo ratings, using BayesElo, for all participating engines only from these k*N games (except for very few single games missing for other reasons, unimportant here). For A1 a relative rating R1 results.

He does the same with A2, with identical conditions and identical opponents, and gets a rating R2 for A2. (I assume these identical conditions and opponents to be given since questioning this is not my topic here, and I don't want to discuss it here.)

Then (where the real order of steps is unimportant, too) he also repeats both runs and gets ratings R1' and R2'. Now R1' differs unexpectedly high from R1, and R2' differs quite much from R2, too.

That's what I understood, I may be wrong, in this case please correct me if I missed some very important detail.

So what should we really expect here in theory? IMHO this is nothing else but a gauntlet against "unrated" opponents, since the only information about playing strength of B1 .. Bk is derived from games between A1/A2 and B1 .. Bk, they did not play each other. My personal feeling is that I could never derive any stable ratings from such an event since the rating calculation system "knows" nearly nothing about the relative strength of the opponents.

If I am unrated and play against two other unrated players X and Y, with 100 games each match, and I get 70% against X and 50% against Y, then I can say that I am stronger than X by T Elo points with a certain level of confidence, and also that I am of about equal strength compared to Y with another level of confidence, and this can be combined into some relative rating. But can I derive that Y is also stronger than X by about T Elo points, without even playing him? What if I let X play Y 100 games, too, and Y wins by 55% only? I definitely think that the latter will give a much better estimate of the "true rating" (whatever this is) than the former.

Now the goal of the tester is to decide which effect the changes made to A1 and resulting in version A2 did have. Therefore he wants to get good estimates of the "true rating" for both A1 and A2. But I think he can't expect to get such good estimates here.

Richard suggests, and so do I, to include these opponent-vs.-opponent games, too, in the rating calculation. Neither do I have the CPU power, nor the theoretical/mathematical/statistical background to prove or show that this will help, but I hope to know enough about the fundamentals of the Elo rating system to state that rating calculation from a round robin tournament will yield more reliable results than rating calculation from an event where only one player meets all others.

I leave it up to the experts to judge whether BayesElo should return different (higher) error margins than those Bob has reported here, or whether the error margings are o.k. and it is simply a matter of interpretation.

Provided someone can confirm that Richards proposal is useful, I would then suggest to Bob (and all others applying a similar method to test the effect of changes) to play these additional games once and reuse the same games for rating calculation every time there is a new version of the engine to be tested, since I expect the Elo results of that opponent's round robin tourney to be stable enough to act like a "fixed" rating of these opponents against which different own engine versions can be compared. The number of games between the opponents should of course be high enough to dominate the additional games that they play against A1/A2, and thus not be influenced seriously.

Again: to be proven! Since this is a topic drawing my attention, I would be glad if someone had the time to analyze it in this direction - maybe someone with a cluster? :wink:

I would also expect that the number of games necessary to get "stable" results should turn out to be lower than 800, and definitely much lower than 25000.

Unfortunately I did not find any information about this particular problem on Remi Coulom's pages, maybe I did not investigate enough.

Sven
bob
Posts: 20943
Joined: Mon Feb 27, 2006 7:30 pm
Location: Birmingham, AL

Re: more on engine testing

Post by bob »

Uri Blass wrote:You are right that _any_ result is "possible" but possible with very low probability is practically something that you can consider as impossible.

If you lose a fair game every time (and I say something like 30-0 against you and not 5-0 or 10-0 and I mean to game that is based only on luck and not on knowledge) then you can be practically sure that people cheat you even if you do not know how they cheat.
What about a game where I am favored yet I lose 30 in a row, and 58 out of 60? I play blackjack, and use the HiLo card counting system which gives me an advantage over the house of something like 1-1.5% long-term. Last Summer, while in Vegas, I played a very good game I ran across, and got rolled up. I know professional card counters that have had losing streaks lasting over a year. So what is impossible to you is pretty regular to me.

It is not impossible that you are unlucky but practically I do not believe in this possibility espacially when I know that you lost some data about the games(in the relevant case you do not have the pgn of more than 50000 games that is more than 25000 games for every one of the versions).
I didn't lose anything. I just do not keep it due to the volume.

It may be possible to convicne me that you were unlucky only if
I see the full data and investigate it.

Uri
Somehow I doubt that is going to work, because sooner or later, you are going to come to the conclusion "All of our testing appears to have something very unusual going on..." I know what causes the randomness in the games. I have clearly explained that in the past. And I have not heard of a single soul that has tried to repeat the experiment and then reported publicly the results. I somehow doubt that nobody tried the experiment. I can certainly imagine how some would like to bury their heads and yell, "can't be, can't be" over and over, as opposed to accepting the alternative.

Believe what you want. But that doesn't make it so.
Dirt
Posts: 2851
Joined: Wed Mar 08, 2006 10:01 pm
Location: Irvine, CA, USA

Re: more on engine testing

Post by Dirt »

bob wrote:Some positions appear to be balanced on a very fine edge, and a move here or there topples the game to one side or the other. But making it happen in any dependent way seems impossible.
It at least seems pretty unlikely to me too, but then again so do your inconsistent rating results. Anyway, I'm convinced that running the same position with the same opponents and relying on relying on noise to give you different games is a bad idea. Especially when you've done everything you can to minimize the noise.

I suggest that you double the number of your starting positions, so that each game in your 800 game matches is unique. For larger matches, increase the initial time by one second for each sub-match. I think that should be enough randomness to stabilize your results.
bob
Posts: 20943
Joined: Mon Feb 27, 2006 7:30 pm
Location: Birmingham, AL

Re: more on engine testing

Post by bob »

hgm wrote:
bob wrote:
hgm wrote:
bob wrote:..., otherwise I am getting _way_ too many impossible results.
Well, what can I say. If you get impossible results, they must be wrong results. And getting many wrong results is a good operational definition for incompetence...
Aha. opinions rather than facts now.
???

Seems a fact to me that you complained about getting imposible results. These were not my words.
Here is a simple experiment for you to run. Go back to my original post at the top of this thread. If you are using or can use firefox, then click "edit", then "find", type "impossible" in the box and tell me where that appears in my post anywhere...

What is "impossible"? _any_ result is "possible". And people do win the lottery every week. Whether this is that kind of unusual event or not we will see as more tests roll out.
Impossible in the practical sense is a relative notion. In principle, it is possible that, when you step out of your office window on the fifth floor, the motion of air molecules below you experiences a random fluctuation so you remain floating in the air. Do not try it! You can be certain to drop like a stone. If you see people floating in the air, without any support, you can be certain that they have a way of supporting themselves, and that you merely not see it. Likewise here.
And I have not made that kind of mistake. Again, _any_ result is possible, you are claiming otherwise. which shows a complete misunderstanding of this particular type of statistical result. There is no "correct" or "incorrect" results possible in this kind of testing, since any possible result, from all wins, to all losses, to any possible combination of those two results is possible.
Not in any useful and pragmatic sense. Some results are too unlikely to be acceptable. It all relates to Bayesian likelyhood. There are alternative explanations to the observation. You can never exclude all of them with absolute certainty, some of them you might not even be aware of. One of your students might want to pull a prank on you, and substituted one of the engine executables or the result file. There might be ways to doctor the results we cannot even imgine. One never can exclude everything to a level of one in a billion. And if an explanation remains with a prior probability of more than one in a billion, it will outcompete the explanation that everything was as it should have been, and the result is only due to a one-in-a-billion fluke. Because the Bayesian likelyhood of this null-hypothesis being corrrect has just be reduced to the one-in-a-billion level by the occurrence of the result. While the smallness of some of the alternative hypotheses relies solely on the smallness of their prior probability.
That's the flaw in this kind of thinking. "the results must match the theory because the results must match the theory".
Learn about probability theory, and you might be able to grasp it. It is not that difficult. Every serious scientist in the World uses it.
Been using it for years in other things. Including my mentioned blackjack card counting exploits. And these streaks are not unusual at all. How many consecutive blackjacks do you believe would be "impossible" in real play? I've gotten seven in a row. I can explain how/why quite easily. I'll bet you will say that is "impossible". I have lost 31 consecutive hands in a game where I have a slight advantage. "impossible, you say". And I have not played millions of hands either. I sat at a table at the Tropicana 3 years ago and played for one hour with 3 witnesses looking on, I played 85 hands, and excluding 3 pushes that were spaced out, I lost every last one. And not intentionally as it was _my_ money.

I gave you the data, as it happened. I will provide more over time. I still have not seen you give one possible way where I can run each game independently and configure things so that the games are somehow dependent.
And none will be forthcoming as long as you don't enable me to do a post-mortem on your data set. Do you really expect that you can tell people just a single large number, and then ask them to explain how you got that number? That does not seem to indicate much realism on your part.
Some of can do that, yes. I have had to chase more operating system bugs over my career that were presented in that same way. Quite often one can not produce a large log. Any more than we can do that for SMP searches. Sometimes you are left with the idea of sitting down and asking yourself "How can this possibly happen" and then looking at the code and data structures, start listing possible ways the symptoms could occur. Then by carefully analysing the code, that list is pared down, hopefully until there is just one entry left that can't be excluded. Same approach for parallel search bugs. It is very uncommon to have enough information to find such bugs outright. Perhaps you don't have experience in either of those areas and therefore have not had to debug what is very difficult to pin down. But some of us do it regularly. And yes, it _is_ possible.

I didn't ask you to tell me "how I ended up with that number..." I clearly asked "can you give me any possible (and by possible, I do mean within the context of the hardware and operating system and configuration I have previously clearly laid out) way I could cause dependencies between games. That has nothing to do with my specific data. It is simply a question that leads to the conclusion that this does not seem possible without doing something so egregious that it would be both intentional and grossly obvious.
I could show you a single position where the four games (two same opponents) produce things from +4 to -4, but there seems to be no point, after all how could the two programs vary that much in a single position, right?
Are remarks like these meant to illustrate how mathematicaly inept you are, or what? Between approximately equal opponents the probability for a 0-4 or a 4-0 result is ~2.5%, (assuming 20% draw rate), and your ability to show them relies on selecting them from a set of tries much larger than 40. This remark displays the same lack of understanding as the one about winning the lottery. of course people whin the lottery. The product of the probability a given individuaal wins and the number of participants is of order one.
One day, if you ever get a chance to run some big tests, you might end up re-thinking your rather rigid ideas about how things ought to be, as opposed to how they really are...
Fat chance, captain... For I have been an experimental physicist al my life, and I know how to collect data, and how to recognize collected data for crap when there was a fault in the data-collection equipment. When things are really as they should not be, it shows that you goofed. Mathematics is indeed a a rigid endeavor, and results that violate mathematical bounds are more suspect than any other.

But the main difference between us is that if I would have faulty equipment, I would analyze the erroneous data in order to trace the error and repair the equipment. Your abilities seem to stop at trying it again in the same way, in the hope it will work now...
Or one can continually shout "can't be, can't be..." and hope that shouting it enough will make it so. But since you want to take the physics road, last time I saw some odd thing reported, rather than everyone shouting "can't be, can't be" and stamping their foot, they attempted to repeat the result. So perhaps you are not quite the physicist you think you are. Have you tried any duplicated tests to see what kind of variability _you_ get in such testing? Didn't think so.

bob wrote: That is exactly my point. when there is absolutely no difference between run a, b, c, d, ... etc except for the timing randomness I have already identified, then that is the cause of the randomness.
I was not asking for the cause of randomness, but for the cause of the ccorrelation between results of games in the same run your data contains.
And I asked you to postulate a single reasonable explanation of how I could finagle things on the cluster to make that happen, without knowing which program I am supposed to be favoring to somehow make dependent results.
I have already verified that and explained how. Searching to fixed node counts produces the same results each and every time, to the exact number of wins, losses and draws. Now the question would be, how could this test somehow bias the time allocation to favor one program more than another? I have played matches of 1+1, 2+2, 3+3, 4+4 and 5+5 to verify that the results are not that different. So how do I bias time to make the results favor one over another in a correlated way, when they share the _same_ processor, same memory, etc.

Time is an issue. Time is the only issue. And _everybody_ has exactly the _same_ timing issues I have, only moreso as most machines are not nearly as "bare-bones" as our cluster nodes, with respect to extra system processes running and stealing cycles.

So, now that I have once again identified the _only_ that that introduces any variability into the search, how could we _possibly_ manipulate the timing to produce dependent results? You suggest an idea, I will develop a test to see if it is possible or not on our cluster. But first I need an idea _how_...
It does not seem possible for timing jitter to systematically favor one engine over the other for an extended stretch of games. So it is not an acceptable explanation of the observed correlation. There thus must be another cause fo this, implying also that other testers might not be plagued by this at all.

"might not be plagued" is the crux of the issue. Anyone can attempt to verify my results by just playing matches. Then we would know. I know that I have gotten the same results on two different and not connected systems. I have gotten the same results using xboard and my referee. The only common elements are:

(1) 40 starting positions used over and over
(2) same 6 program executables, used over and over
(3) same operating system kernel and configuration used over and over

Everything else has varied, from hardware, to referee program. The programs can't do anything to make results dependent, because after each game all files are removed and then the next game is played with no way to communicate anything other than perhaps uninitialized hash which none of the programs I am using fall into. The operating system could somehow "decide" that if A beats B, in game 1, then the next time they play it is going to bias the game for either A or B to make the second result dependent on the first. That is a bit of a stretch. If it does that randomly then it would not be a dependent result. I can't come up with any even remotely plausible way this could happen. I write that off quicker than you write off the randomness of the results.


How exactly did you guard against one of the engines in your test playing stronger on even months, and weaker in odd months?
Why would I? these matches last under a day for 25,000 games. Of course they could play stronger on even hours and weaker on odd hours. And if so _you_ have exactly the same problem.
So in a set of 400 games from the same position, where the final result is dead even, I can't see 25 consecutive wins or losses? Or I can, but rather infrequently? You seem to say "impossible". Which is wrong, because it happened.
The probability for such an observation in a singe 400-game run is 1 in 12 million (assuming 20% draw rate). So if you have tried a few million 400-game runs, it would not be very remarkable if you observed that.

If it occurred in just a few hundred 400-game runs, you can be pretty sure somethin is going on that you don't understand, as the unlikelihood that his would be a genuine chance event starts to rival the prior unlikelihood of alternative hypotheses that you could at best guarantee, no matter how much you tries. The likelyhood of the alternatives would probably not yet exceed the probablity for a statistical fluke by so much that you could safely bet your life on it, though.
Rambling leads nowhere. How can a processor favor one opponent consistently over several games?
That is for you to figure out. They are your processors, after all. When doing a post-mortem to find cause of death, one usually cuts open the body to look inside. You still expect us to see it from only looking at the outside (= overall score). Not possible...
...

I doubt you could put together a more consistent test platform if you had unlimited funds.
Well, so maybe it is the engines that have a strength that varies on the time-scale of your runs. Who knows? This line of arguing is not productive. You should look at the data, to extract the nature of the correlation (time-wise, core-wise, engine-wise). String at the dead body and speculating about what could not have caused its death is no use.
And now you may well hit on the crux of the problem. Do you test against fruit? Or glaurung 1 or 2? I use both. The rating lists use both. If that is a problem on my testing, it is a problem on _all_ testing. Do you get that important point?

[qute]
Totally different animal there.
Wrong! If your data is noisy or not, is never decided by the factory specs and manufacturer guarantee of your equipment. The data itself is the ultimate proof. If two measurements of the same quantity produce a different reading, the difference is by definition noise. What you show us here is low-quality, noisy data. If the test setup producing it promised you low-noise, good quality data, ask your money back.[/quote]

Would you like to see measured clock frequency for all 260 cpus on olympus and 560 cores on Ferrum? want to see valgrind output for cache statistics for each? We've run that kind of stuff. The nodes run a typical cluster system called "Rocks" that simply blasts the O/S image/files to each node on a cold start so that we do not have to take a week to upgrade each node independently. They actually _are_ identical in every way that is important, but even that is irrelevant as using different machines should not cause problems else most rating lists are worthless since they do this all the time.
Sven
Posts: 4052
Joined: Thu May 15, 2008 9:57 pm
Location: Berlin, Germany
Full name: Sven Schüle

Re: An important proposal (Re: more on engine testing)

Post by Sven »

Sven Schüle wrote:play against B1 .. Bk with N games for each match (say N=800)
[...]
I would also expect that the number of games necessary to get "stable" results should turn out to be lower than 800
"Edit": After reading again I realized that Bob has actually N=160. That does not make a big difference for me, though, I mention it just for completeness ...

Sven