more on engine testing

Discussion of chess software programming and technical issues.

Moderator: Ras

bob
Posts: 20943
Joined: Mon Feb 27, 2006 7:30 pm
Location: Birmingham, AL

Re: more on engine testing

Post by bob »

hgm wrote:
bob wrote:
hgm wrote:
bob wrote:..., otherwise I am getting _way_ too many impossible results.
Well, what can I say. If you get impossible results, they must be wrong results. And getting many wrong results is a good operational definition for incompetence...
Aha. opinions rather than facts now.
???

Seems a fact to me that you complained about getting imposible results. These were not my words.
Here is a simple experiment for you to run. Go back to my original post at the top of this thread. If you are using or can use firefox, then click "edit", then "find", type "impossible" in the box and tell me where that appears in my post anywhere...

What is "impossible"? _any_ result is "possible". And people do win the lottery every week. Whether this is that kind of unusual event or not we will see as more tests roll out.
Impossible in the practical sense is a relative notion. In principle, it is possible that, when you step out of your office window on the fifth floor, the motion of air molecules below you experiences a random fluctuation so you remain floating in the air. Do not try it! You can be certain to drop like a stone. If you see people floating in the air, without any support, you can be certain that they have a way of supporting themselves, and that you merely not see it. Likewise here.
And I have not made that kind of mistake. Again, _any_ result is possible, you are claiming otherwise. which shows a complete misunderstanding of this particular type of statistical result. There is no "correct" or "incorrect" results possible in this kind of testing, since any possible result, from all wins, to all losses, to any possible combination of those two results is possible.
Not in any useful and pragmatic sense. Some results are too unlikely to be acceptable. It all relates to Bayesian likelyhood. There are alternative explanations to the observation. You can never exclude all of them with absolute certainty, some of them you might not even be aware of. One of your students might want to pull a prank on you, and substituted one of the engine executables or the result file. There might be ways to doctor the results we cannot even imgine. One never can exclude everything to a level of one in a billion. And if an explanation remains with a prior probability of more than one in a billion, it will outcompete the explanation that everything was as it should have been, and the result is only due to a one-in-a-billion fluke. Because the Bayesian likelyhood of this null-hypothesis being corrrect has just be reduced to the one-in-a-billion level by the occurrence of the result. While the smallness of some of the alternative hypotheses relies solely on the smallness of their prior probability.
That's the flaw in this kind of thinking. "the results must match the theory because the results must match the theory".
Learn about probability theory, and you might be able to grasp it. It is not that difficult. Every serious scientist in the World uses it.
Been using it for years in other things. Including my mentioned blackjack card counting exploits. And these streaks are not unusual at all. How many consecutive blackjacks do you believe would be "impossible" in real play? I've gotten seven in a row. I can explain how/why quite easily. I'll bet you will say that is "impossible". I have lost 31 consecutive hands in a game where I have a slight advantage. "impossible, you say". And I have not played millions of hands either. I sat at a table at the Tropicana 3 years ago and played for one hour with 3 witnesses looking on, I played 85 hands, and excluding 3 pushes that were spaced out, I lost every last one. And not intentionally as it was _my_ money.

I gave you the data, as it happened. I will provide more over time. I still have not seen you give one possible way where I can run each game independently and configure things so that the games are somehow dependent.
And none will be forthcoming as long as you don't enable me to do a post-mortem on your data set. Do you really expect that you can tell people just a single large number, and then ask them to explain how you got that number? That does not seem to indicate much realism on your part.
Some of can do that, yes. I have had to chase more operating system bugs over my career that were presented in that same way. Quite often one can not produce a large log. Any more than we can do that for SMP searches. Sometimes you are left with the idea of sitting down and asking yourself "How can this possibly happen" and then looking at the code and data structures, start listing possible ways the symptoms could occur. Then by carefully analysing the code, that list is pared down, hopefully until there is just one entry left that can't be excluded. Same approach for parallel search bugs. It is very uncommon to have enough information to find such bugs outright. Perhaps you don't have experience in either of those areas and therefore have not had to debug what is very difficult to pin down. But some of us do it regularly. And yes, it _is_ possible.

I didn't ask you to tell me "how I ended up with that number..." I clearly asked "can you give me any possible (and by possible, I do mean within the context of the hardware and operating system and configuration I have previously clearly laid out) way I could cause dependencies between games. That has nothing to do with my specific data. It is simply a question that leads to the conclusion that this does not seem possible without doing something so egregious that it would be both intentional and grossly obvious.
I could show you a single position where the four games (two same opponents) produce things from +4 to -4, but there seems to be no point, after all how could the two programs vary that much in a single position, right?
Are remarks like these meant to illustrate how mathematicaly inept you are, or what? Between approximately equal opponents the probability for a 0-4 or a 4-0 result is ~2.5%, (assuming 20% draw rate), and your ability to show them relies on selecting them from a set of tries much larger than 40. This remark displays the same lack of understanding as the one about winning the lottery. of course people whin the lottery. The product of the probability a given individuaal wins and the number of participants is of order one.
One day, if you ever get a chance to run some big tests, you might end up re-thinking your rather rigid ideas about how things ought to be, as opposed to how they really are...
Fat chance, captain... For I have been an experimental physicist al my life, and I know how to collect data, and how to recognize collected data for crap when there was a fault in the data-collection equipment. When things are really as they should not be, it shows that you goofed. Mathematics is indeed a a rigid endeavor, and results that violate mathematical bounds are more suspect than any other.

But the main difference between us is that if I would have faulty equipment, I would analyze the erroneous data in order to trace the error and repair the equipment. Your abilities seem to stop at trying it again in the same way, in the hope it will work now...
Or one can continually shout "can't be, can't be..." and hope that shouting it enough will make it so. But since you want to take the physics road, last time I saw some odd thing reported, rather than everyone shouting "can't be, can't be" and stamping their foot, they attempted to repeat the result. So perhaps you are not quite the physicist you think you are. Have you tried any duplicated tests to see what kind of variability _you_ get in such testing? Didn't think so.

bob wrote: That is exactly my point. when there is absolutely no difference between run a, b, c, d, ... etc except for the timing randomness I have already identified, then that is the cause of the randomness.
I was not asking for the cause of randomness, but for the cause of the ccorrelation between results of games in the same run your data contains.
And I asked you to postulate a single reasonable explanation of how I could finagle things on the cluster to make that happen, without knowing which program I am supposed to be favoring to somehow make dependent results.
I have already verified that and explained how. Searching to fixed node counts produces the same results each and every time, to the exact number of wins, losses and draws. Now the question would be, how could this test somehow bias the time allocation to favor one program more than another? I have played matches of 1+1, 2+2, 3+3, 4+4 and 5+5 to verify that the results are not that different. So how do I bias time to make the results favor one over another in a correlated way, when they share the _same_ processor, same memory, etc.

Time is an issue. Time is the only issue. And _everybody_ has exactly the _same_ timing issues I have, only moreso as most machines are not nearly as "bare-bones" as our cluster nodes, with respect to extra system processes running and stealing cycles.

So, now that I have once again identified the _only_ that that introduces any variability into the search, how could we _possibly_ manipulate the timing to produce dependent results? You suggest an idea, I will develop a test to see if it is possible or not on our cluster. But first I need an idea _how_...
It does not seem possible for timing jitter to systematically favor one engine over the other for an extended stretch of games. So it is not an acceptable explanation of the observed correlation. There thus must be another cause fo this, implying also that other testers might not be plagued by this at all.

"might not be plagued" is the crux of the issue. Anyone can attempt to verify my results by just playing matches. Then we would know. I know that I have gotten the same results on two different and not connected systems. I have gotten the same results using xboard and my referee. The only common elements are:

(1) 40 starting positions used over and over
(2) same 6 program executables, used over and over
(3) same operating system kernel and configuration used over and over

Everything else has varied, from hardware, to referee program. The programs can't do anything to make results dependent, because after each game all files are removed and then the next game is played with no way to communicate anything other than perhaps uninitialized hash which none of the programs I am using fall into. The operating system could somehow "decide" that if A beats B, in game 1, then the next time they play it is going to bias the game for either A or B to make the second result dependent on the first. That is a bit of a stretch. If it does that randomly then it would not be a dependent result. I can't come up with any even remotely plausible way this could happen. I write that off quicker than you write off the randomness of the results.


How exactly did you guard against one of the engines in your test playing stronger on even months, and weaker in odd months?
Why would I? these matches last under a day for 25,000 games. Of course they could play stronger on even hours and weaker on odd hours. And if so _you_ have exactly the same problem.
So in a set of 400 games from the same position, where the final result is dead even, I can't see 25 consecutive wins or losses? Or I can, but rather infrequently? You seem to say "impossible". Which is wrong, because it happened.
The probability for such an observation in a singe 400-game run is 1 in 12 million (assuming 20% draw rate). So if you have tried a few million 400-game runs, it would not be very remarkable if you observed that.

If it occurred in just a few hundred 400-game runs, you can be pretty sure somethin is going on that you don't understand, as the unlikelihood that his would be a genuine chance event starts to rival the prior unlikelihood of alternative hypotheses that you could at best guarantee, no matter how much you tries. The likelyhood of the alternatives would probably not yet exceed the probablity for a statistical fluke by so much that you could safely bet your life on it, though.
Rambling leads nowhere. How can a processor favor one opponent consistently over several games?
That is for you to figure out. They are your processors, after all. When doing a post-mortem to find cause of death, one usually cuts open the body to look inside. You still expect us to see it from only looking at the outside (= overall score). Not possible...
...

I doubt you could put together a more consistent test platform if you had unlimited funds.
Well, so maybe it is the engines that have a strength that varies on the time-scale of your runs. Who knows? This line of arguing is not productive. You should look at the data, to extract the nature of the correlation (time-wise, core-wise, engine-wise). String at the dead body and speculating about what could not have caused its death is no use.
And now you may well hit on the crux of the problem. Do you test against fruit? Or glaurung 1 or 2? I use both. The rating lists use both. If that is a problem on my testing, it is a problem on _all_ testing. Do you get that important point?

[qute]
Totally different animal there.
Wrong! If your data is noisy or not, is never decided by the factory specs and manufacturer guarantee of your equipment. The data itself is the ultimate proof. If two measurements of the same quantity produce a different reading, the difference is by definition noise. What you show us here is low-quality, noisy data. If the test setup producing it promised you low-noise, good quality data, ask your money back.[/quote]

Would you like to see measured clock frequency for all 260 cpus on olympus and 560 cores on Ferrum? want to see valgrind output for cache statistics for each? We've run that kind of stuff. The nodes run a typical cluster system called "Rocks" that simply blasts the O/S image/files to each node on a cold start so that we do not have to take a week to upgrade each node independently. They actually _are_ identical in every way that is important, but even that is irrelevant as using different machines should not cause problems else most rating lists are worthless since they do this all the time.
bob
Posts: 20943
Joined: Mon Feb 27, 2006 7:30 pm
Location: Birmingham, AL

Re: An important proposal (Re: more on engine testing)

Post by bob »

Sven Schüle wrote:
Richard Allbert wrote:the engines used vs crafty need to play vs themselves in a round robin 800 games each, and these results need to be included in the rating calculation.
Gentlemen,

the whole discussion thread might turn out to be close to obsolete if someone carefully examines this proposal of Richard, and possibly proves its correctness (or at least, shows by example).

Personally I think Richard is right. But before commenting his proposal in detail, I want to cross-check the situation.

You have an engine A (here Crafty) in two versions A1 and A2, and k opponents B1 .. Bk. Bob (or any other engine tester, of course) lets A1 play against B1 .. Bk with N games for each match (say N=800), and then he calculates relative Elo ratings, using BayesElo, for all participating engines only from these k*N games (except for very few single games missing for other reasons, unimportant here). For A1 a relative rating R1 results.

He does the same with A2, with identical conditions and identical opponents, and gets a rating R2 for A2. (I assume these identical conditions and opponents to be given since questioning this is not my topic here, and I don't want to discuss it here.)

Then (where the real order of steps is unimportant, too) he also repeats both runs and gets ratings R1' and R2'. Now R1' differs unexpectedly high from R1, and R2' differs quite much from R2, too.

That's what I understood, I may be wrong, in this case please correct me if I missed some very important detail.

So what should we really expect here in theory? IMHO this is nothing else but a gauntlet against "unrated" opponents, since the only information about playing strength of B1 .. Bk is derived from games between A1/A2 and B1 .. Bk, they did not play each other. My personal feeling is that I could never derive any stable ratings from such an event since the rating calculation system "knows" nearly nothing about the relative strength of the opponents.

If I am unrated and play against two other unrated players X and Y, with 100 games each match, and I get 70% against X and 50% against Y, then I can say that I am stronger than X by T Elo points with a certain level of confidence, and also that I am of about equal strength compared to Y with another level of confidence, and this can be combined into some relative rating. But can I derive that Y is also stronger than X by about T Elo points, without even playing him? What if I let X play Y 100 games, too, and Y wins by 55% only? I definitely think that the latter will give a much better estimate of the "true rating" (whatever this is) than the former.

Now the goal of the tester is to decide which effect the changes made to A1 and resulting in version A2 did have. Therefore he wants to get good estimates of the "true rating" for both A1 and A2. But I think he can't expect to get such good estimates here.

Richard suggests, and so do I, to include these opponent-vs.-opponent games, too, in the rating calculation. Neither do I have the CPU power, nor the theoretical/mathematical/statistical background to prove or show that this will help, but I hope to know enough about the fundamentals of the Elo rating system to state that rating calculation from a round robin tournament will yield more reliable results than rating calculation from an event where only one player meets all others.

I leave it up to the experts to judge whether BayesElo should return different (higher) error margins than those Bob has reported here, or whether the error margings are o.k. and it is simply a matter of interpretation.

Provided someone can confirm that Richards proposal is useful, I would then suggest to Bob (and all others applying a similar method to test the effect of changes) to play these additional games once and reuse the same games for rating calculation every time there is a new version of the engine to be tested, since I expect the Elo results of that opponent's round robin tourney to be stable enough to act like a "fixed" rating of these opponents against which different own engine versions can be compared. The number of games between the opponents should of course be high enough to dominate the additional games that they play against A1/A2, and thus not be influenced seriously.

Again: to be proven! Since this is a topic drawing my attention, I would be glad if someone had the time to analyze it in this direction - maybe someone with a cluster? :wink:

I would also expect that the number of games necessary to get "stable" results should turn out to be lower than 800, and definitely much lower than 25000.

Unfortunately I did not find any information about this particular problem on Remi Coulom's pages, maybe I did not investigate enough.

Sven
There are two distinct issues:

(1) what is the probability that A will beat B? If A wins 75 of every 100 games, then the probability is 3/4. Ditto for A vs C and so forth.

(2) what is the numeric strength of A compared to all the others. Correct Elo rating in the context of chess? Here the more games you get, between _all_ opponents, will make that final number more accurate.

But in case (1), the relative difference between the ratings should end up being approximately the same as the relative difference between the same two opponents when using (2). The actual ratings might not be the same, but the difference should be, if the Elo system is an accurate predictor. Notice that BayesElo is not producing ratings like 2700, and such, but is assuming a rating of zero and then computing a relative difference between two opponents based on the results obtained.

Whether this is a good approach or not is open for debate. however, it is the approach _everyone_ is using. Except nobody is producing the volume of games I am producing. Most are happy with 20 or less per opponent. The Rybka group claims to run about 80K games between two versions to see if the new one is better, whether that is enough or not, and whether the results are more random than expected is unknown.

I still believe that the basic problem is the game of chess, as played by computers, is still far more random that suspected. Otherwise, how could two opponents play 1000 games and have one win 100 more than the other at a specific node count, then using node count +1000, the other wins 100 more? That is a tiny change in programs searching 2-3M nodes per second for a few seconds per move. Yet it makes a significant difference. Unexpected.
Sven
Posts: 4052
Joined: Thu May 15, 2008 9:57 pm
Location: Berlin, Germany
Full name: Sven Schüle

Re: An important proposal (Re: more on engine testing)

Post by Sven »

bob wrote:There are two distinct issues:

(1) what is the probability that A will beat B? If A wins 75 of every 100 games, then the probability is 3/4. Ditto for A vs C and so forth.

(2) what is the numeric strength of A compared to all the others. Correct Elo rating in the context of chess? Here the more games you get, between _all_ opponents, will make that final number more accurate.

But in case (1), the relative difference between the ratings should end up being approximately the same as the relative difference between the same two opponents when using (2). The actual ratings might not be the same, but the difference should be, if the Elo system is an accurate predictor. Notice that BayesElo is not producing ratings like 2700, and such, but is assuming a rating of zero and then computing a relative difference between two opponents based on the results obtained.

Whether this is a good approach or not is open for debate. however, it is the approach _everyone_ is using. [...]
Hi Bob,

I agree to most of it, but I don't agree to your statement that we are talking about your case (1). When testing the effect of certain changes, you want to know whether they improve the engine's playing strength, and this is not possible by only checking results against one opponent (as you know). You see too much bias in single matches against single opponents, but I propose to always look at the whole set of results, and not to expect stability of results in single matches of A against B. Therefore, you need to have good quality from the case (2) viewpoint, I think.

The Elo system does IMHO not give you mainly an estimate of the outcome of A against B, but it is designed to give an estimate of the outcome of A against a variety of opponents.

Sven
Uri Blass
Posts: 10803
Joined: Thu Mar 09, 2006 12:37 am
Location: Tel-Aviv Israel

Re: An important proposal (Re: more on engine testing)

Post by Uri Blass »

bob wrote: I still believe that the basic problem is the game of chess, as played by computers, is still far more random that suspected. Otherwise, how could two opponents play 1000 games and have one win 100 more than the other at a specific node count, then using node count +1000, the other wins 100 more? That is a tiny change in programs searching 2-3M nodes per second for a few seconds per move. Yet it makes a significant difference. Unexpected.

I believe that the difference that you mention simply does not happen
unless you have some serious bug.


I would like to see a case when A beat B 550-450 at 10,000,000 nodes per move when B beat A 550-450 at 10,001,000 nodes per move when the results can be repeated to get exactly the same game(so the games are really dependent only on the number of nodes and not on factors like time).

You can generate intentionally a bug that cause this behaviour by telling the program to resign at move 90 if it is asked to search 10,001,000 nodes but I believe that something like that practically does not happen.

Even if you throw a fair coin 1000 times you will get very rarely 550-450 results and here I expect correlation between result of games from the same position to reduce the variance of the result.

If you give the exact versions and the programs with source code people can try it.

Note that the fact that I do not respond about part of your posts does not mean that I agree and I simply has no time to respond about every part that I disagree because in this case I do not expect the discussion to be finished.

Uri
Sven
Posts: 4052
Joined: Thu May 15, 2008 9:57 pm
Location: Berlin, Germany
Full name: Sven Schüle

Re: An important proposal (Re: more on engine testing)

Post by Sven »

bob wrote:But in case (1), the relative difference between the ratings should end up being approximately the same as the relative difference between the same two opponents when using (2).
I have to add that this may be the key point of misunderstanding. If this were the case then we would have some sort of transitivity in chess. Again to my example:

Scenario 1:
A gets 70% against B in a 100 games match.
A gets 50% against C in a 100 games match.
No other games.
Calculate relative ratings for A, B, C (e.g. with BayesElo).

Scenario 2:
same as Scenario 1, but in addition, C gets 55% against B in a 100 games match.

Did I get your point correctly in that you assume the rating differences of A vs. B and A vs. C should be (nearly) the same in both scenarios?

I have not checked with BayesElo (I simply have not installed it, shame on me :-( ) but I'm quite sure about what the result will be. Maybe it gets even more obvious with a different outcome of C against B, say C loses against B with 40%.

By seeing an engine against which you get 70% of points as T points weaker than your engine only based on this single result, you may miss more information which you may get from other games of that opponent, and that may lead to a change in rating difference.

Sven
bob
Posts: 20943
Joined: Mon Feb 27, 2006 7:30 pm
Location: Birmingham, AL

Re: An important proposal (Re: more on engine testing)

Post by bob »

Sven Schüle wrote:
bob wrote:But in case (1), the relative difference between the ratings should end up being approximately the same as the relative difference between the same two opponents when using (2).
I have to add that this may be the key point of misunderstanding. If this were the case then we would have some sort of transitivity in chess. Again to my example:

Scenario 1:
A gets 70% against B in a 100 games match.
A gets 50% against C in a 100 games match.
No other games.
Calculate relative ratings for A, B, C (e.g. with BayesElo).

Scenario 2:
same as Scenario 1, but in addition, C gets 55% against B in a 100 games match.

Did I get your point correctly in that you assume the rating differences of A vs. B and A vs. C should be (nearly) the same in both scenarios?

I have not checked with BayesElo (I simply have not installed it, shame on me :-( ) but I'm quite sure about what the result will be. Maybe it gets even more obvious with a different outcome of C against B, say C loses against B with 40%.

By seeing an engine against which you get 70% of points as T points weaker than your engine only based on this single result, you may miss more information which you may get from other games of that opponent, and that may lead to a change in rating difference.

Sven
There are certainly lots of questions that can be asked. For example, if you look at the results, the broken crafty (22.2) blows out arasan 10, yet in real games this is not the case. But in the silver positions it is a huge margin of victory. why? good question. Against fruit and glaurung 1 the broken version does very well. It doesn't win overall but it is close. Is that representative of what happens in real games? hard to say. The list goes on and on. We know books make a difference. We know opening choices make a difference. Yet this kind of testing is widely accepted to avoid the book issues, and it introduces a different type of issue.

Ideally we would play games using tournament setups, including books, pondering, SMP, etc. But each of those introduce so much additional variability that the number of games becomes intractable.

BTW one of the most difficult tasks produced by this testing is looking at individual results as well as overall, and then trying to figure out what is causing an imbalance in certain positions. Any position that results in 4 losses and no wins is problematic. And one that results in two losses with black and two draws with black is not a lot better. The quantity of data that is produced is enormous, and for this many games, nobody is going to go thru them with any degree of detail as even 1 minute per game (impossibly fast) turns into weeks... if not months.
User avatar
hgm
Posts: 28354
Joined: Fri Mar 10, 2006 10:06 am
Location: Amsterdam
Full name: H G Muller

Re: more on engine testing

Post by hgm »

bob wrote:
hgm wrote:
bob wrote:
hgm wrote:
bob wrote:..., otherwise I am getting _way_ too many impossible results.
Well, what can I say. If you get impossible results, they must be wrong results. And getting many wrong results is a good operational definition for incompetence...
Aha. opinions rather than facts now.
???

Seems a fact to me that you complained about getting imposible results. These were not my words.
Here is a simple experiment for you to run. Go back to my original post at the top of this thread. If you are using or can use firefox, then click "edit", then "find", type "impossible" in the box and tell me where that appears in my post anywhere...
Oh, we should only read your first post, and the rest is just rambling on your part that should not be taken seriously and that should not be referred to? In the quote you clearly state that, if mathematics would apply to computer Chess, you get many impossible results.

Oh boy, have I got bad news for you. Mathematics does apply to computer Chess, as mathematics applies universally and even extra-universally. So you do get way too many impossible results, and that reflects badly on your competence for getting results.

From remarks like above it shows that you are not interested in serious discussion, and are only intent on wasting my time, just like you are wasting the CPU time of your cluster. I am too busy for that at the moment: I still have to convert zillions of bitmaps for the back-porting of WinBoard to xboard. And that currently has priority, as I do habitually produce results and am intent on continuing to do so. Had it been my cluster, I would have solved the problem already ten times over in the time that you have been playing sillybuggers here.

So this is where I stop reading, just keep muddling on, and in two moths, when we are two xboard versions further, we can resume this discussion, as at that time your next irreproducible result of 10,000 CPU-month of calculation will be ready... Have fun! :lol:
bob
Posts: 20943
Joined: Mon Feb 27, 2006 7:30 pm
Location: Birmingham, AL

Re: An important proposal (Re: more on engine testing)

Post by bob »

Uri Blass wrote:
bob wrote: I still believe that the basic problem is the game of chess, as played by computers, is still far more random that suspected. Otherwise, how could two opponents play 1000 games and have one win 100 more than the other at a specific node count, then using node count +1000, the other wins 100 more? That is a tiny change in programs searching 2-3M nodes per second for a few seconds per move. Yet it makes a significant difference. Unexpected.

I believe that the difference that you mention simply does not happen
unless you have some serious bug.
I believe you ought to actually try an experiment rather than just guessing. Take two programs and play a sn=X match using the 40 positions or whatever ones you choose. Then try a sn=X+1000 match. Compare the results. Then you won't be guessing, you will know.

You actually believe I don't know how to make a program stop searching after some fixed number of nodes? I've been doing this more than long enough to master that complex (for some apparently) task. And then to verify the results by playing dozens of repeated matches comparing everything. The only thing that changes is the time used. sn=X does not always take the same exact amount of time because of clock jitter and architectural issues issues such as cache aliasing and the like. But the moves and scores and PVs are _always_ identical down to the last character in each set of 80 games played like this. Until I slightly change the node count limit.

So believe what you want, but you are wrong.




I would like to see a case when A beat B 550-450 at 10,000,000 nodes per move when B beat A 550-450 at 10,001,000 nodes per move when the results can be repeated to get exactly the same game(so the games are really dependent only on the number of nodes and not on factors like time).
I saved some of that data, let me locate it. I'll provide the data here. Although I am certain I did not save any log files from crafty. I try to save interesting PGN from time to time however...



You can generate intentionally a bug that cause this behaviour by telling the program to resign at move 90 if it is asked to search 10,001,000 nodes but I believe that something like that practically does not happen.

Even if you throw a fair coin 1000 times you will get very rarely 550-450 results and here I expect correlation between result of games from the same position to reduce the variance of the result.

If you give the exact versions and the programs with source code people can try it.
Already done. The program names from BayesElo are exactly what the programs supply via the "myname" command. Any version of crafty produces this same varying result.

Note that the fact that I do not respond about part of your posts does not mean that I agree and I simply has no time to respond about every part that I disagree because in this case I do not expect the discussion to be finished.

Uri
bob
Posts: 20943
Joined: Mon Feb 27, 2006 7:30 pm
Location: Birmingham, AL

more...

Post by bob »

All that I can find at present is a series of crafty vs crafty matches played with sn=x and sn=x+n.

The only issue is that this is the same version of crafty playing both sides. Since the games are played alternating colors, by definition, since time is removed, the wins and losses must match exactly, and they do. But a quick summary and you can tell me if you want to see PGN..

sn=3000000 (3 million nodes per search exactly, no time extension on fail low, no short searches unless a move is forced as in one way to get out of check)

54/ 52/ 54 ( 0) that is 54-52-54, wins, draws, losses

now sn=3001000 (1,000 more nodes)

44/ 72/ 44 ( 0)

Is that not a _huge_ change? 10 less wins, 10 less losses, 20 more draws. Just by adding 1000 nodes. I ran shorter initial search limits as well as longer ones. the longest was sn=3100000 (100000 extra nodes, which is about a .03 second change with crafty doing 3M nodes per second on this machine, 30milliseconds. Not a lot of time. Clocks aren't accurate enough to run for precisely 30 ms and then moving.

sn=3100000

40/ 80/ 40 ( 0)

Does every small change in nodes change the result? Nope. But _most_ do. And even if the results are the same, if you look at all the games (pgn is easiest) some of the moves change but the game outcome stays the same, or one new game is lost, but another new one is won to keep the totals the same.

I can run one of the above 1,000,000 times and get the same result each and every time, precisely, since time is removed as an influence.

If that data is interesting, I will scrounge up the PGN for the games so you can look at it. It is not an overwhelming amount of data compared to the 25,000 game matches.
bob
Posts: 20943
Joined: Mon Feb 27, 2006 7:30 pm
Location: Birmingham, AL

Re: more on engine testing

Post by bob »

hgm wrote:
bob wrote:
hgm wrote:
bob wrote:
hgm wrote:
bob wrote:..., otherwise I am getting _way_ too many impossible results.
Well, what can I say. If you get impossible results, they must be wrong results. And getting many wrong results is a good operational definition for incompetence...
Aha. opinions rather than facts now.
???

Seems a fact to me that you complained about getting imposible results. These were not my words.
Here is a simple experiment for you to run. Go back to my original post at the top of this thread. If you are using or can use firefox, then click "edit", then "find", type "impossible" in the box and tell me where that appears in my post anywhere...
Oh, we should only read your first post, and the rest is just rambling on your part that should not be taken seriously and that should not be referred to? In the quote you clearly state that, if mathematics would apply to computer Chess, you get many impossible results.

Oh boy, have I got bad news for you. Mathematics does apply to computer Chess, as mathematics applies universally and even extra-universally. So you do get way too many impossible results, and that reflects badly on your competence for getting results.

From remarks like above it shows that you are not interested in serious discussion, and are only intent on wasting my time, just like you are wasting the CPU time of your cluster. I am too busy for that at the moment: I still have to convert zillions of bitmaps for the back-porting of WinBoard to xboard. And that currently has priority, as I do habitually produce results and am intent on continuing to do so. Had it been my cluster, I would have solved the problem already ten times over in the time that you have been playing sillybuggers here.

So this is where I stop reading, just keep muddling on, and in two moths, when we are two xboard versions further, we can resume this discussion, as at that time your next irreproducible result of 10,000 CPU-month of calculation will be ready... Have fun! :lol:
All that rambling shows is that you are willing to look no farther ahead than the tip of your nose, because to do so would be too much work. Glad to know your background in operating systems, computer architecture, parallel issues, is so far beyond mine that you would have solved a problem, even though you have not recognized that you even _have_ the problem, to date. Takes a great mind to solve an undetected problem, indeed. And I would certainly bow to that kind of superior intelligence.

If I had seen it, that is.