more on engine testing

hgm · Post by **hgm** » Sat Aug 02, 2008 6:34 am

bob wrote:not in my context. I have always referred to A and A' where A' is a minor modification of A, which is the way we normally test and develop. Very few truly revolutionary ideas crop up nowadays, most are eval changes or minor search changes that are not going to be huge Elo boosters. In the above, 25 Elo is a _significant_ change, one that will rarely happen on a small change.

Well, your context is not everyone's context. Most people have engines that are 1000 Elo weaker than Crafty, and making 100-Elo jumps there in quite common. In that stage they do most of their testing. Plus, if you break something, it is very easy to drop 100 Elo, no matter how strong you were before. Very useful to run 800 games to make sure that didn't happen...

My point was that such a number of games can not be used for what many are using it for. Still don't get that point?

I got that point even before you started programming. To measure something small requires more accuracy. Now tell us something that we didn't know.

For example, adding endgame tables produces an Elo boost that is under that noise level, so does it help or hurt? It takes a huge number of games to answer this. And nobody (except me) has run 'em. I see Elo changes based on SMP results published regularly. After 20 games, etc. And while 20 games might be enough to somewhat accurately conclude that 8 processors is stronger than 1, it isn't enough to conclude with any significant accuracy. And when I see BayesElo numbers of X +/- N where N is very small, and I compare that to the results I just posted where the two ranges are disjoint, I can only conclude that N has to be taken with a huge grain of salt when dealing with computers which are certainly subject to outside and random influences all the time...

And this is the point that you don't seem to get. It doesn't matter how random the influences are. Random is good. The numbers quoted by BayesElo (which, in your case, you could also calculate by hand in 15 seconds) assume totally independent random results, and are an upper limit to the difference that you could typically have between repetitions of the same experiment.

Your results violate that upper bound, and hence the assumptions cannot be satisfied. If you cannot pinpoint the source of the depedence between your games, and eliminate it, your large-number-of-games testing will be completely useless. If I would have designed a laser-ruler to measure distances with sub-nanometer precision, and the typical difference between readings on the same steel needle would be more than a millimeter, I would try to repair my laser-ruler. Not complain about the variability of the length of needles...

hgm · Post by **hgm** » Sat Aug 02, 2008 6:59 am

bob wrote:What do you mean "and divulge your data". I have always done that.

Seems to me my request was quite specific:

hgm wrote:Give us the complete list of game results (all 50,000), with the time they were played, and number of the core they have been running on. So we can slice it up, per engine, per time slice, per core, per position, and check the result histograms of the slices. So that we can do a Fourier transform on results, to make the correlations show up in the power spectrum.

bob wrote:If you want the PGN, I can easily re-run the tests and put the PGN files on my ftp box. If you want to wade thru 50,000 games. Doesn't bother me a bit. But you imply that I don't divulge data when in reality, any time anyone has asked, I've always been happy to make the data available...

Well, if the PGN does contain the time the game was played, and the core it was played on, the yes, the PGN would do. The tags would actually be enough, and the data I asked for could be compressed to one int per game to make it more manageble.

Your remark surprises me, though: I thought the whole point was that you cannot reliably re-run those tests, and that your attempt to do so will not give you the same scores. And if you produce two 25,000-game runs now, where BayesElo says that the ratings are equal to within 9 Elo, they will of course not be of any use. A defect can only be diagnosed when it manifests itself. If a pathologist needs to do a post-mortem on someone that has died of a mysterious disease, it doesn't help much if you tell him: "oh, sorry, I lost the body. But I can kill a few others for you!"

But you are claiming you typically get resut difference of around 6 sigma, and if that is true, I guess you won't have much trouble producing an equally suspect result in the next 50,000 games. But be sure to log the appropriade data (opponent, position+color, time, core/node, result) required for a thorough analysis of the result. "This guy looks dead to me" is no good as a pathologist's report.

Harald · Post by **Harald** » Sat Aug 02, 2008 8:40 am

bob wrote:The only issue with my referee is that one of the opponents has to be crafty, as the referee depends on Crafty to accurately tell it when a game is technically over, which is simpler than having the referee keep up with board position and such here...

Sorry, I did not follow all the arguments and desriptions in this thread,
so I may be wrong, but I have a question.

You have two big 25000 games tests comparing crafty_a to crafty_b.
The collected test data show a result of ELO difference that it shouldn't.
One of the engines is used as a referee. Did you use the same engine as
referee in both tests, that is
run_test referee=crafty_a other=crafty_b
or was one of the tests started like this
run_test referee=crafty_b other=crafty_a ?
If your test results show a strange result and ELO shift then the fact that
an engine is a referee may cost (or gain) a few ELO points.

This is only one of many possible problems with a test experiment.
With a sample from tests so big (2*25000) I trust the statistics, probability
and mathematics so much that I think there _must_ be a hidden problem.
We want to help you to find it or at least learn to avoid such problems
when we all get a big test cluster.

Harald

bob · Post by **bob** » Sat Aug 02, 2008 7:03 pm

hgm wrote:
bob wrote:not in my context. I have always referred to A and A' where A' is a minor modification of A, which is the way we normally test and develop. Very few truly revolutionary ideas crop up nowadays, most are eval changes or minor search changes that are not going to be huge Elo boosters. In the above, 25 Elo is a _significant_ change, one that will rarely happen on a small change.
Well, your context is not everyone's context. Most people have engines that are 1000 Elo weaker than Crafty, and making 100-Elo jumps there in quite common. In that stage they do most of their testing. Plus, if you break something, it is very easy to drop 100 Elo, no matter how strong you were before. Very useful to run 800 games to make sure that didn't happen...

Interesting. I start a thread, and I don't get to define the context I am working in, but I have to live in your fairy-land world instead? Give me a break. I _clearly_ defined what I was trying to do. In case you missed it:

============================================================
A while back I mentioned how difficult it is to draw conclusions about relatively modest changes in a chess program, requiring a ton of games to get usable comparisons. Here is a sample to show that in a way that is pretty easy to understand.
============================================================

Now if that is beyond your ability to read and understand, that is hardly my problem. "modest changes" is pretty clearly defined. Even if you need to resort to a copy of Webster's... So I do not care what _your_ goal is. I stated mine, and the difficulty I was seeing. Others have stated the _same_ goal, including the Rybka team, which I quoted as well. So if the conversation doesn't interest you, ignore it. Don't try to turn it into something that you want and then tell me why my arguments no longer apply.

[/quote]

My point was that such a number of games can not be used for what many are using it for. Still don't get that point?

I got that point even before you started programming. To measure something small requires more accuracy. Now tell us something that we didn't know.[/quote]

That seems to be easy enough, in your case anyway. You, and many others, have quoted short match results when evaluating changes or comparing versions. Apparently you_don't_ understand that the tests being run are worthless.

For example, adding endgame tables produces an Elo boost that is under that noise level, so does it help or hurt? It takes a huge number of games to answer this. And nobody (except me) has run 'em. I see Elo changes based on SMP results published regularly. After 20 games, etc. And while 20 games might be enough to somewhat accurately conclude that 8 processors is stronger than 1, it isn't enough to conclude with any significant accuracy. And when I see BayesElo numbers of X +/- N where N is very small, and I compare that to the results I just posted where the two ranges are disjoint, I can only conclude that N has to be taken with a huge grain of salt when dealing with computers which are certainly subject to outside and random influences all the time...
And this is the point that you don't seem to get. It doesn't matter how random the influences are. Random is good. The numbers quoted by BayesElo (which, in your case, you could also calculate by hand in 15 seconds) assume totally independent random results, and are an upper limit to the difference that you could typically have between repetitions of the same experiment.

Apparently not. I just gave two runs that violated this completely. So it is _far_ from absolute. And again, I will say that this apparently doesn't apply to computer games, otherwise I am getting _way_ too many impossible results.

Your results violate that upper bound, and hence the assumptions cannot be satisfied. If you cannot pinpoint the source of the depedence between your games, and eliminate it, your large-number-of-games testing will be completely useless. If I would have designed a laser-ruler to measure distances with sub-nanometer precision, and the typical difference between readings on the same steel needle would be more than a millimeter, I would try to repair my laser-ruler. Not complain about the variability of the length of needles...

Right. And I have explained exactly why there is no dependency of one game on another. Each is run individually, on a very controlled environment. The same copy of an engine is used each and every time. whatever they do internally I don't care since that is a variable everyone will have to deal with. I'm tired of this nonsense about games are somehow dependent. When it just can't happen. There are other explanations, if you would just look. One is inherent randomness and streakiness of chess games. One is the timing issue that can never be eliminated in normal chess play. that does not produce direct dependencies. If running consecutive games somehow biases that computer to favor one opponent consistently, there is nothing that can be done because there is no logical way to measure and fix such. It is more pure randomness thrown in...

bob · Post by **bob** » Sat Aug 02, 2008 7:28 pm

hgm wrote:
bob wrote:What do you mean "and divulge your data". I have always done that.
Seems to me my request was quite specific:
hgm wrote:Give us the complete list of game results (all 50,000), with the time they were played, and number of the core they have been running on. So we can slice it up, per engine, per time slice, per core, per position, and check the result histograms of the slices. So that we can do a Fourier transform on results, to make the correlations show up in the power spectrum.

bob wrote:If you want the PGN, I can easily re-run the tests and put the PGN files on my ftp box. If you want to wade thru 50,000 games. Doesn't bother me a bit. But you imply that I don't divulge data when in reality, any time anyone has asked, I've always been happy to make the data available...
Well, if the PGN does contain the time the game was played, and the core it was played on, the yes, the PGN would do. The tags would actually be enough, and the data I asked for could be compressed to one int per game to make it more manageble.

Your remark surprises me, though: I thought the whole point was that you cannot reliably re-run those tests, and that your attempt to do so will not give you the same scores. And if you produce two 25,000-game runs now, where BayesElo says that the ratings are equal to within 9 Elo, they will of course not be of any use. A defect can only be diagnosed when it manifests itself. If a pathologist needs to do a post-mortem on someone that has died of a mysterious disease, it doesn't help much if you tell him: "oh, sorry, I lost the body. But I can kill a few others for you!"

But you are claiming you typically get resut difference of around 6 sigma, and if that is true, I guess you won't have much trouble producing an equally suspect result in the next 50,000 games. But be sure to log the appropriade data (opponent, position+color, time, core/node, result) required for a thorough analysis of the result. "This guy looks dead to me" is no good as a pathologist's report.

I do not, as a matter of usual behavior, gather the node name. However, I can certainly do that and put it in the "Site" tag as opposed to "Olympus cluster" or "Ferrum cluster". I always include the FEN, the color, the two opponents, and the result as those are standard PGN requirements. I will have to look at the Date tag to see if I include the time, otherwise it is easy to add. But it would also be useless information since the cluster itself is unaware of the time and has no time-based things it does out of the clear blue sky. And even if it did, that would have to be a part of the testing since all machines have fluctuating loads except for controlled systems like this.

As far as the rest of your stuff goes, it is pure rambling. Be surprised about whatever you want to be surprised about. I did _not_ say the results I posted were "typical". I said that they were just 6 consecutive runs, 4 with 800 game matches, 2 with 25,000 game matches. Whether they are typical or not I do not know. I do know that with 25,000 game matches the Elo from BayesElo varies way too much to draw any conclusions about modest changes. In fact, I removed major parts of Crafty's Evaluation and still had a hard time measuring "good or bad" when I added individual _major_ components back in... using big runs... So again, for the N+1th time, read what I wrote, not what you wanted me to write. Those 6 sets of BayesElo output were just 6 consecutive runs used to test the BayesElo numbers to see if they were more stable than just using raw win/lose/draw match scores to draw conclusions.

I doubt I can make any progress until monday. The A/C problems have been significant this summer for reasons unknown to me, as our computer room has a brand new 40-ton A/C system that is under one year old and is about as reliable as a 50-year old unit it appears...

As far as the "interesting" comment goes, I only said I can re-run the test and no more. Did not say nor imply that the results would be the same. Nor that they would be different. No way to know until I run the test and see what I get. I only said "it is what it is" and explained _exactly_ "what it is."

henkf · Post by **henkf** » Sat Aug 02, 2008 7:38 pm

so basically what you are saying is that statistics doesn't apply to computer chess, since the results of computer chess games are too random?

i'm not an statistics expert, but it seems to me that if you typically get 6 sigma differences over consecutive samples of 25000 games, no number of games will fix this. in this case the clusters, no matter how many cores they have are lost on you and the only benefit in your testing is the contribution to global warming.

as said before i'm not an statistics expert and i'm not stating whether you are wrong or right, however if you are right my conclusion seems to me legitimate.

krazyken · Post by **krazyken** » Sat Aug 02, 2008 7:56 pm

If you do have the PGN from the last game run, it might be useful to see how many duplicate games are in there. Perhaps the sample size is significantly smaller than 25000.

bob · Post by **bob** » Sat Aug 02, 2008 8:05 pm

Harald wrote:
bob wrote:The only issue with my referee is that one of the opponents has to be crafty, as the referee depends on Crafty to accurately tell it when a game is technically over, which is simpler than having the referee keep up with board position and such here...
Sorry, I did not follow all the arguments and desriptions in this thread,
so I may be wrong, but I have a question.

You have two big 25000 games tests comparing crafty_a to crafty_b.
The collected test data show a result of ELO difference that it shouldn't.
One of the engines is used as a referee. Did you use the same engine as
referee in both tests, that is
run_test referee=crafty_a other=crafty_b
or was one of the tests started like this
run_test referee=crafty_b other=crafty_a ?
If your test results show a strange result and ELO shift then the fact that
an engine is a referee may cost (or gain) a few ELO points.

This is only one of many possible problems with a test experiment.
With a sample from tests so big (2*25000) I trust the statistics, probability
and mathematics so much that I think there _must_ be a hidden problem.
We want to help you to find it or at least learn to avoid such problems
when we all get a big test cluster.

Harald

None of the above. The referee is similar to winboard/xboard. I start the referee, it then uses fork() to create two processes, pipe() to create a connection to each of the two new processes, and then each process uses exec() to execute a copy of the two engines being tested. The referee sends both the starting FEN position, the initial time control, and instructs the correct opponent to move. It then simply alternates taking the move from the first opponent and sending it plus the time/otim commands to the second opponent. This goes back and forth until the game ends. The referee is not a part of either engine, it is a separate program much like xboard except no graphical interface since I don't want to see up to 540 graphical boards at one time (I can play up to 540 games at a time on the faster cluster, I do not use both for one run since the processors are not identical on the two clusters.)

The referee has been sanity-checked (although it is a simple program) by playing a few dozen 160 game matches, and taking the PGN for each game and comparing it to the log file produced by Crafty, to verify that the PGN result appears to be consistent with Crafty's expectation. Way back we actually found a bug in Crafty's draw detection when doing this, where it would think it was winning and the PGN said "draw by repetition" and when we checked the PGN it was correct and Crafty had overlooked it. But at least we know that the PGN collection code is saving sane information.

I believe that what we are seeing is just an artifact of computers playing the game. In human play, you won't find two opponents, where one is 200 Elo stronger, but loses _many_ consecutive games, then wins many _more_ consecutive games in the kind of streaks we see regularly in comp vs comp matches. Such behavior would give the impression that a group of games were somehow dependent when they are not.

So, to recap.

for the tests run, I ran on 128 nodes with nothing else running, and no possibility anything else could run since I was on a "closed" system. Each node is absolutely identical to every other node. This is a 128-blade cluster in 5 racks. All use the same 3.2ghz intel Xeon processors (older 64 bit processors), with 4.0 gigs of RAM and a big local hard drive where the directories being used are wiped clean after each set of 4 games (2 with black and 2 with white) are played.

The referee starts two programs, sets the position and then interfaces between the two programs. No pondering. No SMP search. No learning. No opening books. After a game ends, both engines are "killed", two new instances are started up and this is repeated for a total of 4 games with the given starting position, alternating colors after each game. That is a single "mini-match" of 4 games. There are 40 starting positions. or a total of 160 games per opponent for the set of mini-matches. 5 opponents gives 800 games. In the big runs, this 800 game set is repeated 32 times, for a total of 32 times, producing 25,600 games.

Each 4-game mini-match is a single "shell script" that is submitted. Each script is scheduled when a node is unused, and runs to completion. When that run completes, then another one is scheduled, although there is a 1-2 minute gap while the node "settles down" to appear unused again. The runs use local disk space, and after completing the final PGN results are copied to a multi-terabyte file storage system based on hardware raid.

the PGN is organized into a bunch of files, named matchN.X.Y where N is the number of the opponent (1-5), X is the run number (1-32) and Y is the position number (1-40).

There is nothing carried between games (engines are re-started, no learning, no book/book learning, etc.) The games are run essentially in order by player number first, then by position, and finally by run number, so that partial results give a better cross-section of what is happening while the big match is still in progress...

I have a test in Crafty that monitors NPS to make sure that it doesn't fluctuate more than a normal amount. This was done because in the past, before some scheduling changes on the cluster, it was possible for a node to get a third (runaway) process which would consume CPU time and interfere with the running programs. I put the check into Crafty to catch this. It no longer happens, although the test is always there for sanity. There are no endgame tables (although I could copy 3-4-5 piece files to all the nodes locally, or even use all the 6's but over the infiniband network which would not particularly be very good.

From that, you can see why I believe there is zero potential for any single game to be dependent on any other game result. it is as controlled and precise as I can make it, and it is _far_ more controlled than what most are doing in their basement testing...

bob · Post by **bob** » Sat Aug 02, 2008 8:09 pm

krazyken wrote:If you do have the PGN from the last game run, it might be useful to see how many duplicate games are in there. Perhaps the sample size is significantly smaller than 25000.

I think they were removed when I tried to re-run the test. Then the A/C went south. I will queue up a 25,000 game run and post the PGN as soon as this gets fixed. No idea about duplicates, although there must be quite a few since there are 40 starting positions played from black and white side for 80 different starting points for each pair of opponents.

BTW wouldn't "duplicates" be a good thing here? More consistency? If every position always produced 2 wins and 2 losses, or 2 wins and 2 draws, then the variability would be gone and the issue would not be problematic.

bob · Post by **bob** » Sat Aug 02, 2008 8:12 pm

henkf wrote:so basically what you are saying is that statistics doesn't apply to computer chess, since the results of computer chess games are too random?

No, what I am saying is that perhaps what Elo derived as a good measure for _human_ chess might not apply with equal quality to computer chess games.

i'm not an statistics expert, but it seems to me that if you typically get 6 sigma differences over consecutive samples of 25000 games, no number of games will fix this. in this case the clusters, no matter how many cores they have are lost on you and the only benefit in your testing is the contribution to global warming.

Not to mention headaches caused by examining huge quantities of data. But I tend to agree at present.

as said before i'm not an statistics expert and i'm not stating whether you are wrong or right, however if you are right my conclusion seems to me legitimate.

more on engine testing

Re: more on engine testing

Re: more on engine testing

Re: more on engine testing

Re: more on engine testing

Re: more on engine testing

Re: more on engine testing

Re: more on engine testing

Re: more on engine testing

Re: more on engine testing

Re: more on engine testing