New testing thread

Discussion of chess software programming and technical issues.

Moderators: hgm, Rebel, chrisw

bob
Posts: 20943
Joined: Mon Feb 27, 2006 7:30 pm
Location: Birmingham, AL

Re: Correlated data discussion

Post by bob »

hgm wrote:
Dirt wrote: Using the node count instead of time would mean you're not testing the time management code, right? I guess you'd then have to test that separately, which is really starting to make the testing complex.
This would not be true if you use the node-based time-control mode of WinBoard 4.3.14.

Then the time-control code would be fully active. It is just that the engine would be fed, (and uses internally) a virtual time, derived on its node count. But apart from the clock routine, the rest of the engine would never know that it is not dealing with real time.
How can that be? You tell me to search a specific number of nodes, then how can I go over that as I would if I fail low on a position? How can I go over that by variable amounts as I do now, depending on how much I appear to be "losing"???

If you tell me exactly how long, there is going to be node count jitter. If you tell me exactly how many nodes to search, there will be no timing jitter and I can't test my time allocation code at all.

Can't be both ways, at least in my program...
Uri Blass
Posts: 10267
Joined: Thu Mar 09, 2006 12:37 am
Location: Tel-Aviv Israel

Re: Correlated data discussion

Post by Uri Blass »

Fritzlein wrote:
hgm wrote:Watch out here! What standard deviation are you talking about? A standard deviation is defined as a sum over events. What events do you sum over? If you are talking about the (hypothetical) sampling of all possible starting positions played by all possible opponents, you would be right that a sampling procedure limiting the samples to a single position (or just a few), that is in itself randomly chosen, will drive up the standard deviation of the match results by correlating the game. So if procedure A is randomly selecting a position, and then play 80 games with that position, and repeating the full procedure (including selection of a new random position) a number of times might give you a larger SD of the 80-game match results than when you would independently select the position for each individual game randomly (procedure B).

OTOH, if you would randomly select a position only once, (procedure C) and now repeatedly use that random position in several 80-game matches (perhaps differing in opponents, number of nodes, time jitter or whatever), the standard deviation of the 80-game match results will be smaller then when using the totally random sampling, because you correlate games between different samples you are calculating the SD over. While in procedure A you were correlating only the games within a single match, keeping different matches uncorrelated.
I was talking about the standard deviation of the mean of test results in comparison to the true mean winning percentage. This seems to me the relevant number: we want to know what the expected measurement error of our test is, and we want to drive this measurement error as near to zero as possible.

I was not talking about the standard deviation between a test run and an identical or nearly-identical test run. We can make that variation exactly zero if we want to. Big deal. Who wants to get the wrong answer again and again with high precision? (Well maybe it is important to know somehow whether or not _every_ variable has been controlled, but that's a practical question, not a math question.)

I have not explained why the original two 25600-game test runs failed to get the same wrong answer twice with high precision. I'll let computer scientists duke it, and to me it is frankly less interesting because the mathematical mystery is gone. What I have explained (to my own satisfaction at least) is why each of the two 25600-game test runs can be expected to have a large error relative to Crafty's true strength, and thus I only need any old random change between the two big runs to complete the less interesting (to me) explanation as well.
Note that the question why the 2 tests failed to get the same wrong answer twice with high precision is important.

Maybe the difference is not important but maybe the difference is important and can cause wrong result in a test that is supposed to give right result.

If the difference is something that is equivalent to being 0.1% slower than it is not important with the right test because I expect the difference in rating from being 0.1% slower to be less than 1 elo but if the difference is that in part of the games one of the opponents ran 50% slower than the difference is very important and can spoil also results of future tests.

Bob did not save pgn of the data so we cannot know but in the future it is important to save pgn of the data so it is possible to find out if there is a big problem somewhere in the test otherwise there is no point of discussing about the results.

Uri
bob
Posts: 20943
Joined: Mon Feb 27, 2006 7:30 pm
Location: Birmingham, AL

Re: 4 sets of data

Post by bob »

hgm wrote:
bob wrote:However, would it be possible for either (a) you _do_ follow a specific discussion and post comments related to it or (b) if you choose to not follow context, then also choose to not make comments that have nothing to do with the discussion?
You would have to be more specific. That you perhaps cannot see the relevance of my remarks, does not necessarily mean that they have nothing to do with the discussion. In so far the discussion has to do with anything in the first place, that is...
Someone asked me to run the test. It _did_ change the results. I ran 4 runs to see if the variability was lower, higher, or the same.
Did it? What change are you talking about? 2 Elo on Crafty's rating, that had an error bar of +/- 18 in the first place? You call that a change? Your statement is actually very questionable "It _did_ change the results." What is 'it' here? Did that (insignificant) change in the results occur because you included games between the other engines, or simply because you redid the test (or some games of the test). The Crafty rating would very likely change much more than 2 points (namely 9 point = 1 SD) if you would have redone the Crafty games without playing the opponents against each other. And it would likely have changed in another way if you only redid the games of the opponents against each other. The conclusion that including games between opponents is not justified.
First, look at the data. Second, look at what I said changed. The ratings of all programs were squeezed into a narrower band with the round-robin. That certainly gave a better estimate of each program's rating than just the first set of games. Didn't help me much there, but if you look at the two crafty versions, their "distance apart" collapsed quite a bit, which _was_ significant.

And if someone has done that then most all past testing is flawed, since I am using the same unmodified source that was used in the CCRL, SSDF, etc testing.
A good tester would recognize such engines as variable, and delete its results from their tests. This is why you have to keep track of variance and correlations within the results of each engine.
Har-de-har-har. So a tester is going to decide "hey, program behaves strangely, and plays better in even months than in odd months" and throws it out. I don't think so. Until I brought this up, I'd be you had _zero_ idea that the results are as volatile as they are. Later you suggested that crafty was an odd program and went into micro-max/joker and full-iteration discussions. And now we discover that from same starting positoin, even fruit won't play the same game twice in 100 tries, no book or anything. So it is easy to now "notice" something after it has been pointed out. Had I said nothing, the results would have kept on coming in and nobody would have even thought about the randomness that is inherent in computer games, far moreso than in human games.




And btw, that would _not_ corrupt the test and make your "dependency" condition show up. A program would simply play better about 1/2 the time. So even that statistical suggestion would be wrong. A binary decision based on even/odd months is not going to produce any sort of dependency in the data.
It most certainly will. The data will be highly correlated timewise, and the stochastic processes producing them will be highly dependent in the mathematical sense. That the causal relationship in fact is that both are dependent on the month of the year, is something that the math does not care about.
Would produce variation if your sample interval matched up, but not dependency.

No, because the TOD is always synced to NTP sources. You have suggested a way to break the test. If anyone thought that was _really_ the issue, the only flaw with that reasoning is that _all_ the programs would have to be doing that, and in a way that would get in sync with my testing. Month is too coarse when you run a test in a day, except right at the boundary. However, it is certainly easy enough to verify that this doesn't happen, if someone wants to. The source for all the engines _is_ available. But I suspect no one would consider stomping through source to see if _that_ is actually happening. Because if it was, it would show up in other testing as well.
You might not be able to see it at all, by looking at the source, just as you will in general not be able to debug a source by only looking at it. The behavior I described might be the result of a well-hidden bug. The only reliable way to find bugs is by looking how they manifest themselve.

This is not far-fetched. It did actually happen to me. At some point I noted that Joker, for a search to a given depth from the opening position would sometimes produce a different score at the same search depth, even when 'random' was switched off. Turned out that the local array in the evaluation routine that held the backward-most Pawn of each file was only initialized up to the g-file. And the byte for the h-file coincided with a memory address that was used to hold the high-order byte of a clock variable, read to know if the search should be aborted. During weeks where this was large and positive, the backward-most pawn was found without error. When it was negative, the negative value stuck like an off-board ghost Pawn, and caused horrendous misjudgements in the Pawn evaluation, leading to disastrous Pawn moves, losing the game.
Even if that were true (and it does not necssessaryly have to be true, as the problem might be in one of the engines, and other people might not use that engine), the main point is that others do not care. They do not make 25,000-game runs, so their errors are always dominated by sampling statistics. Systematic errors that are 6-sigma in a 25,000-game run are only 0.6 sigma in a 250-game run, and of no consequence. You could not see them, and they would not affect your testing accuracy.
What on earth does that mean? So you are _now_ going to tell me that you would get more accurate results with 250 games than with 25,000? :) Simply because the SD is now much larger? :) Hint: that is _not_ a good definition of accuracy...
Are you incapable of drawing any correct inferences whatsoever? An elephant does not care about the weight of a dog on its back, while that same dog would crush a spider. You conclude from that that elephants are too small to be crushed????
And if you had been reading, you would have noted that I have attempted to do this. No books, no learning, no pondering, no SMP search, A clean directory with just engine executables that is zapped after each game to avoid hidden files that could contain anything from persistent hash to evaluation weight changes. All we are left with is a pool of engines, using the same time control repeatedly, playing the same positions repeatedly, playing on the same pool of identical processors repeatedly. No the timing is not exact. But since that is how everyone is testing, it would seem that is something that is going to have to be dealt with as is. yes a program can on occasion fail low just because it gets to search a few more nodes due to timing jitter. Yes that might burn many seconds of CPU time to resolve. Yes that will change the time per move for the rest of the game. Yes that may change the outcome.
Yes, all very nice. But it did not reduce the variance to the desired (and designed!) level. And it is results that count. An expensive looking spotless car is no good if it refuces to start. Even if the engine is brand new, still under factory guarantee, the gas tank is loaded up, the battery is charged. But you would still have to walk...
However, if I am designing a telescope that is going to be mounted on the surface of planet earth, I am not going to waste much time planning on using a perfect vacuum where there is no air diffraction going on, because that is not going to be possible since only a very few have enough resources to shoot something like Hubble up into orbit. And it is immaterial because we have already decided this will be surface-mounted.
A very educational example, as people are now building huge Earth-based telescopes, which correct for air turbulence by using adaptive optics, and exceed the limits of their mirror size by being grouped as clusters of interferometers, as one does for radio telecopes. If you cannot remove a noise source, you will have to learn to live with it, and outsmart it...
In this case, "live with it" being the operative phrase... I certainly can't do anything with Rybka, or Junior, etc. So there is but one choice left...
User avatar
xsadar
Posts: 147
Joined: Wed Jun 06, 2007 10:01 am
Location: United States
Full name: Mike Leany

Re: Correlated data discussion

Post by xsadar »

Zach Wegner wrote:
xsadar wrote:
Zach Wegner wrote:
xsadar wrote:However, I will agree with you that a fixed (or random) number of nodes won't work. Any change to the evaluation or search will also change the nodes per second of the search, which is also something we would definitely want included in our test. That is, afterall, the major tradeoff when adding new evaluation terms. Also, in my experience, nps varies drastically from one stage of the game to another, so I don't think it would even work to use an average nps in combination with a node count.
First, I think using a node count proportional to NPS would be fine. The NPS doesn't really change all that much from version to version, and from version to version with respect to game stage. If we had a way to measure time in an exact way, and completely get rid of clock jitter, then I would suggest using a random time. But we don't, so I think nodes is the next best option.

But anyways, I think the tradeoff is not between evaluation size and speed, as it was thought to be for ages, but rather the robustness/generality of the evaluation and how it affects every aspect of the game, not just the part it is evaluating. The NPS difference that a (relatively small) evaluation term causes is very minor WRT strength IMO.
I don't know about other people's engines, but I tend to see a huge variance in nps. After reading your comment I had my engine play a couple quick games to see how much it varied. These are the extremes I saw in a 40 moves in 1 minute game:

1883000 nodes at 0.942M nps
1773748 nodes at 2.612M nps

These are the extremes in a 2 seconds per move game:

1671000 nodes at 0.836M nps
3060000 nodes at 1.530M nps

Standard deviations would be more useful, but there were other values close to all four of those listed here. It's entirely possible that background processes or even a buggy engine causes this, but it shouldn't be too difficult to test on other machines and with other engines. But personally I would expect large variations in nps to be fairly normal, because different node types (PV, null-window, quiescence, etc.) will take different amounts of time.
Yes, NPS varies greatly for the different stages of the game, though the effect is different in each engine. My point is that if you add a term, say "rook on seventh", the main downside might not be the slowdown in NPS, but rather the positions where the rook on seventh rule isn't correct. Absolutely essential reading: http://chessprogramming.wikispaces.com/ ... Philosophy
Read Tord Romstad's quote. Read it again. Live it.
As often as the evaluation function is called, changing it has a high potential of drastically changing the nps, especially if you introduce a bug. Of course there are examples of rejecting changes for other reasons, but that's not the point. Of course you can't focus entirely on nps (would be nice if we could though), but you can't ignore it either. That's my point.

I'm not sure why you referred me to that article. Tord's quote is congruent with what I've said, particularly the part about orthogonality. And the rest of the article only emphasizes my point:
The evaluation is typically a collection of "rules of thumb" collected from hundreds of years of human experience. It would be practically impossible to program all of the rules that humans have come up with over the years. Even if you could code for all of them, it would be inadvisable because of performance reasons. There must be a trade-off between knowledge versus speed. The more time you spend evaluating a position, the less time you have to search, and therefore, the less deep your program can see. Some programs are designed with a very light evaluation function containing only the most basic features, letting the search make up the rest. Others prefer a heavy evaluation with as much knowledge as possible. Most programs have an evaluation somewhere in the middle, with the trend in recent years being towards simpler, lighter evaluation functions, the idea being that these evaluations are the most bug-free and maintainable, which is far more important in practice than having many obscure pieces of knowledge. A big influence in this direction was the advent of Fruit , which has a very minimal evaluation function, yet it is a very solid, strong engine.
Seeing as how performance (nps) is such an important factor, why would you ever want to completely disregard it in your testing? Especially when this thread is about measuring small changes in Elo.
User avatar
hgm
Posts: 27788
Joined: Fri Mar 10, 2006 10:06 am
Location: Amsterdam
Full name: H G Muller

Re: Correlated data discussion

Post by hgm »

bob wrote:
hgm wrote:
Dirt wrote: Using the node count instead of time would mean you're not testing the time management code, right? I guess you'd then have to test that separately, which is really starting to make the testing complex.
This would not be true if you use the node-based time-control mode of WinBoard 4.3.14.

Then the time-control code would be fully active. It is just that the engine would be fed, (and uses internally) a virtual time, derived on its node count. But apart from the clock routine, the rest of the engine would never know that it is not dealing with real time.
How can that be? You tell me to search a specific number of nodes, then how can I go over that as I would if I fail low on a position? How can I go over that by variable amounts as I do now, depending on how much I appear to be "losing"???

If you tell me exactly how long, there is going to be node count jitter. If you tell me exactly how many nodes to search, there will be no timing jitter and I can't test my time allocation code at all.

Can't be both ways, at least in my program...
Of course it can. You get a given number of nodes for a given number of moves, and the engine will itself determine how it distributes those nodes over the individual moves. Just like it divides its time quota over the moves in a normal N moves / M minutes time control. What you had in mind is the direct equivalent to the 'st' command, that gives you a fixed, never-exceed time for a single move. But there is no reason to restrict node-based time controls to that mode.

So if you want to play 100,000 nodes (max) per move, you start WinBoard with arguments

-st 1 -nps 100000

If you want to play 4,000,000 nodes for 40 moves, you specfy

-mps 40 -tc 0:40 -nps 100000

That's all. Of course the engine has to implement the WinBoard nps command. So indeed you could not do it on Crafty.
MartinBryant

Re: Correlated data discussion

Post by MartinBryant »

Fritzlein wrote:Wow, it's very interesting that everyone disagrees with me here. It really is a money-making (or money-losing) opportunity. :) I quite believe all of you that a tiny change in node count can change the entire course of the game. However, I'm not at all convinced by these anecdotal reports that my bet is a money-loser.
To be fair they're not entirely anecdotal.
I don't know all the conditions of Bob's earlier experiments where he pointed this issue out but anybody can repeat the experiment I did recently which I reported here earlier which you may have missed.
I played Fruit v Fruit 100 games at a 'fixed' time of 0.5 secs/move from the starting position with no book. The small time fluctuations quickly caused the move choices to change and it produced 100 unique games.
Now with your expert mathematical brain this may not strike you as too extreme but I think it looks pretty convincing to the layman.
Could you offer your opinion on that experiment?
Fritzlein wrote:The conclusion that you are arriving at is that because a small change in node count can change the results (and often does) that there is no correlation between repeated plays with similar node counts. Has anyone measured the correlation? Are there any statistics on this?
Fair point.


Fritzlein wrote: If there are no statistics, let me ask how good you think you are at distinguishing a coin that lands heads 55% of the time from one that lands heads 45% of the time on the basis of looking at a few flips.
But I would feel more confident after 100 flips of my own and 'hearing' that another respectable source also considered the coin suspect.
Fritzlein wrote:Martin, let me make sure that you are offering me the same bet I am offering you. Crafty is playing against Fruit from the same starting position 201 times at different fixed node counts. From Bob's posted results we know that Fruit scores about 60% against Crafty. However, in my bet I stipulate that I get to look at one of the results first, let's say the middle one of the 201. If Crafty didn't win that one, then no bet. If Crafty did win that one, then I take Crafty to score over 100 points in the other 200 games, and you take Fruit to score over 100 points in the other 200 games. You think that one win is likely to be a fluke in the middle of a sea of losses, whereas I think that one win is likely to be correlated with all the other games. So you think you would win money from me in this way?
"Sea of losses" was un-scientific. "Random selection of wins/draws/losses with approx 60-40 spread to Fruit" would be better wording. One experimental point missed... would you want Fruit to be playing to a fixed 3,000,000 nodes (favours you) or would it's node count match Crafty's (favours me) ?

Fritzlein wrote:Well, I could be wrong, but I am still offering, until someone actually measures how much correlation is there. Maybe Bob would even take a sporting interest in our disagreement and play out our bet in this way: He takes random positions from his set of 3000+ until he finds one that Crafty beats Fruit at a median fixed node count (corresponding to the time control at which Fruit was generally winning 60%). He plays it at the 200 neighboring node counts, and marks down whether Crafty or Fruit won more. Then pick more random positions until there is another one Crafty wins, etc., repeating 100 times. Well, that's 20,000+ games to play, but if I'm right and Crafty won more than 50 of those 100 bets, it would give us some insight into a very fundamental point.
Well that's moving the goalposts slightly as the node counts he's getting in his matches will be WAY bigger than 3,000,000.
I guess I feel less confident about my conviction the smaller the percentage change in node count. And I guess you would feel the reverse? Or perhaps not? Care to comment?

I'm not a mathematician so excuse this if it's a dumb question, but as a mathematician, if I gave you two runs of 201 game results, you could apply some formula to measure the correlation of which you speak. If so, does no correlation imply chaotic behaviour? Or would your formula measure at least some correlation between ANY two completely random, unrelated sets of wins/draws/losses?
Fritzlein wrote:It's just a gut feeling with no mathematical calculation. However, there is no mathematical calculation on the other side either, just anecdotal evidence that changing node counts totally changes playouts, and references to the butterfly effect. Let's get some numbers and see whose intuition is accurate!
Absoutely fair point again.
However, in lieu of such an experiment, can you explain why your 'gut' suggests there will be correlation? (I understand why mine doesn't but I'm not sure I could articulate it very well!)
User avatar
hgm
Posts: 27788
Joined: Fri Mar 10, 2006 10:06 am
Location: Amsterdam
Full name: H G Muller

Re: 4 sets of data

Post by hgm »

bob wrote:First, look at the data. Second, look at what I said changed. The ratings of all programs were squeezed into a narrower band with the round-robin.
Aha! So you were posting in the wrong thread, after all! And too thick to notice it even when warned once... So let it be clear to all now who of us is the one that actually does not read the thread he is posting in. :lol: :lol: :lol:
That certainly gave a better estimate of each program's rating than just the first set of games. Didn't help me much there, but if you look at the two crafty versions, their "distance apart" collapsed quite a bit, which _was_ significant.
Oh yeah, big surprise. You give them more games and the rating estimate gets more accurate. And you have selected programs that are close in strength, so their Elo spread is easily dominated by the statistical error. And that gets smaller when they have more games. Btw, it seems that if that is your conclusion, (ratings are closer), it is one that is not justified on the basis of this data, and very likely wrong: The program that happens to be strongest (Glaurung) also happened to be fairly lucky, or better at Crafty than the rating difference would suggest (because of playing style). One single run of Crafty vs World is not eough to determine that. But whatever of these two explanations would be valid, there is no guarantee at all that the best engine you include in the test will always be lucky, or always have a playing style that Crafty handles below average. And if the opposite is true, the ratings would not compress, but expand on inclusion of the Wrld vs Worldgames. (Yes, and I can know all that without actually doing the test, because it is all so obvious...)

A good tester would recognize such engines as variable, and delete its results from their tests. This is why you have to keep track of variance and correlations within the results of each engine.
Har-de-har-har. So a tester is going to decide "hey, program behaves strangely, and plays better in even months than in odd months" and throws it out. I don't think so.
No, obviously not. This is why I said: good tester...
Until I brought this up, I'd be you had _zero_ idea that the results are as volatile as they are.
When I play Joker, they are actually maximally volatile, because I always play Joker with randomization on. So I am not dependent on any form of time jitter to create independent games. But I admit that I have problems with identical or largely identical games when I would switch the randomization off. Micro-Max has no randomization, and when repeating the Silver test suite against Eden, many of the games were totally equal, mostly because Eden deviated. Micro-Max deviated on the average only once every 40 moves. Those are the well-investigated facts. So what? Apparently not all engines display the same natural variability.

But to do accurate testing, variability is essential. So I solved the problem by randomizing.
Later you suggested that crafty was an odd program and went into micro-max/joker and full-iteration discussions. And now we discover that from same starting positoin, even fruit won't play the same game twice in 100 tries, no book or anything.
Well, perhaps Fruit randomizes too. I have no idea of how Fruit works. Are you sure that you play it with randomization off?
So it is easy to now "notice" something after it has been pointed out. Had I said nothing, the results would have kept on coming in and nobody would have even thought about the randomness that is inherent in computer games, far moreso than in human games.
Perhaps not, as it is not really an interesting or relevant subject. If you don't have sufficient randomness, as will be apparent quickly enough from game duplications. The required randomization is almost trivial to program. Others rely on books to randomize, or use a large number of starting psitions, or a large number of opponents to create the variability. If you are so lucky that the variability is intrinsic, well, good for you! Not all of us are so lucky. Again, I don't see that as a particular acheivement.
MartinBryant

Re: Correlated data discussion

Post by MartinBryant »

Forgot to mention...
I repeated the Fruit experiment with Spike and Colossus too.
Again no duplicates in 100 games.
bob
Posts: 20943
Joined: Mon Feb 27, 2006 7:30 pm
Location: Birmingham, AL

Re: Correlated data discussion

Post by bob »

Uri Blass wrote:
Fritzlein wrote:
hgm wrote:Watch out here! What standard deviation are you talking about? A standard deviation is defined as a sum over events. What events do you sum over? If you are talking about the (hypothetical) sampling of all possible starting positions played by all possible opponents, you would be right that a sampling procedure limiting the samples to a single position (or just a few), that is in itself randomly chosen, will drive up the standard deviation of the match results by correlating the game. So if procedure A is randomly selecting a position, and then play 80 games with that position, and repeating the full procedure (including selection of a new random position) a number of times might give you a larger SD of the 80-game match results than when you would independently select the position for each individual game randomly (procedure B).

OTOH, if you would randomly select a position only once, (procedure C) and now repeatedly use that random position in several 80-game matches (perhaps differing in opponents, number of nodes, time jitter or whatever), the standard deviation of the 80-game match results will be smaller then when using the totally random sampling, because you correlate games between different samples you are calculating the SD over. While in procedure A you were correlating only the games within a single match, keeping different matches uncorrelated.
I was talking about the standard deviation of the mean of test results in comparison to the true mean winning percentage. This seems to me the relevant number: we want to know what the expected measurement error of our test is, and we want to drive this measurement error as near to zero as possible.

I was not talking about the standard deviation between a test run and an identical or nearly-identical test run. We can make that variation exactly zero if we want to. Big deal. Who wants to get the wrong answer again and again with high precision? (Well maybe it is important to know somehow whether or not _every_ variable has been controlled, but that's a practical question, not a math question.)

I have not explained why the original two 25600-game test runs failed to get the same wrong answer twice with high precision. I'll let computer scientists duke it, and to me it is frankly less interesting because the mathematical mystery is gone. What I have explained (to my own satisfaction at least) is why each of the two 25600-game test runs can be expected to have a large error relative to Crafty's true strength, and thus I only need any old random change between the two big runs to complete the less interesting (to me) explanation as well.
Note that the question why the 2 tests failed to get the same wrong answer twice with high precision is important.

Maybe the difference is not important but maybe the difference is important and can cause wrong result in a test that is supposed to give right result.

If the difference is something that is equivalent to being 0.1% slower than it is not important with the right test because I expect the difference in rating from being 0.1% slower to be less than 1 elo but if the difference is that in part of the games one of the opponents ran 50% slower than the difference is very important and can spoil also results of future tests.

Bob did not save pgn of the data so we cannot know but in the future it is important to save pgn of the data so it is possible to find out if there is a big problem somewhere in the test otherwise there is no point of discussing about the results.



Uri

I have said this multiple times already, but I do have a sanity check that Crafty does after each and every move, which is to simply verify the NPS value is within sane boundaries. If one fails, the entire run is aborted and I find out about it as soon as I check in.

Now if you want to somehow conjure up a scenario where Crafty's NPS remains in the normal range, but something slows the opponent down by 50%, I'd buy it. Except there is nothing running to cause such a thing, and I can't see how it would be selectively applied to the opponents only. Because the instant it is applied to crafty the match stops... I've explained why I used to have to do this, and left it in simply because I never thought to take it out. The hardware speed never varies, the load never varies, at least any more than what you would see if you bought two 3.2ghz machines, set them up side by side, and compared speeds by running chess programs.
bob
Posts: 20943
Joined: Mon Feb 27, 2006 7:30 pm
Location: Birmingham, AL

Re: Correlated data discussion

Post by bob »

hgm wrote:
bob wrote:
hgm wrote:
Dirt wrote: Using the node count instead of time would mean you're not testing the time management code, right? I guess you'd then have to test that separately, which is really starting to make the testing complex.
This would not be true if you use the node-based time-control mode of WinBoard 4.3.14.

Then the time-control code would be fully active. It is just that the engine would be fed, (and uses internally) a virtual time, derived on its node count. But apart from the clock routine, the rest of the engine would never know that it is not dealing with real time.
How can that be? You tell me to search a specific number of nodes, then how can I go over that as I would if I fail low on a position? How can I go over that by variable amounts as I do now, depending on how much I appear to be "losing"???

If you tell me exactly how long, there is going to be node count jitter. If you tell me exactly how many nodes to search, there will be no timing jitter and I can't test my time allocation code at all.

Can't be both ways, at least in my program...
Of course it can. You get a given number of nodes for a given number of moves, and the engine will itself determine how it distributes those nodes over the individual moves. Just like it divides its time quota over the moves in a normal N moves / M minutes time control. What you had in mind is the direct equivalent to the 'st' command, that gives you a fixed, never-exceed time for a single move. But there is no reason to restrict node-based time controls to that mode.

So if you want to play 100,000 nodes (max) per move, you start WinBoard with arguments

-st 1 -nps 100000

If you want to play 4,000,000 nodes for 40 moves, you specfy

-mps 40 -tc 0:40 -nps 100000

That's all. Of course the engine has to implement the WinBoard nps command. So indeed you could not do it on Crafty.
OK, how does that make a fair test? Program A is fairly constant in NPS throughout the game. Program B varies by a factor of 3x from opening to endgame (Ferret was an example of this). So how many nodes do you tell Ferret to search in comparison to the other program?

So how is this going to give reasonable estimates of skill when it introduces a potential time bias toward one opponent or the other???