New testing thread

Dirt · Post by **Dirt** » Sun Aug 10, 2008 2:49 am

Zach Wegner wrote:First, I think using a node count proportional to NPS would be fine. The NPS doesn't really change all that much from version to version, and from version to version with respect to game stage. If we had a way to measure time in an exact way, and completely get rid of clock jitter, then I would suggest using a random time. But we don't, so I think nodes is the next best option.

Using the node count instead of time would mean you're not testing the time management code, right? I guess you'd then have to test that separately, which is really starting to make the testing complex.

bob · Post by **bob** » Sun Aug 10, 2008 3:03 am

For some reason, Arasan believe that a valid FEN string must have an "id" tag on the end. By adding that, all seems to be well and the thing seems to run now...

xsadar · Post by **xsadar** » Sun Aug 10, 2008 3:41 am

Zach Wegner wrote:
xsadar wrote:However, I will agree with you that a fixed (or random) number of nodes won't work. Any change to the evaluation or search will also change the nodes per second of the search, which is also something we would definitely want included in our test. That is, afterall, the major tradeoff when adding new evaluation terms. Also, in my experience, nps varies drastically from one stage of the game to another, so I don't think it would even work to use an average nps in combination with a node count.
First, I think using a node count proportional to NPS would be fine. The NPS doesn't really change all that much from version to version, and from version to version with respect to game stage. If we had a way to measure time in an exact way, and completely get rid of clock jitter, then I would suggest using a random time. But we don't, so I think nodes is the next best option.

But anyways, I think the tradeoff is not between evaluation size and speed, as it was thought to be for ages, but rather the robustness/generality of the evaluation and how it affects every aspect of the game, not just the part it is evaluating. The NPS difference that a (relatively small) evaluation term causes is very minor WRT strength IMO.

I don't know about other people's engines, but I tend to see a huge variance in nps. After reading your comment I had my engine play a couple quick games to see how much it varied. These are the extremes I saw in a 40 moves in 1 minute game:

1883000 nodes at 0.942M nps
1773748 nodes at 2.612M nps

These are the extremes in a 2 seconds per move game:

1671000 nodes at 0.836M nps
3060000 nodes at 1.530M nps

Standard deviations would be more useful, but there were other values close to all four of those listed here. It's entirely possible that background processes or even a buggy engine causes this, but it shouldn't be too difficult to test on other machines and with other engines. But personally I would expect large variations in nps to be fairly normal, because different node types (PV, null-window, quiescence, etc.) will take different amounts of time.

bob · Post by **bob** » Sun Aug 10, 2008 3:51 am

krazyken wrote:
bob wrote:there is simply something going _badly_ wrong with communication here.

(1) if we play with X nodes per search we get one result. If we play with X+N nodes per search, we get another result.

(2) if we play with time limit for search, each run gets a different result.

I ran several tests using (2) and got different results. I then showed that the _only_ thing possible to produce these changes are a result of the different nodes caused by using time.

Why is that so hard to explain? Every search of exactly 1 second is not exactly one second. It is 1 second +/- N where N is not precisely known because of hardware differences. Before you can fix something, you have to understand what is happening. I wanted to prove, beyond any doubt, that this variability was purely a result of timing fluctuation, which I quite clearly did. Otherwise I didn't care about the varying node count searches at all with respect to testing. It was just a method of validating a hypothesis.
Honestly, I'm not sure that this has been proven beyond any doubt. My guess here is if time fluctuations is a factor, there is no reason to believe that one run has more fluctuation than the other. The original runs of 25k games are gone, no data to support either side of any argument about why the results were different. That horse is dead. When new data comes in, we'll be able to analyze rather than speculate.

You are taking one test by itself. I have run _millions_ of games the last two years. My first goal over a year ago was to figure out why we were seeing such a fluctuation. We made a list of potential issues. From opening book preparation, to program bugs, and everything in between. One by one we eliminated potential causes. Opening books? Gone. Learning? Gone. games played with same node counts? no problems. games played with same depth? no problem. Those were the only two things that would produce the same games over and over, and I do not believe there is much doubt that time is the issue. We looked at games that produced different results on different runs. And in every case, things simply diverged at some point. Different move with a slightly different score. Sometimes the same result still, but sometimes something completely different. And if we ran that position multiple times, sometimes we could get it to reproduce, the behavior, sometimes not. Because of hash table information carried from move to move. But so long as a program searched a consistent tree, the problem did not show up.

I have now played lots of games (as did others, including the quote about 100 fruit vs fruit games with no duplicates) under extremely controlled circumstances, and the only conclusion left is time. Two different programs have the same bug to produce non-deterministic search, but only when node counts vary? N different programs have the same issues? What was the common, underlying theme? They were all using time to limit the search.

Here is an experiment people should be able to reproduce on their home machines that should be analogous to the situation produced in those 2 runs.
Pick one of the 40 silver positions, and run 5 series of 128 games of their engine against 5 opponents from that 1 position. For a total of 640 games. Make sure to disable learning and opening books and such. Then set up the same run again and see how much different the results are. This should be the simplest case of determining the value of the number of starting positions.

should be easy to do although that is a lot of games for one machine. But others have already reported similar results with fewer games so I doubt that anything new will come up.

bob · Post by **bob** » Sun Aug 10, 2008 3:58 am

xsadar wrote:
bob wrote:
Zach Wegner wrote:
Karl wrote:If we want to measure how well engine A plays in the absolute, however, then "randomness" (or more precisely independence between
measurements) is a good thing. We want to do everything possible to kill off correlations between games that A plays and other games
that A plays. This can include having the opponent's node count set to a random number, so that there is less correlation between games
that reuse the same opponent. That said, if we randomize the opposing node count we should save the node count we used, so we can use
exactly the same node count for the same game when A' plays in our comparison game.

I think, therefore, that test suite results will be most significant if the time control is taken out of the picture completely. If the
bots are limited by node count rather than time control, we can control the randomness so that we get the "good randomness" (achieving
less correlation among the games of one engine) and can simultaneously eliminate the "bad randomness" (removing noise from the comparison
between engines).
Isn't this pretty much exactly what I've been saying?
Yes. but there is a fatal flaw. You already saw where 10,001,000 nodes produces different results from 10,000,000 nodes, correct? Now make one tiny change to your evaluation or search and guess what happens to the tree? It will change by _more_ than 1,000 nodes, and you get a different random result.

There is no way to search the same exact tree with a different evaluation or search, so you are right back to where we started. There will be a random offset between the two sample sets just as there is with random times at present...
The changes in the search tree are part of what we're trying to test, whereas fluctuations in the clock are not.

However, I will agree with you that a fixed (or random) number of nodes won't work. Any change to the evaluation or search will also change the nodes per second of the search, which is also something we would definitely want included in our test. That is, afterall, the major tradeoff when adding new evaluation terms. Also, in my experience, nps varies drastically from one stage of the game to another, so I don't think it would even work to use an average nps in combination with a node count.

You are completely missing the point. We want to test whatever changes we made, would you agree? And do you follow in how a tiny change in time measurement changes the node count of the tree being search? And how that changes the result? So a change you make could just as easily change the shape of the tree just enough so that you now get a win where before you got a loss, or vice versa. Having absolutely _nothing_ to do with a better or worse evaluation, just one the produces a different value, which then alters how cutoffs occur, which is the same thing that happens when the node count is changed slightly.

The easiest way to understand this is to view the entire population of games we could possibly play using a time limit. There are only so many possible node count variations. So you slice up this total population by individual node counts, and now you have a super-set of my change-the-node-count games. Except your super-set has sets of games from all possible different node counts. But how do the sub-sets compare to the entire set? Ought to be representative, statistically, but there is no guarantee. It would be ugly to just alter the node count slightly and get a better result, which is what happened in the tests I ran sometimes. It would be just as ugly to add a new eval term that was not very important, but since it changes the tree shape just a bit, I get over into a different one of those node-count samples and get a better (or worse) result and draw a conclusion about the change, when in reality any such change will affect the final result.

That is the problem that is difficult to deal with. Any change alters the entire system, and it seems to take a large number of games/positions/opponents to reduce this noise to an acceptable level.

Dirt · Post by **Dirt** » Sun Aug 10, 2008 4:24 am

bob wrote:You are completely missing the point. We want to test whatever changes we made, would you agree? And do you follow in how a tiny change in time measurement changes the node count of the tree being search? And how that changes the result? So a change you make could just as easily change the shape of the tree just enough so that you now get a win where before you got a loss, or vice versa. Having absolutely _nothing_ to do with a better or worse evaluation, just one the produces a different value, which then alters how cutoffs occur, which is the same thing that happens when the node count is changed slightly.

The easiest way to understand this is to view the entire population of games we could possibly play using a time limit. There are only so many possible node count variations. So you slice up this total population by individual node counts, and now you have a super-set of my change-the-node-count games. Except your super-set has sets of games from all possible different node counts. But how do the sub-sets compare to the entire set? Ought to be representative, statistically, but there is no guarantee. It would be ugly to just alter the node count slightly and get a better result, which is what happened in the tests I ran sometimes. It would be just as ugly to add a new eval term that was not very important, but since it changes the tree shape just a bit, I get over into a different one of those node-count samples and get a better (or worse) result and draw a conclusion about the change, when in reality any such change will affect the final result.

That is the problem that is difficult to deal with. Any change alters the entire system, and it seems to take a large number of games/positions/opponents to reduce this noise to an acceptable level.

I found that very hard to follow. I think you are saying that using node counts to control the engines is better than a time control because there are only so many different ways a game could play out at a given time control. But with the change to not reusing the start positions I don't think that should be an issue anymore. Oh, right, you were saying you wanted to see how few start positions it would take to still get good results, but given the problems you just cited I don't see why you would want to do that.

xsadar · Post by **xsadar** » Sun Aug 10, 2008 6:31 am

bob wrote:
xsadar wrote:
bob wrote:
Zach Wegner wrote:
Karl wrote:If we want to measure how well engine A plays in the absolute, however, then "randomness" (or more precisely independence between
measurements) is a good thing. We want to do everything possible to kill off correlations between games that A plays and other games
that A plays. This can include having the opponent's node count set to a random number, so that there is less correlation between games
that reuse the same opponent. That said, if we randomize the opposing node count we should save the node count we used, so we can use
exactly the same node count for the same game when A' plays in our comparison game.

I think, therefore, that test suite results will be most significant if the time control is taken out of the picture completely. If the
bots are limited by node count rather than time control, we can control the randomness so that we get the "good randomness" (achieving
less correlation among the games of one engine) and can simultaneously eliminate the "bad randomness" (removing noise from the comparison
between engines).
Isn't this pretty much exactly what I've been saying?
Yes. but there is a fatal flaw. You already saw where 10,001,000 nodes produces different results from 10,000,000 nodes, correct? Now make one tiny change to your evaluation or search and guess what happens to the tree? It will change by _more_ than 1,000 nodes, and you get a different random result.

There is no way to search the same exact tree with a different evaluation or search, so you are right back to where we started. There will be a random offset between the two sample sets just as there is with random times at present...
The changes in the search tree are part of what we're trying to test, whereas fluctuations in the clock are not.

However, I will agree with you that a fixed (or random) number of nodes won't work. Any change to the evaluation or search will also change the nodes per second of the search, which is also something we would definitely want included in our test. That is, afterall, the major tradeoff when adding new evaluation terms. Also, in my experience, nps varies drastically from one stage of the game to another, so I don't think it would even work to use an average nps in combination with a node count.
You are completely missing the point.

Could be. I'm having a hard time understanding this post, so I'll go point by point, to see if I can understand any better.

We want to test whatever changes we made, would you agree? And do you follow in how a tiny change in time measurement changes the node count of the tree being search? And how that changes the result? So a change you make could just as easily change the shape of the tree just enough so that you now get a win where before you got a loss, or vice versa.

Yes/correct to all of the above.

Having absolutely _nothing_ to do with a better or worse evaluation, just one the produces a different value,

I'm not sure I agree with this. To me it means that it is better or worse for this particular sample (consisting of the position and number of nodes searched).

which then alters how cutoffs occur, which is the same thing that happens when the node count is changed slightly.

Except when the node count is changed that has nothing to do with changes in the program we're testing, but the size and shape of the tree does. So I don't understand why this is relevant, but perhaps that's because I'm not understanding your point.

The easiest way to understand this is to

No offense, but I think I understood the above a lot better than what follows.

view the entire population of games we could possibly play using a time limit. There are only so many possible node count variations.

Okay so we have all possible games that can be played by your engine from one starting point and with a specific time limit, right? Or is this just all possible games of chess that can be played with that engine? Or all possible games that can be played by anybody?

So you slice up this total population by individual node counts, and now you have a super-set of my change-the-node-count games.

I don't understand what you're doing here. Are you saying that the games resulting from a time limit is a superset of those resulting from varying node counts? If so, isn't the time limit itself just varying the node counts?

Except your super-set has sets of games from all possible different node counts. But how do the sub-sets compare to the entire set?

So the subsets are the set of games we actually play in our tests, I assume.

Ought to be representative, statistically, but there is no guarantee.

Probably depends on how we choose our subsets. So your wondering how we know if our means of selecting subsets is any good?
Just to make sure we're talking about the same thing: of course we can only expect our subset to be representative of the superset under consideration. If the superset only consists of games from one, or a few starting position we can't expect the subset to be representative of games from all possible starting positions.

It would be ugly to just alter the node count slightly and get a better result, which is what happened in the tests I ran sometimes.

But that is where you were testing with only forty start positions, right?

It would be just as ugly to add a new eval term that was not very important, but since it changes the tree shape just a bit, I get over into a different one of those node-count samples and get a better (or worse) result and draw a conclusion about the change, when in reality any such change will affect the final result.

So we don't know if our different result is due to the quality of the change or something similar to the slight-change-in-node-count effect that you've demonstrated in the past? Maybe I understand you now. Personally, I think this effect is mainly due to the correlation caused by playing multiple games from the same positions, which is what you're in the process of testing. But I don't see why that in and of itself would prevent us from testing A and A' with the same node counts in addition to using more positions. The only effect I can see changing to node counts having is to stabilize the results slightly. So, if we were to disregard the issue of any changes also changing nps, I wouldn't see what it would hurt, and there would even be the possibility that it would help. But maybe I'm still not understanding your point.

That is the problem that is difficult to deal with. Any change alters the entire system, and it seems to take a large number of games/positions/opponents to reduce this noise to an acceptable level.

Zach Wegner · Post by **Zach Wegner** » Sun Aug 10, 2008 6:45 am

xsadar wrote:
Zach Wegner wrote:
xsadar wrote:However, I will agree with you that a fixed (or random) number of nodes won't work. Any change to the evaluation or search will also change the nodes per second of the search, which is also something we would definitely want included in our test. That is, afterall, the major tradeoff when adding new evaluation terms. Also, in my experience, nps varies drastically from one stage of the game to another, so I don't think it would even work to use an average nps in combination with a node count.
First, I think using a node count proportional to NPS would be fine. The NPS doesn't really change all that much from version to version, and from version to version with respect to game stage. If we had a way to measure time in an exact way, and completely get rid of clock jitter, then I would suggest using a random time. But we don't, so I think nodes is the next best option.

But anyways, I think the tradeoff is not between evaluation size and speed, as it was thought to be for ages, but rather the robustness/generality of the evaluation and how it affects every aspect of the game, not just the part it is evaluating. The NPS difference that a (relatively small) evaluation term causes is very minor WRT strength IMO.
I don't know about other people's engines, but I tend to see a huge variance in nps. After reading your comment I had my engine play a couple quick games to see how much it varied. These are the extremes I saw in a 40 moves in 1 minute game:

1883000 nodes at 0.942M nps
1773748 nodes at 2.612M nps

These are the extremes in a 2 seconds per move game:

1671000 nodes at 0.836M nps
3060000 nodes at 1.530M nps

Standard deviations would be more useful, but there were other values close to all four of those listed here. It's entirely possible that background processes or even a buggy engine causes this, but it shouldn't be too difficult to test on other machines and with other engines. But personally I would expect large variations in nps to be fairly normal, because different node types (PV, null-window, quiescence, etc.) will take different amounts of time.

Yes, NPS varies greatly for the different stages of the game, though the effect is different in each engine. My point is that if you add a term, say "rook on seventh", the main downside might not be the slowdown in NPS, but rather the positions where the rook on seventh rule isn't correct. Absolutely essential reading: http://chessprogramming.wikispaces.com/ ... Philosophy
Read Tord Romstad's quote. Read it again. Live it.

oystein · Post by **oystein** » Sun Aug 10, 2008 7:16 am

Fritzlein wrote:... suppose the results of a trial run on 40 positions are

X = 1110010001000011011111000010011111111101.

Then suppose a second trial is run on the same 40 positions with nothing changing except clock jitter, and the results are

Y = 0011011001000110011101000010011001111100.

Then it look like the different positions have different expected mean, hence we can't view the two runs as 2*40 values drawn from two random variables. You should rather use one random variable for each position. (But then I don't think you can show what you wanted to show.)

It is possible to let X be a random variable where a measurement is the result from a randomly chosen test set position. But Bob does not choose positions randomly, but rather go systematically through the test set. So the rest of your calculation not about the experiments Bob is running.

bob · Post by **bob** » Sun Aug 10, 2008 8:23 am

Dirt wrote:
bob wrote:You are completely missing the point. We want to test whatever changes we made, would you agree? And do you follow in how a tiny change in time measurement changes the node count of the tree being search? And how that changes the result? So a change you make could just as easily change the shape of the tree just enough so that you now get a win where before you got a loss, or vice versa. Having absolutely _nothing_ to do with a better or worse evaluation, just one the produces a different value, which then alters how cutoffs occur, which is the same thing that happens when the node count is changed slightly.

The easiest way to understand this is to view the entire population of games we could possibly play using a time limit. There are only so many possible node count variations. So you slice up this total population by individual node counts, and now you have a super-set of my change-the-node-count games. Except your super-set has sets of games from all possible different node counts. But how do the sub-sets compare to the entire set? Ought to be representative, statistically, but there is no guarantee. It would be ugly to just alter the node count slightly and get a better result, which is what happened in the tests I ran sometimes. It would be just as ugly to add a new eval term that was not very important, but since it changes the tree shape just a bit, I get over into a different one of those node-count samples and get a better (or worse) result and draw a conclusion about the change, when in reality any such change will affect the final result.

That is the problem that is difficult to deal with. Any change alters the entire system, and it seems to take a large number of games/positions/opponents to reduce this noise to an acceptable level.
I found that very hard to follow. I think you are saying that using node counts to control the engines is better than a time control because there are only so many different ways a game could play out at a given time control. But with the change to not reusing the start positions I don't think that should be an issue anymore. Oh, right, you were saying you wanted to see how few start positions it would take to still get good results, but given the problems you just cited I don't see why you would want to do that.

Let me try again.

First, let's assume that I am playing 3 seconds per move for the tests. And to approximate that on the hardware I am using, let's again assume that three seconds _exactly_ turns into 10M nodes _exactly_.

There is a sampling error in grabbing the time, and let's again suppose that this error turns into an easy number like 300,000 nodes. That is, with a target time of 3 seconds, which should search exactly 10M nodes, it actually searches 10M +/- 300K nodes, So that gives 600,001 different possible node counts from 9,700,000 to 10,300,000. If you were to run my test games 600,001 times, you would up to 600,001 different results. Some will probably be the same, as 1 node doesn't always make a difference in a single game, but in any case, you have now enumerated _all_ possibilities. With me so far?

Now, when I run using time for the limit, rather than the node counts given above, each game will have one move with a slightly different node count, and each result will be pulled from one of those 600,001 possibilities, Yes, I realize that there would be more than 600,001 possibilities since there are that many for any one node count in all games. But let's keep it simple. By randomly choosing a node count, we randomly choose a different result. Just 1K nodes can change the 800 game output significantly, or the 25,000 game output if you look at the numbers I posted.

So, we know that changing the time changes the nodes searched. But so does changing any component of the search itself, also. That is, change an extension, or hashing, or even a small detail in the evaluation, and the node count won't change since we have set a limit, but the shape of the tree may well change (it almost always does) which means those N nodes are moved to a different overall tree, which can again produce a different overall result. Not because the change itself was good or bad, but simply because the change caused us to search a slightly different shape tree, and we have already seen how a very small change there (1,000 nodes) can change the game.

So a minor, and actually insignificant evaluation change can significantly alter the final result, just as if we had changed the node count by 1,000. And we know 1,000 fewer nodes is not going to make it play better, and it would be _extremely_ rare for that to make it play worse either.

That is why a fixed number of nodes for a small number of games (or even a large number of games) is simply NFG. You can try this yourself by taking a program like crafty and just slightly changing a positional score, or something else, something you know is not going to make any significant difference in how it plays, and play a few hundred games, old vs a group and new vs the same group, and see what happens. I played two identical versions vs the world and got different results based on time. The same would happen based on nodes, for exactly the same reason...

Hope that is clearer, if not, feel free to ask again.

New testing thread

Re: Correlated data discussion

found the problem

Re: Correlated data discussion

Re: Correlated data discussion

Re: Correlated data discussion

Re: Correlated data discussion

Re: Correlated data discussion

Re: Correlated data discussion

Re: Correlated data discussion

Re: Correlated data discussion