OK, let's step back just a minute. Do you believe that searching 100 nodes more or less will have any effect on the results at all? I know it will, and posted the data to show that the effect is not minor at all. Now, if I search 100 fewer nodes per position, and I get a significantly better result, do you believe that searching a tiny bit less was really a significant improvement? Of course not. And that is the effect I am talking about. A tiny evaluation change, changes the shape of the tree, and that is all that is needed to change the result, and the really interesting part is, the change in shape is _not_ correlated with the result. One time fewer nodes makes a better result, next time it makes a worse one...xsadar wrote:Could be. I'm having a hard time understanding this post, so I'll go point by point, to see if I can understand any better.bob wrote:You are completely missing the point.xsadar wrote:The changes in the search tree are part of what we're trying to test, whereas fluctuations in the clock are not.bob wrote:Yes. but there is a fatal flaw. You already saw where 10,001,000 nodes produces different results from 10,000,000 nodes, correct? Now make one tiny change to your evaluation or search and guess what happens to the tree? It will change by _more_ than 1,000 nodes, and you get a different random result.Zach Wegner wrote:Isn't this pretty much exactly what I've been saying?Karl wrote:If we want to measure how well engine A plays in the absolute, however, then "randomness" (or more precisely independence between
measurements) is a good thing. We want to do everything possible to kill off correlations between games that A plays and other games
that A plays. This can include having the opponent's node count set to a random number, so that there is less correlation between games
that reuse the same opponent. That said, if we randomize the opposing node count we should save the node count we used, so we can use
exactly the same node count for the same game when A' plays in our comparison game.
I think, therefore, that test suite results will be most significant if the time control is taken out of the picture completely. If the
bots are limited by node count rather than time control, we can control the randomness so that we get the "good randomness" (achieving
less correlation among the games of one engine) and can simultaneously eliminate the "bad randomness" (removing noise from the comparison
between engines).
There is no way to search the same exact tree with a different evaluation or search, so you are right back to where we started. There will be a random offset between the two sample sets just as there is with random times at present...
However, I will agree with you that a fixed (or random) number of nodes won't work. Any change to the evaluation or search will also change the nodes per second of the search, which is also something we would definitely want included in our test. That is, afterall, the major tradeoff when adding new evaluation terms. Also, in my experience, nps varies drastically from one stage of the game to another, so I don't think it would even work to use an average nps in combination with a node count.Yes/correct to all of the above.We want to test whatever changes we made, would you agree? And do you follow in how a tiny change in time measurement changes the node count of the tree being search? And how that changes the result? So a change you make could just as easily change the shape of the tree just enough so that you now get a win where before you got a loss, or vice versa.I'm not sure I agree with this. To me it means that it is better or worse for this particular sample (consisting of the position and number of nodes searched).Having absolutely _nothing_ to do with a better or worse evaluation, just one the produces a different value,which then alters how cutoffs occur, which is the same thing that happens when the node count is changed slightly.
In this context, all possible games that could be played by my engine in that specific time control. the time control is fixed, but the timing jitter will produce different node counts most of the time. If you play every possible node count, you would enumerate them all.Except when the node count is changed that has nothing to do with changes in the program we're testing, but the size and shape of the tree does. So I don't understand why this is relevant, but perhaps that's because I'm not understanding your point.No offense, but I think I understood the above a lot better than what follows.The easiest way to understand this is to
Okay so we have all possible games that can be played by your engine from one starting point and with a specific time limit, right? Or is this just all possible games of chess that can be played with that engine? Or all possible games that can be played by anybody?view the entire population of games we could possibly play using a time limit. There are only so many possible node count variations.
The opposite. You just enumerated games for every _possible_ node count I could produce given a fixed time per search. If we play enough timed games, I _might_ produce them all, but I might not be able to hit _every_ possible node count for various reasons. So what I could actually reach would be a sub-set of what is actually possible in the set of every possible node count.I don't understand what you're doing here. Are you saying that the games resulting from a time limit is a superset of those resulting from varying node counts?So you slice up this total population by individual node counts, and now you have a super-set of my change-the-node-count games.
Yes. And remember, so far, nothing has said that more positions is going to change the results. I have four big runs queued up, the first is about 1/2 way done, when all are done we will see how the 4 runs stack up stability wise thru BayesElo.If so, isn't the time limit itself just varying the node counts?So the subsets are the set of games we actually play in our tests, I assume.Except your super-set has sets of games from all possible different node counts. But how do the sub-sets compare to the entire set?Probably depends on how we choose our subsets. So your wondering how we know if our means of selecting subsets is any good?Ought to be representative, statistically, but there is no guarantee.
Just to make sure we're talking about the same thing: of course we can only expect our subset to be representative of the superset under consideration. If the superset only consists of games from one, or a few starting position we can't expect the subset to be representative of games from all possible starting positions.But that is where you were testing with only forty start positions, right?It would be ugly to just alter the node count slightly and get a better result, which is what happened in the tests I ran sometimes.
Short form: from a given starting position, there are a _huge_ number of possible different games that can be played, some wins, some draws and some losses. The number of nodes searched directly chooses which game we are going to follow. Using time, we follow one of a very large number of different games but the game is different each time we replay it because time changes slightly. If we used fixed nodes, we will just choose one of those games to play, but if you alter how that fixed number is allocated, by changing any aspect of the program, you again change the game path.So we don't know if our different result is due to the quality of the change or something similar to the slight-change-in-node-count effect that you've demonstrated in the past? Maybe I understand you now. Personally, I think this effect is mainly due to the correlation caused by playing multiple games from the same positions, which is what you're in the process of testing. But I don't see why that in and of itself would prevent us from testing A and A' with the same node counts in addition to using more positions. The only effect I can see changing to node counts having is to stabilize the results slightly. So, if we were to disregard the issue of any changes also changing nps, I wouldn't see what it would hurt, and there would even be the possibility that it would help. But maybe I'm still not understanding your point.It would be just as ugly to add a new eval term that was not very important, but since it changes the tree shape just a bit, I get over into a different one of those node-count samples and get a better (or worse) result and draw a conclusion about the change, when in reality any such change will affect the final result.
The real problem is that if each game were always won or loss, it would be great, but with balanced positions, and millions of possible wins and millions of possible losses, if you use a small set (and maybe even 50,000 games is a small set) you can get unexpected results by picking a win in one test, and a loss in the next, where the choice is essentially random and caused by the shape of the tree you search and not the skill used to play through the game...
That is the problem that is difficult to deal with. Any change alters the entire system, and it seems to take a large number of games/positions/opponents to reduce this noise to an acceptable level.