nondeterministic testing

bob · Post by **bob** » Thu Sep 06, 2007 8:21 pm

MartinBryant wrote:Does anyone have any theories/thoughts why a change would be more/less effective at different time controls?

Apart from the odd exception I would think that most changes would be equally valid at any time control?

The main problem case is where you depend on a combination of things, like search + evaluation. Your search handles tactics, your evaluation handles what is left. In some cases, you depend on both. And when that happens, when you get more search (longer games) things can change. We occasionally see this... Some opponents have a better positional understanding than Crafty, and when I run at very short time controls, Crafty occasionally rolls them over because it is fairly fast and tactics becomes more important when the percentage difference in search depth is exaggerated (5 ply vs 4 ply is way more important than 15 vs 14 for example).

bob · Post by **bob** » Thu Sep 06, 2007 8:29 pm

hgm wrote:
bob wrote:This is wrong. I can produce thousands of games played that start from Albert Silvers 40 starting positions. I can play the same position over and over and not get the same results every time. In fact, I can show you games played from the same starting position where I win all 4 and lose 2 of them the second time around. You should test this and you will see what I mean.
The point is that you could do billions tests on Crafty, and it would still tell you zilch about how Eden or uMax behave....

I did play Eden against uMax, and it is indeed as Nicolai claims: all games are always the same. (Well, it would be strange if he did not know his own engine, wouldn't it? ) At least in so far as I tried; I did not have the stamina to play the same game over and over again for thousands of times, just in the hope that I finally might get one that is different.

Actually this is expected for engines that always finish an iteration once they start it. The only possible source of randomness is if an iteration finishes extremely close to the time limit where you would start a new one. Then, due to timing variability, you might sometimes decide to do a new iteration, and other times not. The probability for that is extremely small, as timing variability usually is only a few milliseconds, and search times many seconds. And even doing an iteration doesn't mean you will pick another move. Most of the time the move stays the same....

OK. perhaps we are talking about immature programs. Hashing introduces a serious level of non-determinism. As does history move ordering, killer move ordering, history-based LMR (because of the variability in the history counters) etc.

I verified this behavior on glaurung, glaurung2, fruit2 and arasan 9/10, which are some of the opponents I use in testing. And getting the same game twice is _very_ difficult. It is not that uncommon to get the same result, but the same moves is not very common. I can post examples from any of those if you want, but it is easy enough for anyone to reproduce my results. Start from the same position (aka the null/silver positions or any of your own favorites) and play the game several times.

I could see where a program with primitive search, primitive eval, and no history ordering or history-based LMR, and no (or poor) hashing, etc all might produce the same game over and over. But for "real programs" (those at a mature level with reasonable playing strength) I do not see any sort of reproducibility. Commercial or non-commercial included.

And yes, if your program is basically a "fixed depth" version because it always finishes an iteration, then that will make repeatability very high. But it is not the best way to manage time overall, and not many do that...

bob · Post by **bob** » Thu Sep 06, 2007 8:32 pm

nczempin wrote:
hgm wrote:
bob wrote:This is wrong. I can produce thousands of games played that start from Albert Silvers 40 starting positions. I can play the same position over and over and not get the same results every time. In fact, I can show you games played from the same starting position where I win all 4 and lose 2 of them the second time around. You should test this and you will see what I mean.
The point is that you could do billions tests on Crafty, and it would still tell you zilch about how Eden or uMax behave....

I did play Eden against uMax, and it is indeed as Nicolai claims: all games are always the same. (Well, it would be strange if he did not know his own engine, wouldn't it? ) At least in so far as I tried; I did not have the stamina to play the same game over and over again for thousands of times, just in the hope that I finally might get one that is different.

Actually this is expected for engines that always finish an iteration once they start it. The only possible source of randomness is if an iteration finishes extremely close to the time limit where you would start a new one. Then, due to timing variability, you might sometimes decide to do a new iteration, and other times not. The probability for that is extremely small, as timing variability usually is only a few milliseconds, and search times many seconds. And even doing an iteration doesn't mean you will pick another move. Most of the time the move stays the same....
Just a quick note: Eden does NOT finish the iteration. So far there's only a simplistic time control. I know this is a problem, of course, yet another angle for future improvement (but by itself probably not significant for a new version )

You probably are doing it the bestway. Always finishing an iteration is not the best way to use time. This has been analyzed for years and everyone reaches the same conclusion for the same reasons. And it introduces variability that affects testing as I mentioned...

bob · Post by **bob** » Thu Sep 06, 2007 8:36 pm

nczempin wrote:
bob wrote:
nczempin wrote:A little perspective on weak engines; I think the situation is different once the strength exceeds about 2000 (in any semi-realistic scale).

My engine is pretty weak, and at this level the variability depends on a few factors, I'll list the ones that come immediately to mind.

1. Engines that have no book, or (like mine) always choose the same line given the same previous moves, when playing each other will play exactly the same game each time. So the variability will be zero if you play two games with swapping colours.
This is wrong. I can produce thousands of games played that start from Albert Silvers 40 starting positions. I can play the same position over and over and not get the same results every time. In fact, I can show you games played from the same starting position where I win all 4 and lose 2 of them the second time around. Yo

If you play a weak engine against a strong engine, what you say might actually happen, but then the weaker engine is going to lose no matter what. And there is no way to measure improvement if you can't win some games.

But your basic premise about an opening book being needed to avoid repeated games is simply wrong. If you want data I can send you data from a million games or so run on my cluster here. The variability (non-deterministic behavior) is absolutely remarkable.
I don't think I said that a book is needed to avoid repetitions. After all, Eden has a book (albeit a deterministic one, so to speak). If anything, I said that a non-deterministic book will lead to more variability. Your statement is not equivalent logically.

2. Most engines at this weak level are comparatively stronger in calculation than in evaluation. They regularly play 1. Nh3 or just make more or less random pawn moves as long as there's no material to be gained. If your engine (or let's say, my engine plays those engines (that also have no book) I can easily "prepare" my book against those engines. My engine will seem stronger than it really is if it can get into positions where there superior tactics are no longer sufficient.
A good reason for not using a book. It introduces yet another random aspect into your games that is bad for testing/evaluation purposes. yes you need a good book for tournaments. No, you don't need any kind of a book to evaluate changes. Just start the games at the same point and see if your changes produce better results. A book adds too much random noise, and when you factor in learning, it becomes useless for evaluating changes to the program itself.
Again, remember what kind of book Eden currently has: For any position there's always just one move, or none. I will change this in the future, and I am certain that my variability will skyrocket. This is one reason why I haven't done it yet.

I also haven't mentioned this (or perhaps I have, in that case I'd like to repeat/stress it): When I update Eden's book, I always do so _after_ I am satisfied that the new version is stronger. So in effect I'm already doing what you're suggesting; starting the games at the same point (at least with those opponents that are deterministic; still the majority, but it seems the percentage is decreasing as Eden moves up the ladder).

3. At this weak level, improvements are likely fairly significant. We are usually concerned with huge bugs, plus techniques that have been proven.

So for my engine, where I explicitly have the goal that the new version should be _significantly_ stronger than the previous one, but not exceedingly so, this is the technique I use (and so far it has been pretty consistent):

I find a group of engines that are reasonably close in strength (this selection gets better with time) and start playing a double-round tournament with them with my old version plus the new version that I consider to be worthy of testing for sufficient strength.

Usually I then decide that my version is sufficiently strong if it has at least one other engine placed between it and the old version. Sometimes it places relatively much higher.
And if you don't play thousands of games, sometimes the results are going to be exactly the opposite of what is really going on...
Well, you're invited to test each version of Eden against the older versions (and other engines, of course, because just each new opening book in that "main line" when it plays itself is likely to be improved) and see if your results are different from mine. I. e. if you find an older version that is stronger than some newer version, at the significance level you've chosen.

This may actually be the case for the change from 0.0.11 to 0.0.12, where I explicitly mentioned the fact that the newer version may be only "slightly" better than the older one. Some people have told me that 11 is better, or at least that 12 is not stronger. I was willing to risk that, because there was a lot of time and bad things between 11 and 12, and I just had to get something out again. I was actually surprised that the results were indeed that it was better.

So far I have had no reason to take back my claim that for me at least, my method is working.

BTW you should also note that there are many different versions of Eden even for just one release:
1. The basic java version
2. The "server" version, which uses the JIT VM, which even fewer people have installed than just a JRE for 1)
3. The "JA" version, compiled by Jim Ablett. Although it does have really weird behaviour regarding the nps etc. reporting (that I can't explain for now; it's "lying") it is significantly stronger than the basic version. There are theoretically sound reasons for that, plus my few experiments have confirmed the theory.

And for each of these versions there is a WB version and a UCI version. They actually play differently (the UCI version may lose a little bit of time when having to go through an adapter, the WB version doesn't claim draws, and this has actually led to losing a drawn game because I threw an IllegalStateException when the position was repeated for the fourth time

But for my tests, I always keep the conditions as much the same as before, precisely to reduce variability.

And even when I have received a result which superficially would indicate that the strength has increased, I check the games to see if there are any anomalies. When there's e. g. just one engine in between the old version's placement in the tourney and the new, I am more doubtful than when version A came last out of 15 engines and the new one takes first or even third place.

Based on your comment in another thread, I am amazed you can get the same game over and over.

You said that you don't finish an iteration, you cut the search off when time runs out. That alone should make it impossible to choose the same move every time, unless your eval is _very_ simple. Because there should be plenty of cases where you change your mind if you finish an iteration, but if time runs out early, you don't get to finish the new best move search and play the first (worse) move.

Uri Blass · Post by **Uri Blass** » Thu Sep 06, 2007 9:15 pm

I have a theory about it.

There may be knowledge that is not important at long time control because search can replace it.

Here is an example.
Assume that no tablebases are used.

A program may not know how to win KR vs K but know that KR vs K is winning for the rook.

Adding special knowledge for this endgame may be productive at fast time control becausel the program may draw in this position without adding knowledge but at slower time control it may always win for the simple fact that search is enough to see the mate.

More generally it is possible that there are types of positional advantage that you need deeper search to use them to win the game and without deeper search you may get better position and lose the game thanks to not knowing how to play it so correct knowledge push you to positions that you do not know how to play at blitz and push you to lose.

Note that I did not talk about the speed factor and I think that even with the same speed it may be possible that the same knowledge is counterproductive at fast time control and productive at slow time control or the opposite.

Uri

bob · Post by **bob** » Thu Sep 06, 2007 9:30 pm

Uri Blass wrote:I have a theory about it.

There may be knowledge that is not important at long time control because search can replace it.

Here is an example.
Assume that no tablebases are used.

A program may not know how to win KR vs K but know that KR vs K is winning for the rook.

Adding special knowledge for this endgame may be productive at fast time control becausel the program may draw in this position without adding knowledge but at slower time control it may always win for the simple fact that search is enough to see the mate.

More generally it is possible that there are types of positional advantage that you need deeper search to use them to win the game and without deeper search you may get better position and lose the game thanks to not knowing how to play it so correct knowledge push you to positions that you do not know how to play at blitz and push you to lose.

Note that I did not talk about the speed factor and I think that even with the same speed it may be possible that the same knowledge is counterproductive at fast time control and productive at slow time control or the opposite.

Uri

I would agree. I almost lost the 1986 WCCC event due to a piece of "knowledge" added while testing on a very slow machine, but it turned Cray Blitz into a passive dog when run on the faster 4-8 processor cray's we used... Removing it returned Cray Blitz to a pretty blood-thirsty program once more...

Yet we played dozens of games on the slow hardware to verify that the change made a huge improvement against the old SuperCon program, Old Fidelity program, etc, during testing.

hgm · Post by **hgm** » Thu Sep 06, 2007 10:18 pm

bob wrote:And yes, if your program is basically a "fixed depth" version because it always finishes an iteration, then that will make repeatability very high. But it is not the best way to manage time overall, and not many do that...

I have often wondered about this. The strength of a chess program is only a weak function of the search time (some 50-70 Elo per factor 2). Suppse this really was a continuous function on a per-move basis, say 100*ln(t), and you would play half the moves twice as fast, and half the moves 1.5 times as slow (same total time). Then the poor moves would be 69 Elo weaker, but the strong would be 41 Elo stronger. On the average that makes you only 14 Elo weaker.

But if you have a continuous distribution of times between 0.5*t and 1.5*t (a factor of 3 between the longest and shortest, like you would have when you finish an iteration in an engine with effective branching factor 3) you would have 100 * integral from 0.5 to 1.5 ln(x) dx = 100*(1.5 ln(1.5) -1.5 - (0.5*ln(0.5) -0.5)) = -4.5 Elo. That is not a very impressive loss of strength.

But that ignores the fact that the strength of a move is unlikely to be a smooth function of the search time. I expect moves with an interrupted iteration to be weaker than those with a finished iteration, for equal search time. Interrupting makes part of the search wasted.

In particular, searching the first move takes much longer than refuting the others, to the point were 30% of the search time might go into the same move. If you interrupt during that first 30%, the time in that iteration is totally wasted, as you play the move of the previous iteration. So it is better not to start iterations if it is likely that they will be interrupted during the search of the first move.

Now suppose that your nominal time expires late in the iteration. You have already searched the PV move, and perhaps refuted half of the alternatives. But the only cases in which this deeper iteration will produce a better move is if it actually changes the move. And the move that will fail high usually takes much longer than the expected fail low, even in the null-window search. So in these cases you are very likely to interrupt it during the search of the move that would give the improvement. With as a result that the improvement will not materialize, and again you just wasted time by starting the iteration at all.

I wonder if theis effect of now and then wasting part of the search time would not spoil (more than) these measly 4.5 Elo that you might expect from distributing the time between moves evenly. Did you really test how many Elo weaker an engine becomes if you let it finish every iteration it starts (and tune it for the same average time per move)?

I can add that, once you do not allow search aborts at _arbitrary_ points, but only after a complete iteration, or after the first move that is not more than a given number of centiPawn below the result of the previous iteration, none of the other effects contribute anything to the randomness. hashing, killers, history... It is all a completely determisnistic process.

MartinBryant · Post by **MartinBryant** » Thu Sep 06, 2007 10:32 pm

hgm wrote:I can add that, once you do not allow search aborts at _arbitrary_ points, but only after a complete iteration, or after the first move that is not more than a given number of centiPawn below the result of the previous iteration, none of the other effects contribute anything to the randomness. hashing, killers, history... It is all a completely determisnistic process.

Not neccesarily...

You might have bugs (heaven forbid!) like uninitialised variables picking up different values each time the engine loads... but let's ignore that one.

Even if you don't terminate part way through an iteration you still have to make a time based decision to start another iteration or not and a millisecond more or less on the timer may change the decision next time you run.

Also endgame tablebase throttling may be time based causing one run to read an accurate value and the next to not, thus changing the entire search.

There may be more...

hgm · Post by **hgm** » Thu Sep 06, 2007 10:48 pm

Well, we already identified the timeout vs decision point as a source of randomness, but that would hit you only once every 1000 moves or so (unless your machine does heavy duty work on other tasks, of course).

As for the tablebases... yes, if you use those it might be another source of randomness. But it is not an established fact that tablebases do buy your engines much strength (or in fact any), so i would certainly not label an engine as 'immature' just because it doesn't use tablebases. I would never make my engines probe precalculated tablebases, although I might have them use retrograde analysis during the game. That is a design decision, and has nothing to do with maturity.

bob · Post by **bob** » Thu Sep 06, 2007 11:08 pm

hgm wrote:
bob wrote:And yes, if your program is basically a "fixed depth" version because it always finishes an iteration, then that will make repeatability very high. But it is not the best way to manage time overall, and not many do that...
I have often wondered about this. The strength of a chess program is only a weak function of the search time (some 50-70 Elo per factor 2). Suppse this really was a continuous function on a per-move basis, say 100*ln(t), and you would play half the moves twice as fast, and half the moves 1.5 times as slow (same total time). Then the poor moves would be 69 Elo weaker, but the strong would be 41 Elo stronger. On the average that makes you only 14 Elo weaker.

But if you have a continuous distribution of times between 0.5*t and 1.5*t (a factor of 3 between the longest and shortest, like you would have when you finish an iteration in an engine with effective branching factor 3) you would have 100 * integral from 0.5 to 1.5 ln(x) dx = 100*(1.5 ln(1.5) -1.5 - (0.5*ln(0.5) -0.5)) = -4.5 Elo. That is not a very impressive loss of strength.

But that ignores the fact that the strength of a move is unlikely to be a smooth function of the search time. I expect moves with an interrupted iteration to be weaker than those with a finished iteration, for equal search time. Interrupting makes part of the search wasted.

In particular, searching the first move takes much longer than refuting the others, to the point were 30% of the search time might go into the same move. If you interrupt during that first 30%, the time in that iteration is totally wasted, as you play the move of the previous iteration. So it is better not to start iterations if it is likely that they will be interrupted during the search of the first move.

Now suppose that your nominal time expires late in the iteration. You have already searched the PV move, and perhaps refuted half of the alternatives. But the only cases in which this deeper iteration will produce a better move is if it actually changes the move. And the move that will fail high usually takes much longer than the expected fail low, even in the null-window search. So in these cases you are very likely to interrupt it during the search of the move that would give the improvement. With as a result that the improvement will not materialize, and again you just wasted time by starting the iteration at all.

I wonder if theis effect of now and then wasting part of the search time would not spoil (more than) these measly 4.5 Elo that you might expect from distributing the time between moves evenly. Did you really test how many Elo weaker an engine becomes if you let it finish every iteration it starts (and tune it for the same average time per move)?

I can add that, once you do not allow search aborts at _arbitrary_ points, but only after a complete iteration, or after the first move that is not more than a given number of centiPawn below the result of the previous iteration, none of the other effects contribute anything to the randomness. hashing, killers, history... It is all a completely determisnistic process.

I absolutely agree that the randomness is caused by nothing but timing variability for the programs I mentioned, including Crafty. None of them have any other form of randomness included, but they do vary their times. However, there is one potential form for timing randomness you have not recognized. You have some specific threshold value you set, and if an iteration finishes just before that limit, you will start a new one, if it finishes just after that limit you will not. The timing in unix/windows is not _that_ precise, and you can just barely hit on either side of such a threshold from time to time, so that in one game you stop, in the next you go another ply deeper using more time and maybe play a different move, but definitely affect the time allocation for any searches done after that. Even that will add some randomness here and there, but not like chopping iterations off in the middle as most do.

Now, let me explain what I do and why.

Before I start a search, I set a target time that I want to use, and an absolute-drop-dead time I can't overstep (to avoid time losses or getting too low to have a chance later). Typically the time limit is some optimistic estimate of how much time I have to use, which factors in the time left on the clock, and a fudge factor that considers those "instant response" ponder hits. The drop-dead limit is at most 5X the basic time limit, but is further restricted to be no more than 1/2 the time left on the clock, with a couple of exceptions.

At the start of the first iteration, I order the root moves with a simple q-search call for each. If one move is clearly better than the rest, and it is a recapture of the opponent's last moved piece, I set the "easy move" flag which will cause me to stop the search when 1/3 of the normal time limit has passed, unless that move has failed low, in which case, normal timing rules apply.

Now I start an iteration and do not stop until either (a) the time limit expires or (b) an irrefutable mate has been found (the mate is short enough that a deeper search can't further reduce the depth).

The time limit expiration is a bit complicated, but goes something like this. If the current search value is within S-delta where S is the score from the last iteration and delta is a fudge factor representing the most positional score I am willing to give up without spending more time to try to improve the score, then I will stop. If the score is significantly worse, I will use more time depending on how much worse. I might use 2x the normal time to solve a positional problem, and allow up to 5x to avoid losing a piece.

Now for the first trick. When I decide that "I have used enough time" I set a flag, but I never abort the search outright, I wait until the current root move is completed, and then I abort. Why? If you are going to change to a new best move, the tree blows up, just like searching the first move at a new iteration. If time runs out, and I am not about to change my mind, the root move is dismissed quickly, I terminate the search and make a move. But if I am about to change my mind, I keep right on searching and discover this because it takes longer, but finally I get a new lower bound back and I will choose that move and stop the search immediately.

The only time I don't complete searching the root move is if I am on a new iteration, searching the first move, and time runs out. In that case, I stop immediately, and announce my move and resume with the ponder search.

So, using the above, I usually won't miss a new best move unless I simply don't have time to even start searching one of the root moves before time runs out. If I start on a move that will be best, I won't stop until I find this out. I don't try to determine if the iteration can likely be completed before starting it, because just searching the first move for a few seconds provides important information. If it fails low, it fails low quickly (easier to search for obvious reasons). And now I know I need to spend more time to find a better move and avoid making a mistake that might cost the game.

The reasons I do things this way are easy to explain.

1. I don't want to terminate a search where, given a little more time, I will find a better move. The only stop between root moves helps here. Particularly in an SMP search where you have 4 processors, each is searching a different root move, and the time-out flag is set. As each finishes searching its own root move, it won't choose a new one, but will help one of the others. Quickly the other three are busy helping the processor lucky enough to choose the eventual best move, and we finish that move before we quit, playing a better move than if we had just stopped.

2. If the target is 180 seconds, and I have already used 150 when I finish iteration N, I still start N+1. If nothing else, I verify that the bottom is not going to fall out, or if it does, I then know that I need to use even more time to avoid a costly mistake.

I've been using this for at least 10+ years in Crafty, and for at least 15 years prior to that in Cray Blitz. I have tweaked things from time to time, but I have never looked at a game and thought "dang, If I had just searched another 2 seconds, I would have found the move that would have won, or that would have avoided losing..."

The only thing I don't do in Crafty, that I did do in Cray Blitz, is to try to figure out what to do with the "extra" time I accumulate. If you are supposed to average 180 seconds per move, and when you get to move 20 you have 90 minutes left, you are 30 minutes ahead on the clock. Do you search all remaining moves a little longer, do you save that extra for those annoying fail-low cases, or do you save part of it to carry over to the next time control or use it all in the current one?

nondeterministic testing

Re: nondeterministic testing

Re: nondeterministic testing

Re: nondeterministic testing

Re: nondeterministic testing

Re: nondeterministic testing

Re: nondeterministic testing

Re: nondeterministic testing

Re: nondeterministic testing

Re: nondeterministic testing

Re: nondeterministic testing