New testing thread

krazyken · Post by **krazyken** » Mon Aug 11, 2008 9:57 pm

hgm wrote:
krazyken wrote:That must be a different Xboard than what I've got, I'll look into that.
If you really need xboard as opposed to WinBoard, this is not surprising: the time-odds options are a feature that was first supported in WinBoard 4.3.13, and that release was not accompanied by an xboard version. But 4.3.14, which will be released any day now, will also be available as xboard. (This is in fact the main reason it is not released yet; someone is compiling a Linux version of xboard, and making a release package with it, and we want to release everything together.)

I could help you to a pre-release of the WinBoard executable, or the source code, but I cannot make Linux compiles myself.

Actually, I'm even more barbarian that that, I run Xboard on Mac OS X, so I'll need to grab the source when it's ready.

bob · Post by **bob** » Mon Aug 11, 2008 10:05 pm

hgm wrote:
xsadar wrote:But what if the program you're testing has a highly variable nps rate (as in my case)? It can be expected that changes to the engine will change nps, but you cannot expect the change to be even across all moves. If you're nps rates are highly variable, and you want to compare results for your unmodified engine with the results for your modified engine, but use the nps command instead of time, how can you possibly know that the results are really even comparable?
Well, that is a problem the engine author will have to solve. If the nps is highy variable, it must be because in some cases you do much more time-consuming stuff in some nodes than in others. So you could update a separate counter in that extra code, so you can reconstruct the number of expensive nodes and cheap nodes, and weight them accordingly to construct a time measure.

I don't think the required accuracy is very high here, though. Modulating the speed by +/-10% has only a marginal effect on the engine strength (compared to running at the average speed all the time), as even an overall reduction of the speed by 10% only matters 10 Elo.

The most common issue is pieces left. Every time a piece comes off, a part of the move generator is executed less frequently. Ditto for evaluation. No bishops, no bishop mobility analysis, etc. NPS rises. give me positions when there are lots of deep checking lines and my NPS goes down, because the work I do at a node doesn't change, but the number of branches out of a node is smaller which increases the cost per branch/node. The list goes on and on.

bob · Post by **bob** » Mon Aug 11, 2008 10:11 pm

tvrzsky wrote:
Uri Blass wrote:
stopping at the end of the iteration may not be important for finding bugs (and in ponder on games unlike ponder off games I have no rule to stop at the end of the iteration and of course I may play in 0 seconds not at the end of the iteration in case of ponder hit).

Clearing the hash clearly can help to find bugs and does not help only for deterministic testing.

Suppose that your program did a mistake in some game and you have only the game.

You know that the program did some mistake that you simply cannot reproduce in first tries in analysis because you cannot reproduce the hash situation.

You will have harder time to catch the bug that cause the program to do the mistake.

If you need more time to discover bugs then it means that you sacrifice future improvements.

Uri
You can still have reproducibility even in the case you are not clearing hashtables!

All what you have to do is to log the number of actually searched nodes or something like this (depends on implementation, for example my engine logs number of the very node in which request for breaking the search occurs because it actually often searches a few nodes more) after every search. Then you simply replay the game up to the point of your interest using those nodecounts for breaking of the search instead of the time control. And after that the hashtable should be prepared exactly in the same state as it was in the original game.

Of course it can be quite time consuming if you replays long-time control game. I use this methode for my Arena tournaments which run usually under few seconds per move control. I have a Perl script which extracts from 'arena.debug' file all commands my engine received during the game as well as node counts. It adds commands for breaking the search or switching on/off tree dumping etc. So it creates batchfile of xboard/my proprieatry commands which I send to the standard input of my engine when starting on commandline. So the whole process is almost fully automated, otherwise it would be surely too annoying and labor intensive.

Filip

Unfortunately it won't work if the _other_ player also has the potential for varying. And in the games I am playing, both sides vary at will, and I am not about to try to modify the source for the other engines so that they can replay a game exactly...

Your approach might be good for debugging, but not for addressing the instability of matches between different programs...

xsadar · Post by **xsadar** » Mon Aug 11, 2008 10:25 pm

hgm wrote:
xsadar wrote:But what if the program you're testing has a highly variable nps rate (as in my case)? It can be expected that changes to the engine will change nps, but you cannot expect the change to be even across all moves. If you're nps rates are highly variable, and you want to compare results for your unmodified engine with the results for your modified engine, but use the nps command instead of time, how can you possibly know that the results are really even comparable?
Well, that is a problem the engine author will have to solve. If the nps is highy variable, it must be because in some cases you do much more time-consuming stuff in some nodes than in others. So you could update a separate counter in that extra code, so you can reconstruct the number of expensive nodes and cheap nodes, and weight them accordingly to construct a time measure.

I don't think the required accuracy is very high here, though. Modulating the speed by +/-10% has only a marginal effect on the engine strength (compared to running at the average speed all the time), as even an overall reduction of the speed by 10% only matters 10 Elo.

Like Bob said, there are too many factors, not just node types. It seems highly unlikely to me that computing the average speed of each node type would get me to within a 10% margin of error for a given position.

tvrzsky · Post by **tvrzsky** » Mon Aug 11, 2008 10:37 pm

bob wrote:
tvrzsky wrote:
Uri Blass wrote:
stopping at the end of the iteration may not be important for finding bugs (and in ponder on games unlike ponder off games I have no rule to stop at the end of the iteration and of course I may play in 0 seconds not at the end of the iteration in case of ponder hit).

Clearing the hash clearly can help to find bugs and does not help only for deterministic testing.

Suppose that your program did a mistake in some game and you have only the game.

You know that the program did some mistake that you simply cannot reproduce in first tries in analysis because you cannot reproduce the hash situation.

You will have harder time to catch the bug that cause the program to do the mistake.

If you need more time to discover bugs then it means that you sacrifice future improvements.

Uri
You can still have reproducibility even in the case you are not clearing hashtables!

All what you have to do is to log the number of actually searched nodes or something like this (depends on implementation, for example my engine logs number of the very node in which request for breaking the search occurs because it actually often searches a few nodes more) after every search. Then you simply replay the game up to the point of your interest using those nodecounts for breaking of the search instead of the time control. And after that the hashtable should be prepared exactly in the same state as it was in the original game.

Of course it can be quite time consuming if you replays long-time control game. I use this methode for my Arena tournaments which run usually under few seconds per move control. I have a Perl script which extracts from 'arena.debug' file all commands my engine received during the game as well as node counts. It adds commands for breaking the search or switching on/off tree dumping etc. So it creates batchfile of xboard/my proprieatry commands which I send to the standard input of my engine when starting on commandline. So the whole process is almost fully automated, otherwise it would be surely too annoying and labor intensive.

Filip
Unfortunately it won't work if the _other_ player also has the potential for varying. And in the games I am playing, both sides vary at will, and I am not about to try to modify the source for the other engines so that they can replay a game exactly...

Your approach might be good for debugging, but not for addressing the instability of matches between different programs...

Yes, debugging and search tree investigating is only purpose of this methode. And at the end it is nothing more then search controlled by node counts thus it can not to bring anything new to the problem indeed. My point was that using hashtables without clearing before every move does not necessarily mean an obstacle to debugging. And it does not make any undeterminism per se, it only amplificates undeterminism caused by timing issues.

Dirt · Post by **Dirt** » Mon Aug 11, 2008 10:46 pm

bob wrote:Unfortunately it won't work if the _other_ player also has the potential for varying.

I don't think you even need the other player, just what moves were made and the node counts from the original run.

hgm · Post by **hgm** » Mon Aug 11, 2008 11:14 pm

xsadar wrote:Like Bob said, there are too many factors, not just node types. It seems highly unlikely to me that computing the average speed of each node type would get me to within a 10% margin of error for a given position.

Well, then it might be 20%, so what? Have you ever tried how much the Elo of your engine suffers when you give it 20% extra time for half the game, and 20% less for the other half? It is probably less than you can measure.

The important artifact of 'st' type time controls is that you cannot give extra time when the engine really needs it (such as a disastrous fail low of the PV move in an itereation it is not allowed to complete). That is where yo can earn a lot of Elo points with time management. How you globally divide the time over the gme has far less effect, and certainly requires far less intelligence on behalf of the time management. (Adapting a single parameter isusually enough to tune that behavior.)

So the most important parts of your time management are tested well if you play a nodes-based classical time control, unlike when you would just play a fied maximum number of nodes per move. So I would never do the latter, and always recommend the former. It is equally easy to do anyway. Just use th node count that you alrady need for the WB thinking output, and divide it by the given nps value to convert it to time, in stead of reading the clock. You would have to do that in both cases anyway.

Uri Blass · Post by **Uri Blass** » Mon Aug 11, 2008 11:42 pm

bob wrote:
Uri Blass wrote:
bob wrote:
Uri Blass wrote:
bob wrote:
Uri Blass wrote:You say:"To use your phraseology, all _good_ engines seem to exhibit this variability. The new ones with simplistic time controls, searching only complete iterations, and such won't. But that is a performance penalty that will hurt in OTB play. So good programs are not being designed like that. There's reasons for it... "

my response:
Performance penalty from searching only complete iterations is clearly
less than 100 elo(and I guess something near 20 elo) and I believe that there are programmers who do not care about these small numbers and prefer deterministic results in testing so they may be able to reproduce everything(My opinion is that it is better to sacrifice 20 elo for being able to reproduce everything easily).
I play so many games against other computers on ICC, and I see the analysis they whisper/kibitz when I am watching and we both turn it on. Even in CCT events as well. And _everybody_ that I played was producing output claiming that not all root moves were searched each time a move was played. Sometimes they started a new depth and got nothing back, sometimes they were in the middle, and perhaps very rarely they timed out when the iteration ended. But it is easy to watch 'em. I don't believe _anybody_ designs a program in any way other than to make it as strong as possible. And that includes me. I know how to make a perfectly deterministic parallel search. But with absolutely nowhere near the speedup I now produce. So non-determinism is accepted along with better performance. Ditto for time allocation and each other decision we make. But find me any _serious_ chess programmer that would say "sure, I will give up significant Elo just to make testing more deterministic (although no more accurate)"

You can be sure that not all programs at level that is close to Crafty's level exhibit the variability that you talk about.
When I first ran into this, I tried several. Both Glaurungs, fruit, gnuchess, crafty, arasan 9/10, and a couple of others that I won't name but which I could get to run under linux. And they _all_ had this issue. Yes, there could certainly be a program out there that terminates searches on iteration boundaries, that clears the hash between moves, that does anything else needed to minimize variability. But there certainly are not very many of 'em because we all take any rating improvement we can get, and we aren't going to throw away 30 here, 20 there, as much of that produces a pure patzer.

Not all programmers of programs at that level care much about playing strength and you can be sure that there are people who do not care if their program is 500 elo weaker than rybka or 480 elo weaker than rybka.

I did not work lately about Movei but if I come back I will certainly not care at levels that are more than 100 elo weaker than rybka about this small improvement that make result not reproducable and make it harder to find bugs because I see that the program played a move that I cannot reproduce.

Uri
I'm the opposite, trying to take everything I can get. Most are...
stopping at the end of the iteration may not be important for finding bugs (and in ponder on games unlike ponder off games I have no rule to stop at the end of the iteration and of course I may play in 0 seconds not at the end of the iteration in case of ponder hit).

Clearing the hash clearly can help to find bugs and does not help only for deterministic testing.

Suppose that your program did a mistake in some game and you have only the game.

You know that the program did some mistake that you simply cannot reproduce in first tries in analysis because you cannot reproduce the hash situation.

You will have harder time to catch the bug that cause the program to do the mistake.

If you need more time to discover bugs then it means that you sacrifice future improvements.

Uri
That is an insane approach to designing the strongest possible chess program. Would you then choose to not add major eval blocks of code because those might be harder to debug than if you left it alone? Even though the additions are not going to add 100 Elo, or even 20 Elo, do you add them or not? There is only one sane answer there, and it is the same sane answer to any question about sacrificing Elo for (possibly unnecessary) future debugging...
evaluation is different because the results after adding evaluation terms can be reproduced.

Uri
Are you following this thread? A tiny eval change alters the shape of the game tree. A single node change can change a move or the game result. So no way is that last statement true. Just run a couple of positions, then make a tiny eval change and re-run 'em. And look at the node counts for the same fixed depth. Then this can be put to rest for all time.

My point is being able to reproduce moves and not about reproducing results when I wrote that evaluation is different from my point of view.

I was not talking about testing small changes in many games but about answering the qeustion why the program played a specific move in games.

If I do not clear the hash it may be harder to reproduce the move(as was explained in another place in this thread) so it may be harder to fix bugs.

If I clear the hash after every move and see a move that I do not like then I can simply run the program again from the same position to get the same move so it can help to detect bugs.

Uri

MartinBryant · Post by **MartinBryant** » Mon Aug 11, 2008 11:50 pm

Well I'm afraid Glaurung failed the medical!

Although it appears to support a fixed nodes type of search, when you tell it to search X nodes it actually searches X+a few more, where 'a few more' seems to vary each run. So it was non-deterministic in the fixed nodes game re-runs.

But do not dismay, Twisted Logic has valiantly stepped into the fray and passed all the drugs tests.

The first match is underway!

Fritzlein · Post by **Fritzlein** » Mon Aug 11, 2008 11:59 pm

MartinBryant wrote:So would it be possible to have a situation where the moves of a series of games have very little correlation, but the results have very high correlation?

Yes, absolutely. It is reasonable to expect that less variation in playouts goes hand-in-hand with more correlated results, while more variation in playouts goes hand-in-hand with less correlated results. However, it doesn't have to be like that. We can theoretically have one without the other.

I can give two extreme examples. Let's take a hypothesized baseline of Crafty vs. Fruit from a random position being 30% wins for Crafty, 20% draws, and 50% wins for Fruit. Now we run repeated playouts of a single position.

Theoretical Extreme Case A: It turns out that only three distinct playouts ever happen from this starting position. Every game is an exact duplicate of one of the these three. Nevertheless, it also just so happens that one of the three playouts is a win for Crafty, and that playout happens with 30% probability. One of the three playouts is a draw, and that happens with 20% probability. The final playout is a win for Fruit, which happens with 50% probability. So although there is almost no variability of playouts, the game results we get are just the same as we would get from testing a random position. We don't need more positions, we can just test this same one over and over, and get as accurate a reading on Crafty vs. Fruit strength as if we used random positions.

Theoretical Extreme Case B: It turns out that we run a million playouts from the same position without ever getting a duplicate game. (This implies that we are picking from a pool on the order of a trillion distinct playouts or more.) However, although the playouts are never duplicates, Crafty wins every one. The results are totally correlated to the position, and we get no new insight on Crafy vs. Fruit strength from any test after the first.

I obviously don't expect either extreme. On the contrary, I expect that the more different playouts there are from a specific position, the more the win/draw/loss results will resemble a playout from a random position. But you get the point that measuring one isn't equivalent to measuring the other.

bob wrote:One note. A vs A is more stable than A vs B, obviously. That's why, when I started testing with different (slightly) node limits, I used crafty vs crafty, so that the results had to be symmetric each run. Same number of wins and losses in a single run, since it was the same program playing both sides twice. Otherwise you get another level of variability that might or might not be good.

I definitely want to measure repeat plays of A vs. B, because I expect the results to be more correlated that way, i.e. to more divergent from the results of A vs. B on a random position. I expect this greater divergence because the two engines have different understandings of the position, so it seems plausible that the position happens to be more suited to one engine's eval than the other, even if it is a theoretically balanced position. I expect more correlation even if A vs. B creates more variability of playouts than A vs. A, because even if A vs. A has fewer possible playouts, the engines understand the position equally well, so the win/loss/draw results aren't likely to diverge much from A vs. A on a random position.

In other words, this might exactly be a case where more variation in playouts does not go hand-in-hand with less correlation of results to positions. But this is highly speculative on my part; here my intuitions are based on even less than my usual wild theorizing.

MartinBryant wrote:Hmmm...
well I can pick 201 random openings from high level games no problem, but I just wonder if this leaves the experiment open to criticism such as "well the positions you chose obviously weren't balanced enough", when people are struggling to interpret the results.

Yes, if the result is as I expect, namely that the first playout is clearly correlated to the second playout, that correlation can always be explained by your input positions being unbalanced. One can say that on the positions where white won both, that happened because white was better, not because the playouts were un-random, and similarly for the positions where black won both.

If the correlation of your results is sufficiently extreme, then common sense can rule out unbalanced positions as the root cause. For example, if 90% of the time the first playout has the same result as the second playout, you can be confident your input positions weren't that unbalanced. But we probably won't be so fortunate as to have correlation so extreme.

What to do? One option is this: get a third and fourth engine. No need to be fussy now about node counts on those guys, because we want them purely for validating that your input positions were basically balanced. Now make a third play of each position, this time between your extra engines, randomizing which engine gets white. We will have already measured the correlation of the first playout result to the second playout result. If the cause of that correlation was imbalance of position, then first playout result should be just as correlated to the third playout result as it was to the second. So if we have correlation from first playout to second and not from first to third, the test is a slam dunk against clock jitter sufficiently randomizing results on repeat plays.

MartinBryant wrote:OK, it might take a little while to set it up and play it through so I'll get back to you in a day or two!

Thanks again. This is fun!

New testing thread

Re: Correlated data discussion

Re: Correlated data discussion

Re: 4 sets of data

Re: Correlated data discussion

Re: 4 sets of data

Re: 4 sets of data

Re: Correlated data discussion

Re: 4 sets of data

Re: Correlated data discussion - Another Experiment...

Re: Correlated data discussion - Another Experiment...