New testing thread

bob · Post by **bob** » Mon Aug 11, 2008 7:52 pm

Fritzlein wrote:
MartinBryant wrote:OK I've re-run the test on Bob's randomly selected Silver position.
[...]
In 100 games there was ONE duplicate game pair
For comparison, if I had a bag of 7,213 game results, and I picked out of that bag with equal likelihood of getting each result, then in 100 picks I would have a 50% of getting no duplicates. (birthday paradox) Therefore the actual results of getting 1 duplicate game in 100 playouts is consistent with there only being being 7,213 playouts you will ever get, each of them equally likely. Of course, in practice there will probably be more playouts that you eventually get if you try long enough, because they won't all be equally likely; some will be more likely than others, and there's where you will get your duplicates. I just wanted to throw out a ballpark number to help guide the intuition of how much variation this is.

Come to think of it, it's pretty much the same as the ballpark number you threw out there yourself, namely 4,950 (pairs of comparisons)...

MartinBryant wrote:There's still a mass (excuse the un-scientific term) of variation.
Yes indeed, lots of different playouts possible. But if we look at the results of the playouts (win, draw, loss) does the distribution correspond to the distribution for a randomly chosen position? If the average over all positions (or even just over all balanced positions) is 30% white wins, 45% draw, 25% black wins, and looking at the actual playouts of this position (drawn from a pool of 4,950 or 7,213 or whatever) there are 10% white wins, 40% draws, and 50% black wins, then reusing this position is causing strong correlation in scores.

Blame insufficient clock jitter, blame the engine for not understanding black's strategic weaknesses, blame sunspots if you like, but if the results of playouts of this position are not the same shape as results of playouts from a random position, then there is correlation of the results to the position.

All that I can add is that if you don't shut up, my head is going to explode, much like it feels after some of Gerd's bit-wizard stuff.

One interesting experiment I could try later is to determine just how many different games are possible from a single position. I could just start playing those 1/2 second games, and continue until I play a hundred thousand or so without getting a "new" result. That won't be an exact upper bound of course, but it would be a sort of monte-carlo approach to get real close to the right number.

I am somewhat interested in the issue, because of the way I measure time. Depending on the system, getting the current time can be an expensive operation. On the old Cray's for example, if you checked the time every 10 nodes or so, the NPS would drop by a factor of 4 or so because of the rather large context-switching time required to get down into the O/S to get that value. So I started using a counter that is set to some fixed value, and decremented each time a node is searched. When the counter reaches zero, I check the time.

The only problem is, I was perhaps too clever, as you have to answer the question "how often do you want to check?" and the answer is "often enough to not burn too much extra time by sampling too infrequently. If you play long games, 3 minutes per move, and sample once per second, you can never be more than a second over your target time. But for game in 1 minute, where the searches end up way less than a second, you would lose on time quickly. So I have a dynamically-adjusted "counter" that is reduced as the target search time is reduced.

What all that means is that in longish games, where the sample interval is one second or longer, I would probably see only two different node counts on the same position. One where I searched exactly N nodes and got a time of T-.01 (10 milliseconds short of the actual target time so I go for another time step and burn more nodes. The second time I reach this position, after N nodes I sample and get T+.01 and stop. Clock jitter is not going to be measured in seconds except perhaps on supercomputers. But as the games get shorter and shorter, 10 ms might be 10K nodes at 1M nodes per second, or on my 8-core, it would be 200K at 20M nodes per second. So if the clock jitter is no more than 10ms (probably is more than that on a PC) if my sample interval is small enough so that I do several samples within a "jitter interval" any of those node counts could be the "final value"...

I did run a test with this interval set to 1. so that I checked the time after every node, and the variability did not change significantly or go away. Probably enough games would show how this sample counter affects the number of different games, but it would be a really large number of games to get a real handle on this.

BTW, it has been a breath of fresh air to see real discussions all of a sudden, rather than the old "stampee feet" stuff that has happened each time I brought this up in the past.

I think the realizatoin that this is actually happening, is a real problem, and affects results is slowly creeping in as more and more are beginning to test and join in...

xsadar · Post by **xsadar** » Mon Aug 11, 2008 7:52 pm

hgm wrote:
bob wrote:
hgm wrote:
bob wrote:
hgm wrote:
Dirt wrote: Using the node count instead of time would mean you're not testing the time management code, right? I guess you'd then have to test that separately, which is really starting to make the testing complex.
This would not be true if you use the node-based time-control mode of WinBoard 4.3.14.

Then the time-control code would be fully active. It is just that the engine would be fed, (and uses internally) a virtual time, derived on its node count. But apart from the clock routine, the rest of the engine would never know that it is not dealing with real time.
How can that be? You tell me to search a specific number of nodes, then how can I go over that as I would if I fail low on a position? How can I go over that by variable amounts as I do now, depending on how much I appear to be "losing"???

If you tell me exactly how long, there is going to be node count jitter. If you tell me exactly how many nodes to search, there will be no timing jitter and I can't test my time allocation code at all.

Can't be both ways, at least in my program...
Of course it can. You get a given number of nodes for a given number of moves, and the engine will itself determine how it distributes those nodes over the individual moves. Just like it divides its time quota over the moves in a normal N moves / M minutes time control. What you had in mind is the direct equivalent to the 'st' command, that gives you a fixed, never-exceed time for a single move. But there is no reason to restrict node-based time controls to that mode.

So if you want to play 100,000 nodes (max) per move, you start WinBoard with arguments

-st 1 -nps 100000

If you want to play 4,000,000 nodes for 40 moves, you specfy

-mps 40 -tc 0:40 -nps 100000

That's all. Of course the engine has to implement the WinBoard nps command. So indeed you could not do it on Crafty.
OK, how does that make a fair test? Program A is fairly constant in NPS throughout the game. Program B varies by a factor of 3x from opening to endgame (Ferret was an example of this). So how many nodes do you tell Ferret to search in comparison to the other program?

So how is this going to give reasonable estimates of skill when it introduces a potential time bias toward one opponent or the other???
The purpose of node-based time controls is not to have a fair test of skill between programs, but to create a reproducible testing environment that is insensitive to CPU loading. I just point out that node-based time controls are not limited to 'st' type time controls, but come in all flavors: classical, incremental, sudden-death, or multi-session combintions of those.

Engines do not count nodes the same way, an have not the same nps rate anyway. So if a tester wants to use node-based time control to not have the chracter of a time-odds game, it is unavoidable that he adapts the nps parameter for each engine separately for equal time use.

To prevent the problem with highly variable nps that you point out, I would encourage any program that shows such behavior to ly about their node count accordingly in nps mode. E.g. counting each tablebase probe for 100 nodes, if that is necessary to make the nps rate in the end-game equal to that in the middle-game. Use any internal engine statistic to make the reportd node count reflect CPU-time use as accurately as possible.

But what if the program you're testing has a highly variable nps rate (as in my case)? It can be expected that changes to the engine will change nps, but you cannot expect the change to be even across all moves. If you're nps rates are highly variable, and you want to compare results for your unmodified engine with the results for your modified engine, but use the nps command instead of time, how can you possibly know that the results are really even comparable?

bob · Post by **bob** » Mon Aug 11, 2008 7:57 pm

that is a problem. The idea of reproducible testing is a good one. But, as the saying goes, "the devil is in the details". My NPS fluctuates by a factor of 2-3 in a game, particularly if using a parallel search which gets more efficient as the depth increases. But for a pure serial search, it can vary by 2-3x from move one to the endgame.

This idea would work for programs that don't behave like this (IE for programs with O(1) complexity based on number of pieces remaining) while most seem to be O(n) where N is somehow proportional to remaining material, and a few are the inverse where they get slower in the endgame. And that seems to be a problem because I need to test in the same environment I am going to run in, not in a time-handicap or time-advantage game, but heads-up.

krazyken · Post by **krazyken** » Mon Aug 11, 2008 8:06 pm

hgm wrote:
krazyken wrote:I might be stuck there, none of my GUI choices seem to allow a time per move mode for engine matches. Except Xboard which doesn't seem to have granularity below one second.
You can achieve smaller granularity by specifying a time-odds factor for both engines, and playing with "-timeOddsMode 0", so that the times are not normalized to that of the slowest engine. So with

-st 1 -firstTimeOdds 100 -secondTimeOdds 100 -timeOddsMode 0

you would effcively have centi-second resolution.

That must be a different Xboard than what I've got, I'll look into that.

MartinBryant · Post by **MartinBryant** » Mon Aug 11, 2008 8:18 pm

Fritzlein wrote:To me it is even more telling that not all twelve games were drawn. There was one win for black in there along with eleven draws. But as long as we are just eyeballing results, eleven draws out of twelve doesn't seem like a highly erratic result.?

Now there's a subtle distinction...
To me a different game is a different game, hence erratic.
To you, if it's still a draw, it's not erratic.
It sounds like you can correlate many things from the same data?
So would it be possible to have a situation where the moves of a series of games have very little correlation, but the results have very high correlation?

Fritzlein wrote:Yes indeed. Thanks for offering! Let's try Spike against another bot where node counts can be fixed. Let's pick 201 random starting positions that are roughly balanced. For each position, pick which color Spike plays at random.

Now play out each position twice with Spike playing the same color both times. In the first playout, use 1,000,000 nodes for both engines, and in the second use 1% more, i.e. 1,010,000 nodes for both engines. If the NPS of each bot are vastly different, choose a constant ratio of node counts to make the games roughly even. For example if the opponent is slower/smarter by a factor of two, in the first game use 1,000,000 nodes for Spike and 500,000 for opponent, and in the second game use 1,010,000 nodes for Spike and 505,000 for the opponent. Whatever makes the engines roughly the same strength, since we get the most information from nearly-equal games.

Hmmm...
well I can pick 201 random openings from high level games no problem, but I just wonder if this leaves the experiment open to criticism such as "well the positions you chose obviously weren't balanced enough", when people are struggling to interpret the results.
Glaurung appears to do fixed nodes searches correctly and it is in the same league as Spike so I'll probably go with that but I'll check it for determinancy first. (If I use a weaker engine then the biasing you suggest is something else I'd have to code into my GUI!)

OK, it might take a little while to set it up and play it through so I'll get back to you in a day or two!

hgm · Post by **hgm** » Mon Aug 11, 2008 8:38 pm

xsadar wrote:But what if the program you're testing has a highly variable nps rate (as in my case)? It can be expected that changes to the engine will change nps, but you cannot expect the change to be even across all moves. If you're nps rates are highly variable, and you want to compare results for your unmodified engine with the results for your modified engine, but use the nps command instead of time, how can you possibly know that the results are really even comparable?

Well, that is a problem the engine author will have to solve. If the nps is highy variable, it must be because in some cases you do much more time-consuming stuff in some nodes than in others. So you could update a separate counter in that extra code, so you can reconstruct the number of expensive nodes and cheap nodes, and weight them accordingly to construct a time measure.

I don't think the required accuracy is very high here, though. Modulating the speed by +/-10% has only a marginal effect on the engine strength (compared to running at the average speed all the time), as even an overall reduction of the speed by 10% only matters 10 Elo.

bob · Post by **bob** » Mon Aug 11, 2008 8:40 pm

MartinBryant wrote:
Fritzlein wrote:To me it is even more telling that not all twelve games were drawn. There was one win for black in there along with eleven draws. But as long as we are just eyeballing results, eleven draws out of twelve doesn't seem like a highly erratic result.?
Now there's a subtle distinction...
To me a different game is a different game, hence erratic.
To you, if it's still a draw, it's not erratic.
It sounds like you can correlate many things from the same data?
So would it be possible to have a situation where the moves of a series of games have very little correlation, but the results have very high correlation?

Fritzlein wrote:Yes indeed. Thanks for offering! Let's try Spike against another bot where node counts can be fixed. Let's pick 201 random starting positions that are roughly balanced. For each position, pick which color Spike plays at random.

Now play out each position twice with Spike playing the same color both times. In the first playout, use 1,000,000 nodes for both engines, and in the second use 1% more, i.e. 1,010,000 nodes for both engines. If the NPS of each bot are vastly different, choose a constant ratio of node counts to make the games roughly even. For example if the opponent is slower/smarter by a factor of two, in the first game use 1,000,000 nodes for Spike and 500,000 for opponent, and in the second game use 1,010,000 nodes for Spike and 505,000 for the opponent. Whatever makes the engines roughly the same strength, since we get the most information from nearly-equal games.
Hmmm...
well I can pick 201 random openings from high level games no problem, but I just wonder if this leaves the experiment open to criticism such as "well the positions you chose obviously weren't balanced enough", when people are struggling to interpret the results.
Glaurung appears to do fixed nodes searches correctly and it is in the same league as Spike so I'll probably go with that but I'll check it for determinancy first. (If I use a weaker engine then the biasing you suggest is something else I'd have to code into my GUI!)

OK, it might take a little while to set it up and play it through so I'll get back to you in a day or two!

One note. A vs A is more stable than A vs B, obviously. That's why, when I started testing with different (slightly) node limits, I used crafty vs crafty, so that the results had to be symmetric each run. Same number of wins and losses in a single run, since it was the same program playing both sides twice. Otherwise you get another level of variability that might or might not be good.

hgm · Post by **hgm** » Mon Aug 11, 2008 8:55 pm

krazyken wrote:That must be a different Xboard than what I've got, I'll look into that.

If you really need xboard as opposed to WinBoard, this is not surprising: the time-odds options are a feature that was first supported in WinBoard 4.3.13, and that release was not accompanied by an xboard version. But 4.3.14, which will be released any day now, will also be available as xboard. (This is in fact the main reason it is not released yet; someone is compiling a Linux version of xboard, and making a release package with it, and we want to release everything together.)

I could help you to a pre-release of the WinBoard executable, or the source code, but I cannot make Linux compiles myself.

krazyken · Post by **krazyken** » Mon Aug 11, 2008 9:03 pm

bob wrote:
Fritzlein wrote:
MartinBryant wrote:OK I've re-run the test on Bob's randomly selected Silver position.
[...]
In 100 games there was ONE duplicate game pair
For comparison, if I had a bag of 7,213 game results, and I picked out of that bag with equal likelihood of getting each result, then in 100 picks I would have a 50% of getting no duplicates. (birthday paradox) Therefore the actual results of getting 1 duplicate game in 100 playouts is consistent with there only being being 7,213 playouts you will ever get, each of them equally likely. Of course, in practice there will probably be more playouts that you eventually get if you try long enough, because they won't all be equally likely; some will be more likely than others, and there's where you will get your duplicates. I just wanted to throw out a ballpark number to help guide the intuition of how much variation this is.

Come to think of it, it's pretty much the same as the ballpark number you threw out there yourself, namely 4,950 (pairs of comparisons)...

MartinBryant wrote:There's still a mass (excuse the un-scientific term) of variation.
Yes indeed, lots of different playouts possible. But if we look at the results of the playouts (win, draw, loss) does the distribution correspond to the distribution for a randomly chosen position? If the average over all positions (or even just over all balanced positions) is 30% white wins, 45% draw, 25% black wins, and looking at the actual playouts of this position (drawn from a pool of 4,950 or 7,213 or whatever) there are 10% white wins, 40% draws, and 50% black wins, then reusing this position is causing strong correlation in scores.

Blame insufficient clock jitter, blame the engine for not understanding black's strategic weaknesses, blame sunspots if you like, but if the results of playouts of this position are not the same shape as results of playouts from a random position, then there is correlation of the results to the position.
All that I can add is that if you don't shut up, my head is going to explode, much like it feels after some of Gerd's bit-wizard stuff.

One interesting experiment I could try later is to determine just how many different games are possible from a single position. I could just start playing those 1/2 second games, and continue until I play a hundred thousand or so without getting a "new" result. That won't be an exact upper bound of course, but it would be a sort of monte-carlo approach to get real close to the right number.

That is an experiment I am very interested in. I suspect the number of games between two engines from one position for a given time control is not as large as people would think. With all the latest techniques to reduce branching factor, and so forth, the number of playable moves should be far below the number of possible moves.
Also it would be nice to be able to factor transpositions into the duplicate count. It would probably be reasonable to terminate the game once it reaches 6 pieces remaining.

tvrzsky · Post by **tvrzsky** » Mon Aug 11, 2008 9:28 pm

Uri Blass wrote:
stopping at the end of the iteration may not be important for finding bugs (and in ponder on games unlike ponder off games I have no rule to stop at the end of the iteration and of course I may play in 0 seconds not at the end of the iteration in case of ponder hit).

Clearing the hash clearly can help to find bugs and does not help only for deterministic testing.

Suppose that your program did a mistake in some game and you have only the game.

You know that the program did some mistake that you simply cannot reproduce in first tries in analysis because you cannot reproduce the hash situation.

You will have harder time to catch the bug that cause the program to do the mistake.

If you need more time to discover bugs then it means that you sacrifice future improvements.

Uri

You can still have reproducibility even in the case you are not clearing hashtables!

All what you have to do is to log the number of actually searched nodes or something like this (depends on implementation, for example my engine logs number of the very node in which request for breaking the search occurs because it actually often searches a few nodes more) after every search. Then you simply replay the game up to the point of your interest using those nodecounts for breaking of the search instead of the time control. And after that the hashtable should be prepared exactly in the same state as it was in the original game.

Of course it can be quite time consuming if you replays long-time control game. I use this methode for my Arena tournaments which run usually under few seconds per move control. I have a Perl script which extracts from 'arena.debug' file all commands my engine received during the game as well as node counts. It adds commands for breaking the search or switching on/off tree dumping etc. So it creates batchfile of xboard/my proprieatry commands which I send to the standard input of my engine when starting on commandline. So the whole process is almost fully automated, otherwise it would be surely too annoying and labor intensive.

Filip

New testing thread

Re: Correlated data discussion - Silver Position Result

Re: Correlated data discussion

Re: Correlated data discussion

Re: Correlated data discussion

Re: Correlated data discussion - Another Experiment...

Re: Correlated data discussion

Re: Correlated data discussion - Another Experiment...

Re: Correlated data discussion

Re: Correlated data discussion - Silver Position Result

Re: 4 sets of data