New testing thread

MartinBryant · Post by **MartinBryant** » Mon Aug 11, 2008 12:32 am

krazyken wrote:
bob wrote:
krazyken wrote:
MartinBryant wrote:
krazyken wrote:
MartinBryant wrote:Forgot to mention...
I repeated the Fruit experiment with Spike and Colossus too.
Again no duplicates in 100 games.
I somehow missed this experiment, what were the conditions it was run under? Same starting position?
Yes.
The normal starting position for a game of chess.
Then 100 different games is not surprising to me. If you did the 100 games from one of the silver positions, I'd expect the number of duplicates to be greater.
Why? they are opening positions as well, and starting at the initial position, you may well encounter one of those along the way any and things should stabilize there if that is your belief. The variation occurs in any position that is either very balanced, or if it is not, where there are multiple nearly equal alternatives at some point (or multiple points) in the game as it progresses. Those are the positions where things will change. And this happens in the SIlver positions quite frequently, as remember those are the positions I have been using to produce all this volatile behavior.
It would seem to me that after going several moves down an opening line the number of equally viable alternatives would diminish the further you go. Also, Openings are frequently handled by special code and evaluations as compared to he rest of the game are they not? By starting further in to the game you are reducing the amount this code effects the outcome of the game. So the further you go into the game you should see less variability in repeated experiments.

Bob, could you provide a single Silver position at random and I will happily re-run the test for interest.

krazyken · Post by **krazyken** » Mon Aug 11, 2008 12:42 am

MartinBryant wrote:
krazyken wrote:
bob wrote:
krazyken wrote:
MartinBryant wrote:
krazyken wrote:
MartinBryant wrote:Forgot to mention...
I repeated the Fruit experiment with Spike and Colossus too.
Again no duplicates in 100 games.
I somehow missed this experiment, what were the conditions it was run under? Same starting position?
Yes.
The normal starting position for a game of chess.
Then 100 different games is not surprising to me. If you did the 100 games from one of the silver positions, I'd expect the number of duplicates to be greater.
Why? they are opening positions as well, and starting at the initial position, you may well encounter one of those along the way any and things should stabilize there if that is your belief. The variation occurs in any position that is either very balanced, or if it is not, where there are multiple nearly equal alternatives at some point (or multiple points) in the game as it progresses. Those are the positions where things will change. And this happens in the SIlver positions quite frequently, as remember those are the positions I have been using to produce all this volatile behavior.
It would seem to me that after going several moves down an opening line the number of equally viable alternatives would diminish the further you go. Also, Openings are frequently handled by special code and evaluations as compared to he rest of the game are they not? By starting further in to the game you are reducing the amount this code effects the outcome of the game. So the further you go into the game you should see less variability in repeated experiments.
Bob, could you provide a single Silver position at random and I will happily re-run the test for interest.

If you let me know the conditions you are testing under, I can also run a parallel test here.

xsadar · Post by **xsadar** » Mon Aug 11, 2008 1:19 am

Ok, from reading both your reply to me and your reply to Greg Simpson, what I'm starting to think you're saying/implying is something like this:
Suppose we take an engine A and modify the evaluation in a way that will not affect playing strength to get an engine A'. Now if we test both A and A' with the same node counts, the results will vary just as much as if we had just tested A twice with slightly varying node counts (like that of time jitter).
I know you're saying the results between A and A' will vary, and I agree; my question is how much? And that seems important. My first inclination was that they wouldn't vary quite as much as using different node counts, but after thinking about it, that could be possible.

Fritzlein · Post by **Fritzlein** » Mon Aug 11, 2008 1:22 am

MartinBryant wrote:To be fair they're not entirely anecdotal.
I don't know all the conditions of Bob's earlier experiments where he pointed this issue out but anybody can repeat the experiment I did recently which I reported here earlier which you may have missed.
I played Fruit v Fruit 100 games at a 'fixed' time of 0.5 secs/move from the starting position with no book. The small time fluctuations quickly caused the move choices to change and it produced 100 unique games.
Now with your expert mathematical brain this may not strike you as too extreme but I think it looks pretty convincing to the layman.
Could you offer your opinion on that experiment?

The fact that there are one hundred unique games does not mean that the game results are uncorrelated. For example, if you used the starting position of chess, there's a fair chance that white won more often than black. It would be a long-term money maker for me to bet on whoever won the first game to win the other ninety-nine. I hear you cry, "That's not the kind of correlation I'm talking about!", but your position has been that there is a complete absence of correlation in playouts. For me to win my bet I don't have to specify what kind of correlation exists, I just need there to be enough correlation to overcome Fruit's greater playing strength.

But also there can be correlations apart from the starting position being unbalanced. Specifically, the games didn't all immediately diverge, did they? Were there games that shared the first ten moves, or the first twenty? If two games have the same first twenty moves, might we expect the same side to have won both of them with higher probability than each side winning one?

How much was the clock jitter in your experiment? In our proposed bet we were going to vary node counts from 0.3% up to 3%. If your jitter was in this range or smaller, then yes, that will make we worry more about my bet.

Before you did your one hundred playouts, did you do a hundred playouts with fixed node counts to verify that those results were all identical, i.e. to verify that there was no other source of randomness than clock jitter? If so, that weakens my confidence yet further. My bet is based on assurances that nothing is affecting results other than clock jitter.

One source of correlation that is guaranteed to be in our bet which I know you didn't eliminate is which engine plays which side. All else being equal, repeated Crafty vs. Fruit playouts are more likely to be correlated than repeated Fruit vs. Fruit playouts. If we stumble on a type of position that happens to be more favorable to Crafty's evaluation/search, then that factor being the same in all playouts might overcome the node count difference between playouts. So no matter what you demonstrate with Fruit vs. Fruit, I still have an ace in the hole.

MartinBryant wrote:
Fritzlein wrote: If there are no statistics, let me ask how good you think you are at distinguishing a coin that lands heads 55% of the time from one that lands heads 45% of the time on the basis of looking at a few flips.
But I would feel more confident after 100 flips of my own and 'hearing' that another respectable source also considered the coin suspect.

The analogy is breaking down. Doing a hundred coin flips and keeping track of what percentage are heads is a measurement of exactly the quantity in question. I respect that measurement. Doing a hundred playouts and saying they are all different isn't measuring correlation of win/loss results. You are observing a phenomenon that is related to correlation, but you are not measuring correlation.

You say the moves "quickly" diverge, which is just eyeballing it, not doing math. I admit, you can get a general sense for how much randomness is there. For example, if lots of the games had happened to have forty moves in common and had only diverged in the endgame, you might have felt that the playouts were strongly correlated and reported that. Since you didn't mention any nearly-identical games, that tells me something. But I submit that it is beyond anyone's powers of observation to conclude from divergent playouts that the win/loss results are "totally random" with "no correlation".

MartinBryant wrote:One experimental point missed... would you want Fruit to be playing to a fixed 3,000,000 nodes (favours you) or would it's node count match Crafty's (favours me) ?

Heheh, if either of us strongly believed the case, they would cede this point to the other. Very well, I'll puff up my chest and say I'm going to win despite giving this one to you: let the node counts vary in tandem. That is to say, in game one both sides get 3,000,000 nodes, in game two both sides get 3,010,000 nodes, etc. (Incidentally, I'm assuming that both engines do roughly the same NPS. If giving both engines the same node count makes Fruit more than a 60% favorite, I'm not willing to give that one away. So if necessary, we should peg the node counts to the same ratio as the ratio of the NPS that each engine does.)

MartinBryant wrote:Well that's moving the goalposts slightly as the node counts he's getting in his matches will be WAY bigger than 3,000,000.

Oh, I'm displaying my ignorance there. I wanted it to mimic what Bob was doing before, but I have no idea what node counts are typical

MartinBryant wrote:I guess I feel less confident about my conviction the smaller the percentage change in node count. And I guess you would feel the reverse?

Yes, exactly. The larger the percentage variation in node counts, the less correlation I expect to be present.

MartinBryant wrote:I'm not a mathematician so excuse this if it's a dumb question, but as a mathematician, if I gave you two runs of 201 game results, you could apply some formula to measure the correlation of which you speak. If so, does no correlation imply chaotic behaviour? Or would your formula measure at least some correlation between ANY two completely random, unrelated sets of wins/draws/losses?

Far from a dumb question, it is exactly the right question. And the answer is tricky.

Yes, the correlation can be measured between any two runs of 201 game results. There is a formula, and it will output a number. Interestingly, the coefficient of correlation can be positive or negative. If the results we are trying to correlate are completely random, then the coefficient of correlation in the sample is as likely to be negative as it is to be positive.

But let's think what might happen. Let's say in the first run white won 90% and in the second run, black won 90%. Then the result is highly correlated to the position. But what if the first position was very unbalanced in white's favor, and the second position was very unbalanced in black's favor? Well, tough darts for you. I will win my Crafty vs. Fruit bet if there exists enough correlation of any kind, no matter what the source, so if you want to win, you had better hope that the input positions we are sampling from are all relatively balanced, i.e. that this obvious source of correlation is not present.

Let's take the heat off of the two particular positions by rotating the experiment ninety degrees. Instead of having two positions played 201 times each, let's have 201 positions played twice each. Then we measure whether the winner of the first play of each position is likely to be the winner of the second play of each position. If the playouts are uncorrelated, then there should be as many pairs of playouts with different winners as there are pairs of playouts with the same winner. If the two repeat playouts are at all correlated, then the winners will be the same more often than they are different. (If it is truly random, then the winners might even be different more often than they are the same, in which case the correlation will look negative, but one must suppose that is a fluke.)

Now if your 201 positions are a representative sample from your opening book, you might suppose that white has a 55% chance to win, ignoring draws for the moment. The chance that the results are the same is 0.55*0.55 + 0.45*0.45 = 50.5%, and the chance they are different is 0.55*0.45 + 0.45*0.55 = 49.5%. Aha! There is positive correlation since more are the same than different. Neener, neener. But if that's all the correlation that exists in our Crafty vs. Fruit bet, I'm going to lose.

You can calculate the sample coefficient of correlation quite easily yourself according to the formula given on Wikipedia, or if you like I will explain how to apply the Wikipedia formula, or if you want to give me the ordered pairs of results of two plays from 201 positions, I'll be happy to calculate it for you. But remember, you can win this battle with winning the war because
1. You might have another source of randomness than clock jitter while Bob doesn't.
2. Your clock jitter may be a greater percentage than the node count variation we proposed.
3. You have eliminated a strong source of correlation, namely the engine playing, by having Fruit play itself.

MartinBryant wrote:However, in lieu of such an experiment, can you explain why your 'gut' suggests there will be correlation? (I understand why mine doesn't but I'm not sure I could articulate it very well!)

Ah, does anyone really know why he thinks what he thinks? Intuition is always the sum of many inputs, including ones we are unaware of. And incidentally, applied mathematicians need good intuitions about the real world way more than pure mathematicians do. Pure mathematicians just need raw brainpower.

But anyway there is one justification I can give explicitly: If it turns out that clock jitter doesn't give correlated playouts, then my partial explanation for Bob's outrageous result that kicked off the thread disappears. With correlation in place, then a tiny systematic change in clock speed between the two runs of 25600 games explains his out of bounds result. Without correlation, then the change between the two trials must not be something that would hurt each bot with equal probability, it must be something that would hurt Crafty specifically. For all the suggestions that people have thrown around regarding what might not have been the same before and after, there has been no plausible explanation of what might have been different that would hurt Crafty more than it hurts the other guys. So I'll assume that the change didn't affect Crafty specifically, which tends strongly to the existence of some kind of correlation.

Of course, I could always be wrong.

bob · Post by **bob** » Mon Aug 11, 2008 1:23 am

krazyken wrote:
bob wrote:
krazyken wrote:
MartinBryant wrote:
krazyken wrote:
MartinBryant wrote:Forgot to mention...
I repeated the Fruit experiment with Spike and Colossus too.
Again no duplicates in 100 games.
I somehow missed this experiment, what were the conditions it was run under? Same starting position?
Yes.
The normal starting position for a game of chess.
Then 100 different games is not surprising to me. If you did the 100 games from one of the silver positions, I'd expect the number of duplicates to be greater.
Why? they are opening positions as well, and starting at the initial position, you may well encounter one of those along the way any and things should stabilize there if that is your belief. The variation occurs in any position that is either very balanced, or if it is not, where there are multiple nearly equal alternatives at some point (or multiple points) in the game as it progresses. Those are the positions where things will change. And this happens in the SIlver positions quite frequently, as remember those are the positions I have been using to produce all this volatile behavior.
It would seem to me that after going several moves down an opening line the number of equally viable alternatives would diminish the further you go. Also, Openings are frequently handled by special code and evaluations as compared to he rest of the game are they not? By starting further in to the game you are reducing the amount this code effects the outcome of the game. So the further you go into the game you should see less variability in repeated experiments.

I can see variability in fine #70 in fact. It is not so much about the evaluation, or the search, as it is about the search space changing very slightly as a result of using a different amount of time by some small fraction of a second. And all you need are positions where there are at least two nearly equal alternatives, and you are set up for variability as the tree shape changes. These can even happen if you are losing badly, when there are two or more nearly equal alternatives at some point.

bob · Post by **bob** » Mon Aug 11, 2008 1:26 am

r1b1kbnr/pp3ppp/1qn1p3/3pP3/2pP4/P1P2N2/1P3PPP/RNBQKB1R w KQkq - fmvn 7; id "Silver Suite - French, Advance : ECO C02";

that is from somewhere near the front, removed with a couple of head -n, tail -1 type commands...

bob · Post by **bob** » Mon Aug 11, 2008 1:27 am

krazyken wrote:
MartinBryant wrote:
krazyken wrote:
bob wrote:
krazyken wrote:
MartinBryant wrote:
krazyken wrote:
MartinBryant wrote:Forgot to mention...
I repeated the Fruit experiment with Spike and Colossus too.
Again no duplicates in 100 games.
I somehow missed this experiment, what were the conditions it was run under? Same starting position?
Yes.
The normal starting position for a game of chess.
Then 100 different games is not surprising to me. If you did the 100 games from one of the silver positions, I'd expect the number of duplicates to be greater.
Why? they are opening positions as well, and starting at the initial position, you may well encounter one of those along the way any and things should stabilize there if that is your belief. The variation occurs in any position that is either very balanced, or if it is not, where there are multiple nearly equal alternatives at some point (or multiple points) in the game as it progresses. Those are the positions where things will change. And this happens in the SIlver positions quite frequently, as remember those are the positions I have been using to produce all this volatile behavior.
It would seem to me that after going several moves down an opening line the number of equally viable alternatives would diminish the further you go. Also, Openings are frequently handled by special code and evaluations as compared to he rest of the game are they not? By starting further in to the game you are reducing the amount this code effects the outcome of the game. So the further you go into the game you should see less variability in repeated experiments.
Bob, could you provide a single Silver position at random and I will happily re-run the test for interest.
If you let me know the conditions you are testing under, I can also run a parallel test here.

what he had said previously was .5 seconds per move, from the opening position. Or in this case, from a position that is still in the opening part of the game, but not the starting position. Fruit vs Fruit. Or X vs X for that matter.

bob · Post by **bob** » Mon Aug 11, 2008 1:34 am

xsadar wrote:Ok, from reading both your reply to me and your reply to Greg Simpson, what I'm starting to think you're saying/implying is something like this:
Suppose we take an engine A and modify the evaluation in a way that will not affect playing strength to get an engine A'. Now if we test both A and A' with the same node counts, the results will vary just as much as if we had just tested A twice with slightly varying node counts (like that of time jitter).
I know you're saying the results between A and A' will vary, and I agree; my question is how much? And that seems important. My first inclination was that they wouldn't vary quite as much as using different node counts, but after thinking about it, that could be possible.

Look at my crafty vs crafty tests. I changed the node count by +/- 100, and 1000, and such. X and X+100 changed the results and the programs were identical. So that 100 nodes was the only difference.

If you do this kind of programming long enough, you will see lots of completely unexpected results. For example, you have a set of early endgame positions, and you find that in one you play a bad move but in the others you find good moves and very quickly. One such position is win at chess 230, where the suggested move is the only possible move that can win, and even that is not a clear win so far. But everything else draws. And some programs get this one, and some get it quickly. So you take your program that gets that one quickly, and add a small endgame eval tweak to fix the one it was blowing. And it does that one much better now. And when you re-run the test, now you don't get WAC230 quickly any more. In fact it takes you 2x longer to get to the same depth than the previous version. Is this one worse on WAC 230? Not really, except for the time. It will see it at the same depth as before, and with the same score, but that new eval term altered other places in the tree where you were getting cutoffs before and now are not, and vice-versa. And by simply slightly altering the shape of the tree at some point, everything changes.

I've had that happen more times than I can count. And the real trick to this stuff is to be able to recognize when you actually broke something related to wac230, as opposed to where your change just had a random effect on the tree that now makes it take longer. That's an interesting issue and is one reason why I don't rely on position tests at all, except for sanity testing when I change the extensions or reductions code.

bob · Post by **bob** » Mon Aug 11, 2008 1:38 am

hgm wrote:
bob wrote:I started the thread, remember? If you want, I can back up a few posts and grab the tit-for-tat points to show where my comment came from. I doubt most would need that guidance, however.
Oh, I know where it came from. The point is that you posted that same comment in the other thread, where it was a totally non-sensical remark... And did not even relize that when you got the hint!
The programs are _hardly_ close together in strength, if you'd just look at the results. A relative rating difference of 300 top to bottom is quite a spread...
But there is no difference of 300, as the 4 full round-robins you did clearly show. Glaurung is about +75, Arasan about -50. That is only 125. With fewer games you have the larger statistical error quoted by BayesElo, which is again augmented by the fact that the 'World' plays only a single opponent in the Crafty vs World match. This causes the larger spread there. But it is pure coincidence, with another engine then Arasan as weakest one you might get the exact opposite.

Let me refresh your memory just a tad:

Code: Select all


760 game(s) loaded, 0 game(s) with unknown result ignored.
Rank Name                   Elo    +    - games score oppo. draws
   1 Glaurung 2-epsilon/5   115   44   42   153   69%   -31   17%
   2 Fruit 2.1               64   42   41   149   63%   -31   18%
   3 Glaurung 1.1 SMP        48   42   41   152   61%   -31   16%
   4 opponent-21.7           20   38   37   152   58%   -31   41%
   5 Crafty-22.2            -31   19   19   760   45%     6   22%
   6 Arasan 10.0           -216   43   45   154   26%   -31   16%

There I see a range of +115 to -216. Last time I looked that was a range of over 300. Don't know what your problem is, but mine is neither math nor memory.

Perhaps replace "good" with "no" and you will get it right. Ever heard of anyone kicking a program out because it seemed to lose a few where it should not and vice-versa?
Yes, I have heard of that. Someone complaining that testers wouldn't test his engine because it was 'too unpredictable'. Forgot which engine it was, though.

Never seen such a thing reported anywhere I read. I have seen programs excluded because they crash, or they play illegal moves, or they refuse to accept certain types of illegal moves like under-promotions, or... But not because they just lose an odd game here and there for no reason.

Of course you haven't. So, continue trying to act superior. But it is an act, not a fact.

To use your phraseology, all _good_ engines seem to exhibit this variability. The new ones with simplistic time controls, searching only complete iterations, and such won't. But that is a performance penalty that will hurt in OTB play. So good programs are not being designed like that. There's reasons for it...
Well, we went through that before. Too bad it didn't stick, so let me remind you: The conclusion then was that Joker and Crafty essentially have the same time management. So that cannot be the explanation.

Please. "that cannot be the explanation" just won't cut it. You said you don't do partial iterations. I do. Clearly I can terminate the search after any number of nodes. You have discrete points where you can terminate the search. If you terminate in mid-iteration then you will have this kind of non-uniform behavior, unless your search and evaluation are so basic that minor changes in the nodes is not enough to change the shape of the tree, or you are not carrying hash from move to move as most do, or something along those lines. It would be pretty easy to get rid of this effect. By first clearing the hash after each move. But I'm not willing to give up the performance that costs...

The other thing that surfaced was that you actually had no idea at all how much Elo the finish-all-iterations would actually cost compared to a more sensible scheme, so that the 'reason' you refer to might just as well be described as a 'superstition'.

Actually I did have an idea, but no measurements made over the last 10 years. We were the first to do this approach in the old Blitz program, back when we wrote the original "using time wisely" paper in early in the ICCA journal series. And we did a lot of testing, even way back then, to compare old and new. We only had a few opponents we could automate test with, but we did it. Whether you believe it is better or worse is not an issue, believe what you want. But then wonder why almost everyone else _is_ doing it this way, and you might come to a startling conclusion...

But, fortunately, I could calculate a theoretical estimate for this number, which came to ~7 Elo. Wow, big deal. No wonder that engines that do that are all at the very botom of the rating list...

I'm not impressed with numbers pulled out of that particular bodily orifice. You can't even _measure_ a change that small so there's no point in discussing such, unless we move this to computer chess fiction or something.

Most are "that lucky".
Well, as I said, good for them. But not of any practical importance, as the randomization is trivial to program.

And of no beneficial value, since we want to play the _best_ move possible in every situation, unless you get into playing weaker players where the Crafty "skill command" can certainly introduce randomness. But not in any games I want to observe and analyze.

And one other note. If you want to look even reasonably professional here, you will stop with the over-use of emoticons and net-speak such as lol and such. It makes you look like a ... well it makes you look like _exactly_ what you are, in fact. You will notice most others do _not_ do that.
Well, that is really great! I would never want to look something I am not. But if you rather look like something you are not, do as yoou plese. It is a free world.

Just goes to show not everyone can/will listen to good advice. No surprise there of course.

bob · Post by **bob** » Mon Aug 11, 2008 1:58 am

Uri Blass wrote:You say:"To use your phraseology, all _good_ engines seem to exhibit this variability. The new ones with simplistic time controls, searching only complete iterations, and such won't. But that is a performance penalty that will hurt in OTB play. So good programs are not being designed like that. There's reasons for it... "

my response:
Performance penalty from searching only complete iterations is clearly
less than 100 elo(and I guess something near 20 elo) and I believe that there are programmers who do not care about these small numbers and prefer deterministic results in testing so they may be able to reproduce everything(My opinion is that it is better to sacrifice 20 elo for being able to reproduce everything easily).

I play so many games against other computers on ICC, and I see the analysis they whisper/kibitz when I am watching and we both turn it on. Even in CCT events as well. And _everybody_ that I played was producing output claiming that not all root moves were searched each time a move was played. Sometimes they started a new depth and got nothing back, sometimes they were in the middle, and perhaps very rarely they timed out when the iteration ended. But it is easy to watch 'em. I don't believe _anybody_ designs a program in any way other than to make it as strong as possible. And that includes me. I know how to make a perfectly deterministic parallel search. But with absolutely nowhere near the speedup I now produce. So non-determinism is accepted along with better performance. Ditto for time allocation and each other decision we make. But find me any _serious_ chess programmer that would say "sure, I will give up significant Elo just to make testing more deterministic (although no more accurate)"

You can be sure that not all programs at level that is close to Crafty's level exhibit the variability that you talk about.

When I first ran into this, I tried several. Both Glaurungs, fruit, gnuchess, crafty, arasan 9/10, and a couple of others that I won't name but which I could get to run under linux. And they _all_ had this issue. Yes, there could certainly be a program out there that terminates searches on iteration boundaries, that clears the hash between moves, that does anything else needed to minimize variability. But there certainly are not very many of 'em because we all take any rating improvement we can get, and we aren't going to throw away 30 here, 20 there, as much of that produces a pure patzer.

Not all programmers of programs at that level care much about playing strength and you can be sure that there are people who do not care if their program is 500 elo weaker than rybka or 480 elo weaker than rybka.

I did not work lately about Movei but if I come back I will certainly not care at levels that are more than 100 elo weaker than rybka about this small improvement that make result not reproducable and make it harder to find bugs because I see that the program played a move that I cannot reproduce.

Uri

I'm the opposite, trying to take everything I can get. Most are...

New testing thread

Re: Correlated data discussion

Re: Correlated data discussion

Re: Correlated data discussion

Re: Correlated data discussion

Re: Correlated data discussion

Re: Correlated data discussion

Re: Correlated data discussion

Re: Correlated data discussion

Re: 4 sets of data

Re: 4 sets of data