New testing thread

bob · Post by **bob** » Sun Aug 10, 2008 11:05 pm

hgm wrote:
bob wrote:First, look at the data. Second, look at what I said changed. The ratings of all programs were squeezed into a narrower band with the round-robin.
Aha! So you were posting in the wrong thread, after all! And too thick to notice it even when warned once... So let it be clear to all now who of us is the one that actually does not read the thread he is posting in.

I started the thread, remember? If you want, I can back up a few posts and grab the tit-for-tat points to show where my comment came from. I doubt most would need that guidance, however.

That certainly gave a better estimate of each program's rating than just the first set of games. Didn't help me much there, but if you look at the two crafty versions, their "distance apart" collapsed quite a bit, which _was_ significant.
Oh yeah, big surprise. You give them more games and the rating estimate gets more accurate. And you have selected programs that are close in strength, so their Elo spread is easily dominated by the statistical error. And that gets smaller when they have more games. Btw, it seems that if that is your conclusion, (ratings are closer), it is one that is not justified on the basis of this data, and very likely wrong: The program that happens to be strongest (Glaurung) also happened to be fairly lucky, or better at Crafty than the rating difference would suggest (because of playing style). One single run of Crafty vs World is not eough to determine that. But whatever of these two explanations would be valid, there is no guarantee at all that the best engine you include in the test will always be lucky, or always have a playing style that Crafty handles below average. And if the opposite is true, the ratings would not compress, but expand on inclusion of the Wrld vs Worldgames. (Yes, and I can know all that without actually doing the test, because it is all so obvious...)

The programs are _hardly_ close together in strength, if you'd just look at the results. A relative rating difference of 300 top to bottom is quite a spread...

A good tester would recognize such engines as variable, and delete its results from their tests. This is why you have to keep track of variance and correlations within the results of each engine.
Har-de-har-har. So a tester is going to decide "hey, program behaves strangely, and plays better in even months than in odd months" and throws it out. I don't think so.
No, obviously not. This is why I said: good tester...

Perhaps replace "good" with "no" and you will get it right. Ever heard of anyone kicking a program out because it seemed to lose a few where it should not and vice-versa? Of course you haven't. So, continue trying to act superior. But it is an act, not a fact.

Until I brought this up, I'd be you had _zero_ idea that the results are as volatile as they are.
When I play Joker, they are actually maximally volatile, because I always play Joker with randomization on. So I am not dependent on any form of time jitter to create independent games. But I admit that I have problems with identical or largely identical games when I would switch the randomization off. Micro-Max has no randomization, and when repeating the Silver test suite against Eden, many of the games were totally equal, mostly because Eden deviated. Micro-Max deviated on the average only once every 40 moves. Those are the well-investigated facts. So what? Apparently not all engines display the same natural variability.
To use your phraseology, all _good_ engines seem to exhibit this variability. The new ones with simplistic time controls, searching only complete iterations, and such won't. But that is a performance penalty that will hurt in OTB play. So good programs are not being designed like that. There's reasons for it...

But to do accurate testing, variability is essential. So I solved the problem by randomizing.
Later you suggested that crafty was an odd program and went into micro-max/joker and full-iteration discussions. And now we discover that from same starting positoin, even fruit won't play the same game twice in 100 tries, no book or anything.
Well, perhaps Fruit randomizes too. I have no idea of how Fruit works. Are you sure that you play it with randomization off?
It has no randomness built in to it. So yes, it is off. Ditto for the others I use and any other serious program I am familiar with....

So it is easy to now "notice" something after it has been pointed out. Had I said nothing, the results would have kept on coming in and nobody would have even thought about the randomness that is inherent in computer games, far moreso than in human games.
Perhaps not, as it is not really an interesting or relevant subject. If you don't have sufficient randomness, as will be apparent quickly enough from game duplications. The required randomization is almost trivial to program. Others rely on books to randomize, or use a large number of starting psitions, or a large number of opponents to create the variability. If you are so lucky that the variability is intrinsic, well, good for you! Not all of us are so lucky. Again, I don't see that as a particular acheivement.
Most are "that lucky".

And one other note. If you want to look even reasonably professional here, you will stop with the over-use of emoticons and net-speak such as lol and such. It makes you look like a ... well it makes you look like _exactly_ what you are, in fact. You will notice most others do _not_ do that.

Zach Wegner · Post by **Zach Wegner** » Sun Aug 10, 2008 11:06 pm

xsadar wrote:As often as the evaluation function is called, changing it has a high potential of drastically changing the nps, especially if you introduce a bug. Of course there are examples of rejecting changes for other reasons, but that's not the point. Of course you can't focus entirely on nps (would be nice if we could though), but you can't ignore it either. That's my point.

I'm not sure why you referred me to that article. Tord's quote is congruent with what I've said, particularly the part about orthogonality. And the rest of the article only emphasizes my point:

The evaluation is typically a collection of "rules of thumb" collected from hundreds of years of human experience. It would be practically impossible to program all of the rules that humans have come up with over the years. Even if you could code for all of them, it would be inadvisable because of performance reasons. There must be a trade-off between knowledge versus speed. The more time you spend evaluating a position, the less time you have to search, and therefore, the less deep your program can see. Some programs are designed with a very light evaluation function containing only the most basic features, letting the search make up the rest. Others prefer a heavy evaluation with as much knowledge as possible. Most programs have an evaluation somewhere in the middle, with the trend in recent years being towards simpler, lighter evaluation functions, the idea being that these evaluations are the most bug-free and maintainable, which is far more important in practice than having many obscure pieces of knowledge. A big influence in this direction was the advent of Fruit , which has a very minimal evaluation function, yet it is a very solid, strong engine.
Seeing as how performance (nps) is such an important factor, why would you ever want to completely disregard it in your testing? Especially when this thread is about measuring small changes in Elo.

It's funny that you quote that, because I wrote it. I should really extend it, because it doesn't explain the full picture. If you just add some generic term, it might help in most positions where that term is present, but it could hurt overall. That's the kind of effect I'm talking about. The fact that it minimally slows down the search isn't what makes it a bad term, the fact that it's a bad term does.

Anyways, my point is not that NPS doesn't matter at all, it's that if are testing a minor change, NPS can be safely ignored. The effect of the evaluation itself and its influence on the shape of the tree matter much more.

I think one thing can be gained from this discussion though: it is basically impossible to measure differences like 1 Elo. Because chess engines are so deterministic, you just can't get enough "independent" samples to extrapolate to what happens in the real world, where there are clocks, books, parallel searches, etc.

krazyken · Post by **krazyken** » Sun Aug 10, 2008 11:08 pm

MartinBryant wrote:Forgot to mention...
I repeated the Fruit experiment with Spike and Colossus too.
Again no duplicates in 100 games.

I somehow missed this experiment, what were the conditions it was run under? Same starting position?

MartinBryant · Post by **MartinBryant** » Sun Aug 10, 2008 11:12 pm

krazyken wrote:
MartinBryant wrote:Forgot to mention...
I repeated the Fruit experiment with Spike and Colossus too.
Again no duplicates in 100 games.
I somehow missed this experiment, what were the conditions it was run under? Same starting position?

Yes.
The normal starting position for a game of chess.

krazyken · Post by **krazyken** » Sun Aug 10, 2008 11:18 pm

MartinBryant wrote:
krazyken wrote:
MartinBryant wrote:Forgot to mention...
I repeated the Fruit experiment with Spike and Colossus too.
Again no duplicates in 100 games.
I somehow missed this experiment, what were the conditions it was run under? Same starting position?
Yes.
The normal starting position for a game of chess.

Then 100 different games is not surprising to me. If you did the 100 games from one of the silver positions, I'd expect the number of duplicates to be greater.

bob · Post by **bob** » Sun Aug 10, 2008 11:36 pm

krazyken wrote:
MartinBryant wrote:
krazyken wrote:
MartinBryant wrote:Forgot to mention...
I repeated the Fruit experiment with Spike and Colossus too.
Again no duplicates in 100 games.
I somehow missed this experiment, what were the conditions it was run under? Same starting position?
Yes.
The normal starting position for a game of chess.
Then 100 different games is not surprising to me. If you did the 100 games from one of the silver positions, I'd expect the number of duplicates to be greater.

Why? they are opening positions as well, and starting at the initial position, you may well encounter one of those along the way any and things should stabilize there if that is your belief. The variation occurs in any position that is either very balanced, or if it is not, where there are multiple nearly equal alternatives at some point (or multiple points) in the game as it progresses. Those are the positions where things will change. And this happens in the SIlver positions quite frequently, as remember those are the positions I have been using to produce all this volatile behavior.

hgm · Post by **hgm** » Sun Aug 10, 2008 11:53 pm

bob wrote:I started the thread, remember? If you want, I can back up a few posts and grab the tit-for-tat points to show where my comment came from. I doubt most would need that guidance, however.

Oh, I know where it came from. The point is that you posted that same comment in the other thread, where it was a totally non-sensical remark... And did not even relize that when you got the hint!

The programs are _hardly_ close together in strength, if you'd just look at the results. A relative rating difference of 300 top to bottom is quite a spread...

But there is no difference of 300, as the 4 full round-robins you did clearly show. Glaurung is about +75, Arasan about -50. That is only 125. With fewer games you have the larger statistical error quoted by BayesElo, which is again augmented by the fact that the 'World' plays only a single opponent in the Crafty vs World match. This causes the larger spread there. But it is pure coincidence, with another engine then Arasan as weakest one you might get the exact opposite.

Perhaps replace "good" with "no" and you will get it right. Ever heard of anyone kicking a program out because it seemed to lose a few where it should not and vice-versa?

Yes, I have heard of that. Someone complaining that testers wouldn't test his engine because it was 'too unpredictable'. Forgot which engine it was, though.

Of course you haven't. So, continue trying to act superior. But it is an act, not a fact.

To use your phraseology, all _good_ engines seem to exhibit this variability. The new ones with simplistic time controls, searching only complete iterations, and such won't. But that is a performance penalty that will hurt in OTB play. So good programs are not being designed like that. There's reasons for it...

Well, we went through that before. Too bad it didn't stick, so let me remind you: The conclusion then was that Joker and Crafty essentially have the same time management. So that cannot be the explanation. The other thing that surfaced was that you actually had no idea at all how much Elo the finish-all-iterations would actually cost compared to a more sensible scheme, so that the 'reason' you refer to might just as well be described as a 'superstition'. But, fortunately, I could calculate a theoretical estimate for this number, which came to ~7 Elo. Wow, big deal. No wonder that engines that do that are all at the very botom of the rating list...

Most are "that lucky".

Well, as I said, good for them. But not of any practical importance, as the randomization is trivial to program.

And one other note. If you want to look even reasonably professional here, you will stop with the over-use of emoticons and net-speak such as lol and such. It makes you look like a ... well it makes you look like _exactly_ what you are, in fact. You will notice most others do _not_ do that.

Well, that is really great! I would never want to look something I am not. But if you rather look like something you are not, do as yoou plese. It is a free world.

Uri Blass · Post by **Uri Blass** » Sun Aug 10, 2008 11:56 pm

You say:"To use your phraseology, all _good_ engines seem to exhibit this variability. The new ones with simplistic time controls, searching only complete iterations, and such won't. But that is a performance penalty that will hurt in OTB play. So good programs are not being designed like that. There's reasons for it... "

my response:
Performance penalty from searching only complete iterations is clearly
less than 100 elo(and I guess something near 20 elo) and I believe that there are programmers who do not care about these small numbers and prefer deterministic results in testing so they may be able to reproduce everything(My opinion is that it is better to sacrifice 20 elo for being able to reproduce everything easily).

You can be sure that not all programs at level that is close to Crafty's level exhibit the variability that you talk about.

Not all programmers of programs at that level care much about playing strength and you can be sure that there are people who do not care if their program is 500 elo weaker than rybka or 480 elo weaker than rybka.

I did not work lately about Movei but if I come back I will certainly not care at levels that are more than 100 elo weaker than rybka about this small improvement that make result not reproducable and make it harder to find bugs because I see that the program played a move that I cannot reproduce.

Uri

krazyken · Post by **krazyken** » Sun Aug 10, 2008 11:57 pm

bob wrote:
krazyken wrote:
MartinBryant wrote:
krazyken wrote:
MartinBryant wrote:Forgot to mention...
I repeated the Fruit experiment with Spike and Colossus too.
Again no duplicates in 100 games.
I somehow missed this experiment, what were the conditions it was run under? Same starting position?
Yes.
The normal starting position for a game of chess.
Then 100 different games is not surprising to me. If you did the 100 games from one of the silver positions, I'd expect the number of duplicates to be greater.
Why? they are opening positions as well, and starting at the initial position, you may well encounter one of those along the way any and things should stabilize there if that is your belief. The variation occurs in any position that is either very balanced, or if it is not, where there are multiple nearly equal alternatives at some point (or multiple points) in the game as it progresses. Those are the positions where things will change. And this happens in the SIlver positions quite frequently, as remember those are the positions I have been using to produce all this volatile behavior.

It would seem to me that after going several moves down an opening line the number of equally viable alternatives would diminish the further you go. Also, Openings are frequently handled by special code and evaluations as compared to he rest of the game are they not? By starting further in to the game you are reducing the amount this code effects the outcome of the game. So the further you go into the game you should see less variability in repeated experiments.

xsadar · Post by **xsadar** » Sun Aug 10, 2008 11:59 pm

Zach Wegner wrote:
xsadar wrote:As often as the evaluation function is called, changing it has a high potential of drastically changing the nps, especially if you introduce a bug. Of course there are examples of rejecting changes for other reasons, but that's not the point. Of course you can't focus entirely on nps (would be nice if we could though), but you can't ignore it either. That's my point.

I'm not sure why you referred me to that article. Tord's quote is congruent with what I've said, particularly the part about orthogonality. And the rest of the article only emphasizes my point:

The evaluation is typically a collection of "rules of thumb" collected from hundreds of years of human experience. It would be practically impossible to program all of the rules that humans have come up with over the years. Even if you could code for all of them, it would be inadvisable because of performance reasons. There must be a trade-off between knowledge versus speed. The more time you spend evaluating a position, the less time you have to search, and therefore, the less deep your program can see. Some programs are designed with a very light evaluation function containing only the most basic features, letting the search make up the rest. Others prefer a heavy evaluation with as much knowledge as possible. Most programs have an evaluation somewhere in the middle, with the trend in recent years being towards simpler, lighter evaluation functions, the idea being that these evaluations are the most bug-free and maintainable, which is far more important in practice than having many obscure pieces of knowledge. A big influence in this direction was the advent of Fruit , which has a very minimal evaluation function, yet it is a very solid, strong engine.
Seeing as how performance (nps) is such an important factor, why would you ever want to completely disregard it in your testing? Especially when this thread is about measuring small changes in Elo.
It's funny that you quote that, because I wrote it.

Yes, I had noticed and was amused that you had written the last sentence (but it looks like Pawel_Koziol wrote -- or at least posted -- most of what I quoted). I almost commented on it in my last post, but decided not to.

I should really extend it, because it doesn't explain the full picture. If you just add some generic term, it might help in most positions where that term is present, but it could hurt overall. That's the kind of effect I'm talking about. The fact that it minimally slows down the search isn't what makes it a bad term, the fact that it's a bad term does.

Anyways, my point is not that NPS doesn't matter at all, it's that if are testing a minor change, NPS can be safely ignored. The effect of the evaluation itself and its influence on the shape of the tree matter much more.

And my point is that I think it's never safe to ignore. While you may think you're not hurting nps, it's entirely possible that you're wrong. For one thing, I think the shape of the tree itself can certainly have an effect on nps as well. And like I said before, you never know when you may have inadvertently introduced a bug that hurts your nps.

I think one thing can be gained from this discussion though: it is basically impossible to measure differences like 1 Elo. Because chess engines are so deterministic, you just can't get enough "independent" samples to extrapolate to what happens in the real world, where there are clocks, books, parallel searches, etc.

Perhaps.

New testing thread

Re: 4 sets of data

Re: Correlated data discussion

Re: Correlated data discussion

Re: Correlated data discussion

Re: Correlated data discussion

Re: Correlated data discussion

Re: 4 sets of data

Re: 4 sets of data

Re: Correlated data discussion

Re: Correlated data discussion