more on engine testing

Richard Allbert · Post by **Richard Allbert** » Mon Aug 04, 2008 10:50 pm

Err... no

Craftyx beats Fruit, Fruit beats Glaurung, Glaurung beats Craftx, ... etc.

My "best version" in the list of five had by far the best result against the engine that finished at the top of the list (50%, the others 30% or so). It had a worse result against the engine at the bottom of the list compared with the others.

bob · Post by **bob** » Mon Aug 04, 2008 10:50 pm

Again you are not reading. 15 games for each of 5 opponents. So of _course_ they didn't play white and black the same number of times against each opponent, since there are an odd number of games per opponent. It doesn't take much of that with such a small sample to produce 40 point swings.

As far as the dependent comment, tell me if I play 100 games and lose them all, will the results appear to be depenent? we are not sampling completely random events. There is more going on that that...

bob · Post by **bob** » Mon Aug 04, 2008 10:54 pm

Jeez... he didn't use 300 games to produce the ratings for each version. Each version could only play 75 games total. And only 15 games against each of the 5 opponents.... So we are talking about +/- 2 games out of 15, not out of 300... I don't see how you could interpret his comments any differently than that, given the exact contents of his post and leaving the assumptions in the toilet where they belong.

bob · Post by **bob** » Mon Aug 04, 2008 10:56 pm

Richard Allbert wrote:Err... no

Craftyx beats Fruit, Fruit beats Glaurung, Glaurung beats Craftx, ... etc.

My "best version" in the list of five had by far the best result against the engine that finished at the top of the list (50%, the others 30% or so). It had a worse result against the engine at the bottom of the list compared with the others.

I was only trying to explain to Uri how assumptions lead to poor conclusions. Elo is a strange animal when you consider all the ways the results can be altered by things that appear to be unimportant.

Uri Blass · Post by **Uri Blass** » Mon Aug 04, 2008 11:32 pm

bob wrote:Again you are not reading. 15 games for each of 5 opponents. So of _course_ they didn't play white and black the same number of times against each opponent, since there are an odd number of games per opponent. It doesn't take much of that with such a small sample to produce 40 point swings.

As far as the dependent comment, tell me if I play 100 games and lose them all, will the results appear to be depenent? we are not sampling completely random events. There is more going on that that...

1)The example that I respond to it was with even number of games for every version and not with 15 games.

Edit I will check it to see if there was 15 games per match but even in this case I expect the difference to be very small not to justify big difference in the rating.

It does not make sense to play odd number of games for every opponent.

2)If you lose every game there is no correlation and if you know that you lose every game the results are independent.

The point is that the loss in game 1 does not cause you to lose game 2.
correlation of dependent variables can happen if one of the programs has learning.

Correlation can happen also because of luck when the variables are independent but the data that you provided suggest that it probably was not the case because it can be proved that the correlation was very high(I am not going to spend time on proving it here).

Uri

Uri Blass · Post by **Uri Blass** » Mon Aug 04, 2008 11:38 pm

point 1

"They all scored within 2 points of each other from 300 games vs 5 opponents, black and white vs each engine in the 30 Noomen starting positions"

300:5=60 so we have 60 games against every opponent when 30 are with white and 30 are with black so the possibility that you suggested contradicts the data.

Uri

bob · Post by **bob** » Tue Aug 05, 2008 12:08 am

OK, can we do this _one_ more time. Let me once again, for the umpteenth time, describe what I have found.

1. there is no learning. I can play matches limited by specific node counts and every last game is a duplicate given the same starting position and same starting color. No matter how many games I play, absolutely nothing changes, except for some slight time variance. All PVs, all scores, all best moves, all node counts, all _everything_ is identical. With me so far?

2. I have also used specific depth games. Same result as in (1) above. same games repeated as many times as I want. Same node counts, scores, etc. Any sort of learning would have an influence. The paper Slate and Scherzer wrote on hash learning used this effect in fact.

3. If I use normal time per move constraints, things become wildly variable. To understand why, I tried the node limit and then changed the limit by as little as 100 nodes and found that would make significant differences in the games and the results. And it doesn't take a fraction of a millisecond to search 100 nodes. On good hardware, I can search 100 nodes in a microsecond. So running matches where the node counts for each move vary by hundreds of thousands, produces wildly varying results.

now please re-read the above again. There is no learning. If I eliminate time, things are 100% reproducible for as many times as I would want to run the test.

The instant I switch to time-limited search rather than depth or node limited search, things become variable. What do you conclude from that? It doesn't take much to realize that the variable time jitter is causing each game to search a different number of nodes per move, and that is the variability. That is proven clearly and can be proven by anyone that wants to take the time to do the test rather than the old "stampee feet, no good" type argument...

Now, you need to explain to me any possible way where (a) knowing whether more or less nodes is going to be beneficial in a given position, rather than just being a random modifier; and then (b) how the timing could be "flawed" (not my word BTW) so that it would somehow take advantage of the previous knowledge to influence the games, even to the point of somehow introducing some sort of correlation between them.

There are no software bugs. There are no PGN bugs. There are no data analysis bugs. If there was, I could produce different results on the fixed node searches. So this is _only_ about variability in time. Absolutely everything else has been excluded. No program bugs to produce variable results. No referee bug or interference. No processor bugs. Nothing except time variability produces this particular result.

I have _clearly_ proven that the randomness is coming from time and time alone. Now I am looking for someone to explain how time jitter can possibly produce anything other than pure random correlation if there is any to be found. And all I am getting is "stampee feet... bugs... incompetent research... stampee feet..."

stamping feet is not helping. Others have now reported the same sort of variability. So either there is some sort of sub-etheral conspiracy affecting us all, or there is something else. But it would seem that all we are going to get is 'stampee feet, bugs, impossible, incompetent, stampee feet..." ad nauseum.

Read the above _carefully_ then perhaps we might begin to communicate in useful ways to understand the effect, rather than using the old ostrich approach, and burying ones head in the sand and saying "can't happen". It _is_ happening.

bob · Post by **bob** » Tue Aug 05, 2008 12:11 am

Uri Blass wrote:point 1

"They all scored within 2 points of each other from 300 games vs 5 opponents, black and white vs each engine in the 30 Noomen starting positions"

300:5=60 so we have 60 games against every opponent when 30 are with white and 30 are with black so the possibility that you suggested contradicts the data.

Uri

He said 300 games. 4 versions against each of 5 opponents. that is 15 games per version per opponent, based on what he wrote. Now he may well have done what you said, but his post implies 300 games total, not 300 games each. if he played 60 games per player, that is still not a lot, and BayesElo can produce some interesting numbers depending on how the results actually looked.

I don't see anything that catches my eye as that unusual, it might well depend on whether the results were run independently thru bayeselo or whether they were run as one batch, which can change things as ratings intermingle. Elo changes game by game as ratings of all opponents are adjusted result by result.

Carey · Post by **Carey** » Tue Aug 05, 2008 12:48 am

bob wrote:
Bill Rogers wrote:Carey
A chess program without an evaluation routine can not really play chess in the normal sense. It could only make random moves no matter how fast it could search. A search routine all by itself only generates moves and does not know which move is better or not. In fact it does not even know when it makes a capture or what the piece is worth.
Even the simplest evaluation subroutine which understands good captures from bad ones won't be able to play a decent game of chess. It might beat some begining chess players but not anyone with any kind of real chess playing skills, computer or not.
Bill
This is not quite true. See Don Beal's paper about a simple random evaluation. It played quite reasonable chess. And due to the way the search works, it managed to turn that random evaluation into a strong sense of mobility. I can explain if you want. But it actually does work...

Bob,

If you don't mind, I'd like to hear the explanation.

I don't happen to have that issue of the ICCAJ and I don't see the article on the web.

Who knows, maybe in my next program, I'll just stick a random number generator in there...

It'd be a heck of a lot easier...

Seriously though, from what you've been reporting in this and the other tests, ripping out chunks of Crafty's eval, about hash errors being irrelevant, etc., it's really starting to sund like: "Just get things in the general area, and let the search do the hard work for you" kind of thing.

Just get the evaluator reasonable, add a few special cases for things the search & eval can't handle if it's a leaf, and just let the search do its thing.

Carey

bob · Post by **bob** » Tue Aug 05, 2008 1:10 am

Carey wrote:
bob wrote:
Bill Rogers wrote:Carey
A chess program without an evaluation routine can not really play chess in the normal sense. It could only make random moves no matter how fast it could search. A search routine all by itself only generates moves and does not know which move is better or not. In fact it does not even know when it makes a capture or what the piece is worth.
Even the simplest evaluation subroutine which understands good captures from bad ones won't be able to play a decent game of chess. It might beat some begining chess players but not anyone with any kind of real chess playing skills, computer or not.
Bill
This is not quite true. See Don Beal's paper about a simple random evaluation. It played quite reasonable chess. And due to the way the search works, it managed to turn that random evaluation into a strong sense of mobility. I can explain if you want. But it actually does work...
Bob,

If you don't mind, I'd like to hear the explanation.

I don't happen to have that issue of the ICCAJ and I don't see the article on the web.

Who knows, maybe in my next program, I'll just stick a random number generator in there... It'd be a heck of a lot easier...

Seriously though, from what you've been reporting in this and the other tests, ripping out chunks of Crafty's eval, about hash errors being irrelevant, etc., it's really starting to sund like: "Just get things in the general area, and let the search do the hard work for you" kind of thing.

Just get the evaluator reasonable, add a few special cases for things the search & eval can't handle if it's a leaf, and just let the search do its thing.

Carey

Sure. Let's say you produce an evaluation function that is pure material, plus a random number between -1/2 pawn and +1/2 pawn. So your search will at least play sane moves and not throw material away. But how does it play decently?

Here's the idea. At a position P (and you can even suppose it is one ply away from leaf positions for simplicity) if you have a large number of legal moves, then each time you make one and evaluate the resulting position, you get a random component in your evaluation. Since there are a large number of legal moves in this position, you have a good chance of getting a "good" random number.

In another position at the same depth, you are in check, so you only have one possible move, and one chance to get a good random evaluation.

What Don found was that branches where you have lots of alternatives gives you a better chance to get good random numbers, while branches where you have few alternatives makes you "get what you get". And if you follow that, you will note that it is actually a "poor-man's mobility" approximation.

The results were surprising when I first saw them, but after thinking about the explanation, it was one of those "of course..." type realizations...

Hope that helps...

more on engine testing

Re: more on engine testing

Re: more on engine testing

Re: more on engine testing

Re: more on engine testing

Re: more on engine testing

Re: more on engine testing

Re: more on engine testing

Re: more on engine testing

Re: more on engine testing

Re: more on engine testing