Origin of noise in chess engines matches

mcostalba · Post by **mcostalba** » Sun Aug 09, 2009 1:36 pm

Why we need to play thousand games to find the strongest between two engines ?

It is known that results based on few games can be very misleading and if the two engines are very similar in strength then a single game result is almost random.

I have thought about this and I would like to share with you this few notes in case someone wants to add and/or correct something. I think everybody that has been involved with testing engines had thought about this subject at least once.

NAMING

For a given position p we call S the finite set of possible legal moves available to the side to move.

Now we call e() an evaluation function that for each move m in S returns a score based on a search at fixed depth d:

Code: Select all

v = e&#40;p, m, d&#41;   where p is the position, m belongs to S, d is the search depth

Form now on we consider parameter p implicit so to not mess the notation.

We define that an evaluation function is "valid" if and only if given two arbitrary moves of S, ms and mw so that ms is stronger then mw then e() must satisfy:

Code: Select all

limit e&#40;ms, d&#41;  > limit e&#40;mw, d&#41;   as d approaches plus infinity

Clearly speaking it means that do exist a minimal fixed depth above which our evaluation function will always recognize that ms is stronger then mw. It doesn't matter if this Minimum Disambiguation Depth (MDD) is 10, 100 or 1000.

More clearly speaking any not broken chess engine could be considered a valid evaluation function for any practical purpose.

We should have defined what it means stronger move and weaker move, but we gave it for granted to not mess up the stuff.

The MDD is clearly a function of the position, for instance in a position where there is an hanging queen MDD is 1 because any engine will suggest to take the queen already at depth 1 and will stay with that best move all way long. On the other side, in a middle game quiet position MDD can be very high.

ORIGIN OF NOISE

To easy the conditions we will refer to the case of an engine A playing against itself at fixed depth search.

We will see that the general case of two engines A and B playing against each other at varying depth search, as is normal in engines matches, will add even more noise.

So let engine A playing with itself at fixed depth fd. At one point during the match A plays its best move bm.

We say that A introduces noise through playing bm when the following occurs:

Code: Select all

1&#41; There exists a real best move rbm so that

limit e&#40;rbm, d&#41;  > limit e&#40;bm, d&#41;  as d approaches plus infinity


2&#41; e&#40;bm, fd&#41; > e&#40;rbm, fd&#41;


3&#41; e&#40;rbm, fd + 1&#41; > e&#40;bm, fd + 1&#41;

Point 2 it means simply that engine will play bm because considers it superior to rbm (given the fixed searched depth fd)

Point 3 it means that at the next move the opponent A', that will search always at fixed depth fd, will discover a good refutation, i.e. will discover that rbm would had been better then bm: If the fixed search instead of fd had been one ply deeper then this noise have not been introduced since A had already discovered by itself that rbm > bm.

Playing bm is a noise because is a lesser move and because we don't need a stronger engine to spot this. The same engine would suffice if only had looked one ply deeper. So it _seems_ that A is weaker then A' while instead is not, indeed it's the same engine.

So the origin of noise is that the horizon effect is allowed to come to play due to the fact that:

Code: Select all

e&#40;rbm, d&#41; - e&#40;bm, d&#41; > 0   is not true for any d above fd

In other words there exsist some position P for which  fd < MDD

GENERALIZATION

In case of two engines A and B with evaluation functions ea() and eb() that reach a position where the search depth occurs to be da and db respectively then conditions for A to introduce noise are the same but points 2 and 3 become:

Code: Select all

2&#41; ea&#40;bm, da&#41; > ea&#40;rbm, da&#41;


3&#41; eb&#40;rbm, db&#41; > eb&#40;bm, db&#41;

Here is more "easy" to introduce noise, especially if two engines are similar in strength, because da and db could occur to differ more then one ply. As example da could be 13 while, when replying engine B could decide to think more, take more time and reach as example depth 15. With 2+1=3 plies of difference is more easy to spot for engine B that bm was actually not so good.

MEASURING NOISE

All this theory is useless if we are not able to really measure this noise and analyze how it changes with depth. Measuring noise distribution can be very useful, not only to understand how this works, but to better choose parameters as time control and number of games for each match.

This is normally done based on experience, with statement similar to "I know it is better because has always been", "I have done thousand (millions) of tests and this TC is the best". But it has never been based on a quantitative approach to noise distribution.

How to measure this kind of noise in practice and plot its distribution will be the subject of another post that I am working on...

bob · Post by **bob** » Sun Aug 09, 2009 8:53 pm

I've looked at this in great detail.

The sources for the "noise" are:

(1) randomness from book accessing different lines depending on a random number (PRN actually). You can eliminate this by eliminating the books and using a static set of start positions.

(2) some use randomness in the evaluation. Turn it off.

(3) Any learning approach changes the games each time the same position is played. Disable learning

(4) pondering alters the nodes searched in a random way (this is actually a part of #5 given next). Disable pondering to eliminate this.

(5) But the biggie is timing. When you use a time limit for the search, the PC does not measure time to the nearest picosecond, the time "jumps" in increments. At today's processor/search speeds that can go from 1M nodes per second up, if the time varies by only a millisecond, that is 1,000 nodes.

If you want to try a fun experiment, play a game between two engines using a fixed number of nodes per move for both. These games will reproduce exactly. Now tell one of the programs to search one more node per move. This will change the game far more significantly than you would believe.

That's one reason why you need a lot of games. The variability is much higher than one would expect, even if everything else is unchanged (no book, etc).

mcostalba · Post by **mcostalba** » Mon Aug 10, 2009 12:49 am

bob wrote: If you want to try a fun experiment, play a game between two engines using a fixed number of nodes per move for both. These games will reproduce exactly.

Let's suppose you play 1000 games between the _same_ engine at fixed depth.

What is the standard deviation ? I don't think you don't get any noise in these conditions, I would expect that you have series of games won by engine A then series by A' and so on and if you see all the games I think you get a good standard deviation and I would also think that is very possible that after 1000 games one engine results stronger 5-10 ELO point then the other.

This is what I called noise.

bob · Post by **bob** » Mon Aug 10, 2009 1:54 am

mcostalba wrote:
bob wrote: If you want to try a fun experiment, play a game between two engines using a fixed number of nodes per move for both. These games will reproduce exactly.
Let's suppose you play 1000 games between the _same_ engine at fixed depth.

What is the standard deviation ? I don't think you don't get any noise in these conditions, I would expect that you have series of games won by engine A then series by A' and so on and if you see all the games I think you get a good standard deviation and I would also think that is very possible that after 1000 games one engine results stronger 5-10 ELO point then the other.

This is what I called noise.

Statistics don't bear you out. It takes 40,000 games to reach +/-4 for the error bar.

10,000 to get +/- 8., and that is not enough to safely differentiate when the expected difference between the two is as low as 5 or even as large as 10.

ernest · Post by **ernest** » Mon Aug 10, 2009 3:38 pm

bob wrote:It takes 40,000 games to reach +/-4 for the error bar.

With 1/3 draw rate, I calculate the 2-sigma error bar to be +/-2.9 Elo

mcostalba · Post by **mcostalba** » Mon Aug 10, 2009 4:58 pm

bob wrote: Statistics don't bear you out. It takes 40,000 games to reach +/-4 for the error bar.

10,000 to get +/- 8., and that is not enough to safely differentiate when the expected difference between the two is as low as 5 or even as large as 10.

The point that I wanted to highlight is not how much games we need to narrow the error bar, but WHY we have an error bar in first instance.

Also removing all the conditions you mentioned and also playing the same engine against itself at fixed depth gives an error and very probably gives misleading results (i.e. one stronger then the other when actually it is the same engine) even after 1000 games.

What is the origin of these misleading results ?

This is the question I tried to think about. Note that I have already considered removed all the possible causes of error that you mentioned and even more because I was talking of the _same_ engine playing against itself.

But even after removing all there is an "intrinsic" noise left that is impossible to remove and the origin of this "intrinsic" noise was the subject of my post.

bob · Post by **bob** » Mon Aug 10, 2009 6:50 pm

ernest wrote:
bob wrote:It takes 40,000 games to reach +/-4 for the error bar.
With 1/3 draw rate, I calculate the 2-sigma error bar to be +/-2.9 Elo

I don't see a 33% draw rate in my testing.

Code: Select all

 Crafty-23.1R01-1     2588    4    4 40000   47%  2615   22%

That is BayesElo output. +/- 4 as I said.

bob · Post by **bob** » Mon Aug 10, 2009 6:51 pm

mcostalba wrote:
bob wrote: Statistics don't bear you out. It takes 40,000 games to reach +/-4 for the error bar.

10,000 to get +/- 8., and that is not enough to safely differentiate when the expected difference between the two is as low as 5 or even as large as 10.
The point that I wanted to highlight is not how much games we need to narrow the error bar, but WHY we have an error bar in first instance.

Also removing all the conditions you mentioned and also playing the same engine against itself at fixed depth gives an error and very probably gives misleading results (i.e. one stronger then the other when actually it is the same engine) even after 1000 games.

What is the origin of these misleading results ?

This is the question I tried to think about. Note that I have already considered removed all the possible causes of error that you mentioned and even more because I was talking of the _same_ engine playing against itself.

But even after removing all there is an "intrinsic" noise left that is impossible to remove and the origin of this "intrinsic" noise was the subject of my post.

We have an error bar because we are taking a statistical sample from the set of all possible games. Each chess position is different, and requires a different search depth and/or evaluation to choose the right move. Programs are not perfect. So you need a sample large enough to be sure that you didn't just luck into positoins you could play well, or positions where you play poorly.

ernest · Post by **ernest** » Mon Aug 10, 2009 7:48 pm

"That is BayesElo output. +/- 4 as I said."

???
My (theoretical) formula for the Standard Deviation of the score, in a engine-engine match (N games, W wins, L losses) is
SD = Sqrt ((W+L)/4N)
(ok, there is a (N-1)/N term, rounded to 1

)

Uri Blass · Post by **Uri Blass** » Mon Aug 10, 2009 8:06 pm

1)point 3 is not correct because programs use extensions and reduction so you cannot assume that the opponent will discover no refutation.

2)You practically claim that the origin of noise is that programs do not play perfect and there are moves that they can find at depth d+1 and not at depth d.

I do not think that it is correct.

In case programs play perfect all the games are going to get the same result and you will have no important results so the fact that you cannot know which program is better based on small number of games is not because programs do not search deep enough.

3)Practically speaking the situation is more complex then you say

It is possible that both 1.a4 and 1.e4 draw.

limit e(e4, d) =limit e(a4, d)=0 but it does not mean that 1.e4 is not stronger than a4.

The program that plays 1.a4(by only material evaluation) may be a weak player that can only draw against me even at infinite depth inspite of playing perfect drawing moves.

Uri

Origin of noise in chess engines matches

Origin of noise in chess engines matches

Re: Origin of noise in chess engines matches

Re: Origin of noise in chess engines matches

Re: Origin of noise in chess engines matches

Re: Origin of noise in chess engines matches

Re: Origin of noise in chess engines matches

Re: Origin of noise in chess engines matches

Re: Origin of noise in chess engines matches

Re: Origin of noise in chess engines matches

Re: Origin of noise in chess engines matches