It is known that results based on few games can be very misleading and if the two engines are very similar in strength then a single game result is almost random.
I have thought about this and I would like to share with you this few notes in case someone wants to add and/or correct something. I think everybody that has been involved with testing engines had thought about this subject at least once.
NAMING
For a given position p we call S the finite set of possible legal moves available to the side to move.
Now we call e() an evaluation function that for each move m in S returns a score based on a search at fixed depth d:
Code: Select all
v = e(p, m, d) where p is the position, m belongs to S, d is the search depth
We define that an evaluation function is "valid" if and only if given two arbitrary moves of S, ms and mw so that ms is stronger then mw then e() must satisfy:
Code: Select all
limit e(ms, d) > limit e(mw, d) as d approaches plus infinity
More clearly speaking any not broken chess engine could be considered a valid evaluation function for any practical purpose.
We should have defined what it means stronger move and weaker move, but we gave it for granted to not mess up the stuff.
The MDD is clearly a function of the position, for instance in a position where there is an hanging queen MDD is 1 because any engine will suggest to take the queen already at depth 1 and will stay with that best move all way long. On the other side, in a middle game quiet position MDD can be very high.
ORIGIN OF NOISE
To easy the conditions we will refer to the case of an engine A playing against itself at fixed depth search.
We will see that the general case of two engines A and B playing against each other at varying depth search, as is normal in engines matches, will add even more noise.
So let engine A playing with itself at fixed depth fd. At one point during the match A plays its best move bm.
We say that A introduces noise through playing bm when the following occurs:
Code: Select all
1) There exists a real best move rbm so that
limit e(rbm, d) > limit e(bm, d) as d approaches plus infinity
2) e(bm, fd) > e(rbm, fd)
3) e(rbm, fd + 1) > e(bm, fd + 1)
Point 3 it means that at the next move the opponent A', that will search always at fixed depth fd, will discover a good refutation, i.e. will discover that rbm would had been better then bm: If the fixed search instead of fd had been one ply deeper then this noise have not been introduced since A had already discovered by itself that rbm > bm.
Playing bm is a noise because is a lesser move and because we don't need a stronger engine to spot this. The same engine would suffice if only had looked one ply deeper. So it _seems_ that A is weaker then A' while instead is not, indeed it's the same engine.
So the origin of noise is that the horizon effect is allowed to come to play due to the fact that:
Code: Select all
e(rbm, d) - e(bm, d) > 0 is not true for any d above fd
In other words there exsist some position P for which fd < MDD
GENERALIZATION
In case of two engines A and B with evaluation functions ea() and eb() that reach a position where the search depth occurs to be da and db respectively then conditions for A to introduce noise are the same but points 2 and 3 become:
Code: Select all
2) ea(bm, da) > ea(rbm, da)
3) eb(rbm, db) > eb(bm, db)
MEASURING NOISE
All this theory is useless if we are not able to really measure this noise and analyze how it changes with depth. Measuring noise distribution can be very useful, not only to understand how this works, but to better choose parameters as time control and number of games for each match.
This is normally done based on experience, with statement similar to "I know it is better because has always been", "I have done thousand (millions) of tests and this TC is the best". But it has never been based on a quantitative approach to noise distribution.
How to measure this kind of noise in practice and plot its distribution will be the subject of another post that I am working on...