Statistics from automated tests

Discussion of chess software programming and technical issues.

Moderators: hgm, Rebel, chrisw

jdm64
Posts: 41
Joined: Thu May 27, 2010 11:32 pm

Statistics from automated tests

Post by jdm64 »

I'm starting to write an automated testing function in my chess engine, but I have a few questions:

(1): Excluding the case of a SMP engine which I know would add non-deterministic results, isn't the search function deterministic? Henceforth, if I tested two versions of my engine against itself, wouldn't it always play the same game over and over? So, I'd really only have to play 2 games in testing -- engine one plays white and one with black.

(2): I've seen several mentions of "I tested 5K+ games and here's the results" or something similar. How/What are you testing? Given the assumption that non-SMP engines are deterministic, why so many games? The engines must be playing somewhat randomly then -- right? How would I go about testing my engine against itself?

(3): Currently, my engine selects a random move from the set of move with the highest score. In other words, from the scores that are returned by negascout on the root node move list, I find the max value and return a random move that has that score. Is this how moves to play are selected, if I want some randomness in moving? I've seen the use of a PV move, but is there more than one?

(4): I'm trying to calculate some form of statistics from a set of games. What stats are normally used? Currently I'm returning: white wins, black wins, ties, average ply. I've seen LOS (likelihood of superiority) here, but they don't clearly explain enough for me to convert it to code. The formula looks like:

Code: Select all

LOS = phi(x: score_difference/N(1-draw_ratio)) = (erf(x/sqrt(2)) + 1) / 2
I found erf in the c library, but is N a function or multiplication? It sounds like it's some normal function, but I don't know how to calculate it if it is.
User avatar
jshriver
Posts: 1342
Joined: Wed Mar 08, 2006 9:41 pm
Location: Morgantown, WV, USA

Re: Statistics from automated tests

Post by jshriver »

I would guess if you had your engine play itself games would be different for various reasons.

1. The engine isn't the only thing running on the computer, so even if it's taking up the bulk of the resources at a given time, something still might trigger in the background that might rob it even if for a shortwhile of cpu resources thus affecting the outcome of that move.

2. I believe a lot of engines have some form of slight randomness in their opening book. Otherwise they would play the exact same opening move over and over again in every game.

3. Sure there are many other reasons, and I may stand corrected on my first two :)

-Josh
jdm64
Posts: 41
Joined: Thu May 27, 2010 11:32 pm

Re: Statistics from automated tests

Post by jdm64 »

jshriver wrote:1. The engine isn't the only thing running on the computer, so even if it's taking up the bulk of the resources at a given time, something still might trigger in the background that might rob it even if for a shortwhile of cpu resources thus affecting the outcome of that move.
I'm fairly sure that a single threaded engine would be deterministic. It wouldn't mater the scheduling that the operating system does. A single thread is a single path of execution. I also agree that, a multi-threaded engine would give different results because of timing of the schedulting. I think the main reason would be the transposition table. What I'm really asking is what is everyone else doing in their engines to add randomness? I explained my method in (3).
jshriver wrote:2. I believe a lot of engines have some form of slight randomness in their opening book. Otherwise they would play the exact same opening move over and over again in every game.
My engine currently doesn't use an opening book. I guess my question about that would be: without an opening book is there much randomness in the moves of most engines?

But neither answers my main questions:
(1) What are the automated testing that other people are using with their engines; and why do they work.
(2) How do I calculate LOS (likelihood of superiority) -- mainly I don't get what "N" is. Or what other statistical calculations are useful when testing an engine against itself or other engine.

And other things I mentioned in my original post.
User avatar
jshriver
Posts: 1342
Joined: Wed Mar 08, 2006 9:41 pm
Location: Morgantown, WV, USA

Re: Statistics from automated tests

Post by jshriver »

N is the number of games, it's at the very top
N = wins + draws + losses


Also recommend this: http://remi.coulom.free.fr/Bayesian-Elo/
Good stuff :)
jdm64
Posts: 41
Joined: Thu May 27, 2010 11:32 pm

Re: Statistics from automated tests

Post by jdm64 »

jshriver wrote:N is the number of games, it's at the very top
N = wins + draws + losses
Yeah, I thought that too. But if it is then why not simplify it? It reduces down to just using wins and losses.

Code: Select all

score_difference = win - loss
N = win + loss + draw
draw_ratio = draw / N

score_difference / (N * (1 - draw_ratio))
(win - loss) / ((win + loss + draw) * (1 - draw / N))
(win - loss) / ((win + loss + draw) * (1 - draw / (win + loss + draw)))
(win - loss) / ((win + loss + draw) - (draw*(win + loss + draw)) / (win + loss + draw)))
(win - loss) / ((win + loss + draw) - draw))
(win - loss) / (win + loss)
User avatar
hgm
Posts: 27790
Joined: Fri Mar 10, 2006 10:06 am
Location: Amsterdam
Full name: H G Muller

Re: Statistics from automated tests

Post by hgm »

It depends on the engine's time management. If you take decisions based on the clock reading, this will make the engine non-deterministic. Engines that implement a book often use time-dependent randomization of the opening moves to create diversity. An external GUI book, or a large selected set of initial moves / positions can create game diversity without non-determinism.

My single-CPU engine Joker in non-deterministic because it adds start-time-derived random numbers to each evaluation. (At least, when randomization is switched on, which is default when running under WinBoard.) Although this way of randomizing gets less effective at longer TC, it is still quite effective in 40/1 games. Some engines (e.g. XQWLight) have randomization programmed into them by adding a small random score to each move in the root; this will remain effective at any TC.
bob
Posts: 20943
Joined: Mon Feb 27, 2006 7:30 pm
Location: Birmingham, AL

Re: Statistics from automated tests

Post by bob »

jdm64 wrote:I'm starting to write an automated testing function in my chess engine, but I have a few questions:

(1): Excluding the case of a SMP engine which I know would add non-deterministic results, isn't the search function deterministic? Henceforth, if I tested two versions of my engine against itself, wouldn't it always play the same game over and over? So, I'd really only have to play 2 games in testing -- engine one plays white and one with black.
You won't get the same games because of timing differences, if you use a normal time-limit to control the search. But playing more than two games with the same position is not a good idea, because for many positions, the outcomes of the games (from the same position) will be correlated...


(2): I've seen several mentions of "I tested 5K+ games and here's the results" or something similar. How/What are you testing? Given the assumption that non-SMP engines are deterministic, why so many games? The engines must be playing somewhat randomly then -- right? How would I go about testing my engine against itself?
Just play 10,000 games. Then arrange them in some sort of order (say based on the time played). Then for each game, convert the result to W, L or D, so that you end up with 10,000 characters representing the results. look at them. You will find long strings of W, long strings of L, long strings of D. How do those results compare to the overall result from 10,000 games?

This is a statistical sampling question, not a question of "logic".


(3): Currently, my engine selects a random move from the set of move with the highest score. In other words, from the scores that are returned by negascout on the root node move list, I find the max value and return a random move that has that score. Is this how moves to play are selected, if I want some randomness in moving? I've seen the use of a PV move, but is there more than one?
Surely you are not really doing that? For iteration N, you always want to search the best move from iteration N-1 first. Then there are other ideas to provide good move ordering. Random is not one of them.


(4): I'm trying to calculate some form of statistics from a set of games. What stats are normally used? Currently I'm returning: white wins, black wins, ties, average ply. I've seen LOS (likelihood of superiority) here, but they don't clearly explain enough for me to convert it to code. The formula looks like:

Code: Select all

LOS = phi(x: score_difference/N(1-draw_ratio)) = (erf(x/sqrt(2)) + 1) / 2
I found erf in the c library, but is N a function or multiplication? It sounds like it's some normal function, but I don't know how to calculate it if it is.
User avatar
hgm
Posts: 27790
Joined: Fri Mar 10, 2006 10:06 am
Location: Amsterdam
Full name: H G Muller

Re: Statistics from automated tests

Post by hgm »

jdm64 wrote:(3): Currently, my engine selects a random move from the set of move with the highest score. In other words, from the scores that are returned by negascout on the root node move list, I find the max value and return a random move that has that score. Is this how moves to play are selected, if I want some randomness in moving? I've seen the use of a PV move, but is there more than one?
Oh, I missed this point.

With normal negamax this would not work at all, because you will never get two moves with the same exact score. The first time you get the score it will be a PV move with an exact score, but then alpha will be upped to that value, so that any subsequent time that score is returned, it will be an upper limit only. So if you get two moves with score +0.50, and happen to pick the second, it could be a move that has a score of -299.97 (as -299.79 < +0.50), and the engine will play a move that gets it mated in one...

Of course you can abandon standard negamax, and manipulate the window such that you get more moves with exact scores (i.e., run in multi-PV mode), but that is very costly, and will weaken you signifcantly in game play.
jdm64
Posts: 41
Joined: Thu May 27, 2010 11:32 pm

Re: Statistics from automated tests

Post by jdm64 »

bob wrote:You won't get the same games because of timing differences, if you use a normal time-limit to control the search. But playing more than two games with the same position is not a good idea, because for many positions, the outcomes of the games (from the same position) will be correlated...
Ok, that make sense. I guess that would still make my current engine deterministic because I'm currently just searching to a constant depth + check extension.
bob wrote:Just play 10,000 games. Then arrange them in some sort of order (say based on the time played). Then for each game, convert the result to W, L or D, so that you end up with 10,000 characters representing the results. look at them. You will find long strings of W, long strings of L, long strings of D. How do those results compare to the overall result from 10,000 games? This is a statistical sampling question, not a question of "logic".
10K games! The computer that I'm using to run my tests isn't all that powerful. Currently on low depth it can finish a game in 2.5 minuets; so 10K games would take me about 2 weeks! Is that common? You mentioned doing statistics on the games, and that's my question. I don't know exactly what statistical analysis to do. My current thinking is to use Likelihood Of Superiority in the following way.

1) Run the engine against itself N times. This would be an "internal" match, where there's only one engine, just playing both sides.
2) Use the results from #1 to compute the LOS. This will be the "control group". For example, I ran 256 games and got White: 109 Black: 92 Tie: 55 AvgPly: 223 LOS: 0.5337012.
3) Run an "external" set of N games where two separate and different versions of the engine are used. Engine one should play white half the time and vice-versa. Calculate the LOS. The better engine will be the one that has a higher LOS that is also above the "error" found in #2.

Would this work?
bob wrote:Surely you are not really doing that? For iteration N, you always want to search the best move from iteration N-1 first. Then there are other ideas to provide good move ordering. Random is not one of them.
I guess I didn't explain clear enough. The random move selection is only after I've finished iterative deepening and I need to select a move. For the iterative deepening, yes, I do sort the moves (high-to-low) from the previous iteration.
jdm64
Posts: 41
Joined: Thu May 27, 2010 11:32 pm

Re: Statistics from automated tests

Post by jdm64 »

hgm wrote:With normal negamax this would not work at all, because you will never get two moves with the same exact score. The first time you get the score it will be a PV move with an exact score, but then alpha will be upped to that value, so that any subsequent time that score is returned, it will be an upper limit only. So if you get two moves with score +0.50, and happen to pick the second, it could be a move that has a score of -299.97 (as -299.79 < +0.50), and the engine will play a move that gets it mated in one...

Of course you can abandon standard negamax, and manipulate the window such that you get more moves with exact scores (i.e., run in multi-PV mode), but that is very costly, and will weaken you significantly in game play.
Is the reason why the second move is only an upper limit because alpha is always returned even if searching through all moves doesn't return a value higher than alpha? My search returns the "bestScore" at the fail-low nodes instead of alpha. I guess this would be called a soft-fail search?

So, I am getting several moves with the same value; and I'm selecting a random move from them.