How does Monte Carlo analysis work?

bob · Post by **bob** » Mon Jul 21, 2008 5:10 pm

ernest wrote:
bob wrote:[They might think they are measuring changes as small as one Elo with 80K games, but they aren't...
Theoretically, 80K games, with 1/3 draws, gives a standard deviation (SD) of 0.144%, that is (x 7) 1 Elo.
And 2 SD (95% probability) is 2 Elo

If everything else were equal. But playing 80K games in a few hours turns into very small search times. Which turns into a large variance by itself due to timing variances within the operating system and whatever library routines are used to measure intervals of time. Very short games have significant variance when the time itself is so difficult to measure accurately.

I could certainly produce a couple of 80 K game in 1 second matches. And we could analyze the results. The problem is that the Elo statistics will show a very accurate elo for that sample. But two successive samples won't be within +/- 1 Elo of each other...

Uri Blass · Post by **Uri Blass** » Mon Jul 21, 2008 7:03 pm

bob wrote:
ernest wrote:
bob wrote:[They might think they are measuring changes as small as one Elo with 80K games, but they aren't...
Theoretically, 80K games, with 1/3 draws, gives a standard deviation (SD) of 0.144%, that is (x 7) 1 Elo.
And 2 SD (95% probability) is 2 Elo
If everything else were equal. But playing 80K games in a few hours turns into very small search times. Which turns into a large variance by itself due to timing variances within the operating system and whatever library routines are used to measure intervals of time. Very short games have significant variance when the time itself is so difficult to measure accurately.

I could certainly produce a couple of 80 K game in 1 second matches. And we could analyze the results. The problem is that the Elo statistics will show a very accurate elo for that sample. But two successive samples won't be within +/- 1 Elo of each other...

I expect 2 succesive samples to be within +-1 elo of each other.

If there is a problem with times with the operating system then the engines may use the number of nodes that they search to get an estimate for the time.

The only problem is that the 1 elo is only correct for very fast time control
and may be not correct for slower time control.

Note that I have no practical idea to do something better and I believe that in most of the cases improvement in fast time control is also improvement in slower time control.

Uri

hgm · Post by **hgm** » Mon Jul 21, 2008 7:19 pm

Indeed, they will obey usual statistics. The timing uncertainties might affect the winning probability, but they will do so in the same way in both 80K runs. They just become part of the proces under study.

bob · Post by **bob** » Mon Jul 21, 2008 8:27 pm

Uri Blass wrote:
bob wrote:
ernest wrote:
bob wrote:[They might think they are measuring changes as small as one Elo with 80K games, but they aren't...
Theoretically, 80K games, with 1/3 draws, gives a standard deviation (SD) of 0.144%, that is (x 7) 1 Elo.
And 2 SD (95% probability) is 2 Elo
If everything else were equal. But playing 80K games in a few hours turns into very small search times. Which turns into a large variance by itself due to timing variances within the operating system and whatever library routines are used to measure intervals of time. Very short games have significant variance when the time itself is so difficult to measure accurately.

I could certainly produce a couple of 80 K game in 1 second matches. And we could analyze the results. The problem is that the Elo statistics will show a very accurate elo for that sample. But two successive samples won't be within +/- 1 Elo of each other...
I expect 2 succesive samples to be within +-1 elo of each other.

If there is a problem with times with the operating system then the engines may use the number of nodes that they search to get an estimate for the time.

We have to stay with reality. Nobody does this.

The only problem is that the 1 elo is only correct for very fast time control
and may be not correct for slower time control.

Much more important is the small and overlooked detail that the 1 Elo is correct for _that_ sample. And only that sample. And since that sample is a tiny part of the total population of possible games, it is just a small sample at that..

Note that I have no practical idea to do something better and I believe that in most of the cases improvement in fast time control is also improvement in slower time control.

Uri

I don't, but I will reveal more about this later. I've got some data showing that results at very fast time controls is normally significantly different from results at a longer time control. Over a _lot_ of games...

The danger is that you might be comparing a 10 ply search at quick time controls to a 20 ply search at standard time controls. At 10 plies, you will miss some types of threads or positional issues and adding such stuff to the evaluation will help. But at longer time controls, it is a double penalty that can make it play worse, because it both gets the penalty, and sees the issue in the search as well... I did this to myself in 1986, in a well-documented case of testing at fast time controls on slow hardware and then competing at long time controls on fast hardware...

bob · Post by **bob** » Mon Jul 21, 2008 9:51 pm

hgm wrote:Indeed, they will obey usual statistics. The timing uncertainties might affect the winning probability, but they will do so in the same way in both 80K runs. They just become part of the proces under study.

This assumes that a 1 Elo improvement exists. Or that a program's Elo is actually a precise number. Neither of which I believe to be correct. This is taking a statistical process and trying to derive a precise answer to a question that might not even have a precise answer. If you were to measure the speed of your car, with the speedo set dead on 70mph, but you measure to 4 decimal places, you might get N different answers, and absolutely none of them mean a thing because of the accuracy of the original speedometer... There is inherent jitter in some systems. Chess games seem to fall into that category.

hgm · Post by **hgm** » Mon Jul 21, 2008 9:53 pm

bob wrote:Much more important is the small and overlooked detail that the 1 Elo is correct for _that_ sample. And only that sample. And since that sample is a tiny part of the total population of possible games, it is just a small sample at that..

And that is exactly the nice thing about Monte-Carlo sampling: it is totally irrelevant how large the entire population is w.r.t. the sample. The statistical error is solely determined by the size of the sample, even if the population was infinite. The only thing that matters is if the sampling is representative, i.e. distributed over the population.

So it is not relevant that there exist Chess games that did not occur in the sample. It is only important that such games could have been part of the sample, if chance had decided so.

Thus, what obviously would not work is generate the samples with a defective engine that would never generate moves with, say, Knights. That would prevent sampling of all games containing Knight moves.

As to the "stay with reality" argument: In reality no one tests with fixed number of nodes. But in reality, also no one tests with 1 sec games either. Once they start to do the latter, they will be quickly picking up the former habit as well...

All authors of WinBoard engines are encouraged to implement the command

nps N

in their engine. Having the effect that the engine should not use the wall clock for its time management, but its own internal node count, converting nodes to second according to the factor N given in the command.

I will see to it that the next WinBoard release will support this command, using the node-count reported by the engine in its PV info for decrementing the GUI 'clock'.

(This will be in version 4.3.14, as I just released 4.3.13.)

bob · Post by **bob** » Mon Jul 21, 2008 10:15 pm

hgm wrote:
bob wrote:Much more important is the small and overlooked detail that the 1 Elo is correct for _that_ sample. And only that sample. And since that sample is a tiny part of the total population of possible games, it is just a small sample at that..
And that is exactly the nice thing about Monte-Carlo sampling: it is totally irrelevant how large the entire population is w.r.t. the sample. The statistical error is solely determined by the size of the sample, even if the population was infinite. The only thing that matters is if the sampling is representative, i.e. distributed over the population.

So it is not relevant that there exist Chess games that did not occur in the sample. It is only important that such games could have been part of the sample, if chance had decided so.

Thus, what obviously would not work is generate the samples with a defective engine that would never generate moves with, say, Knights. That would prevent sampling of all games containing Knight moves.

As to the "stay with reality" argument: In reality no one tests with fixed number of nodes. But in reality, also no one tests with 1 sec games either. Once they start to do the latter, they will be quickly picking up the former habit as well...

You did see the discussion about how they are testing rybka by playing 80K games overnight??? So this _is_ being done. 80,000 games, assuming 8 processors is 10K games overnight or 10K games in 12 hours. Make it 10 yours for simplicity and that is 1K games per hour. Or 3 seconds per game total, which is real close to game in 1 second for each side...

All authors of WinBoard engines are encouraged to implement the command

nps N

in their engine. Having the effect that the engine should not use the wall clock for its time management, but its own internal node count, converting nodes to second according to the factor N given in the command.

I will see to it that the next WinBoard release will support this command, using the node-count reported by the engine in its PV info for decrementing the GUI 'clock'.

(This will be in version 4.3.14, as I just released 4.3.13.)

That's a solution, but since adding 100 nodes to any single search can change the result significantly, this is just another way to draw incorrect conclusions. I've tried the fixed node searches but they introduce another issue. You need a bunch of games. With fixed node searches, you then need a bunch of different starting positions since the same game will be played each time given the same starting position and the same node limit...

Uri Blass · Post by **Uri Blass** » Mon Jul 21, 2008 10:20 pm

bob wrote:
Uri Blass wrote:
bob wrote:
ernest wrote:
bob wrote:[They might think they are measuring changes as small as one Elo with 80K games, but they aren't...
Theoretically, 80K games, with 1/3 draws, gives a standard deviation (SD) of 0.144%, that is (x 7) 1 Elo.
And 2 SD (95% probability) is 2 Elo
If everything else were equal. But playing 80K games in a few hours turns into very small search times. Which turns into a large variance by itself due to timing variances within the operating system and whatever library routines are used to measure intervals of time. Very short games have significant variance when the time itself is so difficult to measure accurately.

I could certainly produce a couple of 80 K game in 1 second matches. And we could analyze the results. The problem is that the Elo statistics will show a very accurate elo for that sample. But two successive samples won't be within +/- 1 Elo of each other...
I expect 2 succesive samples to be within +-1 elo of each other.

If there is a problem with times with the operating system then the engines may use the number of nodes that they search to get an estimate for the time.
We have to stay with reality. Nobody does this.

The only problem is that the 1 elo is only correct for very fast time control
and may be not correct for slower time control.
Much more important is the small and overlooked detail that the 1 Elo is correct for _that_ sample. And only that sample. And since that sample is a tiny part of the total population of possible games, it is just a small sample at that..

Note that I have no practical idea to do something better and I believe that in most of the cases improvement in fast time control is also improvement in slower time control.

Uri
I don't, but I will reveal more about this later. I've got some data showing that results at very fast time controls is normally significantly different from results at a longer time control. Over a _lot_ of games...

The danger is that you might be comparing a 10 ply search at quick time controls to a 20 ply search at standard time controls. At 10 plies, you will miss some types of threads or positional issues and adding such stuff to the evaluation will help. But at longer time controls, it is a double penalty that can make it play worse, because it both gets the penalty, and sees the issue in the search as well... I did this to myself in 1986, in a well-documented case of testing at fast time controls on slow hardware and then competing at long time controls on fast hardware...

It may be interesting to see the data.
Note that even if you have data that something productive at 10 plies is counter productive at 20 plies it does not contradicts what I believe because of the following:

1)I said in most cases and not in all cases(of course what happens in most cases is also dependent on what you try)
Of course it may be possible that some knowledge help at blitz and not at long time control but I think that you can be smart to guess it in the first place and not to try it.

For example suppose that a program does not know to win KR vs K at blitz but knows to win it at long time control.

It is obvious that knowledge how to win KR vs K may help at blitz but is going to have no value at long time control.

I think that if you change something in the evaluation the first step before testing in games is testing in positions.

If the change helps the program to avoid some mistake that it did at long time control then the next step is to test it at fast games and in most cases if the change helps the program to play better at very fast time control then it is also going to play better at long time control.

Uri

CRoberson · Post by **CRoberson** » Mon Jul 21, 2008 11:24 pm

hgm wrote:
All authors of WinBoard engines are encouraged to implement the command

nps N

......

I will see to it that the next WinBoard release will support this command, using the node-count reported by the engine in its PV info for decrementing the GUI 'clock'.

I suppose that could be a good idea under special conditions, but not
in general. The wall clock timer is better in several cases as it is a
real timer. What you suggest is not because it would not compensate for the
engine being temporarily suspended due to other job priorities.

hgm · Post by **hgm** » Tue Jul 22, 2008 5:04 am

CRoberson wrote:What you suggest is not because it would not compensate for the engine being temporarily suspended due to other job priorities.

I don't understand this sentence, not only because of the doublr negation. Whatt dors 'it' refer to?

I propose this because at speeds of 25 msec per move the wall clock is too unreliable and inaccurate. So it would be a useful feature for those playing sub-second games.

It would also remove any non-determinism due to timing jitter, which could be a reason for people to even use it at long TC.

Programmers note this: for proper implmentation, please also announce the engine knows this feature by including 'nps=1' in the features command. You can even do this in protocol version 2.

How does Monte Carlo analysis work?

Re: How does Monte Carlo analysis work?

Re: How does Monte Carlo analysis work?

Re: How does Monte Carlo analysis work?

Re: How does Monte Carlo analysis work?

Re: How does Monte Carlo analysis work?

Re: How does Monte Carlo analysis work?

Re: How does Monte Carlo analysis work?

Re: How does Monte Carlo analysis work?

Re: How does Monte Carlo analysis work?

Re: How does Monte Carlo analysis work?