On engine testing again!

Discussion of chess software programming and technical issues.

Moderator: Ras

Milos
Posts: 4190
Joined: Wed Nov 25, 2009 1:47 am

Re: On engine testing again!

Post by Milos »

zamar wrote:If I can recall correctly H.G.Mueller posted the approximative formula for the error bar of winning percentage: sigma = 40 / sqrt(number of games)
Ppl often use this, but this can be very well wrong if difference between opponents is larger, or there are a lot of draws.

The full formula is simple enough:
sigma(in %)=sqrt((win%_a*win%_b - 25*draw%)/num_games)
bob
Posts: 20943
Joined: Mon Feb 27, 2006 7:30 pm
Location: Birmingham, AL

Re: On engine testing again!

Post by bob »

Edsel Apostol wrote:
bob wrote:
Edsel Apostol wrote:
bob wrote:
Edsel Apostol wrote:Let's say that due to the limited resources one can only play 1200 games per engine version/settings.

Which testing method is better and why?

A. 120 games against each of the 10 opponents
B. 240 games against each of the 5 opponents
C. 300 games against each of the 4 opponents
D. 400 games against each of the 3 opponents
E. 600 games against each of the 2 opponents
E. 1200 games against an opponent
Difficult question.

First, one opponent will lead you to tuning against that opponent, which may well hurt you against others.

However, many opponents causes you to reduce the number of starting test positions, which makes the positions critical.

If you are trying to measure 10-20 Elo changes, 1200 games is really hopeless, however. This is a painful issue, no doubt...
That's why I don't target the 10 to 20 elo changes for now, that will be for tuning later when the engine is already really strong. I'm more on trying out ideas that may give or take at least 30 elo.

I agree about the few starting positions being critical.

Well, at least 1200 games is better than nothing. If a version/setting is good it will show in the rating list no matter how few the games.
Actually it won't. I've posted lots of results where a new version was way worse after 200-500 games, and then by 40,000 games was better. And vice-versa. The eye-opening test here is to run a 100 game match, then change _nothing_ and rerun it. The difference can be startling. Run it enough times and you develop a real appreciation for why the error bar is so high after 100 games.
For elo differences of somewhere around 20, then I would agree, but for an elo difference of for example 100 I doubt that after 200 games the stronger version would still be behind the weaker version.

It depends really on what you are trying to measure. The smaller the differences, the more games you need. In my engine, I don't try tuning much. Tuning I think when an engine is not that strong yet will only find the local maximum. I'm more on trying out new ideas.
THe statistics are simple. The error bar says that any rating can be off by that much, given _that_ set of games to go by. So the error bar is not an absolute number, it is derived from the sample you give it (games). I have played the same version of Crafty against the same set of opponents and seen the elo vary by well over 100. When neither "version" had changed at all (and in fact both were the same with different names).

Again, play A vs B for 100 games. Then run them again. And a third time. And look at the results independently with BayesElo, and then combine them and compare again. I would draw _no_ conclusion from 100 games, or even 500 games. Unless the score is truly ugly, such as +450, -50 or something equally one-sided. But +300, -200? Not going to trust that.
Sven
Posts: 4052
Joined: Thu May 15, 2008 9:57 pm
Location: Berlin, Germany
Full name: Sven Schüle

Re: On engine testing again!

Post by Sven »

zamar wrote:
Sven Schüle wrote:Could someone please post a table (or formula) listing typical error bars we can expect for 500, 1000, 1500, ..., 5000 games (most of us can't play more games within reasonable time), together with an explanation how the number of opponents and possibly other major factors affect the error bars? That would be great.

Sven
If I can recall correctly H.G.Mueller posted the approximative formula for the error bar of winning percentage: sigma = 40 / sqrt(number of games)

To get approximation in ELOs multiply by 7.

Number of opponents does not affect the error bar. But when comparing gauntlets (of equal number of games) one must multiply the error bar by sqrt(2).

Someone please correct if I got sth wrong. I'm not an expert in statistics :)
Thanks for your reply. Using that formula would result in the following table, which I don't believe yet since it would contradict heavily to Bob's results:

Code: Select all

errorBarELO = approx. 7 * 40 / sqrt(nGames) ???

nGames	errorBarELO ???
1	     280.0
2	     198.0
3	     161.7
4	     140.0
5	     125.2
6	     114.3
7	     105.8
8	     99.0
9	     93.3
10	    88.5
20	    62.6
30	    51.1
40	    44.3
50	    39.6
60	    36.1
70	    33.5
80	    31.3
90	    29.5
100	   28.0
200	   19.8
300	   16.2
400	   14.0
500	   12.5
1000	  8.9
1500	  7.2
2000	  6.3
2500	  5.6
3000	  5.1
3500	  4.7
4000	  4.4
4500	  4.2
5000	  4.0
5500	  3.8
6000	  3.6
6500	  3.5
7000	  3.3
7500	  3.2
8000	  3.1
8500	  3.0
9000	  3.0
9500	  2.9
10000	 2.8
15000	 2.3
20000	 2.0
25000	 1.8
30000	 1.6
35000	 1.5
40000	 1.4
45000	 1.3
50000	 1.3
It would also mean that 8 games could already be sufficient to get a result with about +/- 100 ELO error, which is far from my expectations yet.

So who can answer whether this is correct or not, and if not, what is correct instead?

Sven
Sven
Posts: 4052
Joined: Thu May 15, 2008 9:57 pm
Location: Berlin, Germany
Full name: Sven Schüle

Re: On engine testing again!

Post by Sven »

Milos wrote:
zamar wrote:If I can recall correctly H.G.Mueller posted the approximative formula for the error bar of winning percentage: sigma = 40 / sqrt(number of games)
Ppl often use this, but this can be very well wrong if difference between opponents is larger, or there are a lot of draws.

The full formula is simple enough:
sigma(in %)=sqrt((win%_a*win%_b - 25*draw%)/num_games)
How is this formula extended to gauntlets vs N opponents?

Also it does not work for many possible combinations of win%_a and draw% since (win%_a*win%_b - 25*draw%) can go negative quite easily so that sqrt() becomes undefined. For instance, that formula does not work for draw%=35 and win%_a>45 (46*(100-46-35)-25*35 = 874-875 = -1), and it fails completely for draw%>38.

So maybe I have misunderstood something, or there is a small error in the formula above?

Sven
Sven
Posts: 4052
Joined: Thu May 15, 2008 9:57 pm
Location: Berlin, Germany
Full name: Sven Schüle

Re: On engine testing again!

Post by Sven »

Sven Schüle wrote:So maybe I have misunderstood something, or there is a small error in the formula above?
I think I found my misunderstanding. It is win%_a + win%_b = 100 but I thought it were win%_a + win%_b + draw% = 100. With the correct way I now see how it works.

My other question remains: how is that formula extended to gauntlets vs N opponents?

Sven
Milos
Posts: 4190
Joined: Wed Nov 25, 2009 1:47 am

Re: On engine testing again!

Post by Milos »

Sven Schüle wrote:I think I found my misunderstanding. It is win%_a + win%_b = 100 but I thought it were win%_a + win%_b + draw% = 100. With the correct way I now see how it works.
Yes this is a correct interpretation.
Moreover, the mistake in the upper table that you printed is in usage.
The formula gives only 1 sigma, meaning 68% probability.
So if you get elo difference between engines A and B of N elo, there is 68% chance that real difference will be in the range [N-sigma, N+sigma].
If you want 95% certainty then you take [N-2*sigma, N+2*sigma] range.

For gauntlet with many opponents, the exact formula becomes very complicated and cannot be calculated by hand (multidimensional gaussian distribution approximation). Moreover, sigma intervals can be non-symmetrical. Still as a rule of thumb, you can use
sigma=1.41*40/sqrt(num_games) and this one usually works well.
Sven
Posts: 4052
Joined: Thu May 15, 2008 9:57 pm
Location: Berlin, Germany
Full name: Sven Schüle

Re: On engine testing again!

Post by Sven »

Milos wrote:
Sven Schüle wrote:I think I found my misunderstanding. It is win%_a + win%_b = 100 but I thought it were win%_a + win%_b + draw% = 100. With the correct way I now see how it works.
Yes this is a correct interpretation.
Moreover, the mistake in the upper table that you printed is in usage.
The formula gives only 1 sigma, meaning 68% probability.
So if you get elo difference between engines A and B of N elo, there is 68% chance that real difference will be in the range [N-sigma, N+sigma].
If you want 95% certainty then you take [N-2*sigma, N+2*sigma] range.

For gauntlet with many opponents, the exact formula becomes very complicated and cannot be calculated by hand (multidimensional gaussian distribution approximation). Moreover, sigma intervals can be non-symmetrical. Still as a rule of thumb, you can use
sigma=1.41*40/sqrt(num_games) and this one usually works well.
Thanks very much for clarifying!

So for 95% confidence, i.e. double sigma, the rule of thumb for gauntlets with many opponents would read like "errorBarELO = 7 * 2 * sqrt(2) * 40 / sqrt(nGames)" which is the same as "errorBarELO = 560 * sqrt(2) / nGames" or even shorter "errorBarELO = sqrt(627200 / nGames)", correct? This explains why even 10000 games are not sufficient for anything much better than a +/- 8 ELO error.

So here is my corrected table, still hoping it is that simple:

Code: Select all

errorBarELO (for gauntlets, double sigma/95% confidence) = approx. 560 * sqrt(2) / sqrt(nGames)

nGames   errorBarELO
1	     792.0
2	     560.0
3	     457.2
4	     396.0
5	     354.2
6	     323.3
7	     299.3
8	     280.0
9	     264.0
10	    250.4
20	    177.1
30	    144.6
40	    125.2
50	    112.0
60	    102.2
70	    94.7
80	    88.5
90	    83.5
100	   79.2
200	   56.0
300	   45.7
400	   39.6
500	   35.4
1000	  25.0
1500	  20.4
2000	  17.7
2500	  15.8
3000	  14.5
3500	  13.4
4000	  12.5
4500	  11.8
5000	  11.2
5500	  10.7
6000	  10.2
6500	  9.8
7000	  9.5
7500	  9.1
8000	  8.9
8500	  8.6
9000	  8.3
9500	  8.1
10000	 7.9
15000	 6.5
20000	 5.6
25000	 5.0
30000	 4.6
35000	 4.2
40000	 4.0
45000	 3.7
50000	 3.5
Sven