Performance testing and number of opponents

CRoberson · Post by **CRoberson** » Sat Jun 14, 2014 11:30 pm

There are several ways to distribute the sparring partners. First, we need a large number of games for statistical significance and we'd like to vary the number of opponents. However, there are lots of options and I think they are not all equal.

I know that several of these are the same as N approaches infinity (where N is the number of games), but I am talking about 200 games specifically.

Lets say we are testing with 200 games. Now, lets choose between 4 methods.
1) 1 opponent and 200 games .
2) 20 opponents at 10 games per opponent w
3) 200 opponents 1 game per opponent
4) 100 opponents with 2 games per opponent

In all cases, the openings/start positions are randomly chosen from from some unbiased list.

Seems to me, this comes down to efficiency of estimators. In case 3 & 4, we have 200 and 100 estimators that will be averaged, but the variance of each estimator will be the largest of the 4 options. In case 1, there can be a bias that will skew the results. So,. option 2 seems best. Also, 10 opponents at 20 games per opponent is a contender, but the lower the opponents the more the possible bias.

Here I am not talking about thorough testing of a new idea in the code. I normally run thousands of games for that. I am talking about making a quick, but reasonable Elo estimate at a long TC.

ZirconiumX · Post by **ZirconiumX** » Sat Jun 14, 2014 11:46 pm

CRoberson wrote:There are several ways to distribute the sparring partners. First, we need a large number of games for statistical significance and we'd like to vary the number of opponents. However, there are lots of options and I think they are not all equal.

I know that several of these are the same as N approaches infinity (where N is the number of games), but I am talking about 200 games specifically.

Lets say we are testing with 200 games. Now, lets choose between 4 methods.
1) 1 opponent and 200 games .
2) 20 opponents at 10 games per opponent w
3) 200 opponents 1 game per opponent
4) 100 opponents with 2 games per opponent

In all cases, the openings/start positions are randomly chosen from from some unbiased list.

Seems to me, this comes down to efficiency of estimators. In case 3 & 4, we have 200 and 100 estimators that will be averaged, but the variance of each estimator will be the largest of the 4 options. In case 1, there can be a bias that will skew the results. So,. option 2 seems best. Also, 10 opponents at 20 games per opponent is a contender, but the lower the opponents the more the possible bias.

Here I am not talking about thorough testing of a new idea in the code. I normally run thousands of games for that. I am talking about making a quick, but reasonable Elo estimate at a long TC.

1) will always be best.

One good reference point is better than 200 bad ones - since any elo tool will give you pretty high error bounds with the 200 bad ones.

Matthew:out

hgm · Post by **hgm** » Sun Jun 15, 2014 12:31 am

It depends on how much difference there is between the results achieved against various opponents. You can see it like this:

From the infinite pool of opponents, and the infinite pool of start positions, you draw a sample. The sample average need not be able to the pool average; the sampling causes a noise inversely proportional to the square-root of the sample size. The total variance (= square standard error) in your measurement will be the sum of the variances due to the opponent and the position sampling.

Unfortunately there is no prior way of knowing how the average result for individual opponents, averaged over the entire pool of positions, will differ from opponent to opponent. To get an estimate for that, you could run a round robin between a group of independent engines, with many games per pairing, calculate their ratings according to some rating model, use the calculated ratings to predict the results of each pairing with the same rating model, and look at the variance of the difference between the predicted and actual results of the pairing. The amount that that calculated variance is larger than the variance expected from the number of games would be the variance of the engine pool.

Unless there is no variance within the engine pool at all (i.e. the rating measured against a single opponent would always be the same, no matter what opponent you selected), it will at some point become pointless to do more games, as the Elo error will be dominated by opponent selection. To converge to the true rating, you would have to let both the number of games and the number of positions tend to infinity.

michiguel · Post by **michiguel** » Sun Jun 15, 2014 1:59 am

CRoberson wrote:There are several ways to distribute the sparring partners. First, we need a large number of games for statistical significance and we'd like to vary the number of opponents. However, there are lots of options and I think they are not all equal.

I know that several of these are the same as N approaches infinity (where N is the number of games), but I am talking about 200 games specifically.

Lets say we are testing with 200 games. Now, lets choose between 4 methods.
1) 1 opponent and 200 games .
2) 20 opponents at 10 games per opponent w
3) 200 opponents 1 game per opponent
4) 100 opponents with 2 games per opponent

In all cases, the openings/start positions are randomly chosen from from some unbiased list.

Seems to me, this comes down to efficiency of estimators. In case 3 & 4, we have 200 and 100 estimators that will be averaged, but the variance of each estimator will be the largest of the 4 options. In case 1, there can be a bias that will skew the results. So,. option 2 seems best. Also, 10 opponents at 20 games per opponent is a contender, but the lower the opponents the more the possible bias.

Here I am not talking about thorough testing of a new idea in the code. I normally run thousands of games for that. I am talking about making a quick, but reasonable Elo estimate at a long TC.

Without any other information, #3.

Miguel
PS: In theory, hard to get truly 200 different opponents near the level of your particular engine.

Performance testing and number of opponents

Performance testing and number of opponents

Re: Performance testing and number of opponents

Re: Performance testing and number of opponents

Re: Performance testing and number of opponents