Testing: More Opponents Or More Positions

brianr · Post by **brianr** » Fri May 29, 2009 7:08 pm

I currently test Tinker using various "levels", from a few select positions to 1,600 game gauntlets, with various other tests in-between.
Considering some of the relatively recent developments regarding testing I would appreciate some comments or suggestions about the number of opponents vs the number of starting positions (Nunn, Noomen, Silver, more).

Current gauntlet:
20 Nunn2 and 30 Noomen, 50 total
Both white and black so 100 games per opponent
16 opponents
1,600 games total
Time control 1min plus 2 sec increment
(seems to work well under Arena, and minimizes time losses and still lets searches get deep enough to trigger various extension/reduction options).

Alternative 1:
Add 50 Silver positions so total is 100
Still do both white and black
Reduce opponent count to 8
Some of the current 16 opponents consistently seem pretty close to each other.
Would still be 1,600 games and same time.

Alternative 2:
Even more positions.
Add say another 100 randomly picking a fixed subset from the "Hyatt.3891" set.
Reduce the number of opponents to 4.

Alternative 3:
Use even more time with more positions like alternative 1,
and increase opponents to say 32.

Additional Option:
Only run each position once as either black or white

I would prefer to avoid testing at much faster time controls and do not want to have to write a minimal overhead test harness.
Arena is "ok" and simplifies keeping track of engines.
On my Q6600, the current 1,600 games take between 2-3 days.

My aim is simply to know if Tinker version x+1 is better or worse than version x.
I would also like to incorporate the "Orthogonal Testing" approach as well
(once I fully understand it).

Thanks,
Brian

krazyken · Post by **krazyken** » Fri May 29, 2009 9:27 pm

Here is another alternative use 800 positions play both black and white, to get your 1600 games. The number of opponents is arbitrary, more is better. You assign 1 opponent randomly to each position.

bob · Post by **bob** » Fri May 29, 2009 9:50 pm

brianr wrote:I currently test Tinker using various "levels", from a few select positions to 1,600 game gauntlets, with various other tests in-between.
Considering some of the relatively recent developments regarding testing I would appreciate some comments or suggestions about the number of opponents vs the number of starting positions (Nunn, Noomen, Silver, more).

Current gauntlet:
20 Nunn2 and 30 Noomen, 50 total
Both white and black so 100 games per opponent
16 opponents
1,600 games total
Time control 1min plus 2 sec increment
(seems to work well under Arena, and minimizes time losses and still lets searches get deep enough to trigger various extension/reduction options).

Alternative 1:
Add 50 Silver positions so total is 100
Still do both white and black
Reduce opponent count to 8
Some of the current 16 opponents consistently seem pretty close to each other.
Would still be 1,600 games and same time.

Alternative 2:
Even more positions.
Add say another 100 randomly picking a fixed subset from the "Hyatt.3891" set.
Reduce the number of opponents to 4.

Alternative 3:
Use even more time with more positions like alternative 1,
and increase opponents to say 32.

Additional Option:
Only run each position once as either black or white

I would prefer to avoid testing at much faster time controls and do not want to have to write a minimal overhead test harness.
Arena is "ok" and simplifies keeping track of engines.
On my Q6600, the current 1,600 games take between 2-3 days.

My aim is simply to know if Tinker version x+1 is better or worse than version x.
I would also like to incorporate the "Orthogonal Testing" approach as well
(once I fully understand it).

Thanks,
Brian

The problem is the error bar for 1600 games. It is _way_ too big to catch most X vs X+1 results accurately. If you use BayesElo, it will give you the error bar. you are running 1/20th the number of games I use for these tests most of the time, and with 32,000 games the error bar is in the range +/- 4 to 5. Unless your changes are +10 or better, 32,000 games won't be enough.

krazyken · Post by **krazyken** » Fri May 29, 2009 10:10 pm

bob wrote:
The problem is the error bar for 1600 games. It is _way_ too big to catch most X vs X+1 results accurately. If you use BayesElo, it will give you the error bar. you are running 1/20th the number of games I use for these tests most of the time, and with 32,000 games the error bar is in the range +/- 4 to 5. Unless your changes are +10 or better, 32,000 games won't be enough.

Not entirely accurate, even though the error bars overlap, it's still possible to determine with reasonable probability if one version is better than the other. This requires looking at the area of the overlap of the two distributions. If you use BayesELO you can see these probabilities with the LOS command.

brianr · Post by **brianr** » Fri May 29, 2009 10:52 pm

When I run bayeselo on a combined pgn with 1,600 games from each of two Tinker versions (3,200 total),
the ratings shows an elo difference of about 26 +/- 1,
but the LOS (with the mm 1 option?) shows LOS of 98.

Is this saying there is a 98% chance that the version that is 26 elo higher is better?

Exactly what does the mm 1 option do?
I recall an earlier post mentioning it, but I tried it without the mm 1 option and the results are the same.

krazyken · Post by **krazyken** » Fri May 29, 2009 11:14 pm

brianr wrote:When I run bayeselo on a combined pgn with 1,600 games from each of two Tinker versions (3,200 total),
the ratings shows an elo difference of about 26 +/- 1,

I'm not understanding how you come to this number, but +/- 1 seems a bit small for the number of games you are talking about. Perhaps you want to post the original data?

brianr wrote: but the LOS (with the mm 1 option?) shows LOS of 98.

Is this saying there is a 98% chance that the version that is 26 elo higher is better?

Yes

brianr wrote: Exactly what does the mm 1 option do?
I recall an earlier post mentioning it, but I tried it without the mm 1 option and the results are the same.

the 1 tells it to calculate the advantage of going first. use the advantage command to to see what that comes out to.

bob · Post by **bob** » Fri May 29, 2009 11:16 pm

krazyken wrote:
bob wrote:
The problem is the error bar for 1600 games. It is _way_ too big to catch most X vs X+1 results accurately. If you use BayesElo, it will give you the error bar. you are running 1/20th the number of games I use for these tests most of the time, and with 32,000 games the error bar is in the range +/- 4 to 5. Unless your changes are +10 or better, 32,000 games won't be enough.
Not entirely accurate, even though the error bars overlap, it's still possible to determine with reasonable probability if one version is better than the other. This requires looking at the area of the overlap of the two distributions. If you use BayesELO you can see these probabilities with the LOS command.

with 32K games, the error bar is roughly +/- 5, 8K games is roughly +/- 10, 2K games is +/- 20.

If you have a 2-3-4 elo change, you are not going to get any confidence that X+1 is better than X, the error bars are going to almost completely overlap. How would you evaluate 2600+/- 20, and 2602+/-20?

The +/-20 is simply too big.. If this were 2600+/-20 and 2630+/-20 you might be more confident that the change is better. But 2600+/-20 and 2605+/-20 doesn't fill me with confidence.

Hence my comments...

bob · Post by **bob** » Fri May 29, 2009 11:22 pm

brianr wrote:When I run bayeselo on a combined pgn with 1,600 games from each of two Tinker versions (3,200 total),
the ratings shows an elo difference of about 26 +/- 1,
but the LOS (with the mm 1 option?) shows LOS of 98.

Is this saying there is a 98% chance that the version that is 26 elo higher is better?

Exactly what does the mm 1 option do?
I recall an earlier post mentioning it, but I tried it without the mm 1 option and the results are the same.

Correct. But 26 Elo changes are not very common for the most part. The bigger the difference between the two versions, the fewer games you need to confidently say one is better than the other.

I do not quite follow your 26 +/- 1... the error bar can't possibly be +/- 1 with a total of 3200 games. You have to get up to the million game level to get that kind of confidence... Here's one recent example:

Code: Select all

   1 Toga2             2672    2    3 194550   60%  2600   25%
   2 Glaurung 2.2      2664    3    2 194550   59%  2600   24%

almost 200,000 games, and a 5 elo error bar. 800,000 games wil drop that to 2.5 or so.

krazyken · Post by **krazyken** » Fri May 29, 2009 11:28 pm

bob wrote:
krazyken wrote:
bob wrote:
The problem is the error bar for 1600 games. It is _way_ too big to catch most X vs X+1 results accurately. If you use BayesElo, it will give you the error bar. you are running 1/20th the number of games I use for these tests most of the time, and with 32,000 games the error bar is in the range +/- 4 to 5. Unless your changes are +10 or better, 32,000 games won't be enough.
Not entirely accurate, even though the error bars overlap, it's still possible to determine with reasonable probability if one version is better than the other. This requires looking at the area of the overlap of the two distributions. If you use BayesELO you can see these probabilities with the LOS command.
with 32K games, the error bar is roughly +/- 5, 8K games is roughly +/- 10, 2K games is +/- 20.

If you have a 2-3-4 elo change, you are not going to get any confidence that X+1 is better than X, the error bars are going to almost completely overlap. How would you evaluate 2600+/- 20, and 2602+/-20?

The +/-20 is simply too big.. If this were 2600+/-20 and 2630+/-20 you might be more confident that the change is better. But 2600+/-20 and 2605+/-20 doesn't fill me with confidence.

Hence my comments...

I understand that of course, but most people are not trying to measure elo change that small. The amount of time required to measure a change that small it is usually better off to declare them equal and move on to something else.

bob · Post by **bob** » Fri May 29, 2009 11:44 pm

krazyken wrote:
bob wrote:
krazyken wrote:
bob wrote:
The problem is the error bar for 1600 games. It is _way_ too big to catch most X vs X+1 results accurately. If you use BayesElo, it will give you the error bar. you are running 1/20th the number of games I use for these tests most of the time, and with 32,000 games the error bar is in the range +/- 4 to 5. Unless your changes are +10 or better, 32,000 games won't be enough.
Not entirely accurate, even though the error bars overlap, it's still possible to determine with reasonable probability if one version is better than the other. This requires looking at the area of the overlap of the two distributions. If you use BayesELO you can see these probabilities with the LOS command.
with 32K games, the error bar is roughly +/- 5, 8K games is roughly +/- 10, 2K games is +/- 20.

If you have a 2-3-4 elo change, you are not going to get any confidence that X+1 is better than X, the error bars are going to almost completely overlap. How would you evaluate 2600+/- 20, and 2602+/-20?

The +/-20 is simply too big.. If this were 2600+/-20 and 2630+/-20 you might be more confident that the change is better. But 2600+/-20 and 2605+/-20 doesn't fill me with confidence.

Hence my comments...
I understand that of course, but most people are not trying to measure elo change that small. The amount of time required to measure a change that small it is usually better off to declare them equal and move on to something else.

There are many programs where changes of +20 are going to be _very_ rare. Brian's program is not brand new. Most changes are evolutionary in nature, not revolutionary. If you can't pick up small changes, you will reach a very quick plateau and never move forward because _every_ change is small and your idea would cause you to throw out good changes or keep bad ones with equal probability.

Crafty 23.0 was a clear +70 Elo improvement over 22.x. But no single change was even +10. Most were less than +5.

Testing: More Opponents Or More Positions

Testing: More Opponents Or More Positions

Re: Testing: More Opponents Or More Positions

Re: Testing: More Opponents Or More Positions

Re: Testing: More Opponents Or More Positions

Re: Testing: More Opponents Or More Positions

Re: Testing: More Opponents Or More Positions

Re: Testing: More Opponents Or More Positions

Re: Testing: More Opponents Or More Positions

Re: Testing: More Opponents Or More Positions

Re: Testing: More Opponents Or More Positions