Testing: More Opponents Or More Positions

Discussion of chess software programming and technical issues.

Moderator: Ras

brianr
Posts: 540
Joined: Thu Mar 09, 2006 3:01 pm
Full name: Brian Richardson

Testing: More Opponents Or More Positions

Post by brianr »

I currently test Tinker using various "levels", from a few select positions to 1,600 game gauntlets, with various other tests in-between.
Considering some of the relatively recent developments regarding testing I would appreciate some comments or suggestions about the number of opponents vs the number of starting positions (Nunn, Noomen, Silver, more).

Current gauntlet:
20 Nunn2 and 30 Noomen, 50 total
Both white and black so 100 games per opponent
16 opponents
1,600 games total
Time control 1min plus 2 sec increment
(seems to work well under Arena, and minimizes time losses and still lets searches get deep enough to trigger various extension/reduction options).

Alternative 1:
Add 50 Silver positions so total is 100
Still do both white and black
Reduce opponent count to 8
Some of the current 16 opponents consistently seem pretty close to each other.
Would still be 1,600 games and same time.

Alternative 2:
Even more positions.
Add say another 100 randomly picking a fixed subset from the "Hyatt.3891" set.
Reduce the number of opponents to 4.

Alternative 3:
Use even more time with more positions like alternative 1,
and increase opponents to say 32.

Additional Option:
Only run each position once as either black or white

I would prefer to avoid testing at much faster time controls and do not want to have to write a minimal overhead test harness.
Arena is "ok" and simplifies keeping track of engines.
On my Q6600, the current 1,600 games take between 2-3 days.

My aim is simply to know if Tinker version x+1 is better or worse than version x.
I would also like to incorporate the "Orthogonal Testing" approach as well
(once I fully understand it).

Thanks,
Brian
krazyken

Re: Testing: More Opponents Or More Positions

Post by krazyken »

Here is another alternative use 800 positions play both black and white, to get your 1600 games. The number of opponents is arbitrary, more is better. You assign 1 opponent randomly to each position.
bob
Posts: 20943
Joined: Mon Feb 27, 2006 7:30 pm
Location: Birmingham, AL

Re: Testing: More Opponents Or More Positions

Post by bob »

brianr wrote:I currently test Tinker using various "levels", from a few select positions to 1,600 game gauntlets, with various other tests in-between.
Considering some of the relatively recent developments regarding testing I would appreciate some comments or suggestions about the number of opponents vs the number of starting positions (Nunn, Noomen, Silver, more).

Current gauntlet:
20 Nunn2 and 30 Noomen, 50 total
Both white and black so 100 games per opponent
16 opponents
1,600 games total
Time control 1min plus 2 sec increment
(seems to work well under Arena, and minimizes time losses and still lets searches get deep enough to trigger various extension/reduction options).

Alternative 1:
Add 50 Silver positions so total is 100
Still do both white and black
Reduce opponent count to 8
Some of the current 16 opponents consistently seem pretty close to each other.
Would still be 1,600 games and same time.

Alternative 2:
Even more positions.
Add say another 100 randomly picking a fixed subset from the "Hyatt.3891" set.
Reduce the number of opponents to 4.

Alternative 3:
Use even more time with more positions like alternative 1,
and increase opponents to say 32.

Additional Option:
Only run each position once as either black or white

I would prefer to avoid testing at much faster time controls and do not want to have to write a minimal overhead test harness.
Arena is "ok" and simplifies keeping track of engines.
On my Q6600, the current 1,600 games take between 2-3 days.

My aim is simply to know if Tinker version x+1 is better or worse than version x.
I would also like to incorporate the "Orthogonal Testing" approach as well
(once I fully understand it).

Thanks,
Brian
The problem is the error bar for 1600 games. It is _way_ too big to catch most X vs X+1 results accurately. If you use BayesElo, it will give you the error bar. you are running 1/20th the number of games I use for these tests most of the time, and with 32,000 games the error bar is in the range +/- 4 to 5. Unless your changes are +10 or better, 32,000 games won't be enough.
krazyken

Re: Testing: More Opponents Or More Positions

Post by krazyken »

bob wrote:
The problem is the error bar for 1600 games. It is _way_ too big to catch most X vs X+1 results accurately. If you use BayesElo, it will give you the error bar. you are running 1/20th the number of games I use for these tests most of the time, and with 32,000 games the error bar is in the range +/- 4 to 5. Unless your changes are +10 or better, 32,000 games won't be enough.
Not entirely accurate, even though the error bars overlap, it's still possible to determine with reasonable probability if one version is better than the other. This requires looking at the area of the overlap of the two distributions. If you use BayesELO you can see these probabilities with the LOS command.
brianr
Posts: 540
Joined: Thu Mar 09, 2006 3:01 pm
Full name: Brian Richardson

Re: Testing: More Opponents Or More Positions

Post by brianr »

When I run bayeselo on a combined pgn with 1,600 games from each of two Tinker versions (3,200 total),
the ratings shows an elo difference of about 26 +/- 1,
but the LOS (with the mm 1 option?) shows LOS of 98.

Is this saying there is a 98% chance that the version that is 26 elo higher is better?

Exactly what does the mm 1 option do?
I recall an earlier post mentioning it, but I tried it without the mm 1 option and the results are the same.
krazyken

Re: Testing: More Opponents Or More Positions

Post by krazyken »

brianr wrote:When I run bayeselo on a combined pgn with 1,600 games from each of two Tinker versions (3,200 total),
the ratings shows an elo difference of about 26 +/- 1,
I'm not understanding how you come to this number, but +/- 1 seems a bit small for the number of games you are talking about. Perhaps you want to post the original data?
brianr wrote: but the LOS (with the mm 1 option?) shows LOS of 98.

Is this saying there is a 98% chance that the version that is 26 elo higher is better?
Yes
brianr wrote: Exactly what does the mm 1 option do?
I recall an earlier post mentioning it, but I tried it without the mm 1 option and the results are the same.
the 1 tells it to calculate the advantage of going first. use the advantage command to to see what that comes out to.
bob
Posts: 20943
Joined: Mon Feb 27, 2006 7:30 pm
Location: Birmingham, AL

Re: Testing: More Opponents Or More Positions

Post by bob »

krazyken wrote:
bob wrote:
The problem is the error bar for 1600 games. It is _way_ too big to catch most X vs X+1 results accurately. If you use BayesElo, it will give you the error bar. you are running 1/20th the number of games I use for these tests most of the time, and with 32,000 games the error bar is in the range +/- 4 to 5. Unless your changes are +10 or better, 32,000 games won't be enough.
Not entirely accurate, even though the error bars overlap, it's still possible to determine with reasonable probability if one version is better than the other. This requires looking at the area of the overlap of the two distributions. If you use BayesELO you can see these probabilities with the LOS command.
with 32K games, the error bar is roughly +/- 5, 8K games is roughly +/- 10, 2K games is +/- 20.

If you have a 2-3-4 elo change, you are not going to get any confidence that X+1 is better than X, the error bars are going to almost completely overlap. How would you evaluate 2600+/- 20, and 2602+/-20?

The +/-20 is simply too big.. If this were 2600+/-20 and 2630+/-20 you might be more confident that the change is better. But 2600+/-20 and 2605+/-20 doesn't fill me with confidence.

Hence my comments...
bob
Posts: 20943
Joined: Mon Feb 27, 2006 7:30 pm
Location: Birmingham, AL

Re: Testing: More Opponents Or More Positions

Post by bob »

brianr wrote:When I run bayeselo on a combined pgn with 1,600 games from each of two Tinker versions (3,200 total),
the ratings shows an elo difference of about 26 +/- 1,
but the LOS (with the mm 1 option?) shows LOS of 98.

Is this saying there is a 98% chance that the version that is 26 elo higher is better?

Exactly what does the mm 1 option do?
I recall an earlier post mentioning it, but I tried it without the mm 1 option and the results are the same.
Correct. But 26 Elo changes are not very common for the most part. The bigger the difference between the two versions, the fewer games you need to confidently say one is better than the other.

I do not quite follow your 26 +/- 1... the error bar can't possibly be +/- 1 with a total of 3200 games. You have to get up to the million game level to get that kind of confidence... Here's one recent example:

Code: Select all

   1 Toga2             2672    2    3 194550   60%  2600   25%
   2 Glaurung 2.2      2664    3    2 194550   59%  2600   24%
almost 200,000 games, and a 5 elo error bar. 800,000 games wil drop that to 2.5 or so.
krazyken

Re: Testing: More Opponents Or More Positions

Post by krazyken »

bob wrote:
krazyken wrote:
bob wrote:
The problem is the error bar for 1600 games. It is _way_ too big to catch most X vs X+1 results accurately. If you use BayesElo, it will give you the error bar. you are running 1/20th the number of games I use for these tests most of the time, and with 32,000 games the error bar is in the range +/- 4 to 5. Unless your changes are +10 or better, 32,000 games won't be enough.
Not entirely accurate, even though the error bars overlap, it's still possible to determine with reasonable probability if one version is better than the other. This requires looking at the area of the overlap of the two distributions. If you use BayesELO you can see these probabilities with the LOS command.
with 32K games, the error bar is roughly +/- 5, 8K games is roughly +/- 10, 2K games is +/- 20.

If you have a 2-3-4 elo change, you are not going to get any confidence that X+1 is better than X, the error bars are going to almost completely overlap. How would you evaluate 2600+/- 20, and 2602+/-20?

The +/-20 is simply too big.. If this were 2600+/-20 and 2630+/-20 you might be more confident that the change is better. But 2600+/-20 and 2605+/-20 doesn't fill me with confidence.

Hence my comments...
I understand that of course, but most people are not trying to measure elo change that small. The amount of time required to measure a change that small it is usually better off to declare them equal and move on to something else.
bob
Posts: 20943
Joined: Mon Feb 27, 2006 7:30 pm
Location: Birmingham, AL

Re: Testing: More Opponents Or More Positions

Post by bob »

krazyken wrote:
bob wrote:
krazyken wrote:
bob wrote:
The problem is the error bar for 1600 games. It is _way_ too big to catch most X vs X+1 results accurately. If you use BayesElo, it will give you the error bar. you are running 1/20th the number of games I use for these tests most of the time, and with 32,000 games the error bar is in the range +/- 4 to 5. Unless your changes are +10 or better, 32,000 games won't be enough.
Not entirely accurate, even though the error bars overlap, it's still possible to determine with reasonable probability if one version is better than the other. This requires looking at the area of the overlap of the two distributions. If you use BayesELO you can see these probabilities with the LOS command.
with 32K games, the error bar is roughly +/- 5, 8K games is roughly +/- 10, 2K games is +/- 20.

If you have a 2-3-4 elo change, you are not going to get any confidence that X+1 is better than X, the error bars are going to almost completely overlap. How would you evaluate 2600+/- 20, and 2602+/-20?

The +/-20 is simply too big.. If this were 2600+/-20 and 2630+/-20 you might be more confident that the change is better. But 2600+/-20 and 2605+/-20 doesn't fill me with confidence.

Hence my comments...
I understand that of course, but most people are not trying to measure elo change that small. The amount of time required to measure a change that small it is usually better off to declare them equal and move on to something else.
There are many programs where changes of +20 are going to be _very_ rare. Brian's program is not brand new. Most changes are evolutionary in nature, not revolutionary. If you can't pick up small changes, you will reach a very quick plateau and never move forward because _every_ change is small and your idea would cause you to throw out good changes or keep bad ones with equal probability.

Crafty 23.0 was a clear +70 Elo improvement over 22.x. But no single change was even +10. Most were less than +5.