I currently test Tinker using various "levels", from a few select positions to 1,600 game gauntlets, with various other tests in-between.
Considering some of the relatively recent developments regarding testing I would appreciate some comments or suggestions about the number of opponents vs the number of starting positions (Nunn, Noomen, Silver, more).
Current gauntlet:
20 Nunn2 and 30 Noomen, 50 total
Both white and black so 100 games per opponent
16 opponents
1,600 games total
Time control 1min plus 2 sec increment
(seems to work well under Arena, and minimizes time losses and still lets searches get deep enough to trigger various extension/reduction options).
Alternative 1:
Add 50 Silver positions so total is 100
Still do both white and black
Reduce opponent count to 8
Some of the current 16 opponents consistently seem pretty close to each other.
Would still be 1,600 games and same time.
Alternative 2:
Even more positions.
Add say another 100 randomly picking a fixed subset from the "Hyatt.3891" set.
Reduce the number of opponents to 4.
Alternative 3:
Use even more time with more positions like alternative 1,
and increase opponents to say 32.
Additional Option:
Only run each position once as either black or white
I would prefer to avoid testing at much faster time controls and do not want to have to write a minimal overhead test harness.
Arena is "ok" and simplifies keeping track of engines.
On my Q6600, the current 1,600 games take between 2-3 days.
My aim is simply to know if Tinker version x+1 is better or worse than version x.
I would also like to incorporate the "Orthogonal Testing" approach as well
(once I fully understand it).
Thanks,
Brian
Testing: More Opponents Or More Positions
Moderator: Ras
-
- Posts: 540
- Joined: Thu Mar 09, 2006 3:01 pm
- Full name: Brian Richardson
Re: Testing: More Opponents Or More Positions
Here is another alternative use 800 positions play both black and white, to get your 1600 games. The number of opponents is arbitrary, more is better. You assign 1 opponent randomly to each position.
-
- Posts: 20943
- Joined: Mon Feb 27, 2006 7:30 pm
- Location: Birmingham, AL
Re: Testing: More Opponents Or More Positions
The problem is the error bar for 1600 games. It is _way_ too big to catch most X vs X+1 results accurately. If you use BayesElo, it will give you the error bar. you are running 1/20th the number of games I use for these tests most of the time, and with 32,000 games the error bar is in the range +/- 4 to 5. Unless your changes are +10 or better, 32,000 games won't be enough.brianr wrote:I currently test Tinker using various "levels", from a few select positions to 1,600 game gauntlets, with various other tests in-between.
Considering some of the relatively recent developments regarding testing I would appreciate some comments or suggestions about the number of opponents vs the number of starting positions (Nunn, Noomen, Silver, more).
Current gauntlet:
20 Nunn2 and 30 Noomen, 50 total
Both white and black so 100 games per opponent
16 opponents
1,600 games total
Time control 1min plus 2 sec increment
(seems to work well under Arena, and minimizes time losses and still lets searches get deep enough to trigger various extension/reduction options).
Alternative 1:
Add 50 Silver positions so total is 100
Still do both white and black
Reduce opponent count to 8
Some of the current 16 opponents consistently seem pretty close to each other.
Would still be 1,600 games and same time.
Alternative 2:
Even more positions.
Add say another 100 randomly picking a fixed subset from the "Hyatt.3891" set.
Reduce the number of opponents to 4.
Alternative 3:
Use even more time with more positions like alternative 1,
and increase opponents to say 32.
Additional Option:
Only run each position once as either black or white
I would prefer to avoid testing at much faster time controls and do not want to have to write a minimal overhead test harness.
Arena is "ok" and simplifies keeping track of engines.
On my Q6600, the current 1,600 games take between 2-3 days.
My aim is simply to know if Tinker version x+1 is better or worse than version x.
I would also like to incorporate the "Orthogonal Testing" approach as well
(once I fully understand it).
Thanks,
Brian
Re: Testing: More Opponents Or More Positions
Not entirely accurate, even though the error bars overlap, it's still possible to determine with reasonable probability if one version is better than the other. This requires looking at the area of the overlap of the two distributions. If you use BayesELO you can see these probabilities with the LOS command.bob wrote:
The problem is the error bar for 1600 games. It is _way_ too big to catch most X vs X+1 results accurately. If you use BayesElo, it will give you the error bar. you are running 1/20th the number of games I use for these tests most of the time, and with 32,000 games the error bar is in the range +/- 4 to 5. Unless your changes are +10 or better, 32,000 games won't be enough.
-
- Posts: 540
- Joined: Thu Mar 09, 2006 3:01 pm
- Full name: Brian Richardson
Re: Testing: More Opponents Or More Positions
When I run bayeselo on a combined pgn with 1,600 games from each of two Tinker versions (3,200 total),
the ratings shows an elo difference of about 26 +/- 1,
but the LOS (with the mm 1 option?) shows LOS of 98.
Is this saying there is a 98% chance that the version that is 26 elo higher is better?
Exactly what does the mm 1 option do?
I recall an earlier post mentioning it, but I tried it without the mm 1 option and the results are the same.
the ratings shows an elo difference of about 26 +/- 1,
but the LOS (with the mm 1 option?) shows LOS of 98.
Is this saying there is a 98% chance that the version that is 26 elo higher is better?
Exactly what does the mm 1 option do?
I recall an earlier post mentioning it, but I tried it without the mm 1 option and the results are the same.
Re: Testing: More Opponents Or More Positions
I'm not understanding how you come to this number, but +/- 1 seems a bit small for the number of games you are talking about. Perhaps you want to post the original data?brianr wrote:When I run bayeselo on a combined pgn with 1,600 games from each of two Tinker versions (3,200 total),
the ratings shows an elo difference of about 26 +/- 1,
Yesbrianr wrote: but the LOS (with the mm 1 option?) shows LOS of 98.
Is this saying there is a 98% chance that the version that is 26 elo higher is better?
the 1 tells it to calculate the advantage of going first. use the advantage command to to see what that comes out to.brianr wrote: Exactly what does the mm 1 option do?
I recall an earlier post mentioning it, but I tried it without the mm 1 option and the results are the same.
-
- Posts: 20943
- Joined: Mon Feb 27, 2006 7:30 pm
- Location: Birmingham, AL
Re: Testing: More Opponents Or More Positions
with 32K games, the error bar is roughly +/- 5, 8K games is roughly +/- 10, 2K games is +/- 20.krazyken wrote:Not entirely accurate, even though the error bars overlap, it's still possible to determine with reasonable probability if one version is better than the other. This requires looking at the area of the overlap of the two distributions. If you use BayesELO you can see these probabilities with the LOS command.bob wrote:
The problem is the error bar for 1600 games. It is _way_ too big to catch most X vs X+1 results accurately. If you use BayesElo, it will give you the error bar. you are running 1/20th the number of games I use for these tests most of the time, and with 32,000 games the error bar is in the range +/- 4 to 5. Unless your changes are +10 or better, 32,000 games won't be enough.
If you have a 2-3-4 elo change, you are not going to get any confidence that X+1 is better than X, the error bars are going to almost completely overlap. How would you evaluate 2600+/- 20, and 2602+/-20?
The +/-20 is simply too big.. If this were 2600+/-20 and 2630+/-20 you might be more confident that the change is better. But 2600+/-20 and 2605+/-20 doesn't fill me with confidence.
Hence my comments...
-
- Posts: 20943
- Joined: Mon Feb 27, 2006 7:30 pm
- Location: Birmingham, AL
Re: Testing: More Opponents Or More Positions
Correct. But 26 Elo changes are not very common for the most part. The bigger the difference between the two versions, the fewer games you need to confidently say one is better than the other.brianr wrote:When I run bayeselo on a combined pgn with 1,600 games from each of two Tinker versions (3,200 total),
the ratings shows an elo difference of about 26 +/- 1,
but the LOS (with the mm 1 option?) shows LOS of 98.
Is this saying there is a 98% chance that the version that is 26 elo higher is better?
Exactly what does the mm 1 option do?
I recall an earlier post mentioning it, but I tried it without the mm 1 option and the results are the same.
I do not quite follow your 26 +/- 1... the error bar can't possibly be +/- 1 with a total of 3200 games. You have to get up to the million game level to get that kind of confidence... Here's one recent example:
Code: Select all
1 Toga2 2672 2 3 194550 60% 2600 25%
2 Glaurung 2.2 2664 3 2 194550 59% 2600 24%
Re: Testing: More Opponents Or More Positions
I understand that of course, but most people are not trying to measure elo change that small. The amount of time required to measure a change that small it is usually better off to declare them equal and move on to something else.bob wrote:with 32K games, the error bar is roughly +/- 5, 8K games is roughly +/- 10, 2K games is +/- 20.krazyken wrote:Not entirely accurate, even though the error bars overlap, it's still possible to determine with reasonable probability if one version is better than the other. This requires looking at the area of the overlap of the two distributions. If you use BayesELO you can see these probabilities with the LOS command.bob wrote:
The problem is the error bar for 1600 games. It is _way_ too big to catch most X vs X+1 results accurately. If you use BayesElo, it will give you the error bar. you are running 1/20th the number of games I use for these tests most of the time, and with 32,000 games the error bar is in the range +/- 4 to 5. Unless your changes are +10 or better, 32,000 games won't be enough.
If you have a 2-3-4 elo change, you are not going to get any confidence that X+1 is better than X, the error bars are going to almost completely overlap. How would you evaluate 2600+/- 20, and 2602+/-20?
The +/-20 is simply too big.. If this were 2600+/-20 and 2630+/-20 you might be more confident that the change is better. But 2600+/-20 and 2605+/-20 doesn't fill me with confidence.
Hence my comments...
-
- Posts: 20943
- Joined: Mon Feb 27, 2006 7:30 pm
- Location: Birmingham, AL
Re: Testing: More Opponents Or More Positions
There are many programs where changes of +20 are going to be _very_ rare. Brian's program is not brand new. Most changes are evolutionary in nature, not revolutionary. If you can't pick up small changes, you will reach a very quick plateau and never move forward because _every_ change is small and your idea would cause you to throw out good changes or keep bad ones with equal probability.krazyken wrote:I understand that of course, but most people are not trying to measure elo change that small. The amount of time required to measure a change that small it is usually better off to declare them equal and move on to something else.bob wrote:with 32K games, the error bar is roughly +/- 5, 8K games is roughly +/- 10, 2K games is +/- 20.krazyken wrote:Not entirely accurate, even though the error bars overlap, it's still possible to determine with reasonable probability if one version is better than the other. This requires looking at the area of the overlap of the two distributions. If you use BayesELO you can see these probabilities with the LOS command.bob wrote:
The problem is the error bar for 1600 games. It is _way_ too big to catch most X vs X+1 results accurately. If you use BayesElo, it will give you the error bar. you are running 1/20th the number of games I use for these tests most of the time, and with 32,000 games the error bar is in the range +/- 4 to 5. Unless your changes are +10 or better, 32,000 games won't be enough.
If you have a 2-3-4 elo change, you are not going to get any confidence that X+1 is better than X, the error bars are going to almost completely overlap. How would you evaluate 2600+/- 20, and 2602+/-20?
The +/-20 is simply too big.. If this were 2600+/-20 and 2630+/-20 you might be more confident that the change is better. But 2600+/-20 and 2605+/-20 doesn't fill me with confidence.
Hence my comments...
Crafty 23.0 was a clear +70 Elo improvement over 22.x. But no single change was even +10. Most were less than +5.