ver_A plays 80 game matches against 5 stronger opponents and the number of different positions won is counted. So any number of wins from one side of one position is only counted as one. Pick opponent engines so that anywhere from 40 to 60 points are aquired.
Then ver_B plays the same.
If ver_B scores more points (based on getting at least one win in each position) then could that be a good indication that ver_B is better? How many more points is needed?
The idea is to ignore the random accumulation of pure score in favor of seeing if a new version can win positions that the earlier version could not win.
My gut feeling is that less games might be needed in a test like this.
If testing was done like this ...
Moderator: Ras
-
- Posts: 3196
- Joined: Fri May 26, 2006 3:00 am
- Location: WY, USA
- Full name: Michael Sherwin
If testing was done like this ...
If you are on a sidewalk and the covid goes beep beep
Just step aside or you might have a bit of heat
Covid covid runs through the town all day
Can the people ever change their ways
Sherwin the covid's after you
Sherwin if it catches you you're through
Just step aside or you might have a bit of heat
Covid covid runs through the town all day
Can the people ever change their ways
Sherwin the covid's after you
Sherwin if it catches you you're through
-
- Posts: 20943
- Joined: Mon Feb 27, 2006 7:30 pm
- Location: Birmingham, AL
Re: If testing was done like this ...
The first thing to try is to run a few of these 80 game matches and measure the variability of the results, which will be _amazingly_ large. That is where I started in late 2006/early 2007.Michael Sherwin wrote:ver_A plays 80 game matches against 5 stronger opponents and the number of different positions won is counted. So any number of wins from one side of one position is only counted as one. Pick opponent engines so that anywhere from 40 to 60 points are aquired.
Then ver_B plays the same.
If ver_B scores more points (based on getting at least one win in each position) then could that be a good indication that ver_B is better? How many more points is needed?
Unfortunately, once you test this, reality sets in, and you will discover you need to add three more zeros or so to the total number of games required unless your changes being tested are _huge_ improvements.
The idea is to ignore the random accumulation of pure score in favor of seeing if a new version can win positions that the earlier version could not win.
My gut feeling is that less games might be needed in a test like this.