testing: reducing the computational workload

bob · Post by **bob** » Mon Aug 18, 2008 12:21 am

OK, different topic. I have about 4,000 positions that I have been using recently, chosen randomly from games played in a large PGN collection, after white and black had both played 10 moves (all positions are currently WTM). The games were played at least 3 or more times so that oddballs were excluded.

now, how to reduce this? I can test most anything, so the issue is to find a good idea and then I can test to watch the fur fly...

Possibilities:

1. play 1 game per position, rather than 2. With alternating colors to balance black vs white effects. Easy to do. This reduces the workload 50%, to about 20,000 games (keeping 5 opponents for new versions of Crafty to play against, 5 x 4,000).

2. play a black/white game but only play each position once. Opponnent 1 would play position 1, opponent 2 would play position 2, etc. This would reduce the workload by 80%, 4,000 positions x 2 games per position x 1 opponent per position or 8,000 games.

3. I could take the first N positions since they are in a fairly random (FEN strings in lexical order) or I could randomly choose N positions from the set. With N=1,000 that reduces the workload by 75%. But this might invite remnants of the old correlation effect back in.

sje · Post by **sje** » Mon Aug 18, 2008 2:55 am

Here's an idea that might help:

1) Select about 1,000 distinct positions of interest. The number is roughly the count of one minute searches that can be performed from the time you leave the office and the time you arrive the next day.

2) For each of these positions, run the position through a very long search. How long? I'd suggest correspondence time control; that's about three days per search. Since this will eat a whole lot of computer time, you can get friends to help. These positions and their deep search results become the baseline for further testing. While it is not claimed that the results are always the best possible move, it's very likely that the overall quality is much better than can be obtained by short, one minute searches.

3) For each change in the program, run an overnight test and compare the results to the baseline.

4) As the overnight result set approaches the baseline results, recalculate the baseline with the revised program.

Dirt · Post by **Dirt** » Mon Aug 18, 2008 4:21 am

bob wrote:Possibilities:

...

3. I could take the first N positions since they are in a fairly random (FEN strings in lexical order) or I could randomly choose N positions from the set. With N=1,000 that reduces the workload by 75%. But this might invite remnants of the old correlation effect back in.

Rather that taking the first 1000 positions you could take every fourth position. That way any lack of randomness in lexical ordering will help you get a wider variety of positions. Although I think reducing the number of positions is the least attractive option.

bob · Post by **bob** » Mon Aug 18, 2008 7:05 am

sje wrote:Here's an idea that might help:

1) Select about 1,000 distinct positions of interest. The number is roughly the count of one minute searches that can be performed from the time you leave the office and the time you arrive the next day.

2) For each of these positions, run the position through a very long search. How long? I'd suggest correspondence time control; that's about three days per search. Since this will eat a whole lot of computer time, you can get friends to help. These positions and their deep search results become the baseline for further testing. While it is not claimed that the results are always the best possible move, it's very likely that the overall quality is much better than can be obtained by short, one minute searches.

3) For each change in the program, run an overnight test and compare the results to the baseline.

4) As the overnight result set approaches the baseline results, recalculate the baseline with the revised program.

I don't like the concept of using positions to measure progress. That's what I did for several years. But it doesn't say much about how the change works throughout a game, and there is an issue about how to measure the results...

The game approach has been a lot more straightforward in how to evaluate the results.

hgm · Post by **hgm** » Mon Aug 18, 2008 7:14 am

Playing fewer games will increase the error bars, as the error-bar (95% confidence) on the Elo difference in a gauntlet comparison, even in the most favorable case (all positions+opponent combinations scoring close to 50%) is 873/sqrt(nrGames) Elo points.

Why do you want to drive up the error? I thought you wanted to measure differences of 1 Elo. You need more games for that (~750,000 in stead of 38,000), not less.

If you now are looking for a gauntlet comparison with large error bars, which require only a small number of games, best is to randomly sample the complete set of (position, opponent) combinations, until you have enough for your gauntlet, and let both Engines Under Test then play that same gauntlet. Then you maximize both the number of positions and the number of opponents.

Uri Blass · Post by **Uri Blass** » Mon Aug 18, 2008 7:56 am

hgm wrote:Playing fewer games will increase the error bars, as the error-bar (95% confidence) on the Elo difference in a gauntlet comparison, even in the most favorable case (all positions+opponent combinations scoring close to 50%) is 873/sqrt(nrGames) Elo points.

Why do you want to drive up the error? I thought you wanted to measure differences of 1 Elo. You need more games for that (~750,000 in stead of 38,000), not less.

If you now are looking for a gauntlet comparison with large error bars, which require only a small number of games, best is to randomly sample the complete set of (position, opponent) combinations, until you have enough for your gauntlet, and let both Engines Under Test then play that same gauntlet. Then you maximize both the number of positions and the number of opponents.

I disagree.
most favorable case is that you choose good positions so you can get smaller error than 873/sqrt(nrGames) elo points.

Here is a simple and clear example

If you do a change that is only relevant to evaluation of pawn endgames it may be better to have many positions of pawn endgames in your tests.
Not all the positions have to be pawn endgames because you may also want to verify no bugs in other positions.

result of 60% of version X+1 against 50% of version X in pawn endgames may be translated to 1 elo improvement in real games.

With changes in middle game evaluation the task of choosing good positions is harder but it is not obvious to me that it is impossible to get
something better than random positions.

Uri

Uri Blass · Post by **Uri Blass** » Mon Aug 18, 2008 8:15 am

For H.G.Muller
Another example:

Suppose that results are only dependent only on the position and the opponent and there is no noise(can happen with fixed number of nodes).

Suppose you have 10000 games for version A and 10000 games for version B from the same positions.
Suppose that you find that 9990 games are identical when in 10 of them(not identical) A lose on time and B does not lose on time and get 5/10

I think that it will be safe to say with more than 95% confidence that
1)B is stronger than A
2)The difference in strength between them is less than 1 elo

Of Course it cannot happen with Crafty and the opponents because they are not deterministic and Bob does not test with fixed number of nodes but inspite of it I believe that there is some correlation with positive effect that can reduce the error.

Uri

hgm · Post by **hgm** » Mon Aug 18, 2008 8:24 am

This is very dangerous, as with most changes it is almost impossible to guarantee that they are only effective in a Pawn ending. Even if you add terms that explicitly are limited to when tere is only King + {awns on the board it can be dangerous. These terms might, for instance strongly affect the tendency to convert to a Pawn ending (and more so to lost Pawn endings than to won Pawn endings), when the search is already probing into the Pawn ending with quite low depth (because the branches with other piecws still on the board eat most of the nodes).

But I agree that there might be cases where this technique could work.

Of course playing tree-games would eliminate most of the noise that is intrinsic in game results, and might reduce the required workload to measure 1-Elo differences thousandfold. So this whole effort of testing by gauntlets could be described as "mopping while the the tap is still running", no matter how you do it. I guess I am going to implement that directly in WinBoard after all. It should not be so difficult to add a third chess program, (add -tcp and -td options), and have WinBoard alternate between using second and third engine. The code to start up the engine between matches should already be there (to handle the case where you play a match whih -xreuse2). I could simply use that code, using the alternate name for the executable.

krazyken · Post by **krazyken** » Mon Aug 18, 2008 9:15 am

Here's another idea:

Step 1: Run a number of games, enough to get a reasonable baseline feel for approximate strength. You will likely get error margins where the two versions overlap.

Step 2: Select a sample of games from the run for each version. The sampling could be random, or biased towards certain positions you are particularly interested in with the latest change.

Step 3: Subject the sampled games to deep analysis by a "trusted source". then measure the correlation of the sampled scores to the trusted scores.

bob · Post by **bob** » Mon Aug 18, 2008 7:41 pm

hgm wrote:Playing fewer games will increase the error bars, as the error-bar (95% confidence) on the Elo difference in a gauntlet comparison, even in the most favorable case (all positions+opponent combinations scoring close to 50%) is 873/sqrt(nrGames) Elo points.

Why do you want to drive up the error? I thought you wanted to measure differences of 1 Elo. You need more games for that (~750,000 in stead of 38,000), not less.

If you now are looking for a gauntlet comparison with large error bars, which require only a small number of games, best is to randomly sample the complete set of (position, opponent) combinations, until you have enough for your gauntlet, and let both Engines Under Test then play that same gauntlet. Then you maximize both the number of positions and the number of opponents.

Again, read what I wrote. Three different tests, designed to measure major changes, significant changes, and then the tiny changes. No desire to increase error, just a desire to use the minimum computational requirements to make a decision.

testing: reducing the computational workload

testing: reducing the computational workload

Re: testing: reducing the computational workload

Re: testing: reducing the computational workload

Re: testing: reducing the computational workload

Re: testing: reducing the computational workload

Re: testing: reducing the computational workload

Re: testing: reducing the computational workload

Re: testing: reducing the computational workload

Re: testing: reducing the computational workload

Re: testing: reducing the computational workload