How much improvement from PVS?

bob · Post by **bob** » Tue Sep 23, 2008 8:36 am

xsadar wrote:
bob wrote:
xsadar wrote:Roughly, what's the range of Elo improvement I should get after changing from alpha-beta to PVS? I finally got around to implementing PVS, but see no Elo difference after testing. (My old version actually performed ever so slightly better.) I ran an engine vs world test with 480 games games each with no measurable difference. So I'm wondering if my implementation is buggy, my testing is flawed, or maybe I'm expecting too much improvement, in which case I need more games. Or perhaps even some combination of the above problems.

Here are the results from ELOstat:
Code: Select all
    Program                          Elo    +   -   Games   Score   Av.Op.  Draws

  1 Matheus-2.3                    :  180   58  56   160    76.9 %    -28   15.0 %
  2 BikJump v1.8 (64-bit)          :  119   53  52   160    70.0 %    -28   17.5 %
  3 umax4.8                        :   70   50  50   160    63.7 %    -28   20.0 %
  4 Heracles 0.6.16                :  -22   49  49   160    50.9 %    -28   18.1 %
  5 zilch-0.0.0.3a                 :  -25   28  28   480    43.0 %     24   20.2 %
  6 zilch-0.0.0.4                  :  -27   28  28   480    42.8 %     24   21.0 %
  7 Smash 1.0.3                    :  -74   50  50   160    43.4 %    -28   15.6 %
  8 cilian-x86_64                  : -117   43  44   160    37.5 %    -28   37.5 %
Edit: zilch-0.0.0.3a is my engine with alpha-beta zilch-0.0.0.4 is with PVS.
PVS is a nominal change. It searches slightly faster, but then has to occasionally re-search to get a true score where normal alpha/beta does not. The advantage is that you will "sniff" out a good move quicker, and if that happens just prior to time running out, PVS will fail high while normal alpha/beta would time out and not play the actual best move. I doubt it is worth more than 10 Elo or something, which means you are going to need 20,000+ games to be able to measure the improvement...
Hmm... The way people talk about PVS I thought (and hoped) it would be more of an improvement than that. So, since around 20,000 games is generally impractical, is it best to just assume that if you're not getting too many re-searches and the average time to depth decreases, that it's probably implemented properly and is thus an improvement? Or how is the best way to verify your implementation?

This is a problem, for sure. 40K games give an Elo range of +/- 4, roughly. Less than that and the error bar goes up and accuracy drops off. Less than 20K games is not going to be enough to measure such a small change, based on the cluster testing I have been doing.

hgm · Post by **hgm** » Tue Sep 23, 2008 9:50 am

xsadar wrote:Or how is the best way to verify your implementation?

As for a given depth search the move is not supposed to depend on the fact if you use PVS or plain alpha-beta, it would be sufficient to record the time-to-depth averged over some 100 representative game positions, and see how much of a speedup one gives w.r.t. the other.

plattyaj · Post by **plattyaj** » Tue Sep 23, 2008 1:22 pm

Personally I've found the only benefit is seeing the nice ++ or -- symbols on the end of the PV!

Andy.

Carey · Post by **Carey** » Tue Sep 23, 2008 7:46 pm

bob wrote:This is a problem, for sure. 40K games give an Elo range of +/- 4, roughly. Less than that and the error bar goes up and accuracy drops off. Less than 20K games is not going to be enough to measure such a small change, based on the cluster testing I have been doing.

Bob,

Maybe I missed it, but did you ever come up with a list of how many games are needed for a XYZ error rate?

In other words, if 40k are for +- 4, does that mean 20k are for +- 8 and 10k for +- 16, etc.?

If we are just looking for, say 50 points, then how many games are needed?

Since not everybody has a cluster, many of us would like to do as few games as possible.

Also, what was the final testing strategy you decided on? If I remember right, it was to use different starting positions for every single game, rather than the limited 40 positions and do both white & black and repeat until done.

hgm · Post by **hgm** » Tue Sep 23, 2008 8:06 pm

The error goes as the inverse square root of the number of games. So if 40K games is +/- 4 points, 10K games is +/- 8 point and 2500 games is +/- 16 points.

The 95% confidence interval is given in Elo points by +/- 550/sqrt(N).

So for 50 points you need only 120 games.

bob · Post by **bob** » Tue Sep 23, 2008 8:35 pm

Carey wrote:
bob wrote:This is a problem, for sure. 40K games give an Elo range of +/- 4, roughly. Less than that and the error bar goes up and accuracy drops off. Less than 20K games is not going to be enough to measure such a small change, based on the cluster testing I have been doing.
Bob,

Maybe I missed it, but did you ever come up with a list of how many games are needed for a XYZ error rate?

In other words, if 40k are for +- 4, does that mean 20k are for +- 8 and 10k for +- 16, etc.?

If we are just looking for, say 50 points, then how many games are needed?

Since not everybody has a cluster, many of us would like to do as few games as possible.

Also, what was the final testing strategy you decided on? If I remember right, it was to use different starting positions for every single game, rather than the limited 40 positions and do both white & black and repeat until done.

HG gave you the answer. The problem occurs when you want to measure small changes. And perhaps the most surprising detail I have discovered is that many programming features, from null-move to no-null-move as an example, is not a 100 elo change. Comparing the adaptive null-move R-2~3 that I used for 10 years in crafty to pure R=3 everywhere was a 2 elo change, as an example. That is _very_ difficult to detect and requires a ton of games. Unfortunately, almost everything you do to a chess engine is a "small elo change". Search extensions and the like all included. The raw computer speed appears to be by far the largest contributor to overall playing level, the software tweaks appear to be just that, software "tweaks" that produce small improvements to a basic alpha/beta program.

160 games is not going to show anything useful unless you try to compare something like minimax to alpha/beta or something equally significant.

ernest · Post by **ernest** » Tue Sep 23, 2008 8:59 pm

hgm wrote:The 95% confidence interval is given in Elo points by +/- 550/sqrt(N)

Nice formula.

From my calculations (supposing it corresponds to "2 sigma" confidence interval, and 1% win is +7 Elo), it is valid if 38% of these N games are draws.

For 1/3 draw rate, I get the formula +/- 571/sqrt(N)

Carey · Post by **Carey** » Tue Sep 23, 2008 10:03 pm

bob wrote:
Carey wrote:
bob wrote:This is a problem, for sure. 40K games give an Elo range of +/- 4, roughly. Less than that and the error bar goes up and accuracy drops off. Less than 20K games is not going to be enough to measure such a small change, based on the cluster testing I have been doing.
Bob,

Maybe I missed it, but did you ever come up with a list of how many games are needed for a XYZ error rate?

In other words, if 40k are for +- 4, does that mean 20k are for +- 8 and 10k for +- 16, etc.?

If we are just looking for, say 50 points, then how many games are needed?

Since not everybody has a cluster, many of us would like to do as few games as possible.

Also, what was the final testing strategy you decided on? If I remember right, it was to use different starting positions for every single game, rather than the limited 40 positions and do both white & black and repeat until done.
HG gave you the answer. The problem occurs when you want to measure small changes. And perhaps the most surprising detail I have discovered is that many programming features, from null-move to no-null-move as an example, is not a 100 elo change. Comparing the adaptive null-move R-2~3 that I used for 10 years in crafty to pure R=3 everywhere was a 2 elo change, as an example. That is _very_ difficult to detect and requires a ton of games. Unfortunately, almost everything you do to a chess engine is a "small elo change". Search extensions and the like all included. The raw computer speed appears to be by far the largest contributor to overall playing level, the software tweaks appear to be just that, software "tweaks" that produce small improvements to a basic alpha/beta program.

160 games is not going to show anything useful unless you try to compare something like minimax to alpha/beta or something equally significant.

So basically, if you have 100 evaluator features, you can expect maybe 200-300 elo from those in total? (Not exactly, of course... Just not a lot.)

That sure does sounds like a good argument against making huge evaluators, like you do...

So that begs the question.... What is the low end of computer chess rating, then?

A brain dead eval, basically. A basic quiscient search. Trans table. Nothing too sophisticated.

Any guess as to what the rating would be?

I know MicroMax 4.8 is often cited as being the lowest reasonable program but I'm not sure what rating it has.

You know Bob, this sounds like a subject for a paper. Start off with a basic program, start adding features and see what kind of ratings improvements they give.

J. Schaeffer did something like that long ago, but of course he didn't have the kind of computing power to do enough reliable testing.

You are about the only one here with the hardware to do those kinds of tests.

And speaking of the tests.... Are you still doing just the 40 starting positions, or are you doing new ones for each game?

bob · Post by **bob** » Tue Sep 23, 2008 10:13 pm

Carey wrote:
bob wrote:
Carey wrote:
bob wrote:This is a problem, for sure. 40K games give an Elo range of +/- 4, roughly. Less than that and the error bar goes up and accuracy drops off. Less than 20K games is not going to be enough to measure such a small change, based on the cluster testing I have been doing.
Bob,

Maybe I missed it, but did you ever come up with a list of how many games are needed for a XYZ error rate?

In other words, if 40k are for +- 4, does that mean 20k are for +- 8 and 10k for +- 16, etc.?

If we are just looking for, say 50 points, then how many games are needed?

Since not everybody has a cluster, many of us would like to do as few games as possible.

Also, what was the final testing strategy you decided on? If I remember right, it was to use different starting positions for every single game, rather than the limited 40 positions and do both white & black and repeat until done.
HG gave you the answer. The problem occurs when you want to measure small changes. And perhaps the most surprising detail I have discovered is that many programming features, from null-move to no-null-move as an example, is not a 100 elo change. Comparing the adaptive null-move R-2~3 that I used for 10 years in crafty to pure R=3 everywhere was a 2 elo change, as an example. That is _very_ difficult to detect and requires a ton of games. Unfortunately, almost everything you do to a chess engine is a "small elo change". Search extensions and the like all included. The raw computer speed appears to be by far the largest contributor to overall playing level, the software tweaks appear to be just that, software "tweaks" that produce small improvements to a basic alpha/beta program.

160 games is not going to show anything useful unless you try to compare something like minimax to alpha/beta or something equally significant.
So basically, if you have 100 evaluator features, you can expect maybe 200-300 elo from those in total? (Not exactly, of course... Just not a lot.)

That sure does sounds like a good argument against making huge evaluators, like you do...

So that begs the question.... What is the low end of computer chess rating, then?

A brain dead eval, basically. A basic quiscient search. Trans table. Nothing too sophisticated.

Any guess as to what the rating would be?

I know MicroMax 4.8 is often cited as being the lowest reasonable program but I'm not sure what rating it has.

You know Bob, this sounds like a subject for a paper. Start off with a basic program, start adding features and see what kind of ratings improvements they give.

J. Schaeffer did something like that long ago, but of course he didn't have the kind of computing power to do enough reliable testing.

You are about the only one here with the hardware to do those kinds of tests.

And speaking of the tests.... Are you still doing just the 40 starting positions, or are you doing new ones for each game?

I probably have enough ideas for papers to take 10 years to finish. Wouldn't it be nice to have something published that takes idea by idea and discovers what they are worth in terms of Elo. Null R=1, R=2, R=3, for example. No reductions vs reductions, no check extensions vs check extensions, etc The list goes on and one. One observation I made years ago was that it often seems more important to evaluate a feature, than the weight you assign for that feature. Just recognizing that something is good or bad is better than not doing anything at all. Adjusting the weight is not as important as one might guess. That could also be proven/disproven with a few games. For reference, over the weekend I ran 32 40,000 game tests.

And that was not quite non-stop. 40K games is taking about 60-80 minutes. So running large numbers of games to get the error bar way down is not a problem here...

hgm · Post by **hgm** » Tue Sep 23, 2008 10:16 pm

ernest wrote:
hgm wrote:The 95% confidence interval is given in Elo points by +/- 550/sqrt(N)
Nice formula.
From my calculations (supposing it corresponds to "2 sigma" confidence interval, and 1% win is +7 Elo), it is valid if 38% of these N games are draws.

For 1/3 draw rate, I get the formula +/- 571/sqrt(N)

Yes, it was only approximate. To know the error bars better than 10%, on had better use BayesElo, as draws have a different weight from wins/losses in the rating model.

How much improvement from PVS?

Re: How much improvement from PVS?

Re: How much improvement from PVS?

Re: How much improvement from PVS?

Re: How much improvement from PVS?

Re: How much improvement from PVS?

Re: How much improvement from PVS?

Re: How much improvement from PVS?

Re: How much improvement from PVS?

Re: How much improvement from PVS?

Re: How much improvement from PVS?