This is a problem, for sure. 40K games give an Elo range of +/- 4, roughly. Less than that and the error bar goes up and accuracy drops off. Less than 20K games is not going to be enough to measure such a small change, based on the cluster testing I have been doing.xsadar wrote:Hmm... The way people talk about PVS I thought (and hoped) it would be more of an improvement than that. So, since around 20,000 games is generally impractical, is it best to just assume that if you're not getting too many re-searches and the average time to depth decreases, that it's probably implemented properly and is thus an improvement? Or how is the best way to verify your implementation?bob wrote:PVS is a nominal change. It searches slightly faster, but then has to occasionally re-search to get a true score where normal alpha/beta does not. The advantage is that you will "sniff" out a good move quicker, and if that happens just prior to time running out, PVS will fail high while normal alpha/beta would time out and not play the actual best move. I doubt it is worth more than 10 Elo or something, which means you are going to need 20,000+ games to be able to measure the improvement...xsadar wrote:Roughly, what's the range of Elo improvement I should get after changing from alpha-beta to PVS? I finally got around to implementing PVS, but see no Elo difference after testing. (My old version actually performed ever so slightly better.) I ran an engine vs world test with 480 games games each with no measurable difference. So I'm wondering if my implementation is buggy, my testing is flawed, or maybe I'm expecting too much improvement, in which case I need more games. Or perhaps even some combination of the above problems.
Here are the results from ELOstat:
Edit: zilch-0.0.0.3a is my engine with alpha-beta zilch-0.0.0.4 is with PVS.Code: Select all
Program Elo + - Games Score Av.Op. Draws 1 Matheus-2.3 : 180 58 56 160 76.9 % -28 15.0 % 2 BikJump v1.8 (64-bit) : 119 53 52 160 70.0 % -28 17.5 % 3 umax4.8 : 70 50 50 160 63.7 % -28 20.0 % 4 Heracles 0.6.16 : -22 49 49 160 50.9 % -28 18.1 % 5 zilch-0.0.0.3a : -25 28 28 480 43.0 % 24 20.2 % 6 zilch-0.0.0.4 : -27 28 28 480 42.8 % 24 21.0 % 7 Smash 1.0.3 : -74 50 50 160 43.4 % -28 15.6 % 8 cilian-x86_64 : -117 43 44 160 37.5 % -28 37.5 %
How much improvement from PVS?
Moderator: Ras
-
- Posts: 20943
- Joined: Mon Feb 27, 2006 7:30 pm
- Location: Birmingham, AL
Re: How much improvement from PVS?
-
- Posts: 28356
- Joined: Fri Mar 10, 2006 10:06 am
- Location: Amsterdam
- Full name: H G Muller
Re: How much improvement from PVS?
As for a given depth search the move is not supposed to depend on the fact if you use PVS or plain alpha-beta, it would be sufficient to record the time-to-depth averged over some 100 representative game positions, and see how much of a speedup one gives w.r.t. the other.xsadar wrote:Or how is the best way to verify your implementation?
Re: How much improvement from PVS?
Personally I've found the only benefit is seeing the nice ++ or -- symbols on the end of the PV! 
Andy.

Andy.
-
- Posts: 313
- Joined: Wed Mar 08, 2006 8:18 pm
Re: How much improvement from PVS?
Bob,bob wrote:This is a problem, for sure. 40K games give an Elo range of +/- 4, roughly. Less than that and the error bar goes up and accuracy drops off. Less than 20K games is not going to be enough to measure such a small change, based on the cluster testing I have been doing.
Maybe I missed it, but did you ever come up with a list of how many games are needed for a XYZ error rate?
In other words, if 40k are for +- 4, does that mean 20k are for +- 8 and 10k for +- 16, etc.?
If we are just looking for, say 50 points, then how many games are needed?
Since not everybody has a cluster, many of us would like to do as few games as possible.
Also, what was the final testing strategy you decided on? If I remember right, it was to use different starting positions for every single game, rather than the limited 40 positions and do both white & black and repeat until done.
-
- Posts: 28356
- Joined: Fri Mar 10, 2006 10:06 am
- Location: Amsterdam
- Full name: H G Muller
Re: How much improvement from PVS?
The error goes as the inverse square root of the number of games. So if 40K games is +/- 4 points, 10K games is +/- 8 point and 2500 games is +/- 16 points.
The 95% confidence interval is given in Elo points by +/- 550/sqrt(N).
So for 50 points you need only 120 games.
The 95% confidence interval is given in Elo points by +/- 550/sqrt(N).
So for 50 points you need only 120 games.
-
- Posts: 20943
- Joined: Mon Feb 27, 2006 7:30 pm
- Location: Birmingham, AL
Re: How much improvement from PVS?
HG gave you the answer. The problem occurs when you want to measure small changes. And perhaps the most surprising detail I have discovered is that many programming features, from null-move to no-null-move as an example, is not a 100 elo change. Comparing the adaptive null-move R-2~3 that I used for 10 years in crafty to pure R=3 everywhere was a 2 elo change, as an example. That is _very_ difficult to detect and requires a ton of games. Unfortunately, almost everything you do to a chess engine is a "small elo change". Search extensions and the like all included. The raw computer speed appears to be by far the largest contributor to overall playing level, the software tweaks appear to be just that, software "tweaks" that produce small improvements to a basic alpha/beta program.Carey wrote:Bob,bob wrote:This is a problem, for sure. 40K games give an Elo range of +/- 4, roughly. Less than that and the error bar goes up and accuracy drops off. Less than 20K games is not going to be enough to measure such a small change, based on the cluster testing I have been doing.
Maybe I missed it, but did you ever come up with a list of how many games are needed for a XYZ error rate?
In other words, if 40k are for +- 4, does that mean 20k are for +- 8 and 10k for +- 16, etc.?
If we are just looking for, say 50 points, then how many games are needed?
Since not everybody has a cluster, many of us would like to do as few games as possible.
Also, what was the final testing strategy you decided on? If I remember right, it was to use different starting positions for every single game, rather than the limited 40 positions and do both white & black and repeat until done.
160 games is not going to show anything useful unless you try to compare something like minimax to alpha/beta or something equally significant.
-
- Posts: 2047
- Joined: Wed Mar 08, 2006 8:30 pm
Re: How much improvement from PVS?
Nice formula.hgm wrote:The 95% confidence interval is given in Elo points by +/- 550/sqrt(N)

From my calculations (supposing it corresponds to "2 sigma" confidence interval, and 1% win is +7 Elo), it is valid if 38% of these N games are draws.
For 1/3 draw rate, I get the formula +/- 571/sqrt(N)
-
- Posts: 313
- Joined: Wed Mar 08, 2006 8:18 pm
Re: How much improvement from PVS?
So basically, if you have 100 evaluator features, you can expect maybe 200-300 elo from those in total? (Not exactly, of course... Just not a lot.)bob wrote:HG gave you the answer. The problem occurs when you want to measure small changes. And perhaps the most surprising detail I have discovered is that many programming features, from null-move to no-null-move as an example, is not a 100 elo change. Comparing the adaptive null-move R-2~3 that I used for 10 years in crafty to pure R=3 everywhere was a 2 elo change, as an example. That is _very_ difficult to detect and requires a ton of games. Unfortunately, almost everything you do to a chess engine is a "small elo change". Search extensions and the like all included. The raw computer speed appears to be by far the largest contributor to overall playing level, the software tweaks appear to be just that, software "tweaks" that produce small improvements to a basic alpha/beta program.Carey wrote:Bob,bob wrote:This is a problem, for sure. 40K games give an Elo range of +/- 4, roughly. Less than that and the error bar goes up and accuracy drops off. Less than 20K games is not going to be enough to measure such a small change, based on the cluster testing I have been doing.
Maybe I missed it, but did you ever come up with a list of how many games are needed for a XYZ error rate?
In other words, if 40k are for +- 4, does that mean 20k are for +- 8 and 10k for +- 16, etc.?
If we are just looking for, say 50 points, then how many games are needed?
Since not everybody has a cluster, many of us would like to do as few games as possible.
Also, what was the final testing strategy you decided on? If I remember right, it was to use different starting positions for every single game, rather than the limited 40 positions and do both white & black and repeat until done.
160 games is not going to show anything useful unless you try to compare something like minimax to alpha/beta or something equally significant.
That sure does sounds like a good argument against making huge evaluators, like you do...
So that begs the question.... What is the low end of computer chess rating, then?
A brain dead eval, basically. A basic quiscient search. Trans table. Nothing too sophisticated.
Any guess as to what the rating would be?
I know MicroMax 4.8 is often cited as being the lowest reasonable program but I'm not sure what rating it has.
You know Bob, this sounds like a subject for a paper. Start off with a basic program, start adding features and see what kind of ratings improvements they give.
J. Schaeffer did something like that long ago, but of course he didn't have the kind of computing power to do enough reliable testing.
You are about the only one here with the hardware to do those kinds of tests.
And speaking of the tests.... Are you still doing just the 40 starting positions, or are you doing new ones for each game?
-
- Posts: 20943
- Joined: Mon Feb 27, 2006 7:30 pm
- Location: Birmingham, AL
Re: How much improvement from PVS?
I probably have enough ideas for papers to take 10 years to finish. Wouldn't it be nice to have something published that takes idea by idea and discovers what they are worth in terms of Elo. Null R=1, R=2, R=3, for example. No reductions vs reductions, no check extensions vs check extensions, etc The list goes on and one. One observation I made years ago was that it often seems more important to evaluate a feature, than the weight you assign for that feature. Just recognizing that something is good or bad is better than not doing anything at all. Adjusting the weight is not as important as one might guess. That could also be proven/disproven with a few games. For reference, over the weekend I ran 32 40,000 game tests.Carey wrote:So basically, if you have 100 evaluator features, you can expect maybe 200-300 elo from those in total? (Not exactly, of course... Just not a lot.)bob wrote:HG gave you the answer. The problem occurs when you want to measure small changes. And perhaps the most surprising detail I have discovered is that many programming features, from null-move to no-null-move as an example, is not a 100 elo change. Comparing the adaptive null-move R-2~3 that I used for 10 years in crafty to pure R=3 everywhere was a 2 elo change, as an example. That is _very_ difficult to detect and requires a ton of games. Unfortunately, almost everything you do to a chess engine is a "small elo change". Search extensions and the like all included. The raw computer speed appears to be by far the largest contributor to overall playing level, the software tweaks appear to be just that, software "tweaks" that produce small improvements to a basic alpha/beta program.Carey wrote:Bob,bob wrote:This is a problem, for sure. 40K games give an Elo range of +/- 4, roughly. Less than that and the error bar goes up and accuracy drops off. Less than 20K games is not going to be enough to measure such a small change, based on the cluster testing I have been doing.
Maybe I missed it, but did you ever come up with a list of how many games are needed for a XYZ error rate?
In other words, if 40k are for +- 4, does that mean 20k are for +- 8 and 10k for +- 16, etc.?
If we are just looking for, say 50 points, then how many games are needed?
Since not everybody has a cluster, many of us would like to do as few games as possible.
Also, what was the final testing strategy you decided on? If I remember right, it was to use different starting positions for every single game, rather than the limited 40 positions and do both white & black and repeat until done.
160 games is not going to show anything useful unless you try to compare something like minimax to alpha/beta or something equally significant.
That sure does sounds like a good argument against making huge evaluators, like you do...
So that begs the question.... What is the low end of computer chess rating, then?
A brain dead eval, basically. A basic quiscient search. Trans table. Nothing too sophisticated.
Any guess as to what the rating would be?
I know MicroMax 4.8 is often cited as being the lowest reasonable program but I'm not sure what rating it has.
You know Bob, this sounds like a subject for a paper. Start off with a basic program, start adding features and see what kind of ratings improvements they give.
J. Schaeffer did something like that long ago, but of course he didn't have the kind of computing power to do enough reliable testing.
You are about the only one here with the hardware to do those kinds of tests.
And speaking of the tests.... Are you still doing just the 40 starting positions, or are you doing new ones for each game?

-
- Posts: 28356
- Joined: Fri Mar 10, 2006 10:06 am
- Location: Amsterdam
- Full name: H G Muller
Re: How much improvement from PVS?
Yes, it was only approximate. To know the error bars better than 10%, on had better use BayesElo, as draws have a different weight from wins/losses in the rating model.ernest wrote:Nice formula.hgm wrote:The 95% confidence interval is given in Elo points by +/- 550/sqrt(N)![]()
From my calculations (supposing it corresponds to "2 sigma" confidence interval, and 1% win is +7 Elo), it is valid if 38% of these N games are draws.
For 1/3 draw rate, I get the formula +/- 571/sqrt(N)