How much improvement from PVS?

hgm · Post by **hgm** » Tue Sep 23, 2008 11:05 pm

Carey wrote:I know MicroMax 4.8 is often cited as being the lowest reasonable program but I'm not sure what rating it has.

From CCRL:

Code: Select all

133 Micro-Max 4.8 2011 +36 −36 32.7% +129.9 21.8% 298

I think this rating was measured on the gcc compile; there now is a compile by Denis Mendoza which is 1.75x faster on my C2D. That should in theory improve the Elo by about 55 points.

Indeed, in the current Chess War competition, micro-Max is a contender for a promoton spot to the E division! With still one (Swiss) round to go:

Code: Select all

 RANK   ENGINE                                    GAMES  POINTS   
   1.   LEARNINGLEMMING 0.31_X64                    10     9.5    
   2.   SLOPPY 0.2.0_X64_JA                         10     9.0    
   3.   CHESS ONE 2.01_X64_JA                       10     7.0    
   4.   FEUERSTEIN 0.4.6                            10     7.0    
   5.   MATHEUS 2.3                                 10     7.0    
   6.   PORUCZNIK 4                                 10     7.0    
   7.   ETUDE 0.1.2                                 10     7.0   
   8.   NAGASKAKI 5.12                              10     6.5    
   9.   CHEETAH 0.78_JA                             10     6.5    
  10.   ADAM 3.1                                    10     6.5    
  11.   SMALLPOTATO 0.6.1                           10     6.5    
  12.   WARRIOR 1.0.3                               10     6.5    
  13.   ATAK 6.2                                    10     6.5     
  14.   SIERZANT 28                                 10     6.0    
  15.   ZCT 0.3.2479_FB                             10     6.0    
  16.   RODIN 1.14                                  10     6.0   
  17.   URALOCHKA 1.1B                              10     6.0    
  18.   WJCHESS 1.64                                10     6.0   
  19.   MARVIN 1.3.0                                10     6.0   
  20.   MICRO-MAX 4.8_PII                           10     6.0   
------------------------------------------------------------------ 
  21.   TAKTIX 2.23X                                10     6.0    
  22.   BESTIA 0.90                                 10     6.0    
  23.   SHARPER 0.17                                10     6.0    
  24.   GULLYDECKEL 2.16PL2                         10     6.0    
  25.   FREYR 1.068_X64_JA                          10     5.5    
  26.   SMIRF MS 175B                               10     5.5   
  27.   TIMEA 4A18                                  10     5.5    
  29.   ALEX 1.86                                   10     5.5   
  30.   PHILOU 1.1.2                                10     5.5    
....

(first 20 promote)

Micro-Max probably has way less evaluation than most other engines in that list, but what it has is well tuned. Apart from piece values and positional eval that is more or less equivalent to piece-square tables (where the Pawn-push bonus implied by the Pawn table would be a function of total non-pawn material on the board), the only two evaluation terms are a penalty for King moves before the end-game, and a penalty for moving Pawns if there is no own pawn two squares left or right from its From square.

In particular, there is no recognition of the Bishop pair, passed Pawns, isolated Pawns, doubled Pawns.

Note that sometimes big improvements can occur: Joker80 gained nearly 100 Elo points when I changed its piece values from the set taken from normal Chess (wth educated guesses for Archbishop and Chancellor value) to the set especially determined for Gothic Chess. This was mainly due increasing the Bishop value by half a Pawn.

A nice game of uMax, (from the 10th ChessWar F round), where you can see how its aggressive Pawn play and high bonus for 6th-rank Pawns helps it salvage the point, is this:

Code: Select all

[Event "ChessWar XIII F 40m/20'"]
[Site "DEVILLE1-5D9C28"]
[Date "2008.09.23"]
[Round "1.17"]
[White "BikJump 1.8_x64"]
[Black "Micro-Max 4.8_PII"]
[Result "0-1"]
[TimeControl "40/1200"]
[Annotator "3. +0.01   1... +0.11"]
[Number "17"]

1. e4 Nc6 {+0.11/9 16} 2. Nf3 Nf6 {+0.05/10 42} 3. Nc3 {+0.01/13 15}
d5 {+0.12/10 18} 4. e5 {+0.01/13 16} Ne4 {-0.16/10 41} 5. Be2 {+0.00/12 16}
e6 {-0.07/9 20} 6. O-O {+0.03/12 13} Be7 {+0.05/10 21} 7. Na4 {+0.10/13 36}
O-O {-0.05/11 15} 8. d3 {+0.13/14 15} Ng5 {-0.13/13 1:18} 9.
Nxg5 {+0.20/13 4} Bxg5 {-0.08/13 24} 10. Bxg5 {+0.15/16 41}
Qxg5 {+0.03/13 53} 11. d4 {+0.14/15 30} f6 {+0.06/11 46} 12.
f4 {+0.09/15 46} Qh6 {+0.10/12 18} 13. c3 {+0.09/13 2} fxe5 {+0.16/13 56}
14. fxe5 {+0.09/14 49} Qe3+ {+0.23/12 16} 15. Kh1 {+0.19/14 23}
Rxf1+ {+0.01/12 18} 16. Qxf1 {+0.25/14 16} b6 {+0.00/11 19} 17.
Rd1 {+0.31/12 14} Bd7 {+0.13/12 17} 18. Rd3 {+0.05/11 3} Qe4 {+0.04/12 13}
19. Rf3 {+0.29/13 30} Qc2 {+0.06/12 15} 20. Bd1 {+0.25/13 22}
Qg6 {+0.03/12 19} 21. Rf2 {+0.22/13 28} a5 {+0.05/12 22} 22.
Bc2 {+0.41/13 24} Qe8 {+0.00/12 23} 23. b3 {+0.38/14 47} Ne7 {-0.09/12 18}
24. Nb2 {+0.42/14 30} Qh5 {-0.08/12 30} 25. Bd3 {+0.54/10 3}
Nf5 {+0.02/11 21} 26. Bxf5 {+0.55/13 19} exf5 {+0.18/14 29} 27.
g3 {+0.53/13 17} Qh6 {+0.25/12 21} 28. Nd3 {+0.54/12 25} Qe3 {+0.35/12 18}
29. Qg2 {+0.48/13 15} c6 {+0.34/12 17} 30. Qf3 {+0.54/14 23}
Qe4 {+0.35/14 17} 31. Nf4 {+0.50/14 18} Rf8 {+0.37/13 28} 32.
Kg2 {+0.58/15 29} Re8 {+0.32/14 32} 33. Re2 {+0.54/16 49}
Qxf3+ {+0.39/13 14} 34. Kxf3 {+0.55/16 15} h6 {+0.24/13 28} 35.
Re1 {+0.56/15 23} Kf7 {+0.32/13 24} 36. Kf2 {+0.53/13 24} g5 {+0.45/14 23}
37. Nh5 {+0.54/16 29} Be6 {+0.34/13 17} 38. Nf6 {+0.51/15 38}
Rh8 {+0.37/13 23} 39. Re2 {+0.52/14 31} b5 {+0.42/14 46} 40.
a3 {+0.54/14 9} h5 {+0.40/14 17} 41. h4 {+0.54/15 25} f4 {+0.31/14 1:11}
42. hxg5 {+1.17/16 23} fxg3+ {+0.43/15 22} 43. Kxg3 {+1.19/18 47}
h4+ {+0.34/16 18} 44. Kh2 {+1.19/17 2} Kg6 {+0.24/16 31} 45.
Rg2 {+0.31/14 3} h3 {+0.40/17 38} 46. Rg1 {+1.04/15 1} Rh4 {+0.44/15 26}
47. Ne8 {+0.22/16 31} Re4 {+0.66/16 44} 48. a4 {+0.22/14 0}
b4 {+1.08/15 17} 49. cxb4 {-0.19/15 24} axb4 {+0.58/15 16} 50.
Rd1 {-0.18/16 35} Re2+ {+0.99/15 18} 51. Kh1 {-0.41/17 25}
Kxg5 {+0.94/16 39} 52. Nf6 {-0.41/15 2} Rb2 {+1.01/16 22} 53.
Rc1 {-0.11/15 10} Rxb3 {+1.55/17 27} 54. Rxc6 {-0.31/14 1}
Bf5 {+1.55/17 45} 55. a5 {-0.19/11 4} Ra3 {+2.99/17 30} 56.
Rc1 {-1.70/12 5} b3 {+4.38/17 19} 57. Rg1+ {-1.58/7 0} Kh4 {+4.46/18 25}
58. Rg8 {-4.21/11 3} b2 {+4.96/18 17} 59. Rh8+ {-4.76/10 0}
Kg5 {+5.91/20 28} 60. Rg8+ {-4.88/13 2} Kf4 {+7.28/19 24} 61.
Nxd5+ {-4.01/10 0} Kf3 {+10.65/19 49} 62. Rg1 {-4.01/9 0}
Kf2 {+14.18/18 27} 63. Kh2 {-7.37/12 0} Ra1 {+79.93/21 24} 64.
Nc3 {-14.85/9 0} Rxg1 {+79.94/26 20} 65. d5 {-15.02/8 0} Rg2+ {+79.95/28 2}
66. Kh1 {-5.91/2 0} b1=Q+ {+79.96/28 0} 67. Nxb1 {-99.94/5 0}
{White resigns} 0-1

The turning point of the game is 41. ... f4, which BikJump underastimates (as evidenced by its score jumping up) because it apparently does not realize the danger of the h-pawn reaching h3, safely defended by B+R. Note that this is a 'ponder on' tournament, which gives uMax a disadvantage to almost all other engines in this Elo range, a it does not have pondering implemented.

xsadar · Post by **xsadar** » Wed Sep 24, 2008 4:39 am

hgm wrote:
xsadar wrote:Or how is the best way to verify your implementation?
As for a given depth search the move is not supposed to depend on the fact if you use PVS or plain alpha-beta,

Well, if there's no search instability you should always get the same move, but it seems to me that any search instability would certainly cause the two to give slightly different scores which would of course lead to different moves for some positions. However, I think my hash is the only thing causing search instability right now, so that may not be a problem if I disable it for testing.

it would be sufficient to record the time-to-depth averged over some 100 representative game positions, and see how much of a speedup one gives w.r.t. the other.

That's certainly more doable than playing 20,000 games. And that makes sense that if the moves always match and the new version is faster, than it must be better. Thanks.

xsadar · Post by **xsadar** » Wed Sep 24, 2008 4:51 am

bob wrote:
Carey wrote:
bob wrote:This is a problem, for sure. 40K games give an Elo range of +/- 4, roughly. Less than that and the error bar goes up and accuracy drops off. Less than 20K games is not going to be enough to measure such a small change, based on the cluster testing I have been doing.
Bob,

Maybe I missed it, but did you ever come up with a list of how many games are needed for a XYZ error rate?

In other words, if 40k are for +- 4, does that mean 20k are for +- 8 and 10k for +- 16, etc.?

If we are just looking for, say 50 points, then how many games are needed?

Since not everybody has a cluster, many of us would like to do as few games as possible.

Also, what was the final testing strategy you decided on? If I remember right, it was to use different starting positions for every single game, rather than the limited 40 positions and do both white & black and repeat until done.
HG gave you the answer. The problem occurs when you want to measure small changes. And perhaps the most surprising detail I have discovered is that many programming features, from null-move to no-null-move as an example, is not a 100 elo change. Comparing the adaptive null-move R-2~3 that I used for 10 years in crafty to pure R=3 everywhere was a 2 elo change, as an example. That is _very_ difficult to detect and requires a ton of games. Unfortunately, almost everything you do to a chess engine is a "small elo change". Search extensions and the like all included. The raw computer speed appears to be by far the largest contributor to overall playing level, the software tweaks appear to be just that, software "tweaks" that produce small improvements to a basic alpha/beta program.

160 games is not going to show anything useful unless you try to compare something like minimax to alpha/beta or something equally significant.

So, any updates on starting positions for testing, Bob? Are you still using your ~4000 positions you came up with before, or did you find a better way to select positions? And what about reusing vs. not reusing a position when playing a different color or against a different engine? Any more information on that?

Tord Romstad · Post by **Tord Romstad** » Wed Sep 24, 2008 11:24 am

xsadar wrote:Roughly, what's the range of Elo improvement I should get after changing from alpha-beta to PVS?

Impossible to say, because it depends on a lot of factors (the quality of your move ordering, your time managment techniques, whether you use aspiration windows, your extension/reduction/pruning scheme, and so on). In my opinion, the most important advantage of PVS is not that it is more efficient, but that it is easier to implement and reason about.

Tord

Greg McGlynn · Post by **Greg McGlynn** » Sat Sep 27, 2008 8:54 pm

Carey wrote:So that begs the question.... What is the low end of computer chess rating, then?

A brain dead eval, basically. A basic quiscient search. Trans table. Nothing too sophisticated.

Any guess as to what the rating would be?

I know MicroMax 4.8 is often cited as being the lowest reasonable program but I'm not sure what rating it has.

You know Bob, this sounds like a subject for a paper. Start off with a basic program, start adding features and see what kind of ratings improvements they give.

I've been running something like this on an engine I'm working on called "Mallard." I set up 9 versions of it, each one with one additional search heuristic and ran a big round-robin. I also included Micro-max, Mediocre, and an engine I wrote about a year ago called Bird. Here are the results:

Code: Select all

Rank Name        Elo    +    - games score oppo. draws
   1 Mediocre    467   37   37   330   70%   290   13%
   2 Bird        463   37   37   330   70%   291   15%
   3 Mallard9    425   36   36   330   65%   294   16%
   4 Mallard8    376   34   34   330   60%   299   24%
   5 Micro-Max   369   34   34   330   58%   299   21%
   6 Mallard7    354   34   34   330   56%   301   23%
   7 Mallard6    352   34   34   330   56%   301   25%
   8 Mallard5    322   34   34   330   52%   303   25%
   9 Mallard4    275   34   34   330   46%   308   19%
  10 Mallard3    185   36   36   330   34%   316   15%
  11 Mallard2     72   40   40   330   20%   326   17%
  12 Mallard1      0   44   44   330   13%   333   15%


LOS table:
           Me Bi Ma Ma Mi Ma Ma Ma Ma Ma Ma Ma
Mediocre      55 95 99 99 99 99 99 99100100100
Bird       44    93 99 99 99 99 99 99100100100
Mallard9    4  6    98 99 99 99 99 99100100100
Mallard8    0  0  1    61 82 84 98 99 99100100
Micro-Max   0  0  0 38    74 76 97 99 99100100
Mallard7    0  0  0 17 25    51 91 99 99100100
Mallard6    0  0  0 15 23 48    90 99 99100100
Mallard5    0  0  0  1  2  8  9    97 99100100
Mallard4    0  0  0  0  0  0  0  2    99 99100
Mallard3    0  0  0  0  0  0  0  0  0    99 99
Mallard2    0  0  0  0  0  0  0  0  0  0    99
Mallard1    0  0  0  0  0  0  0  0  0  0  0

The basic program (Mallard1) had alpha-beta with a simple evaluation function (material + piece-square tables). Main search move ordering was captures ordered by MVV/LVA, then unordered noncaptures. Quiescence search was captures ordered by MVV/LVA, followed by unordered noncaptures if in check. Mallard2-9 had the following added features:

2-transposition table
3-standard recursive r=2 null-move heuristic
4-killer move and history heuristics
5-principal variation search
6-SEE used to prune losing captures in quiescence search
7-futility pruning in quiescence search (if stand_pat_eval + value[capture] + 100 < alpha => prune move)
8-futility pruning in non-pv nodes (margin of 150 at depth 1;margin of 500 at depth 2)
9-check extensions (1 ply)

Of course, you would obtain different results with a different program and maybe a different number of games, and perhaps I haven't implemented all of the above optimally, but I think the results are interesting.

hgm · Post by **hgm** » Sat Sep 27, 2008 10:39 pm

Greg McGlynn wrote:..., but I think the results are interesting.

This is indeed extremely interesting.

Did you do something to randomize the games? Micro-Max has no randomizer and no book, and is very insensitive to time jitter, as it always finishes an iteration. So it tends to reproduce games. Unles the opponent randomizes sufficiently, of course.

The funny thing is that micro-Max lacks many of the features that the engines that ended below it apparently had:

1 - no awareness of check in QS nodes, it just stands pat when in-check.
4 - No killer move or history
5 - no PVS, plain alpha-beta
6 - no SEE, no pruning of losing captures

It does have some others, though:

* It has QS and d=1 futility pruning with zero margin. (As the eval is entirely differential, it is exactly known before the recursive call is made.)
* Internal Iterative Deepening in every node.
* check extension at d>=1.
* Late-Move Reduction of all non-capture non-pawn move except hash move.

Of course the evaluation might be very important too. That you have material + PST does ot mean that they are properly tuned. Do you score the Bishop pair? Do you have no Pawn-structure evaluation at all? If you do not recognize passers in the eval, how does the Pawn PST score Pawns on 6th and 7th rank? Do you have different PST for King in the end-game, and how do you recognize game stage? Do you have any King safety at all in the evaluation?

Greg McGlynn · Post by **Greg McGlynn** » Sat Sep 27, 2008 11:18 pm

hgm wrote:Did you do something to randomize the games? Micro-Max has no randomizer and no book, and is very insensitive to time jitter, as it always finishes an iteration. So it tends to reproduce games. Unles the opponent randomizes sufficiently, of course.

Every pair of engines ran a 30-game match on a fixed set of 15 different starting positions (each engine played white & black from each position).

hgm wrote:The funny thing is that micro-Max lacks many of the features that the engines that ended below it apparently had:

1 - no awareness of check in QS nodes, it just stands pat when in-check.
4 - No killer move or history
5 - no PVS, plain alpha-beta
6 - no SEE, no pruning of losing captures

It does have some others, though:

* It has QS and d=1 futility pruning with zero margin. (As the eval is entirely differential, it is exactly known before the recursive call is made.)
* Internal Iterative Deepening in every node.
* check extension at d>=1.
* Late-Move Reduction of all non-capture non-pawn move except hash move.

Of course the evaluation might be very important too. That you have material + PST does ot mean that they are properly tuned. Do you score the Bishop pair? Do you have no Pawn-structure evaluation at all? If you do not recognize passers in the eval, how does the Pawn PST score Pawns on 6th and 7th rank? Do you have different PST for King in the end-game, and how do you recognize game stage? Do you have any King safety at all in the evaluation?

Mallard's evaluation is intentionally extremely basic; I have been focusing on the search. I am sure that the piece-square tables are very badly tuned. Once there are 4 pieces or fewer on the board all piece square tables are turned off except for a king endgame table which encourages centralization and a pawn endgame table which gives +5 centipawns for each rank a pawn has advanced. No bishop pair term, no pawn structure, no passer eval, no 6th/7th rank bonus, no king safety. The only other aspect is that in pawnless endgames there is a separate eval that either scores a draw (if material difference is <= minor piece) or encourages the attacker to force the defender to the edge of the board.

Internal iterative deepening and LMR are things I will try next. I also have switches in the code that allow me to try things like extending/reducing winning, even, and losing captures, noncaptures, and underpromotions in pv or non-pv nodes. I will also try single-reply, mate-threat, and passed-pawn-push extensions, but I imagine stuff like passed pawns and mate threats are best handled in the evaluation; I think Bird, my last program, drowned in a sea of extensions that severely limited depth.

xsadar · Post by **xsadar** » Sun Sep 28, 2008 6:56 am

Greg McGlynn wrote:
Carey wrote:So that begs the question.... What is the low end of computer chess rating, then?

A brain dead eval, basically. A basic quiscient search. Trans table. Nothing too sophisticated.

Any guess as to what the rating would be?

I know MicroMax 4.8 is often cited as being the lowest reasonable program but I'm not sure what rating it has.

You know Bob, this sounds like a subject for a paper. Start off with a basic program, start adding features and see what kind of ratings improvements they give.
I've been running something like this on an engine I'm working on called "Mallard." I set up 9 versions of it, each one with one additional search heuristic and ran a big round-robin. I also included Micro-max, Mediocre, and an engine I wrote about a year ago called Bird. Here are the results:
Code: Select all
Rank Name        Elo    +    - games score oppo. draws
   1 Mediocre    467   37   37   330   70%   290   13%
   2 Bird        463   37   37   330   70%   291   15%
   3 Mallard9    425   36   36   330   65%   294   16%
   4 Mallard8    376   34   34   330   60%   299   24%
   5 Micro-Max   369   34   34   330   58%   299   21%
   6 Mallard7    354   34   34   330   56%   301   23%
   7 Mallard6    352   34   34   330   56%   301   25%
   8 Mallard5    322   34   34   330   52%   303   25%
   9 Mallard4    275   34   34   330   46%   308   19%
  10 Mallard3    185   36   36   330   34%   316   15%
  11 Mallard2     72   40   40   330   20%   326   17%
  12 Mallard1      0   44   44   330   13%   333   15%


LOS table:
           Me Bi Ma Ma Mi Ma Ma Ma Ma Ma Ma Ma
Mediocre      55 95 99 99 99 99 99 99100100100
Bird       44    93 99 99 99 99 99 99100100100
Mallard9    4  6    98 99 99 99 99 99100100100
Mallard8    0  0  1    61 82 84 98 99 99100100
Micro-Max   0  0  0 38    74 76 97 99 99100100
Mallard7    0  0  0 17 25    51 91 99 99100100
Mallard6    0  0  0 15 23 48    90 99 99100100
Mallard5    0  0  0  1  2  8  9    97 99100100
Mallard4    0  0  0  0  0  0  0  2    99 99100
Mallard3    0  0  0  0  0  0  0  0  0    99 99
Mallard2    0  0  0  0  0  0  0  0  0  0    99
Mallard1    0  0  0  0  0  0  0  0  0  0  0
The basic program (Mallard1) had alpha-beta with a simple evaluation function (material + piece-square tables). Main search move ordering was captures ordered by MVV/LVA, then unordered noncaptures. Quiescence search was captures ordered by MVV/LVA, followed by unordered noncaptures if in check. Mallard2-9 had the following added features:

2-transposition table
3-standard recursive r=2 null-move heuristic
4-killer move and history heuristics
5-principal variation search
6-SEE used to prune losing captures in quiescence search
7-futility pruning in quiescence search (if stand_pat_eval + value[capture] + 100 < alpha => prune move)
8-futility pruning in non-pv nodes (margin of 150 at depth 1;margin of 500 at depth 2)
9-check extensions (1 ply)

Of course, you would obtain different results with a different program and maybe a different number of games, and perhaps I haven't implemented all of the above optimally, but I think the results are interesting.

Hmm... back to the original subject of the thread, I notice you have a difference of 47 Elo for adding PVS. I wonder how much effect self-play may have had on the accuracy of the ratings, and what they would look like if you instead played Mallard4 vs world and Mallard5 vs world (where world consisted of 5 or more engines unrelated to Mallard or each other) and compared the results.

xsadar · Post by **xsadar** » Tue Sep 30, 2008 1:34 am

xsadar wrote:
hgm wrote:
xsadar wrote:Or how is the best way to verify your implementation?
As for a given depth search the move is not supposed to depend on the fact if you use PVS or plain alpha-beta,
Well, if there's no search instability you should always get the same move, but it seems to me that any search instability would certainly cause the two to give slightly different scores which would of course lead to different moves for some positions. However, I think my hash is the only thing causing search instability right now, so that may not be a problem if I disable it for testing.

it would be sufficient to record the time-to-depth averged over some 100 representative game positions, and see how much of a speedup one gives w.r.t. the other.
That's certainly more doable than playing 20,000 games. And that makes sense that if the moves always match and the new version is faster, than it must be better. Thanks.

Well, using this testing method, I've found a PVS bug which can cause incorrect scores to be returned in certain circumstances. Hopefully there aren't any more, but I'll have to do more testing to see. Thanks for your help, HGM.

Dann Corbit · Post by **Dann Corbit** » Tue Sep 30, 2008 3:12 am

xsadar wrote:
Greg McGlynn wrote:
Carey wrote:So that begs the question.... What is the low end of computer chess rating, then?

A brain dead eval, basically. A basic quiscient search. Trans table. Nothing too sophisticated.

Any guess as to what the rating would be?

I know MicroMax 4.8 is often cited as being the lowest reasonable program but I'm not sure what rating it has.

You know Bob, this sounds like a subject for a paper. Start off with a basic program, start adding features and see what kind of ratings improvements they give.
I've been running something like this on an engine I'm working on called "Mallard." I set up 9 versions of it, each one with one additional search heuristic and ran a big round-robin. I also included Micro-max, Mediocre, and an engine I wrote about a year ago called Bird. Here are the results:
Code: Select all
Rank Name        Elo    +    - games score oppo. draws
   1 Mediocre    467   37   37   330   70%   290   13%
   2 Bird        463   37   37   330   70%   291   15%
   3 Mallard9    425   36   36   330   65%   294   16%
   4 Mallard8    376   34   34   330   60%   299   24%
   5 Micro-Max   369   34   34   330   58%   299   21%
   6 Mallard7    354   34   34   330   56%   301   23%
   7 Mallard6    352   34   34   330   56%   301   25%
   8 Mallard5    322   34   34   330   52%   303   25%
   9 Mallard4    275   34   34   330   46%   308   19%
  10 Mallard3    185   36   36   330   34%   316   15%
  11 Mallard2     72   40   40   330   20%   326   17%
  12 Mallard1      0   44   44   330   13%   333   15%


LOS table:
           Me Bi Ma Ma Mi Ma Ma Ma Ma Ma Ma Ma
Mediocre      55 95 99 99 99 99 99 99100100100
Bird       44    93 99 99 99 99 99 99100100100
Mallard9    4  6    98 99 99 99 99 99100100100
Mallard8    0  0  1    61 82 84 98 99 99100100
Micro-Max   0  0  0 38    74 76 97 99 99100100
Mallard7    0  0  0 17 25    51 91 99 99100100
Mallard6    0  0  0 15 23 48    90 99 99100100
Mallard5    0  0  0  1  2  8  9    97 99100100
Mallard4    0  0  0  0  0  0  0  2    99 99100
Mallard3    0  0  0  0  0  0  0  0  0    99 99
Mallard2    0  0  0  0  0  0  0  0  0  0    99
Mallard1    0  0  0  0  0  0  0  0  0  0  0
The basic program (Mallard1) had alpha-beta with a simple evaluation function (material + piece-square tables). Main search move ordering was captures ordered by MVV/LVA, then unordered noncaptures. Quiescence search was captures ordered by MVV/LVA, followed by unordered noncaptures if in check. Mallard2-9 had the following added features:

2-transposition table
3-standard recursive r=2 null-move heuristic
4-killer move and history heuristics
5-principal variation search
6-SEE used to prune losing captures in quiescence search
7-futility pruning in quiescence search (if stand_pat_eval + value[capture] + 100 < alpha => prune move)
8-futility pruning in non-pv nodes (margin of 150 at depth 1;margin of 500 at depth 2)
9-check extensions (1 ply)

Of course, you would obtain different results with a different program and maybe a different number of games, and perhaps I haven't implemented all of the above optimally, but I think the results are interesting.
Hmm... back to the original subject of the thread, I notice you have a difference of 47 Elo for adding PVS. I wonder how much effect self-play may have had on the accuracy of the ratings, and what they would look like if you instead played Mallard4 vs world and Mallard5 vs world (where world consisted of 5 or more engines unrelated to Mallard or each other) and compared the results.

This looks like a very interesting experiment.
I have never seen such a detailed examination of various search aspects.

I wonder if the results would keep their relative merits as eval improves.
In other words, these experiments are all search related and hence are measuring knowlege gained through search.

I wonder if added chess knowlege will skew these proportions or if the relative difference caused by adding search features will remain about the same.

It seems to me that some features may be correlated (for instance, adding checks in qsearch could have impact on king safety and so evaluation terms may have similar effect to search terms, but will they combine to form something more effective than their sum, equal to their sum or less than their sum?)

How much improvement from PVS?

Re: How much improvement from PVS?

Re: How much improvement from PVS?

Re: How much improvement from PVS?

Re: How much improvement from PVS?

Re: How much improvement from PVS?

Re: How much improvement from PVS?

Re: How much improvement from PVS?

Re: How much improvement from PVS?

Re: How much improvement from PVS?

Re: How much improvement from PVS?