Stockfish patches

Paloma · Post by **Paloma** » Thu Jul 11, 2019 3:24 pm

SF dev.

Why are some (up to 6) patches released in one day (often less than 1 hour),
and then again 10 or more days without a single patch?

Why so many within 1 or 2 hour?

Marek Soszynski · Post by **Marek Soszynski** » Thu Jul 11, 2019 5:33 pm

I fear that Axtens's "Enable popcount and prefetch for ppc-64" patch could slow down old x64 PCs by 8–10%.

Eelco de Groot · Post by **Eelco de Groot** » Thu Jul 11, 2019 7:57 pm

Paloma wrote: ↑Thu Jul 11, 2019 3:24 pm SF dev.

Why are some (up to 6) patches released in one day (often less than 1 hour),
and then again 10 or more days without a single patch?

Why so many within 1 or 2 hour?

It just depends on when the maintainer, which is Marco, has the time to make the decision whether a patch passes or not and update the master. Before that time, developers, testers, can make comments and sometimes there is discussion so that a patch is seldom applied to the master the same day it is published as a pull request. (You can view the pull requests pending under their own tab). Sometimes quickly if there is not discussion possible and Marco can apply it.

Uri Blass · Post by **Uri Blass** » Thu Jul 11, 2019 8:18 pm

Eelco de Groot wrote: ↑Thu Jul 11, 2019 7:57 pm
Paloma wrote: ↑Thu Jul 11, 2019 3:24 pm SF dev.

Why are some (up to 6) patches released in one day (often less than 1 hour),
and then again 10 or more days without a single patch?

Why so many within 1 or 2 hour?
It just depends on when the maintainer, which is Marco, has the time to make the decision whether a patch passes or not and update the master. Before that time, developers, testers, can make comments and sometimes there is discussion so that a patch is seldom applied to the master the same day it is published as a pull request. (You can view the pull requests pending under their own tab). Sometimes quickly if there is not discussion possible and Marco can apply it.

I think that it is bad for stockfish because people do not test a patch against the latest version and maybe testing against the latest version could lead to different result.

For example 2 patches passed sprt[0,4]

1)Combo of statscore divisor and pawn psqt changes
2)Tweak capture scoring formula

If I understand correctly it was practically A+patch 1 against A and A+patch 2 against A
Now maybe A+patch1+patch2 does not work against A.

I often see no improvement in the regression tests and I suspect that it may be the reason

We have in the last regression tests

Regression/progression test against SF10 after "More bonus for free passed pawn" of June 20th.

ELO: 24.06 +-1.8 (95%) LOS: 100.0%
Total: 40000 W: 7313 L: 4547 D: 28140

Later

Regression/progression test against SF10 after "Bonus for double attacks on unsupported pawns" of June 27th.

ELO: 22.75 +-1.8 (95%) LOS: 100.0%
Total: 40000 W: 7260 L: 4644 D: 28096

and so far
Regression/progression test against SF10 after "Assorted trivial cleanups June 2019" of July 11th.
ELO: 23.69 +-2.4 (95%) LOS: 100.0%
Total: 24209 W: 4468 L: 2820 D: 16921

I wonder what is the reason for almost no improvement and I think maybe the reason is that stockfish allow more than one change at the same time even if it is not based on testing against the latest version that they accepted.

Dann Corbit · Post by **Dann Corbit** » Thu Jul 11, 2019 9:30 pm

A lot of software development is like that.
I might do a:
pacman -Syu
command and get nothing for a week.
And today,
pacman -Syu
gave me 56 project updates

daniel71 · Post by **daniel71** » Sat Jul 13, 2019 6:06 am

I was wondering the same thing about the Stockfish development versions, nothing for 5 days then a group of patches all at once.
Another thing I noticed on Fishtest that they have many patches get labeled as improvements when in fact they score more loses than wins, is somebody trying to ruin the improvements gained? I know they wrote some patches may get passed if it has a speedup and has a negative score. Looks like they wouldn't remove code that is a net gain of ELO and gives the program more knowledge.

Uri Blass · Post by **Uri Blass** » Sat Jul 13, 2019 10:05 am

Simplifications can pass with slightly negative score but the cases when it happen are rare.

The main problem that I see is that the stockfish team give wrong information about elo.

They give elo estimate for every change and if you add all the estimates that they publish the elo improvement shouls be clearly higher.

Here is the elo estimate for all the improvements if you look only at the stockfish framework results from stockfish 10 at month december and january(I give only the estimate for the improvement at long time control)

It seems that the stockfish designers claim to have more than 20 elo improvment per month if you ignore their regression tests.
I did not calculate the elo advantage except december 2018 and january 2019 but I guess the picture for other months may be similiar.

1)Remove Overload bonus 1.12.2018 +0.06 elo
2)Penalize refuted killers in continuation history 1.12.2018 +3.06 elo(total improvement +3.12 elo)
3)Introduce concept of double pawn protection. 2.12.2018 +3.95 elo(total improvement +7.07 elo)
4)pseudo_legal() and MOVE_NONE 6.12.2018 Reverted in 7 so I assume 0 elo
5)Simplify time manager in search() 6.12.2018 +0.88 elo at long time control(total improvement +7.95 elo)
6)Simplify Killer Move Penalty 6.12.2018 +0.81 elo at long time control(total improvement 8.76 elo)
7)Revert "pseudo_legal() and MOVE_NONE"
8)simplify opposite_colors 9.12.2018
9)add paren. 9.12.2018
10)remove parenthesis. 9.12.2018
11)remove extra line 9.12.2018.
No results for 8-11 and I assume no functional change and no elo change for them
12)Tweak CMH pruning 9.12.2018 +2.22 elo at long time control(total improvment 10.98 elo)
13)Changes identified in RENAME/REFORMATTING thread (#1861) 11.12.2018 no functional change 0 elo
14)Asymmetrical 8x8 Pawn PSQT 13.12.2018 passed long time control but no estimate so I will be large and estimate 0 elo
15)A combo of parameter tweaks 13.12.2018 +2.31 elo(total improvement 13.29 elo)
16)Remove Null Move Pruning material threshold 16.12.2018 +1.03 elo(total improvement 14.32 elo)
17)Start a TT resize only after search finished 16.12.2018 no functional change 0 elo
18)Fix a segfault. 16.12.2018
19)Refactor king ring calculation no testing at long time control so I assume 0 elo
20)Use stronglyProtected 16.12.2018 +0.89 elo (total improvement 15.21 elo)
21)New voting formula for threads 18.12.2018 no functional change in simple thread mode so I assume 0 elo
22)Tweak main killer penalty 18.12.2018 +2.59 elo(total improvement 17.8 elo)
23)Simplify KBNK endgame implementation 20.12.2018
24)Simplify generate_castling (#1885) 23.12.2018
25)Turn on random access for Syzygy files in Windows (#1840) 23.12.2018
26)Use a bit less code to calculate hashfull() (#1830) 23.12.2018
27)Update our continuous integration machinery (#1889) 23.12.2018
28)Improve endgame KBN vs K (#1877) 24.12.2018 claiming no functional change but it is wrong. assume 0 elo
29)Extend stack to ss-5, and remove conditions 24.12.2018 -0.17 elo total improvement 17.63 elo
30)Fix crash in best_group() (#1891) 24.12.2018
31)Simplify SYSTEM_LOGICAL_PROCESSOR_INFORMATION_EX loop (#1892) 24.12.2018
32)Always initialize and evaluate king safety 27.12.2018 +3.59 elo(total improvement 21.22 elo)
33)Improve the Readme 29.12.2018
total improvement in december 21.22 elo

34)Remove as useless micro-optimization in pawns generation (#1915) 1.1.2019(not tested at LTC assume 0 elo)
35)Remove "Any" predicate filter (#1914) 1.1.2019
36)Remove openFiles in pawns. (#1917) 1.1.2019(no testing at LTC so assume 0 elo)
37)Assorted trivial cleanups (#1894) 1.1.2019
38)Delay castling legality check 4.1.2019 +5.03 elo (5.03 total improvement in january)
39)Check tablebase files 4.1.2019
40)Introduce Multi-Cut 6.1.2019 +3.20 elo(8.23 elo total)
41)Flag critical search tree in hash table 9.1.2019 3.21 elo(11.44 total)
42)Small improvements to the CI infrastructure 9.1.2019
43)Minor cleanup to recent 'Flag critical search tree in hash table' patch 10.1.2019
44)Remove pvExact 10.1.2019 +1.72 elo(13.16 total)
45)Simplify time management a bit 14.1.2019 +2.51 elo(15.67 elo total)
46)Simplify pawn moves (#1900) 14.1.2019 no testing at long time control
47)Remove AdjacentFiles 17.1.2019 no testing at long time control
48)Tweak initiative and Pawn PSQT (#1957) 20.1.2019 +1.74 elo(17.41 elo total)
49)Clean-up some shifting in space calculation (#1955) 20.1.2019
50)Simplify pvHit (#1953) 20.1.2019
51)Simplify pondering time management (#1899) no testing with ponder so not relevant
52)Simplify TrappedRook 22.1.2019 +0.44 elo(17.85 total)
53)Use int8_t instead of int for SquareDistance[] 29.1.2019 pure speed up so no testing at LTC and I assume 0 elo
54)Change pinning logic in Static Exchange Evaluation (SEE) 29.1.2019 +1.46 elo(19.31 total)
55)Don't update pvHit after IID 29.1.2019 +0.38 elo(19.69 total)
56)Simplify Stat Score bonus 31.1.2019 +1.34 elo(21.03 total)

Michel · Post by **Michel** » Sat Jul 13, 2019 11:12 am

Uri wrote:They give elo estimate for every change and if you add all the estimates that they publish the elo improvement shouls be clearly higher.

You have a basic misunderstanding of statistics. The elo estimate calculated by fishtest is (median) unbiased over all patches (both failing and passing). In other words it corrects the inherent SPRT bias.

However if you select only the passed test then you have a different kind of bias which is called "selection bias". The elo estimate does not correct for that and it is not at all obvious how to do so. It would be possible if an elo prior were available.

Note: some time ago an elo prior for fishtest was determined as a normal distribution with mu=-1.013 and sigma=1.101 (logistic elo units). This prior was obtained by minimizing the difference between the empirical distribution of the elo estimates (about 6000 patches) and the theoretical one calculated from the prior. See respectively the histogram and the continuous line in the following graph.

As you see the match is visually rather good. No effort has been made however to see how this elo prior evolves over time.

Uri Blass · Post by **Uri Blass** » Sat Jul 13, 2019 12:51 pm

Michel wrote: ↑Sat Jul 13, 2019 11:12 am
Uri wrote:They give elo estimate for every change and if you add all the estimates that they publish the elo improvement shouls be clearly higher.
You have a basic misunderstanding of statistics. The elo estimate calculated by fishtest is (median) unbiased over all patches (both failing and passing). In other words it corrects the inherent SPRT bias.

However if you select only the passed test then you have a different kind of bias which is called "selection bias". The elo estimate does not correct for that and it is not at all obvious how to do so. It would be possible if an elo prior were available.

Note: some time ago an elo prior for fishtest was determined as a normal distribution with mu=-1.013 and sigma=1.101 (logistic elo units). This prior was obtained by minimizing the difference between the empirical distribution of the elo estimates (about 6000 patches) and the theoretical one calculated from the prior. See respectively the histogram and the continuous line in the following graph.

As you see the match is visually rather good. No effort has been made however to see how this elo prior evolves over time.

I understand that the estimate for the passed patches is biased and wrong and it is exactly the problem and they should not write a biased estimate.
If they want to give a good estimate then the only good way is simply to play a fixed number of games after they decided to accept the patch.

using elo prior distribution for the patches means some assumptions that we do not know if they are correct.
playing 40000 games for every 2 consecutive versions even now will give an unbiased estimate with a possible mistake of 2 elo.

People may complain about using hardware time but I think that it is better for stockfish to use hardware to really have an unbiased estimate for the value of patches because the knowledge may help later to understand what are the good patches that helped stockfish to become better.

I

Michel · Post by **Michel** » Sat Jul 13, 2019 2:01 pm

playing 40000 games for every 2 consecutive versions even now will give an unbiased estimate with a possible mistake of 2 elo.

Getting an accurate elo assessment of patches (or to do any kind or research at all) was never a goal of fishtest. The idea is that the considerable ressources needed for such an assessment should instead be used to test more patches, or to do SPRT tests with narrower bounds, which allows smaller elo patches to succeed more easily. This is simply a choice. One cannot argue with it.

Stockfish patches

Stockfish patches

Re: Stockfish patches

Re: Stockfish patches

Re: Stockfish patches

Re: Stockfish patches

Re: Stockfish patches

Re: Stockfish patches

Re: Stockfish patches

Re: Stockfish patches

Re: Stockfish patches