SF dev.
Why are some (up to 6) patches released in one day (often less than 1 hour),
and then again 10 or more days without a single patch?
Why so many within 1 or 2 hour?
Stockfish patches
Moderators: hgm, Rebel, chrisw
-
- Posts: 582
- Joined: Wed May 10, 2006 7:28 pm
- Location: Birmingham, England
Re: Stockfish patches
I fear that Axtens's "Enable popcount and prefetch for ppc-64" patch could slow down old x64 PCs by 8–10%.
Marek Soszynski
-
- Posts: 4576
- Joined: Sun Mar 12, 2006 2:40 am
- Full name:
Re: Stockfish patches
It just depends on when the maintainer, which is Marco, has the time to make the decision whether a patch passes or not and update the master. Before that time, developers, testers, can make comments and sometimes there is discussion so that a patch is seldom applied to the master the same day it is published as a pull request. (You can view the pull requests pending under their own tab). Sometimes quickly if there is not discussion possible and Marco can apply it.
Debugging is twice as hard as writing the code in the first
place. Therefore, if you write the code as cleverly as possible, you
are, by definition, not smart enough to debug it.
-- Brian W. Kernighan
place. Therefore, if you write the code as cleverly as possible, you
are, by definition, not smart enough to debug it.
-- Brian W. Kernighan
-
- Posts: 10413
- Joined: Thu Mar 09, 2006 12:37 am
- Location: Tel-Aviv Israel
Re: Stockfish patches
I think that it is bad for stockfish because people do not test a patch against the latest version and maybe testing against the latest version could lead to different result.Eelco de Groot wrote: ↑Thu Jul 11, 2019 7:57 pmIt just depends on when the maintainer, which is Marco, has the time to make the decision whether a patch passes or not and update the master. Before that time, developers, testers, can make comments and sometimes there is discussion so that a patch is seldom applied to the master the same day it is published as a pull request. (You can view the pull requests pending under their own tab). Sometimes quickly if there is not discussion possible and Marco can apply it.
For example 2 patches passed sprt[0,4]
1)Combo of statscore divisor and pawn psqt changes
2)Tweak capture scoring formula
If I understand correctly it was practically A+patch 1 against A and A+patch 2 against A
Now maybe A+patch1+patch2 does not work against A.
I often see no improvement in the regression tests and I suspect that it may be the reason
We have in the last regression tests
Regression/progression test against SF10 after "More bonus for free passed pawn" of June 20th.
ELO: 24.06 +-1.8 (95%) LOS: 100.0%
Total: 40000 W: 7313 L: 4547 D: 28140
Later
Regression/progression test against SF10 after "Bonus for double attacks on unsupported pawns" of June 27th.
ELO: 22.75 +-1.8 (95%) LOS: 100.0%
Total: 40000 W: 7260 L: 4644 D: 28096
and so far
Regression/progression test against SF10 after "Assorted trivial cleanups June 2019" of July 11th.
ELO: 23.69 +-2.4 (95%) LOS: 100.0%
Total: 24209 W: 4468 L: 2820 D: 16921
I wonder what is the reason for almost no improvement and I think maybe the reason is that stockfish allow more than one change at the same time even if it is not based on testing against the latest version that they accepted.
-
- Posts: 12566
- Joined: Wed Mar 08, 2006 8:57 pm
- Location: Redmond, WA USA
Re: Stockfish patches
A lot of software development is like that.
I might do a:
pacman -Syu
command and get nothing for a week.
And today,
pacman -Syu
gave me 56 project updates
I might do a:
pacman -Syu
command and get nothing for a week.
And today,
pacman -Syu
gave me 56 project updates
Taking ideas is not a vice, it is a virtue. We have another word for this. It is called learning.
But sharing ideas is an even greater virtue. We have another word for this. It is called teaching.
But sharing ideas is an even greater virtue. We have another word for this. It is called teaching.
-
- Posts: 146
- Joined: Wed Aug 27, 2008 3:48 am
Re: Stockfish patches
I was wondering the same thing about the Stockfish development versions, nothing for 5 days then a group of patches all at once.
Another thing I noticed on Fishtest that they have many patches get labeled as improvements when in fact they score more loses than wins, is somebody trying to ruin the improvements gained? I know they wrote some patches may get passed if it has a speedup and has a negative score. Looks like they wouldn't remove code that is a net gain of ELO and gives the program more knowledge.
Another thing I noticed on Fishtest that they have many patches get labeled as improvements when in fact they score more loses than wins, is somebody trying to ruin the improvements gained? I know they wrote some patches may get passed if it has a speedup and has a negative score. Looks like they wouldn't remove code that is a net gain of ELO and gives the program more knowledge.
-
- Posts: 10413
- Joined: Thu Mar 09, 2006 12:37 am
- Location: Tel-Aviv Israel
Re: Stockfish patches
Simplifications can pass with slightly negative score but the cases when it happen are rare.
The main problem that I see is that the stockfish team give wrong information about elo.
They give elo estimate for every change and if you add all the estimates that they publish the elo improvement shouls be clearly higher.
Here is the elo estimate for all the improvements if you look only at the stockfish framework results from stockfish 10 at month december and january(I give only the estimate for the improvement at long time control)
It seems that the stockfish designers claim to have more than 20 elo improvment per month if you ignore their regression tests.
I did not calculate the elo advantage except december 2018 and january 2019 but I guess the picture for other months may be similiar.
1)Remove Overload bonus 1.12.2018 +0.06 elo
2)Penalize refuted killers in continuation history 1.12.2018 +3.06 elo(total improvement +3.12 elo)
3)Introduce concept of double pawn protection. 2.12.2018 +3.95 elo(total improvement +7.07 elo)
4)pseudo_legal() and MOVE_NONE 6.12.2018 Reverted in 7 so I assume 0 elo
5)Simplify time manager in search() 6.12.2018 +0.88 elo at long time control(total improvement +7.95 elo)
6)Simplify Killer Move Penalty 6.12.2018 +0.81 elo at long time control(total improvement 8.76 elo)
7)Revert "pseudo_legal() and MOVE_NONE"
8)simplify opposite_colors 9.12.2018
9)add paren. 9.12.2018
10)remove parenthesis. 9.12.2018
11)remove extra line 9.12.2018.
No results for 8-11 and I assume no functional change and no elo change for them
12)Tweak CMH pruning 9.12.2018 +2.22 elo at long time control(total improvment 10.98 elo)
13)Changes identified in RENAME/REFORMATTING thread (#1861) 11.12.2018 no functional change 0 elo
14)Asymmetrical 8x8 Pawn PSQT 13.12.2018 passed long time control but no estimate so I will be large and estimate 0 elo
15)A combo of parameter tweaks 13.12.2018 +2.31 elo(total improvement 13.29 elo)
16)Remove Null Move Pruning material threshold 16.12.2018 +1.03 elo(total improvement 14.32 elo)
17)Start a TT resize only after search finished 16.12.2018 no functional change 0 elo
18)Fix a segfault. 16.12.2018
19)Refactor king ring calculation no testing at long time control so I assume 0 elo
20)Use stronglyProtected 16.12.2018 +0.89 elo (total improvement 15.21 elo)
21)New voting formula for threads 18.12.2018 no functional change in simple thread mode so I assume 0 elo
22)Tweak main killer penalty 18.12.2018 +2.59 elo(total improvement 17.8 elo)
23)Simplify KBNK endgame implementation 20.12.2018
24)Simplify generate_castling (#1885) 23.12.2018
25)Turn on random access for Syzygy files in Windows (#1840) 23.12.2018
26)Use a bit less code to calculate hashfull() (#1830) 23.12.2018
27)Update our continuous integration machinery (#1889) 23.12.2018
28)Improve endgame KBN vs K (#1877) 24.12.2018 claiming no functional change but it is wrong. assume 0 elo
29)Extend stack to ss-5, and remove conditions 24.12.2018 -0.17 elo total improvement 17.63 elo
30)Fix crash in best_group() (#1891) 24.12.2018
31)Simplify SYSTEM_LOGICAL_PROCESSOR_INFORMATION_EX loop (#1892) 24.12.2018
32)Always initialize and evaluate king safety 27.12.2018 +3.59 elo(total improvement 21.22 elo)
33)Improve the Readme 29.12.2018
total improvement in december 21.22 elo
34)Remove as useless micro-optimization in pawns generation (#1915) 1.1.2019(not tested at LTC assume 0 elo)
35)Remove "Any" predicate filter (#1914) 1.1.2019
36)Remove openFiles in pawns. (#1917) 1.1.2019(no testing at LTC so assume 0 elo)
37)Assorted trivial cleanups (#1894) 1.1.2019
38)Delay castling legality check 4.1.2019 +5.03 elo (5.03 total improvement in january)
39)Check tablebase files 4.1.2019
40)Introduce Multi-Cut 6.1.2019 +3.20 elo(8.23 elo total)
41)Flag critical search tree in hash table 9.1.2019 3.21 elo(11.44 total)
42)Small improvements to the CI infrastructure 9.1.2019
43)Minor cleanup to recent 'Flag critical search tree in hash table' patch 10.1.2019
44)Remove pvExact 10.1.2019 +1.72 elo(13.16 total)
45)Simplify time management a bit 14.1.2019 +2.51 elo(15.67 elo total)
46)Simplify pawn moves (#1900) 14.1.2019 no testing at long time control
47)Remove AdjacentFiles 17.1.2019 no testing at long time control
48)Tweak initiative and Pawn PSQT (#1957) 20.1.2019 +1.74 elo(17.41 elo total)
49)Clean-up some shifting in space calculation (#1955) 20.1.2019
50)Simplify pvHit (#1953) 20.1.2019
51)Simplify pondering time management (#1899) no testing with ponder so not relevant
52)Simplify TrappedRook 22.1.2019 +0.44 elo(17.85 total)
53)Use int8_t instead of int for SquareDistance[] 29.1.2019 pure speed up so no testing at LTC and I assume 0 elo
54)Change pinning logic in Static Exchange Evaluation (SEE) 29.1.2019 +1.46 elo(19.31 total)
55)Don't update pvHit after IID 29.1.2019 +0.38 elo(19.69 total)
56)Simplify Stat Score bonus 31.1.2019 +1.34 elo(21.03 total)
The main problem that I see is that the stockfish team give wrong information about elo.
They give elo estimate for every change and if you add all the estimates that they publish the elo improvement shouls be clearly higher.
Here is the elo estimate for all the improvements if you look only at the stockfish framework results from stockfish 10 at month december and january(I give only the estimate for the improvement at long time control)
It seems that the stockfish designers claim to have more than 20 elo improvment per month if you ignore their regression tests.
I did not calculate the elo advantage except december 2018 and january 2019 but I guess the picture for other months may be similiar.
1)Remove Overload bonus 1.12.2018 +0.06 elo
2)Penalize refuted killers in continuation history 1.12.2018 +3.06 elo(total improvement +3.12 elo)
3)Introduce concept of double pawn protection. 2.12.2018 +3.95 elo(total improvement +7.07 elo)
4)pseudo_legal() and MOVE_NONE 6.12.2018 Reverted in 7 so I assume 0 elo
5)Simplify time manager in search() 6.12.2018 +0.88 elo at long time control(total improvement +7.95 elo)
6)Simplify Killer Move Penalty 6.12.2018 +0.81 elo at long time control(total improvement 8.76 elo)
7)Revert "pseudo_legal() and MOVE_NONE"
8)simplify opposite_colors 9.12.2018
9)add paren. 9.12.2018
10)remove parenthesis. 9.12.2018
11)remove extra line 9.12.2018.
No results for 8-11 and I assume no functional change and no elo change for them
12)Tweak CMH pruning 9.12.2018 +2.22 elo at long time control(total improvment 10.98 elo)
13)Changes identified in RENAME/REFORMATTING thread (#1861) 11.12.2018 no functional change 0 elo
14)Asymmetrical 8x8 Pawn PSQT 13.12.2018 passed long time control but no estimate so I will be large and estimate 0 elo
15)A combo of parameter tweaks 13.12.2018 +2.31 elo(total improvement 13.29 elo)
16)Remove Null Move Pruning material threshold 16.12.2018 +1.03 elo(total improvement 14.32 elo)
17)Start a TT resize only after search finished 16.12.2018 no functional change 0 elo
18)Fix a segfault. 16.12.2018
19)Refactor king ring calculation no testing at long time control so I assume 0 elo
20)Use stronglyProtected 16.12.2018 +0.89 elo (total improvement 15.21 elo)
21)New voting formula for threads 18.12.2018 no functional change in simple thread mode so I assume 0 elo
22)Tweak main killer penalty 18.12.2018 +2.59 elo(total improvement 17.8 elo)
23)Simplify KBNK endgame implementation 20.12.2018
24)Simplify generate_castling (#1885) 23.12.2018
25)Turn on random access for Syzygy files in Windows (#1840) 23.12.2018
26)Use a bit less code to calculate hashfull() (#1830) 23.12.2018
27)Update our continuous integration machinery (#1889) 23.12.2018
28)Improve endgame KBN vs K (#1877) 24.12.2018 claiming no functional change but it is wrong. assume 0 elo
29)Extend stack to ss-5, and remove conditions 24.12.2018 -0.17 elo total improvement 17.63 elo
30)Fix crash in best_group() (#1891) 24.12.2018
31)Simplify SYSTEM_LOGICAL_PROCESSOR_INFORMATION_EX loop (#1892) 24.12.2018
32)Always initialize and evaluate king safety 27.12.2018 +3.59 elo(total improvement 21.22 elo)
33)Improve the Readme 29.12.2018
total improvement in december 21.22 elo
34)Remove as useless micro-optimization in pawns generation (#1915) 1.1.2019(not tested at LTC assume 0 elo)
35)Remove "Any" predicate filter (#1914) 1.1.2019
36)Remove openFiles in pawns. (#1917) 1.1.2019(no testing at LTC so assume 0 elo)
37)Assorted trivial cleanups (#1894) 1.1.2019
38)Delay castling legality check 4.1.2019 +5.03 elo (5.03 total improvement in january)
39)Check tablebase files 4.1.2019
40)Introduce Multi-Cut 6.1.2019 +3.20 elo(8.23 elo total)
41)Flag critical search tree in hash table 9.1.2019 3.21 elo(11.44 total)
42)Small improvements to the CI infrastructure 9.1.2019
43)Minor cleanup to recent 'Flag critical search tree in hash table' patch 10.1.2019
44)Remove pvExact 10.1.2019 +1.72 elo(13.16 total)
45)Simplify time management a bit 14.1.2019 +2.51 elo(15.67 elo total)
46)Simplify pawn moves (#1900) 14.1.2019 no testing at long time control
47)Remove AdjacentFiles 17.1.2019 no testing at long time control
48)Tweak initiative and Pawn PSQT (#1957) 20.1.2019 +1.74 elo(17.41 elo total)
49)Clean-up some shifting in space calculation (#1955) 20.1.2019
50)Simplify pvHit (#1953) 20.1.2019
51)Simplify pondering time management (#1899) no testing with ponder so not relevant
52)Simplify TrappedRook 22.1.2019 +0.44 elo(17.85 total)
53)Use int8_t instead of int for SquareDistance[] 29.1.2019 pure speed up so no testing at LTC and I assume 0 elo
54)Change pinning logic in Static Exchange Evaluation (SEE) 29.1.2019 +1.46 elo(19.31 total)
55)Don't update pvHit after IID 29.1.2019 +0.38 elo(19.69 total)
56)Simplify Stat Score bonus 31.1.2019 +1.34 elo(21.03 total)
-
- Posts: 2273
- Joined: Mon Sep 29, 2008 1:50 am
Re: Stockfish patches
You have a basic misunderstanding of statistics. The elo estimate calculated by fishtest is (median) unbiased over all patches (both failing and passing). In other words it corrects the inherent SPRT bias.Uri wrote:They give elo estimate for every change and if you add all the estimates that they publish the elo improvement shouls be clearly higher.
However if you select only the passed test then you have a different kind of bias which is called "selection bias". The elo estimate does not correct for that and it is not at all obvious how to do so. It would be possible if an elo prior were available.
Note: some time ago an elo prior for fishtest was determined as a normal distribution with mu=-1.013 and sigma=1.101 (logistic elo units). This prior was obtained by minimizing the difference between the empirical distribution of the elo estimates (about 6000 patches) and the theoretical one calculated from the prior. See respectively the histogram and the continuous line in the following graph.
As you see the match is visually rather good. No effort has been made however to see how this elo prior evolves over time.
Ideas=science. Simplification=engineering.
Without ideas there is nothing to simplify.
Without ideas there is nothing to simplify.
-
- Posts: 10413
- Joined: Thu Mar 09, 2006 12:37 am
- Location: Tel-Aviv Israel
Re: Stockfish patches
I understand that the estimate for the passed patches is biased and wrong and it is exactly the problem and they should not write a biased estimate.Michel wrote: ↑Sat Jul 13, 2019 11:12 amYou have a basic misunderstanding of statistics. The elo estimate calculated by fishtest is (median) unbiased over all patches (both failing and passing). In other words it corrects the inherent SPRT bias.Uri wrote:They give elo estimate for every change and if you add all the estimates that they publish the elo improvement shouls be clearly higher.
However if you select only the passed test then you have a different kind of bias which is called "selection bias". The elo estimate does not correct for that and it is not at all obvious how to do so. It would be possible if an elo prior were available.
Note: some time ago an elo prior for fishtest was determined as a normal distribution with mu=-1.013 and sigma=1.101 (logistic elo units). This prior was obtained by minimizing the difference between the empirical distribution of the elo estimates (about 6000 patches) and the theoretical one calculated from the prior. See respectively the histogram and the continuous line in the following graph.
As you see the match is visually rather good. No effort has been made however to see how this elo prior evolves over time.
If they want to give a good estimate then the only good way is simply to play a fixed number of games after they decided to accept the patch.
using elo prior distribution for the patches means some assumptions that we do not know if they are correct.
playing 40000 games for every 2 consecutive versions even now will give an unbiased estimate with a possible mistake of 2 elo.
People may complain about using hardware time but I think that it is better for stockfish to use hardware to really have an unbiased estimate for the value of patches because the knowledge may help later to understand what are the good patches that helped stockfish to become better.
I
-
- Posts: 2273
- Joined: Mon Sep 29, 2008 1:50 am
Re: Stockfish patches
Getting an accurate elo assessment of patches (or to do any kind or research at all) was never a goal of fishtest. The idea is that the considerable ressources needed for such an assessment should instead be used to test more patches, or to do SPRT tests with narrower bounds, which allows smaller elo patches to succeed more easily. This is simply a choice. One cannot argue with it.playing 40000 games for every 2 consecutive versions even now will give an unbiased estimate with a possible mistake of 2 elo.
Ideas=science. Simplification=engineering.
Without ideas there is nothing to simplify.
Without ideas there is nothing to simplify.