Why is SF6 so much stronger?

clumma · Post by **clumma** » Fri Feb 20, 2015 12:45 am

Quick comparison of improvement rates

Elo diff 4-5, 5-6
ipon 101, 39
ccrl40/40 64, 32
cegt40/20 94, 70

Release dates
https://chessprogramming.wikispaces.com ... se%20Dates

4-5 284 days
5-6 241 days

So I was wrong that the improvement rate since SF5 has been higher than usual

Elo/day
ipon 0.36 0.16
ccrl 0.23 0.13
cegt 0.33 0.29

-Carl

Lyudmil Tsvetkov · Post by **Lyudmil Tsvetkov** » Fri Feb 20, 2015 2:29 am

clumma wrote:How did they make such a leap in one version? The code is available. Is the cause not understood? I looked here and on other forums, and didn't see discussion on this question.

-Carl

I don't think SF is strong.

It barely makes legal moves.

clumma · Post by **clumma** » Fri Feb 20, 2015 7:56 am

Here are number of commits per pull request I identified earlier:

2 Add bonuses for minors attacking enemy pieces
? Tune trapped rook penalty
? Double mg bonus and half eg bonus
1 King-pawn threat bonus for endgames
2 Evaluate king safety when no queen is present
1 Change history reduction in LMR to be a full ply
3 Remove use of half-ply reductions
3 Add bonuses for each threat instead of max threat value
1 Be more optimistic in aspiration window
5 Halve StormDanger bonus for blocked pawn on A/H file
6 Avoid searching TT twice for the same key/position...
5 Big King Safety tuning

So something like 31/236 commits could be responsible for 50% of the improvement between versions 5 and 6.

I'll try to put these 12 pull requests into four categories: search, SMP, new heuristic, tuning:

Add bonuses for minors attacking enemy pieces *new heuristic*
Tune trapped rook penalty *tuning*
Double mg bonus and half eg bonus *tuning*
King-pawn threat bonus for endgames *new heuristic*
Evaluate king safety when no queen is present *tuning*
Change history reduction in LMR to be a full ply *search*
Remove use of half-ply reductions *search*
Add bonuses for each threat instead of max threat value *tuning*
Be more optimistic in aspiration window *search*
Halve StormDanger bonus for blocked pawn on A/H file *new heuristic*
Avoid searching TT twice for the same key/position... *search*
Big King Safety tuning *tuning*

The tally is:

* search 4
* SMP 0
* new heuristic 3
* tuning 5

Only in the tuning bucket is the 'reason' for improvements usually unknown.

-Carl

mcostalba · Post by **mcostalba** » Fri Feb 20, 2015 10:02 am

clumma wrote: The tally is:

* search 4
* SMP 0
* new heuristic 3
* tuning 5

To reliable verify that a patch is stronger than default requries a lot of resources, reliable measuring of how much it is stronger requires even more resources.

We, in SF, consciously gave up to know the second answer and just rely on knowing the first answer for development (this was done to optimize the use of resources/time and reduce the queue time for submitted tests, this is important to keep the "momentum" going on).

That's the only fact, all other argumentation could be interesting just for discussing.

clumma · Post by **clumma** » Fri Feb 20, 2015 6:17 pm

mcostalba wrote:To reliable verify that a patch is stronger than default requries a lot of resources, reliable measuring of how much it is stronger requires even more resources.

Of course. And the verification dynamically stops when a certain LLR is reached, correct? And the sooner this happens, the bigger the Elo improvement is likely to be?

I saw results like these in the commit logs

LLR: 2.96 (-2.94,2.94) [-3.00,1.00]
Total: 7647 W: 1356 L: 1214 D: 5077

and basically just took the number of wins, assuming that the smaller the number, the bigger the Elo difference of the change. I sometimes also estimated the win/loss ratio by eye. Am I making wrong assumptions here?

Second question: Is the cause of the Elo regression with version cd065dd known? (top graph at http://tests.stockfishchess.org/regression ) Or is it just measurement error? Changes in that version look innocuous

https://github.com/zamar/Stockfish/comp ... ...cd065dd

-Carl

mcostalba · Post by **mcostalba** » Fri Feb 20, 2015 6:30 pm

clumma wrote: Am I making wrong assumptions here?

Yes, your assumption can be greatly misleading. Although it is true that if a patch has a big ELO advantage the test (on average) stops earlier than if a patch has a small advantage, the SPRT statistic is indeed complex and given 2 tests, one that stops earlier than the other, very little can be said on the absolute ELO value of the 2 patches. So better don't make assumptions.

clumma wrote:
Or is it just measurement error?

Yes, it is just measurement error, there is nothing in the patch that could regress. Testing patches is _not_ easy nor trivial. It took SF a long process to reach a reliable testing methodology and anyhow still today, seldom we have some false positives, some bad patch that looks good (rarely) or some good patches that look bad (more often, and is a conscious trade-off also this one).

clumma · Post by **clumma** » Fri Feb 20, 2015 11:15 pm

Thanks. Back to the drawing board I guess.

Why is SF6 so much stronger?

Re: Why is SF6 so much stronger?

Re: Why is SF6 so much stronger?

Re: Why is SF6 so much stronger?

Re: Why is SF6 so much stronger?

Re: Why is SF6 so much stronger?

Re: Why is SF6 so much stronger?

Re: Why is SF6 so much stronger?