Opening testing suites efficiency

Laskos · Post by **Laskos** » Sun Jul 02, 2017 9:26 pm

Dariusz Orzechowski wrote:
Laskos wrote:Do you have any idea why the shape of the eval is so curious in 5ply_v1.epd case? For building unbalanced suite from this, I cut it to [-1.1,-0.7] and [0.7,1.1] intervals for the eval. When people will start using unbalanced positions, your dataset will be one of the best to optimize for them.
This is by design. I filtered out dead even positions around 0.00. My goal was to have mostly around 0.3-0.8 out of the book. In unbalanced set I mentioned above there should be mostly 0.8-1.3 (I'm not sure I remember correctly the upper limit now, it's maybe a bit higher like 1.5).

5ply_v1.epd seems very prone to improvement. Although itself it doesn't come above 2moves_v1.epd, after some experiments, I managed to significantly improve on it, and the resulting file shows almost significantly better Normalized ELO than other large suites. I used your idea to use absolute value of the eval (openings are anyway screwed up), and it looks like that:

I tried to find a sweet spot in eval with regard to Normalized ELO, but it's not that easy, there are two competing effects on Normalized ELO: higher ELO difference and higher Draw rate. Interesting thing that some sort of sweet spot came in abs(eval) range [0.50,0.60] or roughly in 40%-60% percentile counts of your 5ply_v1.epd file, around the median. You seems to have chosen well thought median in this suite. It shows almost significantly higher sensitivity in this range.

Stockfish self-games 6s vs 3s:

5ply_v1_40_50.epd (12664)
Score of SF2 vs SF1: 2140 - 247 - 1613 [0.737] 4000
ELO difference: 178.67 +/- 8.46
Finished match
Normalized ELO: 0.775 +/- 0.031

3moves_Elo2200.epd (6533)
Score of SF2 vs SF1: 1966 - 211 - 1823 [0.719] 4000
ELO difference: 163.53 +/- 7.90
Finished match
Normalized ELO: 0.740 +/- 0.031

2moves_v1.epd (40455)
Score of SF2 vs SF1: 2094 - 267 - 1639 [0.728] 4000
ELO difference: 171.35 +/- 8.40
Finished match
Normalized ELO: 0.739 +/- 0.031

3moves_Elo2200_epxerimental.epd (5848)
Score of SF2 vs SF1: 1974 - 219 - 1807 [0.719] 4000
ELO difference: 163.53 +/- 7.94
Finished match
Normalized ELO: 0.736 +/- 0.031

2moves_v1_experimental.epd (9564)
Score of SF2 vs SF1: 2066 - 263 - 1671 [0.725] 4000
ELO difference: 168.73 +/- 8.31
Finished match
Normalized ELO: 0.732 +/- 0.031

Also, by playing games, I managed to build a testsuite tuned for Stockfish, which has by far the highest sensitivity, ELO, Normalized ELO with Stockfish, but has only 1560 positions and is not necessary good for other engines. I plays normal openings.

3moves_Elo2200_Stockfish.epd (1560)
Score of SF2 vs SF1: 2199 - 173 - 1628 [0.753] 4000
ELO difference: 193.87 +/- 8.39
Finished match
Normalized ELO: 0.873 +/- 0.031

I uploaded 5ply_v1_40_50.epd and 3moves_Elo2200_Stockfish.epd here:
http://s000.tinyupload.com/?file_id=664 ... 3458846938

Laskos · Post by **Laskos** » Sun Jul 02, 2017 9:35 pm

MikeB wrote:
brtzsnr wrote:
Laskos wrote: The suite is uploaded here:
http://s000.tinyupload.com/?file_id=687 ... 2789470066
Great! Thank you for this work, very important for engine testing.

Do you think others can distribute this file?
Good stuff Kai and the thanks for sharing. I understand that they choose to to use 2moves_v1.epd was to reduce draws and get test results, yea or nay back faster. My complaint about using the book is that missing an opportunity to fine tune the strengths of Stockfish to more commonly used openings.

They do have use a book for regression testing that has the more commonly used lines which I like.

http://www.talkchess.com/forum/viewtopi ... ht=#720389

I'm not expert enough to say which way is better - but my suspicion is using lines that are totally skewed from normal chess openings probably increases the chances of getting patch approved that does not add elo or even worse, loses elo, since you starting with game play as what I would call a defective position, defective is that they are using positions the engines would never see in a real game.

An extreme example would be to use opening lines that perhaps looked like this?

[d]8/8/8/2p5/1pp5/brpp4/1pprp2P/qnkbK3 w - - 0 1 - obviously one would not - but it's sort of like the old man asking the young woman to go to bed with him for $5,000,000 - and she saids "yes". Then he offers her $100 and she says "no, what do you think I am - a prostitute?" and the old man replies replies , "we already established that , now we're just negotiating price". I believe SF would be even better if they simply use normal type openings for testing, but there is no way I can prove that and you cannot knock their success - but using random 2 move openings does little for tuning an engine for opening play.

I believe (and I had some clear indications) that testing on 2 random moves distorts the opening phase for some 10 opening moves. This might be unimportant for Stockfish provided with a reasonable deep book, but might appear as deficiency when playing from very short lines or from standard opening position. Thanks for providing that 8-mover opening set used in Stockfish regression tests (didn't even know they use different suite).

Michel · Post by **Michel** » Tue Jul 04, 2017 3:58 pm

I posted an update of the document containing the formula for the pentanomial model. The factors of 2 are confusing but I think I got it right.

http://hardy.uhasselt.be/Toga/normalized_elo.pdf

Dariusz Orzechowski · Post by **Dariusz Orzechowski** » Tue Jul 04, 2017 7:51 pm

Laskos wrote:Tested with Komodo in 2000 games each. Not enough games to clearly separate them, but an indication that A is better. Both come very well compared to other suites (2moves_v1.epd included), I don't know what did you do. I tried without success to optimize significantly 2moves_v1, but all came at most to the level of your B. A was pretty much ahead (but again, 2000 games are too few).

Results:

2moves_A.epd
Score of K2 vs K1: 1171 - 120 - 709 [0.763] 2000
ELO difference: 202.87 +/- 12.78
Normalized ELO: 0.865 +/- 0.044

2moves_B.epd
Score of K2 vs K1: 1166 - 134 - 700 [0.758] 2000
ELO difference: 198.34 +/- 12.85
Normalized ELO: 0.833 +/- 0.044

Thanks a lot for the test. Sadly, it seems this is just noise. When I tested with SF 2moves_B looked much better than 2moves_A. It seems to not hold at all with Komodo. But on the other hand I played only 1000 games. It's quite frustrating.

I've played also 13000 openings (26000 games) from 2moves_v1.epd book (6+0.06 vs 3+0.03) and split them by pair results (I counted two draws '==' separately from win-lose '1.0'):

Code: Select all

'2.0'&#58; 2771, '1.5'&#58; 6203, '=='&#58; 2689, '1.0'&#58; 821, '0.5'&#58; 498, '0.0'&#58; 18

I'm thinking now if this kind of statistics can be used to build a better book.

Another result: from my preliminary tests it also seems that for SF tipping point where there becomes more lopsided (here '1.0') than drawish ('==') results occurs around eval 0.75 out of the book.

Michel · Post by **Michel** » Wed Jul 05, 2017 9:04 am

Michel wrote:I posted an update of the document containing the formula for the pentanomial model. The factors of 2 are confusing but I think I got it right.

http://hardy.uhasselt.be/Toga/normalized_elo.pdf

New version with less clumsy derivation.

Laskos · Post by **Laskos** » Wed Jul 05, 2017 11:54 am

Michel wrote:
Michel wrote:I posted an update of the document containing the formula for the pentanomial model. The factors of 2 are confusing but I think I got it right.

http://hardy.uhasselt.be/Toga/normalized_elo.pdf
New version with less clumsy derivation.

Thanks! Checked it for factors of 2 with concrete numbers. All seems correct in your last, neat version. Thanks again.

Laskos · Post by **Laskos** » Fri Jul 07, 2017 8:23 am

If we have a faith that Normalized ELO is invariant to doubling in time control in self-games, especially to longer time controls, we can build empirical models.

We take, for small "eps" (although it is not that small in case of doubling, but let's say we increase time control by 10% for one opponent in self-games):
(w,d,l)=(a+eps,1-2*a,a-eps)

We look at the dominant term in eps.

The result which both accounts for the expansion and my past empirical evidence is as follows:

Normalized ELO is proportional to 1 + f(t) with f(t) -> 0 for t -> infinity, f(t) increases as t -> 0 (assumption and evidence)
ELO is proportional to sqrt(a) * (1+f(t))
WiLo is proportional to 1/sqrt(a) * (1+f(t))

Empirically: d = 1 - 2*c/log(t) => a = c/log(t)

Normalized ELO ~ 1 + f(t)
ELO ~ (1+f(t)) / sqrt(log(t))
WiLo ~ (1+f(t)) * sqrt(log(t))

Take from empirical evidence f(t) = 1/(log(t))**2

Then we have to plot, aside from some constants

Normalized ELO ~ 1 + 1/x**2
ELO ~ (1+1/x**2) / sqrt(x)
WiLo ~ (1+1/x**2) * sqrt(x)

with x ~ log(t+constant)

I also played pretty flimsy self-games matches at double time control between Komodo, to not overfit on Stockfish. The opening suite was the fairly balanced 3moves_Elo2200.epd, but observe that even in its case pentanomial gave 5-6% better results than trinomial.

Code: Select all

6s vs 3s
Score of K2 vs K1&#58; 1054 - 112 - 834  &#91;0.736&#93; 2000
ELO difference&#58; 177.66 +/- 11.75
Win/Loss&#58; 9.41
Normalized ELO trinomial&#58; 0.784 +/- 0.044
Normalized ELO pentanomial&#58; 0.803 +/- 0.044

20s vs 10s
Score of K2 vs K1&#58; 232 - 34 - 334  &#91;0.665&#93; 600
ELO difference&#58; 119.11 +/- 18.04
Win/Loss&#58; 6.82
Normalized ELO trinomial&#58; 0.571 +/- 0.080
Normalized ELO pentanomial&#58; 0.609 +/- 0.080

60s vs 30s
Score of K2 vs K1&#58; 195 - 19 - 386  &#91;0.647&#93; 600
ELO difference&#58; 105.00 +/- 15.81
Win/Loss&#58; 10.26
Normalized ELO trinomial&#58; 0.564 +/- 0.080
Normalized ELO pentanomial&#58; 0.603 +/- 0.080

It seems to confirm the model at its point were WiLo has a minimum, and Normalized ELO starts to stabilize to the doubling in time control in self-games. I will soon depart on vacation for a week, and I will leave my home computer running more serious tests for the Normalized ELO stability to longer time controls.

Michel · Post by **Michel** » Wed Jan 02, 2019 4:53 pm

I updated the normalized elo document

http://hardy.uhasselt.be/Toga/normalized_elo.pdf

with a proof that normalized elo is a measure for the amount of effort it takes to separate two engines (this is section 5).

First of all I had to think about the formal meaning of this statement. So I introduce the notion of a context. A context will typically consists of an opening book and a time control, but it may include other things as well such as contempt settings. One then defines the relative sensitivity of contexts C,D as the ratio of the normalized elo's of two engines X,Y with respect to those contexts. The weak dependency hypothesis then says that the relative sensitivity of C,D does not depend strongly on the engines X,Y used to measure it.

Assuming the weak dependency hypothesis one then shows (Theorem 5.1.4) that the relative expected duration of two SPRT's, using contexts C,D, that have the same power to separate two engines is inversely proportional to the square of the relative sensitivity of C,D.

Here is a trivial example. There were regression tests of sf9->sf10 using the 2 moves and the 8 moves books. The outcomes were

Code: Select all

W,L,D=9754,3612,26634 # LTC test (sf9->sf10) with 8 moves book.
W,L,D=12041,4583,23376 # LTC test (sf9->sf10) with 2 moves book

A simple computation shows that the relative sensitivity of the 2 moves book versus the 8 moves book is with 95% confidence in the interval

Code: Select all

[1.04375629346 1.14932517816]

In other words (assuming the weak dependency hypothesis) the reduction in games achievable by using the 2 moves book, without sacrificing power, would be between 8% and 24%.

There is one possible caveat however in interpreting these results. Since fishtest using the 2 moves book for testing patches there may be a form of selection bias going on. Patches that work well with the 2 moves book are more likely to make it into master, possibly inflating normalized elo when measured with the 2 moves book.

jorose · Post by **jorose** » Wed Jan 02, 2019 6:17 pm

Thanks for reviving this thread, I had missed it =)

The links to the opening suites seem dead. Does anybody know where I could find them now?

Laskos · Post by **Laskos** » Sat Jan 05, 2019 9:50 am

Michel wrote: ↑Tue Jul 04, 2017 3:58 pm I posted an update of the document containing the formula for the pentanomial model. The factors of 2 are confusing but I think I got it right.

http://hardy.uhasselt.be/Toga/normalized_elo.pdf

Michel, I will have a look, I am on vacation now on my phone, hard to do anything. And this formal mathematical language is hard to decipher for me in some more pragmatic one to me.

Opening testing suites efficiency

Re: Opening testing suites efficiency

Re: Opening testing suites efficiency

Re: Opening testing suites efficiency

Re: Opening testing suites efficiency

Re: Opening testing suites efficiency

Re: Opening testing suites efficiency

Re: Opening testing suites efficiency

Re: Opening testing suites efficiency

Re: Opening testing suites efficiency

Re: Opening testing suites efficiency