A match between SF12+NNUE and Leele ver.0.26.2

mwyoung · Post by **mwyoung** » Wed Sep 09, 2020 1:20 am

OliverBr wrote: ↑Wed Sep 09, 2020 1:01 am
mwyoung wrote: ↑Wed Sep 09, 2020 12:52 am What are you trying to answer.
1. What is the exact elo deference between A vs B.
2. Who is better only between A vs B, and this usually takes much fewer games to prove statistically.
Still you need a couple of thousand games.
Believe me, I can do them with a 32-core computer and no, some hundred games are not enough.

That is just false. It can be done with 10 to 20 games. What matters is the Elo difference between A vs B. An Example Engine A scores 10 wins in 10 games.

LOS table.

Code: Select all

	10	50	100	200	500	1000
0	50.00%	50.00%	50.00%	50.00%	50.00%	50.00%
1	64.51%	56.77%	54.81%	53.41%	52.16%	51.53%
2	77.22%	63.34%	59.55%	56.80%	54.32%	53.06%
3	86.95%	69.55%	64.16%	60.14%	56.46%	54.58%
4	93.43%	75.24%	68.57%	63.40%	58.58%	56.09%
5	97.14%	80.32%	72.73%	66.57%	60.68%	57.60%
6	98.95%	84.71%	76.60%	69.63%	62.75%	59.10%
7	99.68%	88.40%	80.14%	72.55%	64.78%	60.58%
8	99.93%	91.41%	83.34%	75.33%	66.77%	62.05%
9	99.99%	93.80%	86.19%	77.96%	68.72%	63.50%
10	100.00%	95.64%	88.69%	80.41%	70.61%	64.93%
11		97.02%	90.85%	82.69%	72.45%	66.34%
12		98.01%	92.68%	84.80%	74.23%	67.73%
13		98.71%	94.23%	86.72%	75.95%	69.09%
14		99.19%	95.50%	88.48%	77.60%	70.43%
15		99.50%	96.54%	90.06%	79.19%	71.74%
16		99.71%	97.37%	91.48%	80.71%	73.02%
17		99.83%	98.03%	92.74%	82.16%	74.27%
18		99.91%	98.55%	93.85%	83.54%	75.49%
19		99.95%	98.94%	94.82%	84.85%	76.68%
20		99.97%	99.24%	95.67%	86.08%	77.84%
21		99.99%	99.46%	96.41%	87.25%	78.96%
22		99.99%	99.62%	97.03%	88.35%	80.05%
23		100.00%	99.74%	97.57%	89.38%	81.11%
24			99.82%	98.02%	90.34%	82.12%
25			99.88%	98.40%	91.23%	83.11%
26			99.92%	98.71%	92.07%	84.06%
27			99.95%	98.97%	92.84%	84.97%
28			99.97%	99.18%	93.55%	85.85%
29			99.98%	99.36%	94.21%	86.69%
30			99.99%	99.50%	94.81%	87.50%
31			99.99%	99.61%	95.36%	88.27%
32			100.00%	99.70%	95.86%	89.01%
33				99.77%	96.32%	89.71%
34				99.82%	96.74%	90.38%
35				99.87%	97.11%	91.02%
36				99.90%	97.45%	91.62%
37				99.93%	97.76%	92.20%
38				99.95%	98.03%	92.74%
39				99.96%	98.28%	93.26%
40				99.97%	98.50%	93.74%
41				99.98%	98.69%	94.20%
42				99.98%	98.86%	94.63%
43				99.99%	99.02%	95.04%
44				99.99%	99.15%	95.42%
45				99.99%	99.27%	95.78%
46				100.00%	99.37%	96.11%
47					99.46%	96.42%
48					99.54%	96.72%
49					99.61%	96.99%
50					99.67%	97.24%
51					99.72%	97.47%
52					99.76%	97.69%
53					99.80%	97.89%
54					99.83%	98.08%
55					99.86%	98.25%
56					99.88%	98.41%
57					99.90%	98.56%
58					99.92%	98.69%
59					99.93%	98.82%
60					99.94%	98.93%
61					99.95%	99.03%
62					99.96%	99.13%
63					99.97%	99.22%
64					99.97%	99.29%
65					99.98%	99.37%
66					99.98%	99.43%
67					99.99%	99.49%
68					99.99%	99.54%
69					99.99%	99.59%
70					99.99%	99.64%
71					99.99%	99.68%
72					100.00%	99.71%
73						99.74%
74						99.77%
75						99.80%
76						99.82%
77						99.84%
78						99.86%
79						99.88%
80						99.89%
81						99.91%
82						99.92%
83						99.93%
84						99.94%
85						99.94%
86						99.95%
87						99.96%
88						99.96%
89						99.97%
90						99.97%
91						99.98%
92						99.98%
93						99.98%
94						99.98%
95						99.99%
96						99.99%
97						99.99%
98						99.99%
99						99.99%
100						99.99%
101						99.99%
102						100.00%

jorose · Post by **jorose** » Wed Sep 09, 2020 5:22 am

mwyoung wrote: ↑Wed Sep 09, 2020 12:52 am A tester does not need 5000 games, unless the goal is only to maximize the Elo precision. A useful tool for engine developers.

What are you trying to answer.
1. What is the exact elo deference between A vs B.
2. Who is better only between A vs B, and this usually takes much fewer games to prove statistically.

Normally, developers are trying to answer question number 2. Tests on Fishtest often go over 100k games in order to answer that question.

mwyoung · Post by **mwyoung** » Wed Sep 09, 2020 6:49 am

jorose wrote: ↑Wed Sep 09, 2020 5:22 am
mwyoung wrote: ↑Wed Sep 09, 2020 12:52 am A tester does not need 5000 games, unless the goal is only to maximize the Elo precision. A useful tool for engine developers.

What are you trying to answer.
1. What is the exact elo deference between A vs B.
2. Who is better only between A vs B, and this usually takes much fewer games to prove statistically.
Normally, developers are trying to answer question number 2. Tests on Fishtest often go over 100k games in order to answer that question.

They need to answer both questions, because they must have a high degree of confidence. Since the pass fail threshold is very high. When measuring 1 or 2 Elo for a pass.

.

corres · Post by **corres** » Wed Sep 09, 2020 8:18 am

OliverBr wrote: ↑Wed Sep 09, 2020 1:01 am
mwyoung wrote: ↑Wed Sep 09, 2020 12:52 am What are you trying to answer.
1. What is the exact elo deference between A vs B.
2. Who is better only between A vs B, and this usually takes much fewer games to prove statistically.
Still you need a couple of thousand games.
Believe me, I can do them with a 32-core computer and no, some hundred games are not enough.

I use the all 16 physical cores when I play on my machine.
So I want to know how behave the Stockfish 12 in this case of using all physical 16 cores and I do not interested in to know how behave using only one thread, so sorry, but I do not agree you.
Making test, it is rather time and power consuming task, so I am happy if other people also make tests.

corres · Post by **corres** » Fri Sep 11, 2020 9:00 pm

You can download the pgn and Minibook .zip from
http://wikisend.com
File ID 742058
Password nnue

OliverBr · Post by **OliverBr** » Sun Sep 20, 2020 1:13 am

mwyoung wrote: ↑Wed Sep 09, 2020 1:20 am That is just false. It can be done with 10 to 20 games. What matters is the Elo difference between A vs B. An Example Engine A scores 10 wins in 10 games.

Sorry, I have to correct this statement, because it is absolutely wrong. I have seen test series, where an engine lead after over 1000 games and still was finally beaten.

For sure, 20 games are not enough. Statistically alone this makes no sense.

OliverBr · Post by **OliverBr** » Sun Sep 20, 2020 1:14 am

jorose wrote: ↑Wed Sep 09, 2020 5:22 am
mwyoung wrote: ↑Wed Sep 09, 2020 12:52 am A tester does not need 5000 games, unless the goal is only to maximize the Elo precision. A useful tool for engine developers.

What are you trying to answer.
1. What is the exact elo deference between A vs B.
2. Who is better only between A vs B, and this usually takes much fewer games to prove statistically.
Normally, developers are trying to answer question number 2. Tests on Fishtest often go over 100k games in order to answer that question.

Exactly.

corres · Post by **corres** » Tue Sep 22, 2020 12:58 pm

The needing number of parties depends on how big the difference in Elo among engines.
As the difference is smaller as higher number of parties you need to make difference among engines.
But as the difference is really small, you can say (practically) the power of engines is (near) equal.
In practical viewpoint it is senseless to run a test match with 100k games even if it seems being more "scientific", because even the temperature of the environment(!) and alternation in power supply have also influence on the Elo difference.
If you run lot of tests with some different engines, you can experience in a lot of cases that after some ten circle it forms near the last order of the participants. So if you do not curious to know the exact Elo number of the participant but you satisfied only with a kind of power-line, you need only a competition with some hundred or even some tens games.
What I wrote above these are only the opinions of a chess engine user and not a hair-splitting "scientific" developer.
I am not interested in a newer version of Stockfish (for.e.g), if the "official test" can not show at least 5 Elo
enhancement, so I do not wait for 100k test games, for me it is enough 20k test games only.

yurikvelo · Post by **yurikvelo** » Tue Sep 22, 2020 6:07 pm

+9 -6 =185 says only that both engine strength are close (within +-20 elo)

next 200 games can be +10 -1 or -1 +10

specially crafted microbook (100 posiitions) can favour any desired contender.
E.g. take 50 000 games between Lc0 and SF12 and filter 100 openings where one or another lost more.
Than make them replay those 200 games again for kibitzers

mwyoung · Post by **mwyoung** » Tue Sep 22, 2020 6:29 pm

OliverBr wrote: ↑Sun Sep 20, 2020 1:13 am
mwyoung wrote: ↑Wed Sep 09, 2020 1:20 am That is just false. It can be done with 10 to 20 games. What matters is the Elo difference between A vs B. An Example Engine A scores 10 wins in 10 games.
Sorry, I have to correct this statement, because it is absolutely wrong. I have seen test series, where an engine lead after over 1000 games and still was finally beaten.

For sure, 20 games are not enough. Statistically alone this makes no sense.

Then you need to go back to school. This is statistics.

You need to understand what is being said.

The question is have you ever seen a engine win all games in a 10 game test. Then lose a test match in 1000 games or 10,000 games to the same engine. No!

LOS stats let's you determine who is better.

A match between SF12+NNUE and Leele ver.0.26.2

Re: A match between SF12+NNUE and Leele ver.0.26.2

Re: A match between SF12+NNUE and Leele ver.0.26.2

Re: A match between SF12+NNUE and Leele ver.0.26.2

Re: A match between SF12+NNUE and Leele ver.0.26.2

Re: A match between SF12+NNUE and Leele ver.0.26.2

Re: A match between SF12+NNUE and Leele ver.0.26.2

Re: A match between SF12+NNUE and Leele ver.0.26.2

Re: A match between SF12+NNUE and Leele ver.0.26.2

Re: A match between SF12+NNUE and Leele ver.0.26.2

Re: A match between SF12+NNUE and Leele ver.0.26.2