A match between SF12+NNUE and Leele ver.0.26.2

Discussion of anything and everything relating to chess playing software and machines.

Moderators: hgm, Rebel, chrisw

mwyoung
Posts: 2727
Joined: Wed May 12, 2010 10:00 pm

Re: A match between SF12+NNUE and Leele ver.0.26.2

Post by mwyoung »

OliverBr wrote: Wed Sep 09, 2020 1:01 am
mwyoung wrote: Wed Sep 09, 2020 12:52 am What are you trying to answer.
1. What is the exact elo deference between A vs B.
2. Who is better only between A vs B, and this usually takes much fewer games to prove statistically.
Still you need a couple of thousand games.
Believe me, I can do them with a 32-core computer and no, some hundred games are not enough.
That is just false. It can be done with 10 to 20 games. What matters is the Elo difference between A vs B. An Example Engine A scores 10 wins in 10 games.


LOS table.

Code: Select all

	10	50	100	200	500	1000
0	50.00%	50.00%	50.00%	50.00%	50.00%	50.00%
1	64.51%	56.77%	54.81%	53.41%	52.16%	51.53%
2	77.22%	63.34%	59.55%	56.80%	54.32%	53.06%
3	86.95%	69.55%	64.16%	60.14%	56.46%	54.58%
4	93.43%	75.24%	68.57%	63.40%	58.58%	56.09%
5	97.14%	80.32%	72.73%	66.57%	60.68%	57.60%
6	98.95%	84.71%	76.60%	69.63%	62.75%	59.10%
7	99.68%	88.40%	80.14%	72.55%	64.78%	60.58%
8	99.93%	91.41%	83.34%	75.33%	66.77%	62.05%
9	99.99%	93.80%	86.19%	77.96%	68.72%	63.50%
10	100.00%	95.64%	88.69%	80.41%	70.61%	64.93%
11		97.02%	90.85%	82.69%	72.45%	66.34%
12		98.01%	92.68%	84.80%	74.23%	67.73%
13		98.71%	94.23%	86.72%	75.95%	69.09%
14		99.19%	95.50%	88.48%	77.60%	70.43%
15		99.50%	96.54%	90.06%	79.19%	71.74%
16		99.71%	97.37%	91.48%	80.71%	73.02%
17		99.83%	98.03%	92.74%	82.16%	74.27%
18		99.91%	98.55%	93.85%	83.54%	75.49%
19		99.95%	98.94%	94.82%	84.85%	76.68%
20		99.97%	99.24%	95.67%	86.08%	77.84%
21		99.99%	99.46%	96.41%	87.25%	78.96%
22		99.99%	99.62%	97.03%	88.35%	80.05%
23		100.00%	99.74%	97.57%	89.38%	81.11%
24			99.82%	98.02%	90.34%	82.12%
25			99.88%	98.40%	91.23%	83.11%
26			99.92%	98.71%	92.07%	84.06%
27			99.95%	98.97%	92.84%	84.97%
28			99.97%	99.18%	93.55%	85.85%
29			99.98%	99.36%	94.21%	86.69%
30			99.99%	99.50%	94.81%	87.50%
31			99.99%	99.61%	95.36%	88.27%
32			100.00%	99.70%	95.86%	89.01%
33				99.77%	96.32%	89.71%
34				99.82%	96.74%	90.38%
35				99.87%	97.11%	91.02%
36				99.90%	97.45%	91.62%
37				99.93%	97.76%	92.20%
38				99.95%	98.03%	92.74%
39				99.96%	98.28%	93.26%
40				99.97%	98.50%	93.74%
41				99.98%	98.69%	94.20%
42				99.98%	98.86%	94.63%
43				99.99%	99.02%	95.04%
44				99.99%	99.15%	95.42%
45				99.99%	99.27%	95.78%
46				100.00%	99.37%	96.11%
47					99.46%	96.42%
48					99.54%	96.72%
49					99.61%	96.99%
50					99.67%	97.24%
51					99.72%	97.47%
52					99.76%	97.69%
53					99.80%	97.89%
54					99.83%	98.08%
55					99.86%	98.25%
56					99.88%	98.41%
57					99.90%	98.56%
58					99.92%	98.69%
59					99.93%	98.82%
60					99.94%	98.93%
61					99.95%	99.03%
62					99.96%	99.13%
63					99.97%	99.22%
64					99.97%	99.29%
65					99.98%	99.37%
66					99.98%	99.43%
67					99.99%	99.49%
68					99.99%	99.54%
69					99.99%	99.59%
70					99.99%	99.64%
71					99.99%	99.68%
72					100.00%	99.71%
73						99.74%
74						99.77%
75						99.80%
76						99.82%
77						99.84%
78						99.86%
79						99.88%
80						99.89%
81						99.91%
82						99.92%
83						99.93%
84						99.94%
85						99.94%
86						99.95%
87						99.96%
88						99.96%
89						99.97%
90						99.97%
91						99.98%
92						99.98%
93						99.98%
94						99.98%
95						99.99%
96						99.99%
97						99.99%
98						99.99%
99						99.99%
100						99.99%
101						99.99%
102						100.00%
"The worst thing that can happen to a forum is a running wild attacking moderator(HGM) who is not corrected by the community." - Ed Schröder
But my words like silent raindrops fell. And echoed in the wells of silence.
jorose
Posts: 360
Joined: Thu Jan 22, 2015 3:21 pm
Location: Zurich, Switzerland
Full name: Jonathan Rosenthal

Re: A match between SF12+NNUE and Leele ver.0.26.2

Post by jorose »

mwyoung wrote: Wed Sep 09, 2020 12:52 am A tester does not need 5000 games, unless the goal is only to maximize the Elo precision. A useful tool for engine developers.

What are you trying to answer.
1. What is the exact elo deference between A vs B.
2. Who is better only between A vs B, and this usually takes much fewer games to prove statistically.
Normally, developers are trying to answer question number 2. Tests on Fishtest often go over 100k games in order to answer that question.
-Jonathan
mwyoung
Posts: 2727
Joined: Wed May 12, 2010 10:00 pm

Re: A match between SF12+NNUE and Leele ver.0.26.2

Post by mwyoung »

jorose wrote: Wed Sep 09, 2020 5:22 am
mwyoung wrote: Wed Sep 09, 2020 12:52 am A tester does not need 5000 games, unless the goal is only to maximize the Elo precision. A useful tool for engine developers.

What are you trying to answer.
1. What is the exact elo deference between A vs B.
2. Who is better only between A vs B, and this usually takes much fewer games to prove statistically.
Normally, developers are trying to answer question number 2. Tests on Fishtest often go over 100k games in order to answer that question.
They need to answer both questions, because they must have a high degree of confidence. Since the pass fail threshold is very high. When measuring 1 or 2 Elo for a pass.



.
"The worst thing that can happen to a forum is a running wild attacking moderator(HGM) who is not corrected by the community." - Ed Schröder
But my words like silent raindrops fell. And echoed in the wells of silence.
corres
Posts: 3657
Joined: Wed Nov 18, 2015 11:41 am
Location: hungary

Re: A match between SF12+NNUE and Leele ver.0.26.2

Post by corres »

OliverBr wrote: Wed Sep 09, 2020 1:01 am
mwyoung wrote: Wed Sep 09, 2020 12:52 am What are you trying to answer.
1. What is the exact elo deference between A vs B.
2. Who is better only between A vs B, and this usually takes much fewer games to prove statistically.
Still you need a couple of thousand games.
Believe me, I can do them with a 32-core computer and no, some hundred games are not enough.
I use the all 16 physical cores when I play on my machine.
So I want to know how behave the Stockfish 12 in this case of using all physical 16 cores and I do not interested in to know how behave using only one thread, so sorry, but I do not agree you.
Making test, it is rather time and power consuming task, so I am happy if other people also make tests.
corres
Posts: 3657
Joined: Wed Nov 18, 2015 11:41 am
Location: hungary

Re: A match between SF12+NNUE and Leele ver.0.26.2

Post by corres »

You can download the pgn and Minibook .zip from
http://wikisend.com
File ID 742058
Password nnue
OliverBr
Posts: 725
Joined: Tue Dec 18, 2007 9:38 pm
Location: Munich, Germany
Full name: Dr. Oliver Brausch

Re: A match between SF12+NNUE and Leele ver.0.26.2

Post by OliverBr »

mwyoung wrote: Wed Sep 09, 2020 1:20 am That is just false. It can be done with 10 to 20 games. What matters is the Elo difference between A vs B. An Example Engine A scores 10 wins in 10 games.
Sorry, I have to correct this statement, because it is absolutely wrong. I have seen test series, where an engine lead after over 1000 games and still was finally beaten.

For sure, 20 games are not enough. Statistically alone this makes no sense.
Chess Engine OliThink: http://brausch.org/home/chess
OliThink GitHub:https://github.com/olithink
OliverBr
Posts: 725
Joined: Tue Dec 18, 2007 9:38 pm
Location: Munich, Germany
Full name: Dr. Oliver Brausch

Re: A match between SF12+NNUE and Leele ver.0.26.2

Post by OliverBr »

jorose wrote: Wed Sep 09, 2020 5:22 am
mwyoung wrote: Wed Sep 09, 2020 12:52 am A tester does not need 5000 games, unless the goal is only to maximize the Elo precision. A useful tool for engine developers.

What are you trying to answer.
1. What is the exact elo deference between A vs B.
2. Who is better only between A vs B, and this usually takes much fewer games to prove statistically.
Normally, developers are trying to answer question number 2. Tests on Fishtest often go over 100k games in order to answer that question.
Exactly.
Chess Engine OliThink: http://brausch.org/home/chess
OliThink GitHub:https://github.com/olithink
corres
Posts: 3657
Joined: Wed Nov 18, 2015 11:41 am
Location: hungary

Re: A match between SF12+NNUE and Leele ver.0.26.2

Post by corres »

The needing number of parties depends on how big the difference in Elo among engines.
As the difference is smaller as higher number of parties you need to make difference among engines.
But as the difference is really small, you can say (practically) the power of engines is (near) equal.
In practical viewpoint it is senseless to run a test match with 100k games even if it seems being more "scientific", because even the temperature of the environment(!) and alternation in power supply have also influence on the Elo difference.
If you run lot of tests with some different engines, you can experience in a lot of cases that after some ten circle it forms near the last order of the participants. So if you do not curious to know the exact Elo number of the participant but you satisfied only with a kind of power-line, you need only a competition with some hundred or even some tens games.
What I wrote above these are only the opinions of a chess engine user and not a hair-splitting "scientific" developer.
I am not interested in a newer version of Stockfish (for.e.g), if the "official test" can not show at least 5 Elo
enhancement, so I do not wait for 100k test games, for me it is enough 20k test games only.
User avatar
yurikvelo
Posts: 710
Joined: Sat Dec 06, 2014 1:53 pm

Re: A match between SF12+NNUE and Leele ver.0.26.2

Post by yurikvelo »

+9 -6 =185 says only that both engine strength are close (within +-20 elo)

next 200 games can be +10 -1 or -1 +10

specially crafted microbook (100 posiitions) can favour any desired contender.
E.g. take 50 000 games between Lc0 and SF12 and filter 100 openings where one or another lost more.
Than make them replay those 200 games again for kibitzers
mwyoung
Posts: 2727
Joined: Wed May 12, 2010 10:00 pm

Re: A match between SF12+NNUE and Leele ver.0.26.2

Post by mwyoung »

OliverBr wrote: Sun Sep 20, 2020 1:13 am
mwyoung wrote: Wed Sep 09, 2020 1:20 am That is just false. It can be done with 10 to 20 games. What matters is the Elo difference between A vs B. An Example Engine A scores 10 wins in 10 games.
Sorry, I have to correct this statement, because it is absolutely wrong. I have seen test series, where an engine lead after over 1000 games and still was finally beaten.

For sure, 20 games are not enough. Statistically alone this makes no sense.
Then you need to go back to school. This is statistics.

You need to understand what is being said.

The question is have you ever seen a engine win all games in a 10 game test. Then lose a test match in 1000 games or 10,000 games to the same engine. No!

LOS stats let's you determine who is better.
"The worst thing that can happen to a forum is a running wild attacking moderator(HGM) who is not corrected by the community." - Ed Schröder
But my words like silent raindrops fell. And echoed in the wells of silence.