And what rank is your engine, @mwyoung?
A match between SF12+NNUE and Leele ver.0.26.2
Moderators: hgm, Rebel, chrisw
-
- Posts: 725
- Joined: Tue Dec 18, 2007 9:38 pm
- Location: Munich, Germany
- Full name: Dr. Oliver Brausch
Re: A match between SF12+NNUE and Leele ver.0.26.2
-
- Posts: 1759
- Joined: Tue Apr 19, 2016 6:08 am
- Location: U.S.A
- Full name: Andrew Grant
Re: A match between SF12+NNUE and Leele ver.0.26.2
You don't need to have an engine to be able to test competently.
Conversely, having an engine does not make you able to test competently.
#WeAreAllDraude #JusticeForDraude #RememberDraude #LeptirBigUltra
"Those who can't do, clone instead" - Eduard ( A real life friend, not this forum's Eduard )
"Those who can't do, clone instead" - Eduard ( A real life friend, not this forum's Eduard )
-
- Posts: 360
- Joined: Thu Jan 22, 2015 3:21 pm
- Location: Zurich, Switzerland
- Full name: Jonathan Rosenthal
Re: A match between SF12+NNUE and Leele ver.0.26.2
Wow! I knew Ethereal progressed a lot, but a perfect score against SF12 is amazing! I always knew TCEC was a scam!
-Jonathan
-
- Posts: 216
- Joined: Sun Jan 22, 2017 8:30 pm
- Location: Russia
Re: A match between SF12+NNUE and Leele ver.0.26.2
Don't blame TCEC - 12.50 was late for the divP submission deadline. The next season (or Cup 7 already?) may feature 12.51+ with the pawn-king NN and even more progress.
Last edited by Tony P. on Thu Sep 24, 2020 12:45 am, edited 1 time in total.
-
- Posts: 2727
- Joined: Wed May 12, 2010 10:00 pm
Re: A match between SF12+NNUE and Leele ver.0.26.2
Andrew Grant master of the strawman! Thank You.AndrewGrant wrote: ↑Wed Sep 23, 2020 9:04 pm Only 8 games, not the highly coveted 10, but it looks like I can say with 100% confidence that Ethereal > Houdini
Thanks ! I never knew testing was so easy
You need more games, consult the chart. This is LOS statistics, not Andrew Grant statistics.
Code: Select all
10 50 100 200 500 1000
0 50.00% 50.00% 50.00% 50.00% 50.00% 50.00%
1 64.51% 56.77% 54.81% 53.41% 52.16% 51.53%
2 77.22% 63.34% 59.55% 56.80% 54.32% 53.06%
3 86.95% 69.55% 64.16% 60.14% 56.46% 54.58%
4 93.43% 75.24% 68.57% 63.40% 58.58% 56.09%
5 97.14% 80.32% 72.73% 66.57% 60.68% 57.60%
6 98.95% 84.71% 76.60% 69.63% 62.75% 59.10%
7 99.68% 88.40% 80.14% 72.55% 64.78% 60.58%
8 99.93% 91.41% 83.34% 75.33% 66.77% 62.05%
9 99.99% 93.80% 86.19% 77.96% 68.72% 63.50%
10 100.00% 95.64% 88.69% 80.41% 70.61% 64.93%
11 97.02% 90.85% 82.69% 72.45% 66.34%
12 98.01% 92.68% 84.80% 74.23% 67.73%
13 98.71% 94.23% 86.72% 75.95% 69.09%
14 99.19% 95.50% 88.48% 77.60% 70.43%
15 99.50% 96.54% 90.06% 79.19% 71.74%
16 99.71% 97.37% 91.48% 80.71% 73.02%
17 99.83% 98.03% 92.74% 82.16% 74.27%
18 99.91% 98.55% 93.85% 83.54% 75.49%
19 99.95% 98.94% 94.82% 84.85% 76.68%
20 99.97% 99.24% 95.67% 86.08% 77.84%
21 99.99% 99.46% 96.41% 87.25% 78.96%
22 99.99% 99.62% 97.03% 88.35% 80.05%
23 100.00% 99.74% 97.57% 89.38% 81.11%
24 99.82% 98.02% 90.34% 82.12%
25 99.88% 98.40% 91.23% 83.11%
26 99.92% 98.71% 92.07% 84.06%
27 99.95% 98.97% 92.84% 84.97%
28 99.97% 99.18% 93.55% 85.85%
29 99.98% 99.36% 94.21% 86.69%
30 99.99% 99.50% 94.81% 87.50%
31 99.99% 99.61% 95.36% 88.27%
32 100.00% 99.70% 95.86% 89.01%
33 99.77% 96.32% 89.71%
34 99.82% 96.74% 90.38%
35 99.87% 97.11% 91.02%
36 99.90% 97.45% 91.62%
37 99.93% 97.76% 92.20%
38 99.95% 98.03% 92.74%
39 99.96% 98.28% 93.26%
40 99.97% 98.50% 93.74%
41 99.98% 98.69% 94.20%
42 99.98% 98.86% 94.63%
43 99.99% 99.02% 95.04%
44 99.99% 99.15% 95.42%
45 99.99% 99.27% 95.78%
46 100.00% 99.37% 96.11%
47 99.46% 96.42%
48 99.54% 96.72%
49 99.61% 96.99%
50 99.67% 97.24%
51 99.72% 97.47%
52 99.76% 97.69%
53 99.80% 97.89%
54 99.83% 98.08%
55 99.86% 98.25%
56 99.88% 98.41%
57 99.90% 98.56%
58 99.92% 98.69%
59 99.93% 98.82%
60 99.94% 98.93%
61 99.95% 99.03%
62 99.96% 99.13%
63 99.97% 99.22%
64 99.97% 99.29%
65 99.98% 99.37%
66 99.98% 99.43%
67 99.99% 99.49%
68 99.99% 99.54%
69 99.99% 99.59%
70 99.99% 99.64%
71 99.99% 99.68%
72 100.00% 99.71%
73 99.74%
74 99.77%
75 99.80%
76 99.82%
77 99.84%
78 99.86%
79 99.88%
80 99.89%
81 99.91%
82 99.92%
83 99.93%
84 99.94%
85 99.94%
86 99.95%
87 99.96%
88 99.96%
89 99.97%
90 99.97%
91 99.98%
92 99.98%
93 99.98%
94 99.98%
95 99.99%
96 99.99%
97 99.99%
98 99.99%
99 99.99%
100 99.99%
101 99.99%
102 100.00%
"The worst thing that can happen to a forum is a running wild attacking moderator(HGM) who is not corrected by the community." - Ed Schröder
But my words like silent raindrops fell. And echoed in the wells of silence.
But my words like silent raindrops fell. And echoed in the wells of silence.
-
- Posts: 2727
- Joined: Wed May 12, 2010 10:00 pm
Re: A match between SF12+NNUE and Leele ver.0.26.2
This chunk would be a non starter for LOS. Draws do not count, and any wins for engine A, and engine B would cancel out to a draw. So for this chunk, and saying this is how you would apply LOS test, and it is not. This 10 chunk result would be +2 in 10 games. Or LOS = 77.22% a meaningless result. And would need more games.
Code: Select all
10 50 100 200 500 1000
0 50.00% 50.00% 50.00% 50.00% 50.00% 50.00%
1 64.51% 56.77% 54.81% 53.41% 52.16% 51.53%
2 77.22% 63.34% 59.55% 56.80% 54.32% 53.06%
3 86.95% 69.55% 64.16% 60.14% 56.46% 54.58%
4 93.43% 75.24% 68.57% 63.40% 58.58% 56.09%
5 97.14% 80.32% 72.73% 66.57% 60.68% 57.60%
6 98.95% 84.71% 76.60% 69.63% 62.75% 59.10%
7 99.68% 88.40% 80.14% 72.55% 64.78% 60.58%
8 99.93% 91.41% 83.34% 75.33% 66.77% 62.05%
9 99.99% 93.80% 86.19% 77.96% 68.72% 63.50%
10 100.00% 95.64% 88.69% 80.41% 70.61% 64.93%
11 97.02% 90.85% 82.69% 72.45% 66.34%
12 98.01% 92.68% 84.80% 74.23% 67.73%
13 98.71% 94.23% 86.72% 75.95% 69.09%
14 99.19% 95.50% 88.48% 77.60% 70.43%
15 99.50% 96.54% 90.06% 79.19% 71.74%
16 99.71% 97.37% 91.48% 80.71% 73.02%
17 99.83% 98.03% 92.74% 82.16% 74.27%
18 99.91% 98.55% 93.85% 83.54% 75.49%
19 99.95% 98.94% 94.82% 84.85% 76.68%
20 99.97% 99.24% 95.67% 86.08% 77.84%
21 99.99% 99.46% 96.41% 87.25% 78.96%
22 99.99% 99.62% 97.03% 88.35% 80.05%
23 100.00% 99.74% 97.57% 89.38% 81.11%
24 99.82% 98.02% 90.34% 82.12%
25 99.88% 98.40% 91.23% 83.11%
26 99.92% 98.71% 92.07% 84.06%
27 99.95% 98.97% 92.84% 84.97%
28 99.97% 99.18% 93.55% 85.85%
29 99.98% 99.36% 94.21% 86.69%
30 99.99% 99.50% 94.81% 87.50%
31 99.99% 99.61% 95.36% 88.27%
32 100.00% 99.70% 95.86% 89.01%
33 99.77% 96.32% 89.71%
34 99.82% 96.74% 90.38%
35 99.87% 97.11% 91.02%
36 99.90% 97.45% 91.62%
37 99.93% 97.76% 92.20%
38 99.95% 98.03% 92.74%
39 99.96% 98.28% 93.26%
40 99.97% 98.50% 93.74%
41 99.98% 98.69% 94.20%
42 99.98% 98.86% 94.63%
43 99.99% 99.02% 95.04%
44 99.99% 99.15% 95.42%
45 99.99% 99.27% 95.78%
46 100.00% 99.37% 96.11%
47 99.46% 96.42%
48 99.54% 96.72%
49 99.61% 96.99%
50 99.67% 97.24%
51 99.72% 97.47%
52 99.76% 97.69%
53 99.80% 97.89%
54 99.83% 98.08%
55 99.86% 98.25%
56 99.88% 98.41%
57 99.90% 98.56%
58 99.92% 98.69%
59 99.93% 98.82%
60 99.94% 98.93%
61 99.95% 99.03%
62 99.96% 99.13%
63 99.97% 99.22%
64 99.97% 99.29%
65 99.98% 99.37%
66 99.98% 99.43%
67 99.99% 99.49%
68 99.99% 99.54%
69 99.99% 99.59%
70 99.99% 99.64%
71 99.99% 99.68%
72 100.00% 99.71%
73 99.74%
74 99.77%
75 99.80%
76 99.82%
77 99.84%
78 99.86%
79 99.88%
80 99.89%
81 99.91%
82 99.92%
83 99.93%
84 99.94%
85 99.94%
86 99.95%
87 99.96%
88 99.96%
89 99.97%
90 99.97%
91 99.98%
92 99.98%
93 99.98%
94 99.98%
95 99.99%
96 99.99%
97 99.99%
98 99.99%
99 99.99%
100 99.99%
101 99.99%
102 100.00%
"The worst thing that can happen to a forum is a running wild attacking moderator(HGM) who is not corrected by the community." - Ed Schröder
But my words like silent raindrops fell. And echoed in the wells of silence.
But my words like silent raindrops fell. And echoed in the wells of silence.
-
- Posts: 2727
- Joined: Wed May 12, 2010 10:00 pm
Re: A match between SF12+NNUE and Leele ver.0.26.2
For the third time. I am a engine tester. I do not have a engine. So say my engine has a rating of 0. And that would rank it very close to your engine.
In case you ask again. This will save us both time.
For the 4th time. I am a engine tester. I do not have a engine. So say my engine has a rating of 0. And that would rank it very close to your engine.
For the 5th time. I am a engine tester. I do not have a engine. So say my engine has a rating of 0. And that would rank it very close to your engine.
For the 6th time. I am a engine tester. I do not have a engine. So say my engine has a rating of 0. And that would rank it very close to your engine..............
"The worst thing that can happen to a forum is a running wild attacking moderator(HGM) who is not corrected by the community." - Ed Schröder
But my words like silent raindrops fell. And echoed in the wells of silence.
But my words like silent raindrops fell. And echoed in the wells of silence.
-
- Posts: 550
- Joined: Tue Nov 19, 2019 8:48 pm
- Full name: Alayan Feh
Re: A match between SF12+NNUE and Leele ver.0.26.2
Oh, really ?mwyoung wrote: ↑Thu Sep 24, 2020 1:00 amThis chunk would be a non starter for LOS. Draws do not count, and any wins for engine A, and engine B would cancel out to a draw. So for this chunk, and saying this is how you would apply LOS test, and it is not. This 10 chunk result would be +2 in 10 games. Or LOS = 77.22% a meaningless result. And would need more games.
Let me remind you of what happened earlier in this thread :
If +6-4 is "a meaningless result" that "would need more games", +9-6 isn't going to be much better.mwyoung wrote: ↑Wed Sep 09, 2020 12:52 amA tester does not need 5000 games, unless the goal is only to maximize the Elo precision. A useful tool for engine developers.
What are you trying to answer.
1. What is the exact elo deference between A vs B.
2. Who is better only between A vs B, and this usually takes much fewer games to prove statistically.
See the issue ? Of course the +6-4=2 doesn't allow to conclude with confidence on superiority, but neither does the +9-6=185 result.
-
- Posts: 2727
- Joined: Wed May 12, 2010 10:00 pm
Re: A match between SF12+NNUE and Leele ver.0.26.2
""See the issue ? Of course the +6-4=2 doesn't allow to conclude with confidence on superiority, but neither does the +9-6=185 result.""Alayan wrote: ↑Thu Sep 24, 2020 4:23 amOh, really ?mwyoung wrote: ↑Thu Sep 24, 2020 1:00 amThis chunk would be a non starter for LOS. Draws do not count, and any wins for engine A, and engine B would cancel out to a draw. So for this chunk, and saying this is how you would apply LOS test, and it is not. This 10 chunk result would be +2 in 10 games. Or LOS = 77.22% a meaningless result. And would need more games.
Let me remind you of what happened earlier in this thread :
If +6-4 is "a meaningless result" that "would need more games", +9-6 isn't going to be much better.mwyoung wrote: ↑Wed Sep 09, 2020 12:52 amA tester does not need 5000 games, unless the goal is only to maximize the Elo precision. A useful tool for engine developers.
What are you trying to answer.
1. What is the exact elo deference between A vs B.
2. Who is better only between A vs B, and this usually takes much fewer games to prove statistically.
See the issue ? Of course the +6-4=2 doesn't allow to conclude with confidence on superiority, but neither does the +9-6=185 result.
And I never said it did, and neither did LOS for any of your straw man scores. LOS score for +9-6=185 = LOS 60.14% a meaningless result. This is your fantasy, or STRAW MAN.
And why is this inconsistent. I never said it did STRAW MAN! I am working with LOS stats. So do you know how LOS work?
Remember Elo is statistics, and SPRT is statistics, LOS is statistics. "And statistics is a game of probability, and it cannot be known for certain whether statistical conclusions are correct. Whenever there is uncertainty, there is the possibility of making an error. Considering this nature of statistics science, all statistical hypothesis tests have a probability of making type I and type II errors."
But the probability is low. And a tool to be used.
Again I level with with the LOS table.
Code: Select all
10 50 100 200 500 1000
0 50.00% 50.00% 50.00% 50.00% 50.00% 50.00%
1 64.51% 56.77% 54.81% 53.41% 52.16% 51.53%
2 77.22% 63.34% 59.55% 56.80% 54.32% 53.06%
3 86.95% 69.55% 64.16% 60.14% 56.46% 54.58%
4 93.43% 75.24% 68.57% 63.40% 58.58% 56.09%
5 97.14% 80.32% 72.73% 66.57% 60.68% 57.60%
6 98.95% 84.71% 76.60% 69.63% 62.75% 59.10%
7 99.68% 88.40% 80.14% 72.55% 64.78% 60.58%
8 99.93% 91.41% 83.34% 75.33% 66.77% 62.05%
9 99.99% 93.80% 86.19% 77.96% 68.72% 63.50%
10 100.00% 95.64% 88.69% 80.41% 70.61% 64.93%
11 97.02% 90.85% 82.69% 72.45% 66.34%
12 98.01% 92.68% 84.80% 74.23% 67.73%
13 98.71% 94.23% 86.72% 75.95% 69.09%
14 99.19% 95.50% 88.48% 77.60% 70.43%
15 99.50% 96.54% 90.06% 79.19% 71.74%
16 99.71% 97.37% 91.48% 80.71% 73.02%
17 99.83% 98.03% 92.74% 82.16% 74.27%
18 99.91% 98.55% 93.85% 83.54% 75.49%
19 99.95% 98.94% 94.82% 84.85% 76.68%
20 99.97% 99.24% 95.67% 86.08% 77.84%
21 99.99% 99.46% 96.41% 87.25% 78.96%
22 99.99% 99.62% 97.03% 88.35% 80.05%
23 100.00% 99.74% 97.57% 89.38% 81.11%
24 99.82% 98.02% 90.34% 82.12%
25 99.88% 98.40% 91.23% 83.11%
26 99.92% 98.71% 92.07% 84.06%
27 99.95% 98.97% 92.84% 84.97%
28 99.97% 99.18% 93.55% 85.85%
29 99.98% 99.36% 94.21% 86.69%
30 99.99% 99.50% 94.81% 87.50%
31 99.99% 99.61% 95.36% 88.27%
32 100.00% 99.70% 95.86% 89.01%
33 99.77% 96.32% 89.71%
34 99.82% 96.74% 90.38%
35 99.87% 97.11% 91.02%
36 99.90% 97.45% 91.62%
37 99.93% 97.76% 92.20%
38 99.95% 98.03% 92.74%
39 99.96% 98.28% 93.26%
40 99.97% 98.50% 93.74%
41 99.98% 98.69% 94.20%
42 99.98% 98.86% 94.63%
43 99.99% 99.02% 95.04%
44 99.99% 99.15% 95.42%
45 99.99% 99.27% 95.78%
46 100.00% 99.37% 96.11%
47 99.46% 96.42%
48 99.54% 96.72%
49 99.61% 96.99%
50 99.67% 97.24%
51 99.72% 97.47%
52 99.76% 97.69%
53 99.80% 97.89%
54 99.83% 98.08%
55 99.86% 98.25%
56 99.88% 98.41%
57 99.90% 98.56%
58 99.92% 98.69%
59 99.93% 98.82%
60 99.94% 98.93%
61 99.95% 99.03%
62 99.96% 99.13%
63 99.97% 99.22%
64 99.97% 99.29%
65 99.98% 99.37%
66 99.98% 99.43%
67 99.99% 99.49%
68 99.99% 99.54%
69 99.99% 99.59%
70 99.99% 99.64%
71 99.99% 99.68%
72 100.00% 99.71%
73 99.74%
74 99.77%
75 99.80%
76 99.82%
77 99.84%
78 99.86%
79 99.88%
80 99.89%
81 99.91%
82 99.92%
83 99.93%
84 99.94%
85 99.94%
86 99.95%
87 99.96%
88 99.96%
89 99.97%
90 99.97%
91 99.98%
92 99.98%
93 99.98%
94 99.98%
95 99.99%
96 99.99%
97 99.99%
98 99.99%
99 99.99%
100 99.99%
101 99.99%
102 100.00%
"The worst thing that can happen to a forum is a running wild attacking moderator(HGM) who is not corrected by the community." - Ed Schröder
But my words like silent raindrops fell. And echoed in the wells of silence.
But my words like silent raindrops fell. And echoed in the wells of silence.
-
- Posts: 2727
- Joined: Wed May 12, 2010 10:00 pm
Re: A match between SF12+NNUE and Leele ver.0.26.2
This test is over and after the result. The answer is definitely NO!AndrewGrant wrote: ↑Tue Sep 22, 2020 6:45 pmIf I play 100,000 games against Stockfish with Ethereal. Do you not think I can find a 10 game chunk where Ethereak beats Stockfish handedly?
Your engine could not win a single game in 200 games. Let alone a winning LOS 100% score taking chucks out of 100,000 games STRAWMAN! An improper way to use LOS. At 200 games Stockfish only needed to be you by +46 and it won by almost double that score.
Code: Select all
Result:
-------------------------------------------------------------------------------------
# name games wins draws losses score los% elo+/-
1. Stockfish 210920 200 92 108 0 146.0 100.0 172.8
2. Ethereal 12.50 (POPCNT) 200 0 108 92 54.0 0.0 -172.8
Cross table:
-------------------------------------------------------------------------------------
# name score games 1 2
1. Stockfish 210920 146.0 200 x ===1==11=11===1=11=11=======1====1==1=1==11==11=1==11==1=111=1111111=1==1==1==11=1=======111=1=111=1==1111=1=111==111111111=========1=111==1=1=1==111=111=1===1=1====1=1====111=====1=1==1=====11=111===
2. Ethereal 12.50 (POPCNT) 54.0 200 ===0==00=00===0=00=00=======0====0==0=0==00==00=0==00==0=000=0000000=0==0==0==00=0=======000=0=000=0==0000=0=000==000000000=========0=000==0=0=0==000=000=0===0=0====0=0====000=====0=0==0=====00=000=== x
Tech:
-------------------------------------------------------------------------------------
Tech (average nodes, depths, time/m per move, others per game), counted for computing moves only, ignored moves with zero nodes:
# name nodes/m NPS depth/m time/m moves time
1. Stockfish 210920 140752K 29802503 36.5 4.7 56.3 265.8
2. Ethereal 12.50 (POPCNT) 191699K 38994410 28.5 4.9 56.5 277.9
all --- 162382K 34500376 32.5 4.8 56.4 271.9
Last edited by mwyoung on Thu Sep 24, 2020 6:10 am, edited 3 times in total.
"The worst thing that can happen to a forum is a running wild attacking moderator(HGM) who is not corrected by the community." - Ed Schröder
But my words like silent raindrops fell. And echoed in the wells of silence.
But my words like silent raindrops fell. And echoed in the wells of silence.