A match between SF12+NNUE and Leele ver.0.26.2

OliverBr · Post by **OliverBr** » Wed Sep 23, 2020 11:34 pm

mwyoung wrote: ↑Wed Sep 23, 2020 11:12 pm That is because you are not very bright.

And what rank is your engine, @mwyoung?

AndrewGrant · Post by **AndrewGrant** » Wed Sep 23, 2020 11:39 pm

OliverBr wrote: ↑Wed Sep 23, 2020 11:34 pm
mwyoung wrote: ↑Wed Sep 23, 2020 11:12 pm That is because you are not very bright.
And what rank is your engine, @mwyoung?

You don't need to have an engine to be able to test competently.

Conversely, having an engine does not make you able to test competently.

jorose · Post by **jorose** » Wed Sep 23, 2020 11:45 pm

AndrewGrant wrote: ↑Wed Sep 23, 2020 9:07 pm
Alayan wrote: ↑Tue Sep 22, 2020 8:13 pm Ethereal 12.50 vs Stockfish 12 CCRL FRC testing has this chunk of games right at the very end of the 300 games they played :
[redacted for brevity]1 1 1
[redacted for brevity]

Overall results .....
[redacted for brevity]

Wow! I knew Ethereal progressed a lot, but a perfect score against SF12 is amazing! I always knew TCEC was a scam!

Tony P. · Post by **Tony P.** » Thu Sep 24, 2020 12:38 am

jorose wrote: ↑Wed Sep 23, 2020 11:45 pm I always knew TCEC was a scam!

Don't blame TCEC - 12.50 was late for the divP submission deadline. The next season (or Cup 7 already?) may feature 12.51+ with the pawn-king NN and even more progress.

mwyoung · Post by **mwyoung** » Thu Sep 24, 2020 12:44 am

AndrewGrant wrote: ↑Wed Sep 23, 2020 9:04 pm Only 8 games, not the highly coveted 10, but it looks like I can say with 100% confidence that Ethereal > Houdini

Thanks ! I never knew testing was so easy

Andrew Grant master of the strawman! Thank You.

You need more games, consult the chart. This is LOS statistics, not Andrew Grant statistics.

Code: Select all

10	50	100	200	500	1000
0	50.00%	50.00%	50.00%	50.00%	50.00%	50.00%
1	64.51%	56.77%	54.81%	53.41%	52.16%	51.53%
2	77.22%	63.34%	59.55%	56.80%	54.32%	53.06%
3	86.95%	69.55%	64.16%	60.14%	56.46%	54.58%
4	93.43%	75.24%	68.57%	63.40%	58.58%	56.09%
5	97.14%	80.32%	72.73%	66.57%	60.68%	57.60%
6	98.95%	84.71%	76.60%	69.63%	62.75%	59.10%
7	99.68%	88.40%	80.14%	72.55%	64.78%	60.58%
8	99.93%	91.41%	83.34%	75.33%	66.77%	62.05%
9	99.99%	93.80%	86.19%	77.96%	68.72%	63.50%
10	100.00%	95.64%	88.69%	80.41%	70.61%	64.93%
11		97.02%	90.85%	82.69%	72.45%	66.34%
12		98.01%	92.68%	84.80%	74.23%	67.73%
13		98.71%	94.23%	86.72%	75.95%	69.09%
14		99.19%	95.50%	88.48%	77.60%	70.43%
15		99.50%	96.54%	90.06%	79.19%	71.74%
16		99.71%	97.37%	91.48%	80.71%	73.02%
17		99.83%	98.03%	92.74%	82.16%	74.27%
18		99.91%	98.55%	93.85%	83.54%	75.49%
19		99.95%	98.94%	94.82%	84.85%	76.68%
20		99.97%	99.24%	95.67%	86.08%	77.84%
21		99.99%	99.46%	96.41%	87.25%	78.96%
22		99.99%	99.62%	97.03%	88.35%	80.05%
23		100.00%	99.74%	97.57%	89.38%	81.11%
24			99.82%	98.02%	90.34%	82.12%
25			99.88%	98.40%	91.23%	83.11%
26			99.92%	98.71%	92.07%	84.06%
27			99.95%	98.97%	92.84%	84.97%
28			99.97%	99.18%	93.55%	85.85%
29			99.98%	99.36%	94.21%	86.69%
30			99.99%	99.50%	94.81%	87.50%
31			99.99%	99.61%	95.36%	88.27%
32			100.00%	99.70%	95.86%	89.01%
33				99.77%	96.32%	89.71%
34				99.82%	96.74%	90.38%
35				99.87%	97.11%	91.02%
36				99.90%	97.45%	91.62%
37				99.93%	97.76%	92.20%
38				99.95%	98.03%	92.74%
39				99.96%	98.28%	93.26%
40				99.97%	98.50%	93.74%
41				99.98%	98.69%	94.20%
42				99.98%	98.86%	94.63%
43				99.99%	99.02%	95.04%
44				99.99%	99.15%	95.42%
45				99.99%	99.27%	95.78%
46				100.00%	99.37%	96.11%
47					99.46%	96.42%
48					99.54%	96.72%
49					99.61%	96.99%
50					99.67%	97.24%
51					99.72%	97.47%
52					99.76%	97.69%
53					99.80%	97.89%
54					99.83%	98.08%
55					99.86%	98.25%
56					99.88%	98.41%
57					99.90%	98.56%
58					99.92%	98.69%
59					99.93%	98.82%
60					99.94%	98.93%
61					99.95%	99.03%
62					99.96%	99.13%
63					99.97%	99.22%
64					99.97%	99.29%
65					99.98%	99.37%
66					99.98%	99.43%
67					99.99%	99.49%
68					99.99%	99.54%
69					99.99%	99.59%
70					99.99%	99.64%
71					99.99%	99.68%
72					100.00%	99.71%
73						99.74%
74						99.77%
75						99.80%
76						99.82%
77						99.84%
78						99.86%
79						99.88%
80						99.89%
81						99.91%
82						99.92%
83						99.93%
84						99.94%
85						99.94%
86						99.95%
87						99.96%
88						99.96%
89						99.97%
90						99.97%
91						99.98%
92						99.98%
93						99.98%
94						99.98%
95						99.99%
96						99.99%
97						99.99%
98						99.99%
99						99.99%
100						99.99%
101						99.99%
102						100.00%

mwyoung · Post by **mwyoung** » Thu Sep 24, 2020 1:00 am

Alayan wrote: ↑Tue Sep 22, 2020 8:13 pm Ethereal 12.50 vs Stockfish 12 CCRL FRC testing has this chunk of games right at the very end of the 300 games they played :
1 1 0 = 0 0 = 1 0 1 1 1
+6-4=2 for Ethereal. +4-1 in the last 5 games.

This chunk would be a non starter for LOS. Draws do not count, and any wins for engine A, and engine B would cancel out to a draw. So for this chunk, and saying this is how you would apply LOS test, and it is not. This 10 chunk result would be +2 in 10 games. Or LOS = 77.22% a meaningless result. And would need more games.

Code: Select all

10	50	100	200	500	1000
0	50.00%	50.00%	50.00%	50.00%	50.00%	50.00%
1	64.51%	56.77%	54.81%	53.41%	52.16%	51.53%
2	77.22%	63.34%	59.55%	56.80%	54.32%	53.06%
3	86.95%	69.55%	64.16%	60.14%	56.46%	54.58%
4	93.43%	75.24%	68.57%	63.40%	58.58%	56.09%
5	97.14%	80.32%	72.73%	66.57%	60.68%	57.60%
6	98.95%	84.71%	76.60%	69.63%	62.75%	59.10%
7	99.68%	88.40%	80.14%	72.55%	64.78%	60.58%
8	99.93%	91.41%	83.34%	75.33%	66.77%	62.05%
9	99.99%	93.80%	86.19%	77.96%	68.72%	63.50%
10	100.00%	95.64%	88.69%	80.41%	70.61%	64.93%
11		97.02%	90.85%	82.69%	72.45%	66.34%
12		98.01%	92.68%	84.80%	74.23%	67.73%
13		98.71%	94.23%	86.72%	75.95%	69.09%
14		99.19%	95.50%	88.48%	77.60%	70.43%
15		99.50%	96.54%	90.06%	79.19%	71.74%
16		99.71%	97.37%	91.48%	80.71%	73.02%
17		99.83%	98.03%	92.74%	82.16%	74.27%
18		99.91%	98.55%	93.85%	83.54%	75.49%
19		99.95%	98.94%	94.82%	84.85%	76.68%
20		99.97%	99.24%	95.67%	86.08%	77.84%
21		99.99%	99.46%	96.41%	87.25%	78.96%
22		99.99%	99.62%	97.03%	88.35%	80.05%
23		100.00%	99.74%	97.57%	89.38%	81.11%
24			99.82%	98.02%	90.34%	82.12%
25			99.88%	98.40%	91.23%	83.11%
26			99.92%	98.71%	92.07%	84.06%
27			99.95%	98.97%	92.84%	84.97%
28			99.97%	99.18%	93.55%	85.85%
29			99.98%	99.36%	94.21%	86.69%
30			99.99%	99.50%	94.81%	87.50%
31			99.99%	99.61%	95.36%	88.27%
32			100.00%	99.70%	95.86%	89.01%
33				99.77%	96.32%	89.71%
34				99.82%	96.74%	90.38%
35				99.87%	97.11%	91.02%
36				99.90%	97.45%	91.62%
37				99.93%	97.76%	92.20%
38				99.95%	98.03%	92.74%
39				99.96%	98.28%	93.26%
40				99.97%	98.50%	93.74%
41				99.98%	98.69%	94.20%
42				99.98%	98.86%	94.63%
43				99.99%	99.02%	95.04%
44				99.99%	99.15%	95.42%
45				99.99%	99.27%	95.78%
46				100.00%	99.37%	96.11%
47					99.46%	96.42%
48					99.54%	96.72%
49					99.61%	96.99%
50					99.67%	97.24%
51					99.72%	97.47%
52					99.76%	97.69%
53					99.80%	97.89%
54					99.83%	98.08%
55					99.86%	98.25%
56					99.88%	98.41%
57					99.90%	98.56%
58					99.92%	98.69%
59					99.93%	98.82%
60					99.94%	98.93%
61					99.95%	99.03%
62					99.96%	99.13%
63					99.97%	99.22%
64					99.97%	99.29%
65					99.98%	99.37%
66					99.98%	99.43%
67					99.99%	99.49%
68					99.99%	99.54%
69					99.99%	99.59%
70					99.99%	99.64%
71					99.99%	99.68%
72					100.00%	99.71%
73						99.74%
74						99.77%
75						99.80%
76						99.82%
77						99.84%
78						99.86%
79						99.88%
80						99.89%
81						99.91%
82						99.92%
83						99.93%
84						99.94%
85						99.94%
86						99.95%
87						99.96%
88						99.96%
89						99.97%
90						99.97%
91						99.98%
92						99.98%
93						99.98%
94						99.98%
95						99.99%
96						99.99%
97						99.99%
98						99.99%
99						99.99%
100						99.99%
101						99.99%
102						100.00%

mwyoung · Post by **mwyoung** » Thu Sep 24, 2020 1:08 am

OliverBr wrote: ↑Wed Sep 23, 2020 11:34 pm
mwyoung wrote: ↑Wed Sep 23, 2020 11:12 pm That is because you are not very bright.
And what rank is your engine, @mwyoung?

For the third time. I am a engine tester. I do not have a engine. So say my engine has a rating of 0. And that would rank it very close to your engine.

In case you ask again. This will save us both time.

For the 4th time. I am a engine tester. I do not have a engine. So say my engine has a rating of 0. And that would rank it very close to your engine.

For the 5th time. I am a engine tester. I do not have a engine. So say my engine has a rating of 0. And that would rank it very close to your engine.

For the 6th time. I am a engine tester. I do not have a engine. So say my engine has a rating of 0. And that would rank it very close to your engine..............

Alayan · Post by **Alayan** » Thu Sep 24, 2020 4:23 am

mwyoung wrote: ↑Thu Sep 24, 2020 1:00 am
Alayan wrote: ↑Tue Sep 22, 2020 8:13 pm Ethereal 12.50 vs Stockfish 12 CCRL FRC testing has this chunk of games right at the very end of the 300 games they played :
1 1 0 = 0 0 = 1 0 1 1 1
+6-4=2 for Ethereal. +4-1 in the last 5 games.
This chunk would be a non starter for LOS. Draws do not count, and any wins for engine A, and engine B would cancel out to a draw. So for this chunk, and saying this is how you would apply LOS test, and it is not. This 10 chunk result would be +2 in 10 games. Or LOS = 77.22% a meaningless result. And would need more games.

Oh, really ?

Let me remind you of what happened earlier in this thread :

mwyoung wrote: ↑Wed Sep 09, 2020 12:52 am
corres wrote: ↑Wed Sep 09, 2020 12:36 am
OliverBr wrote: ↑Wed Sep 09, 2020 12:29 am
corres wrote: ↑Tue Sep 08, 2020 10:06 pm Result is
SF12+NNUE : kTRIAD = 9 : 6 (185 draw) 200 games
This is no result at all, you need at least 5000 games in order to announce a result.
[snip]
A tester does not need 5000 games, unless the goal is only to maximize the Elo precision. A useful tool for engine developers.

What are you trying to answer.
1. What is the exact elo deference between A vs B.
2. Who is better only between A vs B, and this usually takes much fewer games to prove statistically.

If +6-4 is "a meaningless result" that "would need more games", +9-6 isn't going to be much better.

See the issue ? Of course the +6-4=2 doesn't allow to conclude with confidence on superiority, but neither does the +9-6=185 result.

mwyoung · Post by **mwyoung** » Thu Sep 24, 2020 4:45 am

Alayan wrote: ↑Thu Sep 24, 2020 4:23 am
mwyoung wrote: ↑Thu Sep 24, 2020 1:00 am
Alayan wrote: ↑Tue Sep 22, 2020 8:13 pm Ethereal 12.50 vs Stockfish 12 CCRL FRC testing has this chunk of games right at the very end of the 300 games they played :
1 1 0 = 0 0 = 1 0 1 1 1
+6-4=2 for Ethereal. +4-1 in the last 5 games.
This chunk would be a non starter for LOS. Draws do not count, and any wins for engine A, and engine B would cancel out to a draw. So for this chunk, and saying this is how you would apply LOS test, and it is not. This 10 chunk result would be +2 in 10 games. Or LOS = 77.22% a meaningless result. And would need more games.
Oh, really ?

Let me remind you of what happened earlier in this thread :

mwyoung wrote: ↑Wed Sep 09, 2020 12:52 am
corres wrote: ↑Wed Sep 09, 2020 12:36 am
OliverBr wrote: ↑Wed Sep 09, 2020 12:29 am
corres wrote: ↑Tue Sep 08, 2020 10:06 pm Result is
SF12+NNUE : kTRIAD = 9 : 6 (185 draw) 200 games
This is no result at all, you need at least 5000 games in order to announce a result.
[snip]
A tester does not need 5000 games, unless the goal is only to maximize the Elo precision. A useful tool for engine developers.

What are you trying to answer.
1. What is the exact elo deference between A vs B.
2. Who is better only between A vs B, and this usually takes much fewer games to prove statistically.
If +6-4 is "a meaningless result" that "would need more games", +9-6 isn't going to be much better.

See the issue ? Of course the +6-4=2 doesn't allow to conclude with confidence on superiority, but neither does the +9-6=185 result.

""See the issue ? Of course the +6-4=2 doesn't allow to conclude with confidence on superiority, but neither does the +9-6=185 result.""

And I never said it did, and neither did LOS for any of your straw man scores. LOS score for +9-6=185 = LOS 60.14% a meaningless result.

This is your fantasy, or STRAW MAN.

And why is this inconsistent. I never said it did STRAW MAN! I am working with LOS stats. So do you know how LOS work?

Remember Elo is statistics, and SPRT is statistics, LOS is statistics. "And statistics is a game of probability, and it cannot be known for certain whether statistical conclusions are correct. Whenever there is uncertainty, there is the possibility of making an error. Considering this nature of statistics science, all statistical hypothesis tests have a probability of making type I and type II errors."

But the probability is low. And a tool to be used.

Again I level with with the LOS table.

Code: Select all

	
	10	50	100	200	500	1000
0	50.00%	50.00%	50.00%	50.00%	50.00%	50.00%
1	64.51%	56.77%	54.81%	53.41%	52.16%	51.53%
2	77.22%	63.34%	59.55%	56.80%	54.32%	53.06%
3	86.95%	69.55%	64.16%	60.14%	56.46%	54.58%
4	93.43%	75.24%	68.57%	63.40%	58.58%	56.09%
5	97.14%	80.32%	72.73%	66.57%	60.68%	57.60%
6	98.95%	84.71%	76.60%	69.63%	62.75%	59.10%
7	99.68%	88.40%	80.14%	72.55%	64.78%	60.58%
8	99.93%	91.41%	83.34%	75.33%	66.77%	62.05%
9	99.99%	93.80%	86.19%	77.96%	68.72%	63.50%
10	100.00%	95.64%	88.69%	80.41%	70.61%	64.93%
11		97.02%	90.85%	82.69%	72.45%	66.34%
12		98.01%	92.68%	84.80%	74.23%	67.73%
13		98.71%	94.23%	86.72%	75.95%	69.09%
14		99.19%	95.50%	88.48%	77.60%	70.43%
15		99.50%	96.54%	90.06%	79.19%	71.74%
16		99.71%	97.37%	91.48%	80.71%	73.02%
17		99.83%	98.03%	92.74%	82.16%	74.27%
18		99.91%	98.55%	93.85%	83.54%	75.49%
19		99.95%	98.94%	94.82%	84.85%	76.68%
20		99.97%	99.24%	95.67%	86.08%	77.84%
21		99.99%	99.46%	96.41%	87.25%	78.96%
22		99.99%	99.62%	97.03%	88.35%	80.05%
23		100.00%	99.74%	97.57%	89.38%	81.11%
24			99.82%	98.02%	90.34%	82.12%
25			99.88%	98.40%	91.23%	83.11%
26			99.92%	98.71%	92.07%	84.06%
27			99.95%	98.97%	92.84%	84.97%
28			99.97%	99.18%	93.55%	85.85%
29			99.98%	99.36%	94.21%	86.69%
30			99.99%	99.50%	94.81%	87.50%
31			99.99%	99.61%	95.36%	88.27%
32			100.00%	99.70%	95.86%	89.01%
33				99.77%	96.32%	89.71%
34				99.82%	96.74%	90.38%
35				99.87%	97.11%	91.02%
36				99.90%	97.45%	91.62%
37				99.93%	97.76%	92.20%
38				99.95%	98.03%	92.74%
39				99.96%	98.28%	93.26%
40				99.97%	98.50%	93.74%
41				99.98%	98.69%	94.20%
42				99.98%	98.86%	94.63%
43				99.99%	99.02%	95.04%
44				99.99%	99.15%	95.42%
45				99.99%	99.27%	95.78%
46				100.00%	99.37%	96.11%
47					99.46%	96.42%
48					99.54%	96.72%
49					99.61%	96.99%
50					99.67%	97.24%
51					99.72%	97.47%
52					99.76%	97.69%
53					99.80%	97.89%
54					99.83%	98.08%
55					99.86%	98.25%
56					99.88%	98.41%
57					99.90%	98.56%
58					99.92%	98.69%
59					99.93%	98.82%
60					99.94%	98.93%
61					99.95%	99.03%
62					99.96%	99.13%
63					99.97%	99.22%
64					99.97%	99.29%
65					99.98%	99.37%
66					99.98%	99.43%
67					99.99%	99.49%
68					99.99%	99.54%
69					99.99%	99.59%
70					99.99%	99.64%
71					99.99%	99.68%
72					100.00%	99.71%
73						99.74%
74						99.77%
75						99.80%
76						99.82%
77						99.84%
78						99.86%
79						99.88%
80						99.89%
81						99.91%
82						99.92%
83						99.93%
84						99.94%
85						99.94%
86						99.95%
87						99.96%
88						99.96%
89						99.97%
90						99.97%
91						99.98%
92						99.98%
93						99.98%
94						99.98%
95						99.99%
96						99.99%
97						99.99%
98						99.99%
99						99.99%
100						99.99%
101						99.99%
102						100.00%

mwyoung · Post by **mwyoung** » Thu Sep 24, 2020 5:46 am

AndrewGrant wrote: ↑Tue Sep 22, 2020 6:45 pm
mwyoung wrote: ↑Tue Sep 22, 2020 6:29 pm The question is have you ever seen a engine win all games in a 10 game test. Then lose a test match in 1000 games or 10,000 games to the same engine. No!
If I play 100,000 games against Stockfish with Ethereal. Do you not think I can find a 10 game chunk where Ethereak beats Stockfish handedly?

This test is over and after the result. The answer is definitely NO!

Your engine could not win a single game in 200 games. Let alone a winning LOS 100% score taking chucks out of 100,000 games STRAWMAN! An improper way to use LOS. At 200 games Stockfish only needed to be you by +46 and it won by almost double that score.

Code: Select all

Result:
-------------------------------------------------------------------------------------
  #  name                     games    wins   draws  losses   score    los%  elo+/-
  1. Stockfish 210920           200      92     108       0   146.0   100.0   172.8
  2. Ethereal 12.50 (POPCNT)    200       0     108      92    54.0     0.0  -172.8

Cross table:
-------------------------------------------------------------------------------------
  #  name                        score   games                                                                                                                                                                                                        1                                                                                                                                                                                                        2
  1. Stockfish 210920            146.0     200                                                                                                                                                                                                        x ===1==11=11===1=11=11=======1====1==1=1==11==11=1==11==1=111=1111111=1==1==1==11=1=======111=1=111=1==1111=1=111==111111111=========1=111==1=1=1==111=111=1===1=1====1=1====111=====1=1==1=====11=111===
  2. Ethereal 12.50 (POPCNT)      54.0     200 ===0==00=00===0=00=00=======0====0==0=0==00==00=0==00==0=000=0000000=0==0==0==00=0=======000=0=000=0==0000=0=000==000000000=========0=000==0=0=0==000=000=0===0=0====0=0====000=====0=0==0=====00=000===                                                                                                                                                                                                        x

Tech:
-------------------------------------------------------------------------------------

Tech (average nodes, depths, time/m per move, others per game), counted for computing moves only, ignored moves with zero nodes:
  #  name                       nodes/m         NPS  depth/m   time/m    moves     time
  1. Stockfish 210920           140752K    29802503     36.5      4.7     56.3    265.8
  2. Ethereal 12.50 (POPCNT)    191699K    38994410     28.5      4.9     56.5    277.9
     all ---                    162382K    34500376     32.5      4.8     56.4    271.9

A match between SF12+NNUE and Leele ver.0.26.2

Re: A match between SF12+NNUE and Leele ver.0.26.2

Re: A match between SF12+NNUE and Leele ver.0.26.2

Re: A match between SF12+NNUE and Leele ver.0.26.2

Re: A match between SF12+NNUE and Leele ver.0.26.2

Re: A match between SF12+NNUE and Leele ver.0.26.2

Re: A match between SF12+NNUE and Leele ver.0.26.2

Re: A match between SF12+NNUE and Leele ver.0.26.2

Re: A match between SF12+NNUE and Leele ver.0.26.2

Re: A match between SF12+NNUE and Leele ver.0.26.2

Re: A match between SF12+NNUE and Leele ver.0.26.2