A match between SF12+NNUE and Leele ver.0.26.2

AndrewGrant · Post by **AndrewGrant** » Tue Sep 22, 2020 6:45 pm

mwyoung wrote: ↑Tue Sep 22, 2020 6:29 pm The question is have you ever seen a engine win all games in a 10 game test. Then lose a test match in 1000 games or 10,000 games to the same engine. No!

Uh, no. I see Ethereal win 10 games straight and then fail a 1,000 game test once every few days.

I can never tell if users are trolling. Do you _really_ think a sample size of 10 games means _anything_ ?

If I play 100,000 games against Stockfish with Ethereal. Do you not think I can find a 10 game chunk where Ethereak beats Stockfish handedly?

Alayan · Post by **Alayan** » Tue Sep 22, 2020 8:13 pm

Ethereal 12.50 vs Stockfish 12 CCRL FRC testing has this chunk of games right at the very end of the 300 games they played :

1 1 0 = 0 0 = 1 0 1 1 1

+6-4=2 for Ethereal. +4-1 in the last 5 games.

Overall results are +22-185=93.

One thing that should be said is that error bars are directly correlated to the results variance. Play e.g. 100 games N times and measure the distribution of results. When the draw rate is higher (as it is with longer TC and more threads), the standard deviation will go down, while the sample size is unchanged. But I don't know if intrinsic WDL is enough or if you need experimental data to build a good prior for error bar computations.

A simple thought experiment involves an engine X playing a drawless game. In self-play playing both sides in equal amount, the intrinsic WDL will always be 50/0/50. However, at very long TC the engine might achieve 100% win with the strong side and no variance at all, while at very short TC the weak side at the start of the game could snatch wins, creating variance. So, it seems intrinsic WDL by itself isn't enough to properly assess variance. And of course, measured WDL will be off compared to the intrinsic WDL, so this increases unreliability.

In the end, that you might get away with less games for the same confidence doesn't mean multiple orders of magnitude less.

corres · Post by **corres** » Tue Sep 22, 2020 8:21 pm

AndrewGrant wrote: ↑Tue Sep 22, 2020 6:45 pm
mwyoung wrote: ↑Tue Sep 22, 2020 6:29 pm The question is have you ever seen a engine win all games in a 10 game test. Then lose a test match in 1000 games or 10,000 games to the same engine. No!
Uh, no. I see Ethereal win 10 games straight and then fail a 1,000 game test once every few days.
I can never tell if users are trolling. Do you _really_ think a sample size of 10 games means _anything_ ?
If I play 100,000 games against Stockfish with Ethereal. Do you not think I can find a 10 game chunk where Ethereak beats Stockfish handedly?

How many is the probability only the first 10 games, or only the last 10 games Ethereal will win from 1000 games and then all 990 Ethereal will loose?
It is very-very small.
Note
The statistics what in generally used for estimating the exactness of Elo is appropriate for many thousand
games because the Gauss curve is a continuous curve what is approached only with lot of measuring point.
For a data-set with a few element is do not appropriate to the Gauss statistics. But this fact is not an obstacle to draw conclusion for the power-line of engines. Obviously as the number of games and the power-difference is low, as the uncertainty is higher.
In general, I publish my test method and the results, and everybody can use it as he want.
This is the all.

mwyoung · Post by **mwyoung** » Tue Sep 22, 2020 8:30 pm

AndrewGrant wrote: ↑Tue Sep 22, 2020 6:45 pm
mwyoung wrote: ↑Tue Sep 22, 2020 6:29 pm The question is have you ever seen a engine win all games in a 10 game test. Then lose a test match in 1000 games or 10,000 games to the same engine. No!
Uh, no. I see Ethereal win 10 games straight and then fail a 1,000 game test once every few days.

I can never tell if users are trolling. Do you _really_ think a sample size of 10 games means _anything_ ?

If I play 100,000 games against Stockfish with Ethereal. Do you not think I can find a 10 game chunk where Ethereak beats Stockfish handedly?

I know you have many issues.

What engine! What testing conditions! And I can do this test also.

Your always full of B.S.

This is LOS statistics. This is not Mark's statistics.

"If I play 100,000 games against Stockfish with Ethereal. Do you not think I can find a 10 game chunk where Ethereak beats Stockfish handedly?"

Very unlikely, and that is the point. But you might be able to cherry pick 10 games. And say this was a 10 game test. I put nothing past you with your record.

Milos · Post by **Milos** » Tue Sep 22, 2020 9:34 pm

mwyoung wrote: ↑Tue Sep 22, 2020 8:30 pm I know you have many issues.

What engine! What testing conditions! And I can do this test also. Your always full of B.S.

This is LOS statistics. This is not Mark's statistics.

"If I play 100,000 games against Stockfish with Ethereal. Do you not think I can find a 10 game chunk where Ethereak beats Stockfish handedly?"

Very unlikely, and that is the point. But you might be able to cherry pick 10 games. And say this was a 10 game test. I put nothing past you with your record.

Oh shut up you clueless troll. You are ignorant beyond comprehension. I wonder why they even try to explain you anything.

mwyoung · Post by **mwyoung** » Tue Sep 22, 2020 9:43 pm

Milos wrote: ↑Tue Sep 22, 2020 9:34 pm
mwyoung wrote: ↑Tue Sep 22, 2020 8:30 pm I know you have many issues.

What engine! What testing conditions! And I can do this test also. Your always full of B.S.

This is LOS statistics. This is not Mark's statistics.

"If I play 100,000 games against Stockfish with Ethereal. Do you not think I can find a 10 game chunk where Ethereak beats Stockfish handedly?"

Very unlikely, and that is the point. But you might be able to cherry pick 10 games. And say this was a 10 game test. I put nothing past you with your record.
Oh shut up you clueless troll. You are ignorant beyond comprehension. I wonder why they even try to explain you anything.

Funny, when I was not the one posting here first. You came to me.

You have been called out. Now prove LOS stats wrong and invalid.

mwyoung · Post by **mwyoung** » Tue Sep 22, 2020 10:03 pm

mwyoung wrote: ↑Tue Sep 22, 2020 9:43 pm
Milos wrote: ↑Tue Sep 22, 2020 9:34 pm
mwyoung wrote: ↑Tue Sep 22, 2020 8:30 pm I know you have many issues.

What engine! What testing conditions! And I can do this test also. Your always full of B.S.

This is LOS statistics. This is not Mark's statistics.

"If I play 100,000 games against Stockfish with Ethereal. Do you not think I can find a 10 game chunk where Ethereak beats Stockfish handedly?"

Very unlikely, and that is the point. But you might be able to cherry pick 10 games. And say this was a 10 game test. I put nothing past you with your record.
Oh shut up you clueless troll. You are ignorant beyond comprehension. I wonder why they even try to explain you anything.
Funny, when I was not the one posting here first. You came to me.

You have been called out. Now prove LOS stats wrong and invalid.

LOS practical testing. How LOS can be used. Here we have 2 new engines. Stockfish 210920 and Ethereal 12.50. I have never tested them. And I have no clue what engine is better! How many games does it take with these two engines with correct application of LOS to determine which engine is stronger.

Let us find out.

Live Stream:

Live Stockfish 12 (210920) vs Ethereal 12.50 (3m+2s) LOS testing.

Testing conditions.

Hardware 2950x, RTX 2080 Ti

Ethereal 12.50
Stockfish 12 (210920)
Ponder off.
TC=3m+2s
200 Games
32 threads.
4 Gb hash.
6 man TB, and the top ten 7 man TB.
Opening book 6 moves random.

OliverBr · Post by **OliverBr** » Tue Sep 22, 2020 10:37 pm

mwyoung wrote: ↑Tue Sep 22, 2020 10:03 pm 200 Games

Why are you playing 200 games? I thought 10 games are more than enough...

mwyoung · Post by **mwyoung** » Tue Sep 22, 2020 10:41 pm

OliverBr wrote: ↑Tue Sep 22, 2020 10:37 pm
mwyoung wrote: ↑Tue Sep 22, 2020 10:03 pm 200 Games
Why are you playing 200 games? I thought 10 games is more than enough...

Again your stupidity shows. And you have no clue what or how to apply LOS testing. I do not know what engine is better, or if one engine will win 10 games in the first 10 games.

But right now after 8 games only. It is likely that SF is the better engine.
LOS score after 8 games is SF 95.8% and Ethereal 4.2%.

The match continues until we know for sure....100%

OliverBr · Post by **OliverBr** » Tue Sep 22, 2020 10:49 pm

mwyoung wrote: ↑Tue Sep 22, 2020 10:41 pm Again your stupidity shows.

What, again, is your engine? Could you please post the git link?
Thank you very much.

A match between SF12+NNUE and Leele ver.0.26.2

Re: A match between SF12+NNUE and Leele ver.0.26.2

Re: A match between SF12+NNUE and Leele ver.0.26.2

Re: A match between SF12+NNUE and Leele ver.0.26.2

Re: A match between SF12+NNUE and Leele ver.0.26.2

Re: A match between SF12+NNUE and Leele ver.0.26.2

Re: A match between SF12+NNUE and Leele ver.0.26.2

Re: A match between SF12+NNUE and Leele ver.0.26.2

Re: A match between SF12+NNUE and Leele ver.0.26.2

Re: A match between SF12+NNUE and Leele ver.0.26.2

Re: A match between SF12+NNUE and Leele ver.0.26.2