I waited for you, chrisw, together with your bad dream, till.
The new NNUE-net (nn-308..) seems being weaker
Moderator: Ras
-
corres
- Posts: 3657
- Joined: Wed Nov 18, 2015 11:41 am
- Location: hungary
Re: The new NNUE-net (nn-308..) seems being weaker
Last edited by corres on Sun Sep 06, 2020 1:22 am, edited 1 time in total.
-
corres
- Posts: 3657
- Joined: Wed Nov 18, 2015 11:41 am
- Location: hungary
Re: The new NNUE-net (nn-308..) seems being weaker
Maybe you can not read in English?
The score was 3 : 1 for nn-82215.. net in the test what consisted of 100 games. Draw were 96 (obviously).
-
corres
- Posts: 3657
- Joined: Wed Nov 18, 2015 11:41 am
- Location: hungary
-
mehmet123
- Posts: 699
- Joined: Sun Jan 26, 2020 10:38 pm
- Location: Turkey
- Full name: Mehmet Karaman
Re: The new NNUE-net (nn-308..) seems being weaker
This is your first answer to my question. It was this strange answer that bothered me.
A few days ago I claimed that SV 1705 net was stronger than the default net (SV 2257) according to my tests. In my tests SV 1705 was +2 elo stronger than defult net. But SV 1705 net was failed at Fishtest ( -3 elo) at 10 sec + 0.6 sec test. Then a new test was done and at this test SV 1705 net beat default net 60 sec + 0.6 tc fishtest (+1 elo).
http://talkchess.com/forum3/viewtopic.p ... 8&start=90
https://PrivateLadyEscorts.com - Live Local Dating - No Verify - Anonymous Casual Dating - Chat Local Singles
-
corres
- Posts: 3657
- Joined: Wed Nov 18, 2015 11:41 am
- Location: hungary
Re: The new NNUE-net (nn-308..) seems being weaker
A question:mehmet123 wrote: ↑Sun Sep 06, 2020 1:24 amThis is your first answer to my question. It was this strange answer that bothered me.
A few days ago I claimed that SV 1705 net was stronger than the default net (SV 2257) according to my tests. In my tests SV 1705 was +2 elo stronger than defult net. But SV 1705 net was failed at Fishtest ( -3 elo) at 10 sec + 0.6 sec test. Then a new test was done and at this test SV 1705 net beat default net 60 sec + 0.6 tc fishtest (+1 elo).
http://talkchess.com/forum3/viewtopic.p ... 8&start=90
From how times 10000 games can you calculate those 1-2 Elo difference?
I know only the the marks of nets from Stockfish developers and not from Sergio.
So what is SV-1705 and SV-2257 nets?
-
Alayan
- Posts: 550
- Joined: Tue Nov 19, 2019 8:48 pm
- Full name: Alayan Feh
Re: The new NNUE-net (nn-308..) seems being weaker
Choosing heads or tails and flipping a fair coin 4 times, there is a 31.25% probability to lose 0-4 or 1-3.
For an engine A having a long-term probability of winning X% of the decisive (non-draw) games against an engine B, odds of losing 0-4 or 1-3 in a random sample of 4 decisive games :
From your 3-1 results, you can conclude with high confidence that the old net has a double-digit percent chance of winning a decisive game against the new net, and that's it.
Claiming that because it's your test it's enough for you to believe the new net is weaker is missing the point.
You can believe all the BS you want, but as soon as you share it in a forum thread, you open yourself to criticism. Others just skimming through thread titles or not well-versed in statistics might give credit to an outright falsehood. It's not acceptable to spread disinformation even if you didn't mean to harm.
For an engine A having a long-term probability of winning X% of the decisive (non-draw) games against an engine B, odds of losing 0-4 or 1-3 in a random sample of 4 decisive games :
Code: Select all
- 30% decisive games won, 70% lost => 65.1% of losing a 4-decisive games sample, 34.9% of winning or drawing.
- 40% decisive games won, 60% lost => 52.5% of losing a 4-decisive games sample, 47.5% of winning or drawing.
- 45% decisive games won, 55% lost => 39.1% of losing a 4-decisive games sample, 60.9% of winning or drawing.
- 50% decisive games won, 50% lost => 31.2% of losing a 4-decisive games sample, 68.8% of winning or drawing.
- 55% decisive games won, 45% lost => 24.1% of losing a 4-decisive games sample, 75.9% of winning or drawing.
- 60% decisive games won, 40% lost => 17.9% of losing a 4-decisive games sample, 82.1% of winning or drawing.
- 2 to 1 decisive game win ratio (~66.7% to ~33.3%) => 11.1% of losing a 4-decisive games sample, 88.9% of winning or drawing.
- 70% decisive games won, 30% lost => 8.4% of losing a 4-decisive games sample, 91.6% of winning or drawing.
- 80% decisive games won, 20% lost => 2.7% of losing a 4-decisive games sample, 97.8% of winning or drawing.
- 90% decisive games won, 10% lost => 0.4% of losing a 4-decisive games sample, 99.6% of winning or drawing.Claiming that because it's your test it's enough for you to believe the new net is weaker is missing the point.
You can believe all the BS you want, but as soon as you share it in a forum thread, you open yourself to criticism. Others just skimming through thread titles or not well-versed in statistics might give credit to an outright falsehood. It's not acceptable to spread disinformation even if you didn't mean to harm.
-
MikeB
- Posts: 4889
- Joined: Thu Mar 09, 2006 6:34 am
- Location: Pen Argyl, Pennsylvania
Re: The new NNUE-net (nn-308..) seems being weaker
Time to play nice - let’s hit the pause button. Thanks.
-
corres
- Posts: 3657
- Joined: Wed Nov 18, 2015 11:41 am
- Location: hungary
Re: The new NNUE-net (nn-308..) seems being weaker
I get every critics with kindly, but after that I was the man who stated my 100 games test is too few to prove the nn-308... is the weaker, every "critics" is no more than evil-minded attack against me. A typical example for this is the post of Terje, who mix a political site (TCF) to a kind of technical forum.Alayan wrote: ↑Sun Sep 06, 2020 1:51 am ...
You can believe all the BS you want, but as soon as you share it in a forum thread, you open yourself to criticism. Others just skimming through thread titles or not well-versed in statistics might give credit to an outright falsehood. It's not acceptable to spread disinformation even if you didn't mean to harm.
Maybe he do not like my sentence "I am not a polcorrect man. I used to say the sincere". Yes, terje, I am not a polcorrect man, and I used to say the sincere, you like it or me, or not.
Chrisw know this about me for this was the cause why he feels needing the necessity to bring up wind hire.
Last edited by corres on Sun Sep 06, 2020 10:02 am, edited 2 times in total.
-
yurikvelo
- Posts: 710
- Joined: Sat Dec 06, 2014 1:53 pm
Re: The new NNUE-net (nn-308..) seems being weaker
measured result is +1.06 ELO @ STC and +4.23 ELO @ LTC
To measure such small difference as 1.06 ELO, 108328 games were played.
Fishtest play games in a batches of 200. 554 batches, each 200 games were played.
238 batches (200 games each) out of 554 had MORE wins for older (weaker) NET.
238/554= 43% = expected probability that in your particular 200-game run weaker version will receive more wins.
32 batches (200 games each) had [Loss-Wins > 10]
Weaker net won 32 series (200 games each) by more than 10 netto-wins!
In 2 runs (200 games each) weaker net won by a margin of 20 games:
-20 +40 =140
-12 +32 =156
Impressive -35 ELO regression?!
-
corres
- Posts: 3657
- Joined: Wed Nov 18, 2015 11:41 am
- Location: hungary
Re: The new NNUE-net (nn-308..) seems being weaker
OK, but these lot of test games was played not on my machine, with not those starting position what I used, not on my moving time.yurikvelo wrote: ↑Sun Sep 06, 2020 9:44 ammeasured result is +1.06 ELO @ STC and +4.23 ELO @ LTC
To measure such small difference as 1.06 ELO, 108328 games were played.
Fishtest play games in a batches of 200. 554 batches, each 200 games were played.
238 batches (200 games each) out of 554 had MORE wins for older (weaker) NET.
238/554= 43% = expected probability that in your particular 200-game run weaker version will receive more wins.
32 batches (200 games each) had [Loss-Wins > 10]
Weaker net won 32 series (200 games each) by more than 10 netto-wins!
In 2 runs (200 games each) weaker net won by a margin of 20 games:
-20 +40 =140
-12 +32 =156
And - mainly - no one who can decide what net I must use. If somebody can not agree my sentence about that net, ask for those opinion what he likes. That is all.
From my side this stupid debate is ended. Point.
