I'm disappointed with Stockfish dev.

connor_mcmonigle · Post by **connor_mcmonigle** » Sun Mar 12, 2023 7:23 pm

CornfedForever wrote: ↑Sun Mar 12, 2023 6:59 pm
DrEinstein wrote: ↑Sun Mar 12, 2023 11:12 am So we are all patiently waiting for the next big jump to the bigger hill. I believe, or want to believe, that Stockfish is not yet standing on top of the highest mountain.
I wonder how one might define next "big jump". All the 'big jumps' have likely come and gone as engine strength is closer to topping out. What is left are likely 'little jumps'. The issue I (and I think others - but I do not speak for them ) see is that those are harder to find...and probably harder under the traditional testing framework to - these days, actually know 'what tweaks" actually' are responsible for those...really, very a little jumps if only because they fall closer to the 'margin of error'. You get a '+' and presume you 'have it' when it is part of multiple 'patches' working together...then later we find something in the tweaks/patches being disregarded or at least changed. And some people...do not seem to want to admit to seeing this 2 steps forward, 1 step back/1step forward, 2 step back thing happening. But it is a viable 'blind approach' that can work over time.

I (like to think) I know a little about quantum physics. There reality is just so 'odd' that no one currently fully understands it...you just "follow the math" into the darkness. Chess though is different animal as we know there are 'only' 10 to the 40 legal moves possible in a game, you play it on only 64 squars and Knights do not move like Bishops...etc.

Sure you can see VERY slow, incremental progress with the path being taken (and steps backward...). However, being at a bit of a loss for exactly what tweak 'works' means it resembles more 'wishcraft' than science - throwing things against the wall and hoping 'something' sticks (and often not knowing exactly what or why it stuck). It's almost like blindly taking herbs to combat Covid-19 until you eventually find in your testing a statistical 'hit' that seems to indicate 'something' in one of those herbs resulted in a tiny number of people not dying who might otherwise would have....vs identifying 'what' specific thing in a given herb actually is responsible and using that...or looking at things differently and finding a spike protein and using it to alert the bodies immune system to respond to something that looks like it...or viral vector technologies for dealing with other disease. etc. Wishcraft vs Science. Both can work...but with one you tend to know 'why' it is working...which in theory should mean 'less steps back'.

That's a lot of words to say very little. Most search patches tested on fishtest are motivated by some understanding of how Stockfish's search works. The patches aren't written by a bunch monkeys at type writers. Parameter tweaks are usually the result of SPSA tuning. There's not much to be learned from the fact that 482 is better than 480, but it's certainly not a random/unmotivated change.

In any case, it's entirely unclear what your proposed alternative to the current testing methodology is which makes any substantive conversation impossible. You seem to claim the current testing methodology is no longer viable (without any evidence beyond the anecdotal), but you haven't even proposed an alternative.

syzygy · Post by **syzygy** » Sun Mar 12, 2023 9:52 pm

CornfedForever wrote: ↑Sun Mar 12, 2023 6:59 pmHowever, being at a bit of a loss for exactly what tweak 'works' means it resembles more 'wishcraft' than science - throwing things against the wall and hoping 'something' sticks (and often not knowing exactly what or why it stuck).

So tell us how we can establish with 100% certainty whether a +0.5 Elo patch is indeed an improvement.

Rebel · Post by **Rebel** » Mon Mar 13, 2023 12:04 am

syzygy wrote: ↑Sun Mar 12, 2023 9:52 pm
CornfedForever wrote: ↑Sun Mar 12, 2023 6:59 pmHowever, being at a bit of a loss for exactly what tweak 'works' means it resembles more 'wishcraft' than science - throwing things against the wall and hoping 'something' sticks (and often not knowing exactly what or why it stuck).
So tell us how we can establish with 100% certainty whether a +0.5 Elo patch is indeed an improvement.

http://rebel13.nl/text/example1.html

100K games looks pretty reliable.

syzygy · Post by **syzygy** » Mon Mar 13, 2023 12:31 am

Rebel wrote: ↑Mon Mar 13, 2023 12:04 am
syzygy wrote: ↑Sun Mar 12, 2023 9:52 pm
CornfedForever wrote: ↑Sun Mar 12, 2023 6:59 pmHowever, being at a bit of a loss for exactly what tweak 'works' means it resembles more 'wishcraft' than science - throwing things against the wall and hoping 'something' sticks (and often not knowing exactly what or why it stuck).
So tell us how we can establish with 100% certainty whether a +0.5 Elo patch is indeed an improvement.
http://rebel13.nl/text/example1.html

100K games looks pretty reliable.

Not really. Still "wishcraft" according to that guy's definition.

Sopel · Post by **Sopel** » Mon Mar 13, 2023 2:51 pm

Rebel wrote: ↑Mon Mar 13, 2023 12:04 am
syzygy wrote: ↑Sun Mar 12, 2023 9:52 pm
CornfedForever wrote: ↑Sun Mar 12, 2023 6:59 pmHowever, being at a bit of a loss for exactly what tweak 'works' means it resembles more 'wishcraft' than science - throwing things against the wall and hoping 'something' sticks (and often not knowing exactly what or why it stuck).
So tell us how we can establish with 100% certainty whether a +0.5 Elo patch is indeed an improvement.
http://rebel13.nl/text/example1.html

100K games looks pretty reliable.

With 100k games, +0.5 Elo will likely mean around 90% chance that it's positive Elo (depends on draw rate). That's worse than current Stockfish practice that's being debated.

Rebel · Post by **Rebel** » Mon Mar 13, 2023 5:15 pm

Sopel wrote: ↑Mon Mar 13, 2023 2:51 pm
Rebel wrote: ↑Mon Mar 13, 2023 12:04 am
syzygy wrote: ↑Sun Mar 12, 2023 9:52 pm
CornfedForever wrote: ↑Sun Mar 12, 2023 6:59 pmHowever, being at a bit of a loss for exactly what tweak 'works' means it resembles more 'wishcraft' than science - throwing things against the wall and hoping 'something' sticks (and often not knowing exactly what or why it stuck).
So tell us how we can establish with 100% certainty whether a +0.5 Elo patch is indeed an improvement.
http://rebel13.nl/text/example1.html

100K games looks pretty reliable.
With 100k games, +0.5 Elo will likely mean around 90% chance that it's positive Elo (depends on draw rate). That's worse than current Stockfish practice that's being debated.

With nowadays draw rates you can evaluate different.

A hypothetical 100K match may end in 50137-49863 (50.1%) traditionally meaning less than 1 elo progress but you can also consider 137 more won games and consider the version as better ?

syzygy · Post by **syzygy** » Mon Mar 13, 2023 9:04 pm

Rebel wrote: ↑Mon Mar 13, 2023 5:15 pm
Sopel wrote: ↑Mon Mar 13, 2023 2:51 pm
Rebel wrote: ↑Mon Mar 13, 2023 12:04 am
syzygy wrote: ↑Sun Mar 12, 2023 9:52 pm
CornfedForever wrote: ↑Sun Mar 12, 2023 6:59 pmHowever, being at a bit of a loss for exactly what tweak 'works' means it resembles more 'wishcraft' than science - throwing things against the wall and hoping 'something' sticks (and often not knowing exactly what or why it stuck).
So tell us how we can establish with 100% certainty whether a +0.5 Elo patch is indeed an improvement.
http://rebel13.nl/text/example1.html

100K games looks pretty reliable.
With 100k games, +0.5 Elo will likely mean around 90% chance that it's positive Elo (depends on draw rate). That's worse than current Stockfish practice that's being debated.
With nowadays draw rates you can evaluate different.

A hypothetical 100K match may end in 50137-49863 (50.1%) traditionally meaning less than 1 elo progress but you can also consider 137 more won games and consider the version as better ?

You are certainly right that, with a high draw rate, W-L can be lower to achieve the same confidence that version A is better than version B than with a low draw rate.
274 wins, 99726 draws, 0 losses -> high confidence that A>B
50137 wins, 0 draws, 49863 losses -> low confidence that A>B

But there is probably quite a bit of noise in fishtest (leading to wins and losses for both sides, so lower draw rate but W/L remaining the same -> less confidence at same number of games, more games needed). Both SPRT and fixed number of games should suffer from this.

I assume the various statistical models being used take this into account, but I'm not sure (I never broke my head on it, as we say in Dutch

).

(Of course the point remains that absolute certainty does not exist. But let's wait for Cornfed to enlighten us on how to make the mathematically impossible possible.)

CornfedForever · Post by **CornfedForever** » Mon Mar 13, 2023 10:59 pm

syzygy wrote: ↑Mon Mar 13, 2023 9:04 pm [
(Of course the point remains that absolute certainty does not exist. But let's wait for Cornfed to enlighten us on how to make the mathematically impossible possible.)

Dude, nothing is 100% certain. Just stop it with the straw-man.

syzygy · Post by **syzygy** » Mon Mar 13, 2023 11:30 pm

CornfedForever wrote: ↑Mon Mar 13, 2023 10:59 pm
syzygy wrote: ↑Mon Mar 13, 2023 9:04 pm [
(Of course the point remains that absolute certainty does not exist. But let's wait for Cornfed to enlighten us on how to make the mathematically impossible possible.)
Dude, nothing is 100% certain. Just stop it with the straw-man.

Hey, you were the wishcraft guy... Did you lose it?

AlexChess · Post by **AlexChess** » Tue Mar 14, 2023 10:10 am

syzygy wrote: ↑Mon Mar 13, 2023 11:30 pm
CornfedForever wrote: ↑Mon Mar 13, 2023 10:59 pm
syzygy wrote: ↑Mon Mar 13, 2023 9:04 pm [
(Of course the point remains that absolute certainty does not exist. But let's wait for Cornfed to enlighten us on how to make the mathematically impossible possible.)
Dude, nothing is 100% certain. Just stop it with the straw-man.
Hey, you were the wishcraft guy... Did you lose it?

What about to simply re-insert a much improved Contempt option to avoid that a ThreadRipper 128 threads boosted SF hitting 75 MN/s often draws against Raspberry-P3 SF calculating only 79 kN/s?

I'm disappointed with Stockfish dev.

Re: I'm disappointed with Stockfish dev.

Re: I'm disappointed with Stockfish dev.

Re: I'm disappointed with Stockfish dev.

Re: I'm disappointed with Stockfish dev.

Re: I'm disappointed with Stockfish dev.

Re: I'm disappointed with Stockfish dev.

Re: I'm disappointed with Stockfish dev.

Re: I'm disappointed with Stockfish dev.

Re: I'm disappointed with Stockfish dev.

Re: I'm disappointed with Stockfish dev.