How helpful would it be to have +0.7 Elo +-0.5 for every patch?
I'm disappointed with Stockfish dev.
Moderator: Ras
-
- Posts: 391
- Joined: Tue Oct 08, 2019 11:39 pm
- Full name: Tomasz Sobczyk
Re: I'm disappointed with Stockfish dev.
dangi12012 wrote:No one wants to touch anything you have posted. That proves you now have negative reputations since everyone knows already you are a forum troll.
Maybe you copied your stockfish commits from someone else too?
I will look into that.
-
- Posts: 5780
- Joined: Tue Feb 28, 2012 11:56 pm
Re: I'm disappointed with Stockfish dev.
What argument?chrisw wrote: ↑Sat Mar 11, 2023 12:35 amThat's a circular argument. If you only have a hammer then everything is hammering.syzygy wrote: ↑Fri Mar 10, 2023 11:13 pmOk, so you mean all is going perfectly fine with SF development. Then we can close the thread.CornfedForever wrote: ↑Fri Mar 10, 2023 11:02 pmOnce again...intentionally misinterpreting my words...but we have come to expect that from you.syzygy wrote: ↑Fri Mar 10, 2023 10:19 pmWho says SF is not increasing in strength?CornfedForever wrote: ↑Wed Mar 08, 2023 10:06 pmEnough with what is essentially name calling rather than an argument. I'm talking about the data and not knowing with any real certainly how you get to a + elo or a -elo (outside of tollerance) because so much is tested together. I mean...if every patch was a positive...SF would be increasing in strength every week. It is not.syzygy wrote: ↑Wed Mar 08, 2023 9:34 pmNo, they see Dunning-Kruger at work.CornfedForever wrote: ↑Tue Mar 07, 2023 4:02 amAnd they wonder why I question how they can know which changes actually resulted in a positive change and which result in a negative change.![]()
SF has increased hundreds of Elo because its development process works.
Of course ultimately there is a ceiling to what can be achieved.
No one is saying it is 'not increasing in strength'. But there is are a series of patches released each and every week...if every one of them was a positive, it would be increasing each week and clearly it is not. Sometimes it is one step forward, two steps back...not a straight linear progression.
Maybe some remedial English for you is in order?
forum3/viewtopic.php?p=943314#p943314You can NEVER be 100% sure that a patch that seems to gain 1 Elo really does not lose Elo.syzygy wrote:The SF development process does not require 100% certainty that a patch gains Elo. It is a game of statistics.
You can be 99.99% sure if you want, but it would be a waste of resources.
Chess engine development is a game of statistcs.
Chess engine development is a game of statistcs.
Chess engine development is a game of statistcs.
-
- Posts: 648
- Joined: Mon Jun 20, 2022 4:08 am
- Full name: Brian D. Smith
-
- Posts: 5780
- Joined: Tue Feb 28, 2012 11:56 pm
Re: I'm disappointed with Stockfish dev.
Oh boy. You must be a fun person.CornfedForever wrote: ↑Sat Mar 11, 2023 9:01 pmIf I may...just google 'circular argument' or 'circular reasoning'...it's a logical falllacy...and you will see what he mans.
What argument did I make. Before an argument can be circular, there has to be an argument.
-
- Posts: 4648
- Joined: Tue Apr 03, 2012 4:28 pm
- Location: Midi-Pyrénées
- Full name: Christopher Whittington
Re: I'm disappointed with Stockfish dev.
It's been reduced to a statistical game by reducing the game itself to 1, 0 and -1. If however, you were to regard the chess engine/game/author/development thing with features more than 1,0,-1 it would be not just more than a "game of statistics" but have other properties too. It must have those other properties otherwise you wouldn't be doing it and the general interest would sink to zero. One assumes.syzygy wrote: ↑Sat Mar 11, 2023 8:55 pmWhat argument?chrisw wrote: ↑Sat Mar 11, 2023 12:35 amThat's a circular argument. If you only have a hammer then everything is hammering.syzygy wrote: ↑Fri Mar 10, 2023 11:13 pmOk, so you mean all is going perfectly fine with SF development. Then we can close the thread.CornfedForever wrote: ↑Fri Mar 10, 2023 11:02 pmOnce again...intentionally misinterpreting my words...but we have come to expect that from you.syzygy wrote: ↑Fri Mar 10, 2023 10:19 pmWho says SF is not increasing in strength?CornfedForever wrote: ↑Wed Mar 08, 2023 10:06 pmEnough with what is essentially name calling rather than an argument. I'm talking about the data and not knowing with any real certainly how you get to a + elo or a -elo (outside of tollerance) because so much is tested together. I mean...if every patch was a positive...SF would be increasing in strength every week. It is not.syzygy wrote: ↑Wed Mar 08, 2023 9:34 pmNo, they see Dunning-Kruger at work.CornfedForever wrote: ↑Tue Mar 07, 2023 4:02 amAnd they wonder why I question how they can know which changes actually resulted in a positive change and which result in a negative change.![]()
SF has increased hundreds of Elo because its development process works.
Of course ultimately there is a ceiling to what can be achieved.
No one is saying it is 'not increasing in strength'. But there is are a series of patches released each and every week...if every one of them was a positive, it would be increasing each week and clearly it is not. Sometimes it is one step forward, two steps back...not a straight linear progression.
Maybe some remedial English for you is in order?
forum3/viewtopic.php?p=943314#p943314You can NEVER be 100% sure that a patch that seems to gain 1 Elo really does not lose Elo.syzygy wrote:The SF development process does not require 100% certainty that a patch gains Elo. It is a game of statistics.
You can be 99.99% sure if you want, but it would be a waste of resources.
Chess engine development is a game of statistcs.
Chess engine development is a game of statistcs.
Chess engine development is a game of statistcs.
-
- Posts: 544
- Joined: Sun Sep 06, 2020 4:40 am
- Full name: Connor McMonigle
Re: I'm disappointed with Stockfish dev.
It's immediately telling that the only individuals criticizing the Stockfish project's testing methodology are those who are not remotely experienced when it comes to actual chess engine development. SPRT both is statistically principled and has been empirically demonstrated to be effective across a wide variety of disciplines. It is true that results at STC aren't necessarily guaranteed to translate to results at LTC perfectly, but Stockfish's incredible progress in terms of Elo at a wide range of time controls over the last several years is a testament to the fact that the Stockfish project's testing methodology is effective.
This is a classic case of Dunning Kruger at work. To have any ground to stand on in criticizing Stockfish's testing methodology, you have to both propose and demonstrate an effective alternative. Go develop your own engine from scratch or start from a much weaker engine. Implement your own alternative testing methodology and see what kind of progress you make. You'll quickly find that practical considerations prevent VLTC SPRT testing. You'll quickly find that just using test positions (as was suggested earlier in this thread for some reason) as a proxy for engine strength will get you nowhere. Or you could try the approach which Eduard here has seemingly taken: make a few random changes and watch the engine play a handful of games on some random server and call it good enough. In a great surprise to no one, this also won't get you anywhere.
Starting with Stockfish as a base for experimenting with alternative testing methodologies is incredibly daft as Stockfish is so incredibly strong that random garbage changes usually won't significantly harm its LTC performance. If you've weakened Stockfish by tens of Elo in STC testing and your changes seem roughly neutral in limited VLTC testing, that doesn't mean your changes are brilliant and will continue to scale better at increasingly longer time controls. Rather, it just means chess is pretty close to a draw for an engine as strong as Stockfish and, with sufficient time, even garbage patches won't significantly harm its performance.
This is a classic case of Dunning Kruger at work. To have any ground to stand on in criticizing Stockfish's testing methodology, you have to both propose and demonstrate an effective alternative. Go develop your own engine from scratch or start from a much weaker engine. Implement your own alternative testing methodology and see what kind of progress you make. You'll quickly find that practical considerations prevent VLTC SPRT testing. You'll quickly find that just using test positions (as was suggested earlier in this thread for some reason) as a proxy for engine strength will get you nowhere. Or you could try the approach which Eduard here has seemingly taken: make a few random changes and watch the engine play a handful of games on some random server and call it good enough. In a great surprise to no one, this also won't get you anywhere.
Starting with Stockfish as a base for experimenting with alternative testing methodologies is incredibly daft as Stockfish is so incredibly strong that random garbage changes usually won't significantly harm its LTC performance. If you've weakened Stockfish by tens of Elo in STC testing and your changes seem roughly neutral in limited VLTC testing, that doesn't mean your changes are brilliant and will continue to scale better at increasingly longer time controls. Rather, it just means chess is pretty close to a draw for an engine as strong as Stockfish and, with sufficient time, even garbage patches won't significantly harm its performance.
-
- Posts: 4648
- Joined: Tue Apr 03, 2012 4:28 pm
- Location: Midi-Pyrénées
- Full name: Christopher Whittington
Re: I'm disappointed with Stockfish dev.
This is true if there is only one hill to climb.connor_mcmonigle wrote: ↑Sun Mar 12, 2023 12:24 am It's immediately telling that the only individuals criticizing the Stockfish project's testing methodology are those who are not remotely experienced when it comes to actual chess engine development. SPRT both is statistically principled and has been empirically demonstrated to be effective across a wide variety of disciplines. It is true that results at STC aren't necessarily guaranteed to translate to results at LTC perfectly, but Stockfish's incredible progress in terms of Elo at a wide range of time controls over the last several years is a testament to the fact that the Stockfish project's testing methodology is effective.
This is a classic case of Dunning Kruger at work. To have any ground to stand on in criticizing Stockfish's testing methodology, you have to both propose and demonstrate an effective alternative. Go develop your own engine from scratch or start from a much weaker engine. Implement your own alternative testing methodology and see what kind of progress you make. You'll quickly find that practical considerations prevent VLTC SPRT testing. You'll quickly find that just using test positions (as was suggested earlier in this thread for some reason) as a proxy for engine strength will get you nowhere. Or you could try the approach which Eduard here has seemingly taken: make a few random changes and watch the engine play a handful of games on some random server and call it good enough. In a great surprise to no one, this also won't get you anywhere.
Starting with Stockfish as a base for experimenting with alternative testing methodologies is incredibly daft as Stockfish is so incredibly strong that random garbage changes usually won't significantly harm its LTC performance. If you've weakened Stockfish by tens of Elo in STC testing and your changes seem roughly neutral in limited VLTC testing, that doesn't mean your changes are brilliant and will continue to scale better at increasingly longer time controls. Rather, it just means chess is pretty close to a draw for an engine as strong as Stockfish and, with sufficient time, even garbage patches won't significantly harm its performance.
-
- Posts: 75
- Joined: Wed Sep 15, 2021 8:50 pm
- Full name: Albert Einstein
Re: I'm disappointed with Stockfish dev.
So we are all patiently waiting for the next big jump to the bigger hill. I believe, or want to believe, that Stockfish is not yet standing on top of the highest mountain.
-
- Posts: 5780
- Joined: Tue Feb 28, 2012 11:56 pm
Re: I'm disappointed with Stockfish dev.
I am not making an argument but stating a fact.chrisw wrote: ↑Sun Mar 12, 2023 12:03 amIt's been reduced to a statistical game by reducing the game itself to 1, 0 and -1. If however, you were to regard the chess engine/game/author/development thing with features more than 1,0,-1 it would be not just more than a "game of statistics" but have other properties too. It must have those other properties otherwise you wouldn't be doing it and the general interest would sink to zero. One assumes.syzygy wrote: ↑Sat Mar 11, 2023 8:55 pmWhat argument?chrisw wrote: ↑Sat Mar 11, 2023 12:35 amThat's a circular argument. If you only have a hammer then everything is hammering.syzygy wrote: ↑Fri Mar 10, 2023 11:13 pm Ok, so you mean all is going perfectly fine with SF development. Then we can close the thread.
forum3/viewtopic.php?p=943314#p943314You can NEVER be 100% sure that a patch that seems to gain 1 Elo really does not lose Elo.syzygy wrote:The SF development process does not require 100% certainty that a patch gains Elo. It is a game of statistics.
You can be 99.99% sure if you want, but it would be a waste of resources.
Chess engine development is a game of statistcs.
Chess engine development is a game of statistcs.
Chess engine development is a game of statistcs.
Sure, there is engine functionality not related to chess-playing strength, and I am not talking about that kind of functionality.
-
- Posts: 648
- Joined: Mon Jun 20, 2022 4:08 am
- Full name: Brian D. Smith
Re: I'm disappointed with Stockfish dev.
I wonder how one might define next "big jump". All the 'big jumps' have likely come and gone as engine strength is closer to topping out. What is left are likely 'little jumps'. The issue I (and I think others - but I do not speak for them ) see is that those are harder to find...and probably harder under the traditional testing framework to - these days, actually know 'what tweaks" actually' are responsible for those...really, very a little jumps if only because they fall closer to the 'margin of error'. You get a '+' and presume you 'have it' when it is part of multiple 'patches' working together...then later we find something in the tweaks/patches being disregarded or at least changed. And some people...do not seem to want to admit to seeing this 2 steps forward, 1 step back/1step forward, 2 step back thing happening. But it is a viable 'blind approach' that can work over time.DrEinstein wrote: ↑Sun Mar 12, 2023 11:12 am So we are all patiently waiting for the next big jump to the bigger hill. I believe, or want to believe, that Stockfish is not yet standing on top of the highest mountain.
I (like to think) I know a little about quantum physics. There reality is just so 'odd' that no one currently fully understands it...you just "follow the math" into the darkness. Chess though is different animal as we know there are 'only' 10 to the 40 legal moves possible in a game, you play it on only 64 squars and Knights do not move like Bishops...etc.
Sure you can see VERY slow, incremental progress with the path being taken (and steps backward...). However, being at a bit of a loss for exactly what tweak 'works' means it resembles more 'wishcraft' than science - throwing things against the wall and hoping 'something' sticks (and often not knowing exactly what or why it stuck). It's almost like blindly taking herbs to combat Covid-19 until you eventually find in your testing a statistical 'hit' that seems to indicate 'something' in one of those herbs resulted in a tiny number of people not dying who might otherwise would have....vs identifying 'what' specific thing in a given herb actually is responsible and using that...or looking at things differently and finding a spike protein and using it to alert the bodies immune system to respond to something that looks like it...or viral vector technologies for dealing with other disease. etc. Wishcraft vs Science. Both can work...but with one you tend to know 'why' it is working...which in theory should mean 'less steps back'.