I'm disappointed with Stockfish dev.

Discussion of anything and everything relating to chess playing software and machines.

Moderator: Ras

CornfedForever
Posts: 648
Joined: Mon Jun 20, 2022 4:08 am
Full name: Brian D. Smith

Re: I'm disappointed with Stockfish dev.

Post by CornfedForever »

syzygy wrote: Sat Feb 18, 2023 6:50 pm
Eduard wrote: Tue Feb 14, 2023 8:12 amBut: Can someone show me where the progress is based on practical positions? What can the new network do better than the old one (before Linrock)?
It has been known since forever that engine progress cannot be measured on single positions. You need to play games, MANY MANY games.
While what you say has traditionally made sense, remember that it's not the playability of the engine (all the developmental tweaks TO THE ENGINE each week) people 'seem' to be complaining about but rather the nets. I may not agree with a lot of what Eduard says, but here perhaps he has a point.

Picking a BIG, DIVERSE, FIXED set of middle game 'control' positions of various kinds ( how you vet this set, I'm not sure...) to test an engines evaluation (and thus the move/path forward it would chose) of those positions across nets...would at least take all the 'game play' aspects out of the mix. Perhaps the constant release of new developmental versions with both engine and NNUE changes intertwined...results in 'bad control' where it would seem harder to evaluate if it's the net or the engine change(s) and....sometime even how they interact with each net which results in a little + or - in the ELO used to judge the worth of each developmental release.

I realize though that I may be wrong and Team SF has sufficient controls in place.
Sopel
Posts: 391
Joined: Tue Oct 08, 2019 11:39 pm
Full name: Tomasz Sobczyk

Re: I'm disappointed with Stockfish dev.

Post by Sopel »

The improvements of Stockfish are precise and reproducible given a significant sample size. You see issues because you expect Stockfish to be better on all positions and selectively choose ones that it regressed on (and recently you also have to selective choose a time control).
dangi12012 wrote:No one wants to touch anything you have posted. That proves you now have negative reputations since everyone knows already you are a forum troll.

Maybe you copied your stockfish commits from someone else too?
I will look into that.
syzygy
Posts: 5694
Joined: Tue Feb 28, 2012 11:56 pm

Re: I'm disappointed with Stockfish dev.

Post by syzygy »

Uri Blass wrote: Sat Feb 18, 2023 7:32 pm
syzygy wrote: Sat Feb 18, 2023 6:50 pm
Eduard wrote: Tue Feb 14, 2023 8:12 amBut: Can someone show me where the progress is based on practical positions? What can the new network do better than the old one (before Linrock)?
It has been known since forever that engine progress cannot be measured on single positions. You need to play games, MANY MANY games.
I agree that you need to play many games to measure progress but
If the engine play better then it means that there are positions that it play better moves and Eduard asked for these positions.
Sure, but you cannot pick a position and expect that an improved engine will do better on that position.
And given an improved engine, you can always cherry pick a position on which it happens to do worse.

Individual positions are not interesting if we talk about the strength of an engine.
Tests are only at bullet time control and usually with biased book UHO_XXL_+0.90_+1.19.epd
There is no other way.
We cannot run 100,000 games at 40 moves per 2 hours to test a change.
But it is possible to complain about this until the heat death of the universe, and so people will do that.
syzygy
Posts: 5694
Joined: Tue Feb 28, 2012 11:56 pm

Re: I'm disappointed with Stockfish dev.

Post by syzygy »

CornfedForever wrote: Sat Feb 18, 2023 7:51 pmWhile what you say has traditionally made sense, remember that it's not the playability of the engine (all the developmental tweaks TO THE ENGINE each week) people 'seem' to be complaining about but rather the nets. I may not agree with a lot of what Eduard says, but here perhaps he has a point.
A new net is a patch like any other patch.
abgursu
Posts: 92
Joined: Thu May 14, 2020 3:34 pm
Full name: A. B. Gursu

Re: I'm disappointed with Stockfish dev.

Post by abgursu »

syzygy wrote: Sat Feb 18, 2023 10:09 pm
CornfedForever wrote: Sat Feb 18, 2023 7:51 pmWhile what you say has traditionally made sense, remember that it's not the playability of the engine (all the developmental tweaks TO THE ENGINE each week) people 'seem' to be complaining about but rather the nets. I may not agree with a lot of what Eduard says, but here perhaps he has a point.
A new net is a patch like any other patch.
Agreed. You guys have to understand that new nets only gain like 2-3 elos. So, the strenght is almost same and lots of the nets will agree on a silent position %99 of the time. And the positions which they don't agree won't be solved completely and you can't really say it's the deciding position of the game when they disagree. There are only two ways to test nets and the whole topic is started based on the inconsistency of those ways. One way is the way SF Team uses, playing many many games and see if there any improvements and the other is testing for tactics but there are obvious inconsistencies and you have to select your path on your own.
Uri Blass
Posts: 10790
Joined: Thu Mar 09, 2006 12:37 am
Location: Tel-Aviv Israel

Re: I'm disappointed with Stockfish dev.

Post by Uri Blass »

syzygy wrote: Sat Feb 18, 2023 10:04 pm
Uri Blass wrote: Sat Feb 18, 2023 7:32 pm
syzygy wrote: Sat Feb 18, 2023 6:50 pm
Eduard wrote: Tue Feb 14, 2023 8:12 amBut: Can someone show me where the progress is based on practical positions? What can the new network do better than the old one (before Linrock)?
It has been known since forever that engine progress cannot be measured on single positions. You need to play games, MANY MANY games.
I agree that you need to play many games to measure progress but
If the engine play better then it means that there are positions that it play better moves and Eduard asked for these positions.
Sure, but you cannot pick a position and expect that an improved engine will do better on that position.
And given an improved engine, you can always cherry pick a position on which it happens to do worse.

Individual positions are not interesting if we talk about the strength of an engine.
Tests are only at bullet time control and usually with biased book UHO_XXL_+0.90_+1.19.epd
There is no other way.
We cannot run 100,000 games at 40 moves per 2 hours to test a change.
But it is possible to complain about this until the heat death of the universe, and so people will do that.
1)I agree that there are positions when a better engine does worse but
there are also positions when a better engine does better.
I think that individual positions when it does better are clearly interesting to understand in what type of positions the engine does better.

2)I did not ask for running 100,000 games at 40 move per 2 hours.

Here is again what I wrote
"I think that if they test only patches that pass for no regression with normal book and 8 cores for engine and 60+0.6 time control with 8 cores then it may be better for stockfish's developement."
It means:
a)I suggested a third step of time control of 60+0.6 with 8 cores and not 40 moves per 2 hours for no regression.
b)I suggested normal book and no UHO_XXL_+0.90_+1.19.epd in these tests
c)I suggested to test in the third step only patches that passed stage 1 and stage 2 that are clear minority of patches(most patches fail at 10+0.1 or 60+0.6 with a single core and I did not suggest changes in testing them).
CornfedForever
Posts: 648
Joined: Mon Jun 20, 2022 4:08 am
Full name: Brian D. Smith

Re: I'm disappointed with Stockfish dev.

Post by CornfedForever »

abgursu wrote: Sat Feb 18, 2023 11:11 pm
syzygy wrote: Sat Feb 18, 2023 10:09 pm
CornfedForever wrote: Sat Feb 18, 2023 7:51 pmWhile what you say has traditionally made sense, remember that it's not the playability of the engine (all the developmental tweaks TO THE ENGINE each week) people 'seem' to be complaining about but rather the nets. I may not agree with a lot of what Eduard says, but here perhaps he has a point.
A new net is a patch like any other patch.
Agreed. You guys have to understand that new nets only gain like 2-3 elos. So, the strenght is almost same and lots of the nets will agree on a silent position %99 of the time. And the positions which they don't agree won't be solved completely and you can't really say it's the deciding position of the game when they disagree. There are only two ways to test nets and the whole topic is started based on the inconsistency of those ways. One way is the way SF Team uses, playing many many games and see if there any improvements and the other is testing for tactics but there are obvious inconsistencies and you have to select your path on your own.
You mention that "New nets only gain like 2-3 elo". Okay, but a new net can also lose elo, right? So when you couple a 'new net' and a 'tweaked engine' (other than the net), how do you know if it's really the net that has lost the elo or the tweak in the engine? I guess that's what I was getting at with my earlier question. As so many new development versions come with new nets...it seems like it would be difficult to tell which (or if both) resulted in the lost/gained elo.
syzygy
Posts: 5694
Joined: Tue Feb 28, 2012 11:56 pm

Re: I'm disappointed with Stockfish dev.

Post by syzygy »

Uri Blass wrote: Sat Feb 18, 2023 11:14 pm 1)I agree that there are positions when a better engine does worse but
there are also positions when a better engine does better.
I think that individual positions when it does better are clearly interesting to understand in what type of positions the engine does better.
Have fun picking cherries until the heat death of the universe.
2)I did not ask for running 100,000 games at 40 move per 2 hours.

Here is again what I wrote
"I think that if they test only patches that pass for no regression with normal book and 8 cores for engine and 60+0.6 time control with 8 cores then it may be better for stockfish's developement."
It means:
a)I suggested a third step of time control of 60+0.6 with 8 cores and not 40 moves per 2 hours for no regression.
b)I suggested normal book and no UHO_XXL_+0.90_+1.19.epd in these tests
c)I suggested to test in the third step only patches that passed stage 1 and stage 2 that are clear minority of patches(most patches fail at 10+0.1 or 60+0.6 with a single core and I did not suggest changes in testing them).
The SF team has to try to make optimal use of its resources. In the end, engine development is just a statistics game.
syzygy
Posts: 5694
Joined: Tue Feb 28, 2012 11:56 pm

Re: I'm disappointed with Stockfish dev.

Post by syzygy »

CornfedForever wrote: Sat Feb 18, 2023 11:33 pmYou mention that "New nets only gain like 2-3 elo". Okay, but a new net can also lose elo, right? So when you couple a 'new net' and a 'tweaked engine' (other than the net), how do you know if it's really the net that has lost the elo or the tweak in the engine?
Just like any other patch.
I guess that's what I was getting at with my earlier question. As so many new development versions come with new nets...it seems like it would be difficult to tell which (or if both) resulted in the lost/gained elo.
Just like any other patch.

The SF development process does not require 100% certainty that a patch gains Elo. It is a game of statistics.
CornfedForever
Posts: 648
Joined: Mon Jun 20, 2022 4:08 am
Full name: Brian D. Smith

Re: I'm disappointed with Stockfish dev.

Post by CornfedForever »

syzygy wrote: Sat Feb 18, 2023 11:34 pm
CornfedForever wrote: Sat Feb 18, 2023 11:33 pmYou mention that "New nets only gain like 2-3 elo". Okay, but a new net can also lose elo, right? So when you couple a 'new net' and a 'tweaked engine' (other than the net), how do you know if it's really the net that has lost the elo or the tweak in the engine?
Just like any other patch.
I guess that's what I was getting at with my earlier question. As so many new development versions come with new nets...it seems like it would be difficult to tell which (or if both) resulted in the lost/gained elo.
Just like any other patch.

The SF development process does not require 100% certainty that a patch gains Elo. It is a game of statistics.