I'm disappointed with Stockfish dev.

CornfedForever · Post by **CornfedForever** » Sat Feb 18, 2023 7:51 pm

syzygy wrote: ↑Sat Feb 18, 2023 6:50 pm
Eduard wrote: ↑Tue Feb 14, 2023 8:12 amBut: Can someone show me where the progress is based on practical positions? What can the new network do better than the old one (before Linrock)?
It has been known since forever that engine progress cannot be measured on single positions. You need to play games, MANY MANY games.

While what you say has traditionally made sense, remember that it's not the playability of the engine (all the developmental tweaks TO THE ENGINE each week) people 'seem' to be complaining about but rather the nets. I may not agree with a lot of what Eduard says, but here perhaps he has a point.

Picking a BIG, DIVERSE, FIXED set of middle game 'control' positions of various kinds ( how you vet this set, I'm not sure...) to test an engines evaluation (and thus the move/path forward it would chose) of those positions across nets...would at least take all the 'game play' aspects out of the mix. Perhaps the constant release of new developmental versions with both engine and NNUE changes intertwined...results in 'bad control' where it would seem harder to evaluate if it's the net or the engine change(s) and....sometime even how they interact with each net which results in a little + or - in the ELO used to judge the worth of each developmental release.

I realize though that I may be wrong and Team SF has sufficient controls in place.

Sopel · Post by **Sopel** » Sat Feb 18, 2023 9:23 pm

The improvements of Stockfish are precise and reproducible given a significant sample size. You see issues because you expect Stockfish to be better on all positions and selectively choose ones that it regressed on (and recently you also have to selective choose a time control).

syzygy · Post by **syzygy** » Sat Feb 18, 2023 10:04 pm

Uri Blass wrote: ↑Sat Feb 18, 2023 7:32 pm
syzygy wrote: ↑Sat Feb 18, 2023 6:50 pm
Eduard wrote: ↑Tue Feb 14, 2023 8:12 amBut: Can someone show me where the progress is based on practical positions? What can the new network do better than the old one (before Linrock)?
It has been known since forever that engine progress cannot be measured on single positions. You need to play games, MANY MANY games.
I agree that you need to play many games to measure progress but
If the engine play better then it means that there are positions that it play better moves and Eduard asked for these positions.

Sure, but you cannot pick a position and expect that an improved engine will do better on that position.
And given an improved engine, you can always cherry pick a position on which it happens to do worse.

Individual positions are not interesting if we talk about the strength of an engine.

Tests are only at bullet time control and usually with biased book UHO_XXL_+0.90_+1.19.epd

There is no other way.
We cannot run 100,000 games at 40 moves per 2 hours to test a change.
But it is possible to complain about this until the heat death of the universe, and so people will do that.

syzygy · Post by **syzygy** » Sat Feb 18, 2023 10:09 pm

CornfedForever wrote: ↑Sat Feb 18, 2023 7:51 pmWhile what you say has traditionally made sense, remember that it's not the playability of the engine (all the developmental tweaks TO THE ENGINE each week) people 'seem' to be complaining about but rather the nets. I may not agree with a lot of what Eduard says, but here perhaps he has a point.

A new net is a patch like any other patch.

abgursu · Post by **abgursu** » Sat Feb 18, 2023 11:11 pm

syzygy wrote: ↑Sat Feb 18, 2023 10:09 pm
CornfedForever wrote: ↑Sat Feb 18, 2023 7:51 pmWhile what you say has traditionally made sense, remember that it's not the playability of the engine (all the developmental tweaks TO THE ENGINE each week) people 'seem' to be complaining about but rather the nets. I may not agree with a lot of what Eduard says, but here perhaps he has a point.
A new net is a patch like any other patch.

Agreed. You guys have to understand that new nets only gain like 2-3 elos. So, the strenght is almost same and lots of the nets will agree on a silent position %99 of the time. And the positions which they don't agree won't be solved completely and you can't really say it's the deciding position of the game when they disagree. There are only two ways to test nets and the whole topic is started based on the inconsistency of those ways. One way is the way SF Team uses, playing many many games and see if there any improvements and the other is testing for tactics but there are obvious inconsistencies and you have to select your path on your own.

Uri Blass · Post by **Uri Blass** » Sat Feb 18, 2023 11:14 pm

syzygy wrote: ↑Sat Feb 18, 2023 10:04 pm
Uri Blass wrote: ↑Sat Feb 18, 2023 7:32 pm
syzygy wrote: ↑Sat Feb 18, 2023 6:50 pm
Eduard wrote: ↑Tue Feb 14, 2023 8:12 amBut: Can someone show me where the progress is based on practical positions? What can the new network do better than the old one (before Linrock)?
It has been known since forever that engine progress cannot be measured on single positions. You need to play games, MANY MANY games.
I agree that you need to play many games to measure progress but
If the engine play better then it means that there are positions that it play better moves and Eduard asked for these positions.
Sure, but you cannot pick a position and expect that an improved engine will do better on that position.
And given an improved engine, you can always cherry pick a position on which it happens to do worse.

Individual positions are not interesting if we talk about the strength of an engine.

Tests are only at bullet time control and usually with biased book UHO_XXL_+0.90_+1.19.epd
There is no other way.
We cannot run 100,000 games at 40 moves per 2 hours to test a change.
But it is possible to complain about this until the heat death of the universe, and so people will do that.

1)I agree that there are positions when a better engine does worse but
there are also positions when a better engine does better.
I think that individual positions when it does better are clearly interesting to understand in what type of positions the engine does better.

2)I did not ask for running 100,000 games at 40 move per 2 hours.

Here is again what I wrote
"I think that if they test only patches that pass for no regression with normal book and 8 cores for engine and 60+0.6 time control with 8 cores then it may be better for stockfish's developement."
It means:
a)I suggested a third step of time control of 60+0.6 with 8 cores and not 40 moves per 2 hours for no regression.
b)I suggested normal book and no UHO_XXL_+0.90_+1.19.epd in these tests
c)I suggested to test in the third step only patches that passed stage 1 and stage 2 that are clear minority of patches(most patches fail at 10+0.1 or 60+0.6 with a single core and I did not suggest changes in testing them).

CornfedForever · Post by **CornfedForever** » Sat Feb 18, 2023 11:33 pm

abgursu wrote: ↑Sat Feb 18, 2023 11:11 pm
syzygy wrote: ↑Sat Feb 18, 2023 10:09 pm
CornfedForever wrote: ↑Sat Feb 18, 2023 7:51 pmWhile what you say has traditionally made sense, remember that it's not the playability of the engine (all the developmental tweaks TO THE ENGINE each week) people 'seem' to be complaining about but rather the nets. I may not agree with a lot of what Eduard says, but here perhaps he has a point.
A new net is a patch like any other patch.
Agreed. You guys have to understand that new nets only gain like 2-3 elos. So, the strenght is almost same and lots of the nets will agree on a silent position %99 of the time. And the positions which they don't agree won't be solved completely and you can't really say it's the deciding position of the game when they disagree. There are only two ways to test nets and the whole topic is started based on the inconsistency of those ways. One way is the way SF Team uses, playing many many games and see if there any improvements and the other is testing for tactics but there are obvious inconsistencies and you have to select your path on your own.

You mention that "New nets only gain like 2-3 elo". Okay, but a new net can also lose elo, right? So when you couple a 'new net' and a 'tweaked engine' (other than the net), how do you know if it's really the net that has lost the elo or the tweak in the engine? I guess that's what I was getting at with my earlier question. As so many new development versions come with new nets...it seems like it would be difficult to tell which (or if both) resulted in the lost/gained elo.

syzygy · Post by **syzygy** » Sat Feb 18, 2023 11:33 pm

Uri Blass wrote: ↑Sat Feb 18, 2023 11:14 pm 1)I agree that there are positions when a better engine does worse but
there are also positions when a better engine does better.
I think that individual positions when it does better are clearly interesting to understand in what type of positions the engine does better.

Have fun picking cherries until the heat death of the universe.

2)I did not ask for running 100,000 games at 40 move per 2 hours.

Here is again what I wrote
"I think that if they test only patches that pass for no regression with normal book and 8 cores for engine and 60+0.6 time control with 8 cores then it may be better for stockfish's developement."
It means:
a)I suggested a third step of time control of 60+0.6 with 8 cores and not 40 moves per 2 hours for no regression.
b)I suggested normal book and no UHO_XXL_+0.90_+1.19.epd in these tests
c)I suggested to test in the third step only patches that passed stage 1 and stage 2 that are clear minority of patches(most patches fail at 10+0.1 or 60+0.6 with a single core and I did not suggest changes in testing them).

The SF team has to try to make optimal use of its resources. In the end, engine development is just a statistics game.

syzygy · Post by **syzygy** » Sat Feb 18, 2023 11:34 pm

CornfedForever wrote: ↑Sat Feb 18, 2023 11:33 pmYou mention that "New nets only gain like 2-3 elo". Okay, but a new net can also lose elo, right? So when you couple a 'new net' and a 'tweaked engine' (other than the net), how do you know if it's really the net that has lost the elo or the tweak in the engine?

Just like any other patch.

I guess that's what I was getting at with my earlier question. As so many new development versions come with new nets...it seems like it would be difficult to tell which (or if both) resulted in the lost/gained elo.

Just like any other patch.

The SF development process does not require 100% certainty that a patch gains Elo. It is a game of statistics.

CornfedForever · Post by **CornfedForever** » Sun Feb 19, 2023 12:40 am

syzygy wrote: ↑Sat Feb 18, 2023 11:34 pm
CornfedForever wrote: ↑Sat Feb 18, 2023 11:33 pmYou mention that "New nets only gain like 2-3 elo". Okay, but a new net can also lose elo, right? So when you couple a 'new net' and a 'tweaked engine' (other than the net), how do you know if it's really the net that has lost the elo or the tweak in the engine?
Just like any other patch.

I guess that's what I was getting at with my earlier question. As so many new development versions come with new nets...it seems like it would be difficult to tell which (or if both) resulted in the lost/gained elo.
Just like any other patch.

The SF development process does not require 100% certainty that a patch gains Elo. It is a game of statistics.

I'm disappointed with Stockfish dev.

Re: I'm disappointed with Stockfish dev.

Re: I'm disappointed with Stockfish dev.

Re: I'm disappointed with Stockfish dev.

Re: I'm disappointed with Stockfish dev.

Re: I'm disappointed with Stockfish dev.

Re: I'm disappointed with Stockfish dev.

Re: I'm disappointed with Stockfish dev.

Re: I'm disappointed with Stockfish dev.

Re: I'm disappointed with Stockfish dev.

Re: I'm disappointed with Stockfish dev.