syzygy wrote: ↑Sat Feb 18, 2023 10:04 pm
Uri Blass wrote: ↑Sat Feb 18, 2023 7:32 pm
syzygy wrote: ↑Sat Feb 18, 2023 6:50 pm
Eduard wrote: ↑Tue Feb 14, 2023 8:12 amBut: Can someone show me where the progress is based on practical positions? What can the new network do better than the old one (before Linrock)?
It has been known since forever that engine progress cannot be measured on single positions. You need to play games, MANY MANY games.
I agree that you need to play many games to measure progress but
If the engine play better then it means that there are positions that it play better moves and Eduard asked for these positions.
Sure, but you cannot pick a position and expect that an improved engine will do better on that position.
And given an improved engine, you can always cherry pick a position on which it happens to do worse.
Individual positions are not interesting if we talk about the strength of an engine.
Tests are only at bullet time control and usually with biased book UHO_XXL_+0.90_+1.19.epd
There is no other way.
We cannot run 100,000 games at 40 moves per 2 hours to test a change.
But it is possible to complain about this until the heat death of the universe, and so people will do that.
1)I agree that there are positions when a better engine does worse but
there are also positions when a better engine does better.
I think that individual positions when it does better are clearly interesting to understand in what type of positions the engine does better.
2)I did not ask for running 100,000 games at 40 move per 2 hours.
Here is again what I wrote
"I think that if they test only patches that pass for no regression with normal book and 8 cores for engine and 60+0.6 time control with 8 cores then it may be better for stockfish's developement."
It means:
a)I suggested a third step of time control of 60+0.6 with 8 cores and not 40 moves per 2 hours for no regression.
b)I suggested normal book and no UHO_XXL_+0.90_+1.19.epd in these tests
c)I suggested to test in the third step only patches that passed stage 1 and stage 2 that are clear minority of patches(most patches fail at 10+0.1 or 60+0.6 with a single core and I did not suggest changes in testing them).