Stockfish 2.3.1 weaker than 2.2.2?

Matthias Gemuh · Post by **Matthias Gemuh** » Wed Sep 26, 2012 7:58 pm

mcostalba wrote:...
Marco

Marco, can you change SF time management so that it does not use the increment till after it plays its move ?

See http://74.220.23.57/forum/viewtopic.php?p=484664#484664

Matthias.

mcostalba · Post by **mcostalba** » Wed Sep 26, 2012 8:04 pm

Matthias Gemuh wrote: See http://74.220.23.57/forum/viewtopic.php?p=484664#484664

This is interesting. Thanks for reporting. I will look at this issue.

Modern Times · Post by **Modern Times** » Wed Sep 26, 2012 8:48 pm

mcostalba wrote: I would like also to thank Ingo, Werner and the CEGT, Ray and all the other people that are testing this release: I know I made your job a tad difficult due to the small ELO increase and the different releases. I promise, also to myself, that the next one will be better prepared.

Thanks
Marco

No problem, blitz isn't a huge effort.

After some more games, 40/4 standard chess shows +9, better than nothing but within the error bars as you pointed out.

zamar · Post by **zamar** » Wed Sep 26, 2012 11:12 pm

Matthias Gemuh wrote:
mcostalba wrote:...
Marco
Marco, can you change SF time management so that it does not use the increment till after it plays its move ?

See http://74.220.23.57/forum/viewtopic.php?p=484664#484664

Matthias.

Hi Mathias,

I've played >1000 1+1' test blitz games with SF using XBoard. SF often goes really low on time (0.3 seconds), but it has never stepped over.

The current time management code doesn't use increment before move is played (unless there is a bug).

My first thought is that the cause is a slow interaction between GUI and engine. Default Emergency Base Time = 300ms, you may want to increase that...

Matthias Gemuh · Post by **Matthias Gemuh** » Wed Sep 26, 2012 11:51 pm

zamar wrote: Hi Mathias,

I've played >1000 1+1' test blitz games with SF using XBoard. SF often goes really low on time (0.3 seconds), but it has never stepped over.

The current time management code doesn't use increment before move is played (unless there is a bug).

My first thought is that the cause is a slow interaction between GUI and engine. Default Emergency Base Time = 300ms, you may want to increase that...

ChessGUI has a near-perfect solution against GUI time lag.
Stockfish and RedQueen are the only 2 engines above 1800 Elo that lose on time under ChessGUI (if "lag-free Timing" is used) in Fischer TC.

Anyway, I shall henceforth use more Emergency Base Time.
Thanks (to Ray also) for the hint.

Matthias.

Lyudmil Tsvetkov · Post by **Lyudmil Tsvetkov** » Thu Sep 27, 2012 12:55 pm

Hi Gary,
I think you have made some changes and tested them against 2.2.2,
and then tuned the values until 2.3 was clearly superior to 2.2.2.
In this case the reasonable thing to do would be to try tuning the
changes in respect to a wider range of opponents, the way you did with 2.2.2. The changes might be good, but not tuned adequately.
It is also interesting to know how 2.3 fares in blitz against 2.2.2.

Best regards,
Ludmil

gerold · Post by **gerold** » Thu Sep 27, 2012 3:47 pm

Lyudmil Tsvetkov wrote:Hi Gary,
I think you have made some changes and tested them against 2.2.2,
and then tuned the values until 2.3 was clearly superior to 2.2.2.
In this case the reasonable thing to do would be to try tuning the
changes in respect to a wider range of opponents, the way you did with 2.2.2. The changes might be good, but not tuned adequately.
It is also interesting to know how 2.3 fares in blitz against 2.2.2.

Best regards,
Ludmil

After 100 games (which is way to little games) with 1/5 TC the two engines played equal.

zamar · Post by **zamar** » Fri Sep 28, 2012 12:59 am

gladius wrote:
lkaufman wrote:
gladius wrote:Agreed, the results are very disappointing. The improvements were tested against Stockfish 2.2.2. It seems while they were good in heads up matches, they made things worse against weaker opponents, and didn't help against stronger ones.
My tests indicate that 2.3.1 is a clear improvement even against foreign opponents at hyperspeed levels. So the problem, if there is one, is not the choice of opponents but the time control of the tests. My guess is that the change involving lateral attacks on pawns, being a tactical term, is great at speeds like game/10" but pretty useless at IPON levels.
Interesting, thanks Larry. A few of the evaluation changes were more tactical terms (pinned piece penalty, undefended pieces, and rook-pawn-rank bonus). So, that could be an explanation.

I'm going back now and applying each eval change to 2.2.2, and testing against a wider set of opponents (still at hyperblitz, 4s+0.05). It will be interesting to see how the changes do there. If they do the same, I guess testing at longer TC is the only way to go.

Seriously I don't think that there is anything wrong with the current testing method.

Looking at the results now (CCRL FRC: +9 elo, CCRL 40/4: +9 elo, IPON: -7 elo, CEGT 40/20: -1 elo), it's fully possible and I'd say even likely that SF 2.3.1 is ~5 ELO strong than SF 2.2.2.

If you look at the history of recent SF releases, almost in every release a new version has done worse in some rating list than previous version. Still in the long run it has been going up in all rating lists.

This is completely natural and one just needs to learn to live with it and have confidence in long term slow progress.

gladius · Post by **gladius** » Fri Sep 28, 2012 3:46 am

zamar wrote:Seriously I don't think that there is anything wrong with the current testing method.

Looking at the results now (CCRL FRC: +9 elo, CCRL 40/4: +9 elo, IPON: -7 elo, CEGT 40/20: -1 elo), it's fully possible and I'd say even likely that SF 2.3.1 is ~5 ELO strong than SF 2.2.2.

If you look at the history of recent SF releases, almost in every release a new version has done worse in some rating list than previous version. Still in the long run it has been going up in all rating lists.

This is completely natural and one just needs to learn to live with it and have confidence in long term slow progress.

Yes, the progress on Stockfish has been great! However, with each change we made for 2.3.1, things looked quite positive. The sum of all those changes seems to be that 2.3.1 is about equal, and maybe a bit stronger. So, something is definitely amiss.

If it's easy to tweak things a bit, and improve the patch to elo gain ratio, then we should definitely try. For example, the Komodo folks seem to not use self tests as much.

zamar · Post by **zamar** » Fri Sep 28, 2012 9:18 am

gladius wrote: Yes, the progress on Stockfish has been great! However, with each change we made for 2.3.1, things looked quite positive. The sum of all those changes seems to be that 2.3.1 is about equal, and maybe a bit stronger. So, something is definitely amiss.

Not necessarily. Two thing must be kept in mind:

- Self-play always exaggerates things. Against other engines the actual change is about 1/2 of the improvement measured in self-play.

- Selection bias. You cannot sum up ELOs between separate tests, the result will be too high.

Stockfish 2.3.1 weaker than 2.2.2?

Re: Stockfish 2.3.1 weaker than 2.2.2?

Re: Stockfish 2.3.1 weaker than 2.2.2?

Re: Stockfish 2.3.1 weaker than 2.2.2?

Re: Stockfish 2.3.1 weaker than 2.2.2?

Re: Stockfish 2.3.1 weaker than 2.2.2?

Re: Stockfish 2.3.1 weaker than 2.2.2?

Re: Stockfish 2.3.1 weaker than 2.2.2?

Re: Stockfish 2.3.1 weaker than 2.2.2?

Re: Stockfish 2.3.1 weaker than 2.2.2?

Re: Stockfish 2.3.1 weaker than 2.2.2?