Stockfish 2.3.1 weaker than 2.2.2?

Discussion of anything and everything relating to chess playing software and machines.

Moderators: bob, hgm, Harvey Williamson

Forum rules
This textbox is used to restore diagrams posted with the [d] tag before the upgrade.
User avatar
Matthias Gemuh
Posts: 3238
Joined: Thu Mar 09, 2006 8:10 am
Contact:

Re: Stockfish 2.3.1 weaker than 2.2.2?

Post by Matthias Gemuh » Wed Sep 26, 2012 5:58 pm

mcostalba wrote:...
Marco
Marco, can you change SF time management so that it does not use the increment till after it plays its move ?

See http://74.220.23.57/forum/viewtopic.php?p=484664#484664


Matthias.
My engine was quite strong till I added knowledge to it.
http://www.chess.hylogic.de

mcostalba
Posts: 2684
Joined: Sat Jun 14, 2008 7:17 pm

Re: Stockfish 2.3.1 weaker than 2.2.2?

Post by mcostalba » Wed Sep 26, 2012 6:04 pm

This is interesting. Thanks for reporting. I will look at this issue.

Modern Times
Posts: 2417
Joined: Thu Jun 07, 2012 9:02 pm

Re: Stockfish 2.3.1 weaker than 2.2.2?

Post by Modern Times » Wed Sep 26, 2012 6:48 pm

mcostalba wrote: I would like also to thank Ingo, Werner and the CEGT, Ray and all the other people that are testing this release: I know I made your job a tad difficult due to the small ELO increase and the different releases. I promise, also to myself, that the next one will be better prepared.

Thanks
Marco
No problem, blitz isn't a huge effort.

After some more games, 40/4 standard chess shows +9, better than nothing but within the error bars as you pointed out.

zamar
Posts: 613
Joined: Sun Jan 18, 2009 6:03 am

Re: Stockfish 2.3.1 weaker than 2.2.2?

Post by zamar » Wed Sep 26, 2012 9:12 pm

Matthias Gemuh wrote:
mcostalba wrote:...
Marco
Marco, can you change SF time management so that it does not use the increment till after it plays its move ?

See http://74.220.23.57/forum/viewtopic.php?p=484664#484664


Matthias.
Hi Mathias,

I've played >1000 1+1' test blitz games with SF using XBoard. SF often goes really low on time (0.3 seconds), but it has never stepped over.

The current time management code doesn't use increment before move is played (unless there is a bug).

My first thought is that the cause is a slow interaction between GUI and engine. Default Emergency Base Time = 300ms, you may want to increase that...
Joona Kiiski

User avatar
Matthias Gemuh
Posts: 3238
Joined: Thu Mar 09, 2006 8:10 am
Contact:

Re: Stockfish 2.3.1 weaker than 2.2.2?

Post by Matthias Gemuh » Wed Sep 26, 2012 9:51 pm

zamar wrote: Hi Mathias,

I've played >1000 1+1' test blitz games with SF using XBoard. SF often goes really low on time (0.3 seconds), but it has never stepped over.

The current time management code doesn't use increment before move is played (unless there is a bug).

My first thought is that the cause is a slow interaction between GUI and engine. Default Emergency Base Time = 300ms, you may want to increase that...
ChessGUI has a near-perfect solution against GUI time lag.
Stockfish and RedQueen are the only 2 engines above 1800 Elo that lose on time under ChessGUI (if "lag-free Timing" is used) in Fischer TC.

Anyway, I shall henceforth use more Emergency Base Time.
Thanks (to Ray also) for the hint.

Matthias.
My engine was quite strong till I added knowledge to it.
http://www.chess.hylogic.de

Lyudmil Tsvetkov
Posts: 6052
Joined: Tue Jun 12, 2012 10:41 am

Re: Stockfish 2.3.1 weaker than 2.2.2?

Post by Lyudmil Tsvetkov » Thu Sep 27, 2012 10:55 am

Hi Gary,
I think you have made some changes and tested them against 2.2.2,
and then tuned the values until 2.3 was clearly superior to 2.2.2.
In this case the reasonable thing to do would be to try tuning the
changes in respect to a wider range of opponents, the way you did with 2.2.2. The changes might be good, but not tuned adequately.
It is also interesting to know how 2.3 fares in blitz against 2.2.2.

Best regards,
Ludmil

gerold
Posts: 10121
Joined: Wed Mar 08, 2006 11:57 pm
Location: van buren,missouri

Re: Stockfish 2.3.1 weaker than 2.2.2?

Post by gerold » Thu Sep 27, 2012 1:47 pm

Lyudmil Tsvetkov wrote:Hi Gary,
I think you have made some changes and tested them against 2.2.2,
and then tuned the values until 2.3 was clearly superior to 2.2.2.
In this case the reasonable thing to do would be to try tuning the
changes in respect to a wider range of opponents, the way you did with 2.2.2. The changes might be good, but not tuned adequately.
It is also interesting to know how 2.3 fares in blitz against 2.2.2.

Best regards,
Ludmil
After 100 games (which is way to little games) with 1/5 TC the two engines played equal.

zamar
Posts: 613
Joined: Sun Jan 18, 2009 6:03 am

Re: Stockfish 2.3.1 weaker than 2.2.2?

Post by zamar » Thu Sep 27, 2012 10:59 pm

gladius wrote:
lkaufman wrote:
gladius wrote:Agreed, the results are very disappointing. The improvements were tested against Stockfish 2.2.2. It seems while they were good in heads up matches, they made things worse against weaker opponents, and didn't help against stronger ones.
My tests indicate that 2.3.1 is a clear improvement even against foreign opponents at hyperspeed levels. So the problem, if there is one, is not the choice of opponents but the time control of the tests. My guess is that the change involving lateral attacks on pawns, being a tactical term, is great at speeds like game/10" but pretty useless at IPON levels.
Interesting, thanks Larry. A few of the evaluation changes were more tactical terms (pinned piece penalty, undefended pieces, and rook-pawn-rank bonus). So, that could be an explanation.

I'm going back now and applying each eval change to 2.2.2, and testing against a wider set of opponents (still at hyperblitz, 4s+0.05). It will be interesting to see how the changes do there. If they do the same, I guess testing at longer TC is the only way to go.
Seriously I don't think that there is anything wrong with the current testing method.

Looking at the results now (CCRL FRC: +9 elo, CCRL 40/4: +9 elo, IPON: -7 elo, CEGT 40/20: -1 elo), it's fully possible and I'd say even likely that SF 2.3.1 is ~5 ELO strong than SF 2.2.2.

If you look at the history of recent SF releases, almost in every release a new version has done worse in some rating list than previous version. Still in the long run it has been going up in all rating lists.

This is completely natural and one just needs to learn to live with it and have confidence in long term slow progress.
Joona Kiiski

gladius
Posts: 538
Joined: Tue Dec 12, 2006 9:10 am

Re: Stockfish 2.3.1 weaker than 2.2.2?

Post by gladius » Fri Sep 28, 2012 1:46 am

zamar wrote:Seriously I don't think that there is anything wrong with the current testing method.

Looking at the results now (CCRL FRC: +9 elo, CCRL 40/4: +9 elo, IPON: -7 elo, CEGT 40/20: -1 elo), it's fully possible and I'd say even likely that SF 2.3.1 is ~5 ELO strong than SF 2.2.2.

If you look at the history of recent SF releases, almost in every release a new version has done worse in some rating list than previous version. Still in the long run it has been going up in all rating lists.

This is completely natural and one just needs to learn to live with it and have confidence in long term slow progress.
Yes, the progress on Stockfish has been great! However, with each change we made for 2.3.1, things looked quite positive. The sum of all those changes seems to be that 2.3.1 is about equal, and maybe a bit stronger. So, something is definitely amiss.

If it's easy to tweak things a bit, and improve the patch to elo gain ratio, then we should definitely try. For example, the Komodo folks seem to not use self tests as much.

zamar
Posts: 613
Joined: Sun Jan 18, 2009 6:03 am

Re: Stockfish 2.3.1 weaker than 2.2.2?

Post by zamar » Fri Sep 28, 2012 7:18 am

gladius wrote: Yes, the progress on Stockfish has been great! However, with each change we made for 2.3.1, things looked quite positive. The sum of all those changes seems to be that 2.3.1 is about equal, and maybe a bit stronger. So, something is definitely amiss.
Not necessarily. Two thing must be kept in mind:

- Self-play always exaggerates things. Against other engines the actual change is about 1/2 of the improvement measured in self-play.

- Selection bias. You cannot sum up ELOs between separate tests, the result will be too high.
Joona Kiiski

Post Reply