testing: Elo gain out of two changes

Discussion of chess software programming and technical issues.

Moderator: Ras

PK
Posts: 908
Joined: Mon Jan 15, 2007 11:23 am
Location: Warsza

testing: Elo gain out of two changes

Post by PK »

hi,

suppose You have made two changes to a chess playing program, each of them showing, say, a 20 Elo increase in strength. now You test version with those two accepted changes against version without them. what result, in terms of Elo, do You expect? I have never got anything near to the sum of the gains so far.
Mincho Georgiev
Posts: 454
Joined: Sat Apr 04, 2009 6:44 pm
Location: Bulgaria

Re: testing: Elo gain out of two changes

Post by Mincho Georgiev »

Regardless that you may consider it as a bizarre situation, sometimes, combining "2 changes" could end up worst than any of them separately.
This is not an exact science (still).
Albert Silver
Posts: 3026
Joined: Wed Mar 08, 2006 9:57 pm
Location: Rio de Janeiro, Brazil

Re: testing: Elo gain out of two changes

Post by Albert Silver »

Mincho Georgiev wrote:Regardless that you may consider it as a bizarre situation, sometimes, combining "2 changes" could end up worst than any of them separately.
This is not an exact science (still).
Unless they overlap, meaning one interferes with another, I don't see how that is possible.
"Tactics are the bricks and sticks that make up a game, but positional play is the architectural blueprint."
Mincho Georgiev
Posts: 454
Joined: Sat Apr 04, 2009 6:44 pm
Location: Bulgaria

Re: testing: Elo gain out of two changes

Post by Mincho Georgiev »

Albert Silver wrote:
Mincho Georgiev wrote:Regardless that you may consider it as a bizarre situation, sometimes, combining "2 changes" could end up worst than any of them separately.
This is not an exact science (still).
Unless they overlap, meaning one interferes with another, I don't see how that is possible.
Exactly!
The way that question was asked just doesn't exclude that worst case scenario. i.e. almost no information was given.
bob
Posts: 20943
Joined: Mon Feb 27, 2006 7:30 pm
Location: Birmingham, AL

Re: testing: Elo gain out of two changes

Post by bob »

PK wrote:hi,

suppose You have made two changes to a chess playing program, each of them showing, say, a 20 Elo increase in strength. now You test version with those two accepted changes against version without them. what result, in terms of Elo, do You expect? I have never got anything near to the sum of the gains so far.
Depends. As an example, last year I measured the gain for Null-move and LMR. I took a baseline version of Crafty with both disabled, then enabled Null-move only and got +80. I then disabled NM again and enabled LMR and also got +80. For the final test I enabled both, and got +120. In thinking about it, the two methodologies have significant overlap, since both are using reduced-depth searches to speed things up.

If you have two changes, A and B, that are totally independent, and each produces +20, you should get +40 when you test both together. However, quite often, there is unexpected interaction between the two that will cause you to get less than this.

We have seen both extremes in our cluster testing, where A+B is _more_ than what we get if we test A and then B and add the improvements. We have also gotten less. And if you are not careful, you can add either A or B and get +20, and when you add them both you still get just +20 (or even less).

It depends on how "connected" the two terms happen to be. I always try to test just one change at a time, and if good, that becomes the new baseline for testing the next change.
bob
Posts: 20943
Joined: Mon Feb 27, 2006 7:30 pm
Location: Birmingham, AL

Re: testing: Elo gain out of two changes

Post by bob »

Albert Silver wrote:
Mincho Georgiev wrote:Regardless that you may consider it as a bizarre situation, sometimes, combining "2 changes" could end up worst than any of them separately.
This is not an exact science (still).
Unless they overlap, meaning one interferes with another, I don't see how that is possible.
One term can pull you toward king attacks. The other term could make you more defense-minded by having better pawn structure. Added together you can become way too aggressive or way too passive. It is not that uncommon..
Mincho Georgiev
Posts: 454
Joined: Sat Apr 04, 2009 6:44 pm
Location: Bulgaria

Re: testing: Elo gain out of two changes

Post by Mincho Georgiev »

bob wrote:
PK wrote:hi,

suppose You have made two changes to a chess playing program, each of them showing, say, a 20 Elo increase in strength. now You test version with those two accepted changes against version without them. what result, in terms of Elo, do You expect? I have never got anything near to the sum of the gains so far.
Depends. As an example, last year I measured the gain for Null-move and LMR. I took a baseline version of Crafty with both disabled, then enabled Null-move only and got +80. I then disabled NM again and enabled LMR and also got +80. For the final test I enabled both, and got +120. In thinking about it, the two methodologies have significant overlap, since both are using reduced-depth searches to speed things up.

If you have two changes, A and B, that are totally independent, and each produces +20, you should get +40 when you test both together. However, quite often, there is unexpected interaction between the two that will cause you to get less than this.

We have seen both extremes in our cluster testing, where A+B is _more_ than what we get if we test A and then B and add the improvements. We have also gotten less. And if you are not careful, you can add either A or B and get +20, and when you add them both you still get just +20 (or even less).

It depends on how "connected" the two terms happen to be. I always try to test just one change at a time, and if good, that becomes the new baseline for testing the next change.
Besides, let's not forget something else. Too few people are capable of making tests with the density of yours. For the rest of us, that lacks the hardware, +20 is not really +20 (it could be even +5 or -5) in probably 75% of the tests, so no wonder, if after combining the two changes, the result becomes highly unexpected.