testing: Elo gain out of two changes

PK · Post by PK » Sat Jun 12, 2010 5:15 pm

hi,

suppose You have made two changes to a chess playing program, each of them showing, say, a 20 Elo increase in strength. now You test version with those two accepted changes against version without them. what result, in terms of Elo, do You expect? I have never got anything near to the sum of the gains so far.

Mincho Georgiev · Post by **Mincho Georgiev** » Sat Jun 12, 2010 5:54 pm

Regardless that you may consider it as a bizarre situation, sometimes, combining "2 changes" could end up worst than any of them separately.
This is not an exact science (still).

Albert Silver · Post by **Albert Silver** » Sat Jun 12, 2010 5:58 pm

Mincho Georgiev wrote:Regardless that you may consider it as a bizarre situation, sometimes, combining "2 changes" could end up worst than any of them separately.
This is not an exact science (still).

Unless they overlap, meaning one interferes with another, I don't see how that is possible.

Mincho Georgiev · Post by **Mincho Georgiev** » Sat Jun 12, 2010 6:04 pm

Albert Silver wrote:
Mincho Georgiev wrote:Regardless that you may consider it as a bizarre situation, sometimes, combining "2 changes" could end up worst than any of them separately.
This is not an exact science (still).
Unless they overlap, meaning one interferes with another, I don't see how that is possible.

Exactly!
The way that question was asked just doesn't exclude that worst case scenario. i.e. almost no information was given.

bob · Post by **bob** » Sat Jun 12, 2010 7:37 pm

PK wrote:hi,

suppose You have made two changes to a chess playing program, each of them showing, say, a 20 Elo increase in strength. now You test version with those two accepted changes against version without them. what result, in terms of Elo, do You expect? I have never got anything near to the sum of the gains so far.

Depends. As an example, last year I measured the gain for Null-move and LMR. I took a baseline version of Crafty with both disabled, then enabled Null-move only and got +80. I then disabled NM again and enabled LMR and also got +80. For the final test I enabled both, and got +120. In thinking about it, the two methodologies have significant overlap, since both are using reduced-depth searches to speed things up.

If you have two changes, A and B, that are totally independent, and each produces +20, you should get +40 when you test both together. However, quite often, there is unexpected interaction between the two that will cause you to get less than this.

We have seen both extremes in our cluster testing, where A+B is _more_ than what we get if we test A and then B and add the improvements. We have also gotten less. And if you are not careful, you can add either A or B and get +20, and when you add them both you still get just +20 (or even less).

It depends on how "connected" the two terms happen to be. I always try to test just one change at a time, and if good, that becomes the new baseline for testing the next change.

bob · Post by **bob** » Sat Jun 12, 2010 7:39 pm

Albert Silver wrote:
Mincho Georgiev wrote:Regardless that you may consider it as a bizarre situation, sometimes, combining "2 changes" could end up worst than any of them separately.
This is not an exact science (still).
Unless they overlap, meaning one interferes with another, I don't see how that is possible.

One term can pull you toward king attacks. The other term could make you more defense-minded by having better pawn structure. Added together you can become way too aggressive or way too passive. It is not that uncommon..

Mincho Georgiev · Post by **Mincho Georgiev** » Sat Jun 12, 2010 8:50 pm

bob wrote:
PK wrote:hi,

suppose You have made two changes to a chess playing program, each of them showing, say, a 20 Elo increase in strength. now You test version with those two accepted changes against version without them. what result, in terms of Elo, do You expect? I have never got anything near to the sum of the gains so far.
Depends. As an example, last year I measured the gain for Null-move and LMR. I took a baseline version of Crafty with both disabled, then enabled Null-move only and got +80. I then disabled NM again and enabled LMR and also got +80. For the final test I enabled both, and got +120. In thinking about it, the two methodologies have significant overlap, since both are using reduced-depth searches to speed things up.

If you have two changes, A and B, that are totally independent, and each produces +20, you should get +40 when you test both together. However, quite often, there is unexpected interaction between the two that will cause you to get less than this.

We have seen both extremes in our cluster testing, where A+B is _more_ than what we get if we test A and then B and add the improvements. We have also gotten less. And if you are not careful, you can add either A or B and get +20, and when you add them both you still get just +20 (or even less).

It depends on how "connected" the two terms happen to be. I always try to test just one change at a time, and if good, that becomes the new baseline for testing the next change.

Besides, let's not forget something else. Too few people are capable of making tests with the density of yours. For the rest of us, that lacks the hardware, +20 is not really +20 (it could be even +5 or -5) in probably 75% of the tests, so no wonder, if after combining the two changes, the result becomes highly unexpected.

testing: Elo gain out of two changes

testing: Elo gain out of two changes

Re: testing: Elo gain out of two changes

Re: testing: Elo gain out of two changes

Re: testing: Elo gain out of two changes

Re: testing: Elo gain out of two changes

Re: testing: Elo gain out of two changes

Re: testing: Elo gain out of two changes