feature1 feature1 feature1
feature2 feature2 + new feature
+ + new feature
------------------ ------------------ ------------------
+50 ELO +10 ELO not again +80 wow
If this is true then the testing combination explode, right?
feature1 feature1 feature1
feature2 feature2 + new feature
+ + new feature
------------------ ------------------ ------------------
+50 ELO +10 ELO not again +80 wow
If this is true then the testing combination explode, right?
Yes, I think this is possible. Certain testers can probably give practical examples. I could imagine for instance that overlapping positional evaluation criteria, like mobility and piece-square table, could lead to such results in principle, although perhaps not of that magnitude.
As to your example: as long as you have no proof that feature1 is really an improvement independent from feature2 and "new feature", it would indeed be necessary IMO to test all combinations. Otherwise you don't know whether it is really the combination of feature2 with "new feature" that is unproductive, or whether in fact feature1 and "new feature" are those that overlap or even contradict somehow.
In every case, at least I think you need a statement about the improvement produced by feature1 alone.
feature1 feature1 feature1
feature2 feature2 + new feature
+ + new feature
------------------ ------------------ ------------------
+50 ELO +10 ELO not again +80 wow
If this is true then the testing combination explode, right?
- Jarkko
Yes I think it is. As example if feature2 is some extension it is very probably that it can work with your program at the beginning but can hurt when you make your program more performant so that it searches at deeper plies, and in that case perhaps you need to reduce / remove the extension to make your program gain additional ELO.
Something similar could also happen with some kind of king safety code that is good at medium search depth but not at very high depths.
feature1 feature1 feature1
feature2 feature2 + new feature
+ + new feature
------------------ ------------------ ------------------
+50 ELO +10 ELO not again +80 wow
If this is true then the testing combination explode, right?
- Jarkko
It happens. I can't say how frequently, but it definitely happens.
I have not seen that happen except for the case where +50 and +10 come from testing at (say) 1 sec / move, and the -80 comes from testing at 60 secs / move...
I have not seen that happen except for the case where +50 and +10 come from testing at (say) 1 sec / move, and the -80 comes from testing at 60 secs / move...
It should be pointed out that although *old* research (such as Schaeffer's thesis) can suggest cases such as that, those were all done with a very limited number of games. (Bob has shown such limited testing can't be considered valid. Which makes everybody wonder just how people managed without clusters for the last 30+ years....)
All of Bob's tests have been done with a fairly complete and sophisticated program and the rresults for a more simple program or one where the evaluator is still in the early stages might get all sorts of weird behavior until things stablize a bit.
Hmm I'd think that'd be possible with, say, three forward pruning schemes... Say feature2 and "new feature" overlap on the nodes they reduce/prune, so having both could be too aggressive, but "new feature" + feature1 could prune smarter than feature1 + feature2....