lkaufman wrote: ↑Wed Jun 24, 2020 5:24 pm
Rebel wrote: ↑Wed Jun 24, 2020 12:18 pm
Finished the elo 2900 pool.
Stockfish gauntlet, knight-odds, tc=40/10
Code: Select all
# ENGINE : RATING POINTS PLAYED (%)
1 cheng4_4.39 : 3273.3 128.0 200 64.0%
2 Bobcat_8 : 3250.9 122.0 200 61.0%
3 Stockfish_11 : 3172.5 269.5 600 44.9%
4 Crafty_25.6 : 3103.3 80.5 200 40.3%
tc=40/20
Code: Select all
# ENGINE : RATING POINTS PLAYED (%)
1 cheng4_4.39 : 3315.9 137.5 200 68.8%
2 Bobcat_8 : 3294.0 132.0 200 66.0%
3 Stockfish_11 : 3177.8 274.5 600 45.8%
4 Crafty_25.6 : 3012.3 56.0 200 28.0%
tc=40/40
Code: Select all
# ENGINE : RATING POINTS PLAYED (%)
1 cheng4_4.39 : 3318.8 146.0 200 73.0%
2 Bobcat_8 : 3295.0 140.5 200 70.3%
3 Stockfish_11 : 3144.5 242.0 600 40.3%
4 Crafty_25.6 : 3041.7 71.5 200 35.8%
Next, 2800 pool.
So results improved steadily with more time as expected for cheng and bobcat, but not for crafty (between 40/10 and 40/20 regression); I wonder why? Two questions: How were the positions used chosen from the ChrisW set? I'm finding that taking them from the middle (pruning equal number from each end) is the fairest and closest simulation to real knight odds.
We can’t go cherry picking positions according to subjective criteria. And this concept of “real knight odds” is about as subjective as it gets, and it isn’t reached by asking an engine to evaluate at the root and using that as the definition. Imagine defining “real chess odds” by asking an engine to search from the root and give the answer. 42?
There are no “real knight odds”, all there is are positions without the knight and see how the results work out from *many* tests. We can try to use “natural” positions without either side having an apparent head start, eg remove the outliers.
Nor are we trying to determine what knights odds are in some numerical sense, we trying to determine how modern engines do against strong oldies with various handicaps, the first handicap being minus a knight.
Also, did Stockfish use default Contempt, or 0, or max (100)? It would do best with 100 I'm sure.
It’s better to just use defaults, too much parameter fiddling around just confuses everything.
Anyway, I prepared suites of 25, 100, 250 and 1000 epds. They are each a randomly selected subset of about 1200 epds taken from, I forget, it says in the github readme, roughly 370 to 420 I think. Probably that selection is actually in line with your desires, actually.
Posit from me: the most sensible course would be to use those sets only for a while, we’ll soon see if the 25 suite gives very different results from the 1000 suite, and then we can start worrying if small subsets and the positions in general are too noisy. For example, we don’t know right now if the anomalous(?) results of Crafty are down to unlucky position selection.