Komodo CEGT 40/20 and 'Strength'

Discussion of computer chess matches and engine tournaments.

Moderators: hgm, Rebel, chrisw

leavenfish
Posts: 282
Joined: Mon Sep 02, 2013 8:23 am

Komodo CEGT 40/20 and 'Strength'

Post by leavenfish »

Hopefully I am not barking up a dead tree or wasting anyone's time....this is mostly asked of Team Komodo, Team CEGT...or anyone who really has something to say.

I am going to hope my formatting stays true for this cut/paste.
These are the Komodo versions since 9 for the CEGT 40/20 just released.
I have sorted by CPU and then by rating (Category = the Komodo version family (9 thru 13)).

Now, I am no slave to looking to ratings as I use an engines only for game analysis (not game play vs other engines) but I realize it is an easy way to try to gauge game play "strength" over a bunch of quick games. For my purposes, I would think a large sample of tactical and positional positions be fed into an engine to determine how accurately and to a lesser extent 'quickly' the proper 'answer' is found. How does one do this...well, the thought comes to mind that LCZero or perhaps better, Alpha Zero, could be put to work on this test suite first- lots of games ran and the 'proper idea' found/verified (if it contradicts conventional wisdom and truly offers better chances) for each position...then all the other engines ran thru the gauntlet. Just my suggestion. Anyway...

Question 1:
When I check to see the individual results vs others, I see the introduction of the LCZero engine starting with...I believe...Komodo 13.1 Does the fact that It and Stockfish (multiple recent iterations) do so well against Komodo in any way account for the ratings leveling off in the recent versions? I do not see LCZero or Stockfish 10 being ran against previous versions and perhaps Komodo's strength compared to earlier multiple versions of Stockfish involved less of a gap (?).

Question 2:
This ties into my comment above...do you think non-game play testing could in any way be used to determine a different view of "strength" (positional/judgement) for more critical positions? This way you evaluate...say the strength of evaluation on (making something up here) a Kasparov - Karpov game, move 24 which is in our hypothetical test and for which there might be multiple reasonable choices, but maybe 1 leads more of an advantage after many logical moves. If you had engines playing engines starting at say move 16 of that same game, some elite engines might actually play the game to move 24, while other elite engine might have deviated (and the choices could be perfectly good as well) at some point along the line and never get to chew on that 'test' position...and that would be a 'downside' of game play to determine using ratings to determine 'strength'. Whereas in a test suite you would have them all chewing on the same position.

Or in regards to Q2 - it is just generally thought that engine vs engine play produces its own unique set of critical positions (but of course, they are never the same for each engine as they are pushing the game along....) throughout a game...and that kind of makes up for the value one might find in a good, extensive and properly vetted test suite as I propose above?


Overall Cat Program CPU Elo + - GamesScore Av.Op.Draws
4 CPU
10 1 Komodo 12.1.1 x64 4CPU 4 3419 13 13 2249 62.00% 3318 56.30%
13 2 Komodo 12.3 x64 4CPU 4 3402 21 21 734 65.10% 3285 56.40%
8 1 Komodo 11.2 x64 4CPU 4 3421 13 13 2052 68.70% 3261 49.80%
12 2 Komodo 11.3 x64 4CPU 4 3406 15 15 1799 65.70% 3280 55.10%
14 3 Komodo 11.01 x64 4CPU 4 3395 15 15 1855 70.00% 3222 47.70%
15 1 Komodo 10.4 x64 4CPU 4 3395 13 13 2276 75.00% 3181 42.00%
20 2 Komodo 10.2 x64 4CPU 4 3355 15 15 1885 68.70% 3203 45.40%
21 3 Komodo 10.3 x64 4CPU 4 3354 18 18 1102 66.70% 3224 52.70%
24 4 Komodo 10.1 x64 4CPU 4 3340 18 18 1061 71.80% 3168 46.10%
25 5 Komodo 10 x64 4CPU 4 3340 12 12 2716 71.80% 3158 42.90%
27 1 Komodo 9.42 x64 4CPU 4 3333 12 12 2406 70.40% 3162 44.40%
28 2 Komodo 9.2 x64 4CPU 4 3331 12 12 2177 72.90% 3141 42.50%
34 3 Komodo 9.3 x64 4CPU 4 3319 15 15 1497 72.10% 3143 46.40%
37 4 Komodo 9.1 x64 4CPU 4 3308 14 14 2049 71.40% 3124 45.00%
40 5 Komodo 9.0 x64 4CPU 4 3298 14 14 2023 73.20% 3108 43.60%
51 1 Komodo 8.0 x64 4CPU 4 3250 14 14 1629 60.20% 3171 53.70%
64 1 Komodo 7.0 x64 4CPU 4 3211 15 15 1408 59.30% 3138 55.80%
1 CPU
23 1 Komodo 13.01 x64 1CPU 1 3340 18 18 945 58.60% 3276 56.70%
30 1 Komodo 12.2.2 x64 1CPU 1 3327 15 15 1601 65.40% 3207 47.10%
32 1 Komodo 11.3 x64 1CPU 1 3324 16 16 1710 71.80% 3142 45.10%
33 2 Komodo 11.2 x64 1CPU 1 3320 11 11 2899 71.10% 3145 45.60%
38 3 Komodo 11.01 x64 1CPU 1 3303 15 15 1949 72.60% 3114 44.60%
36 1 Komodo 10.4 x64 1CPU 1 3310 15 15 1711 72.00% 3128 43.80%
44 2 Komodo 10.3 x64 1CPU 1 3282 16 16 1315 61.10% 3194 54.00%
46 3 Komodo 10.2 x64 1CPU 1 3258 15 15 1658 67.10% 3122 47.90%
47 4 Komodo 10.1 x64 1CPU 1 3257 15 15 1500 67.00% 3122 45.70%
50 5 Komodo 10 x64 1CPU 1 3251 17 17 1498 75.90% 3037 36.50%
56 1 Komodo 9.42 x64 1CPU 1 3233 15 15 1396 74.50% 3035 39.40%
58 2 Komodo 9.3 x64 1CPU 1 3230 14 14 2038 67.00% 3092 44.80%
62 3 Komodo 9.2 x64 1CPU 1 3217 14 14 1750 71.70% 3044 41.70%
70 4 Komodo 9.1 x64 1CPU 1 3196 14 14 1637 70.00% 3038 46.20%
71 5 Komodo 9.0 x64 1CPU 1 3193 16 16 1299 71.60% 3023 43.00%
MCTS
48 1 Komodo 13.01 x64 1CPU (MCTS) 1 3256 16 16 1035 49.70% 3259 56.90%
74 1 Komodo 12.3 x64 1CPU (MCTS) 1 3189 15 15 1900 53.00% 3165 50.20%
mjlef
Posts: 1494
Joined: Thu Mar 30, 2006 2:08 pm

Re: Komodo CEGT 40/20 and 'Strength'

Post by mjlef »

leavenfish wrote: Mon May 20, 2019 7:51 pm Hopefully I am not barking up a dead tree or wasting anyone's time....this is mostly asked of Team Komodo, Team CEGT...or anyone who really has something to say.

I am going to hope my formatting stays true for this cut/paste.
These are the Komodo versions since 9 for the CEGT 40/20 just released.
I have sorted by CPU and then by rating (Category = the Komodo version family (9 thru 13)).

Now, I am no slave to looking to ratings as I use an engines only for game analysis (not game play vs other engines) but I realize it is an easy way to try to gauge game play "strength" over a bunch of quick games. For my purposes, I would think a large sample of tactical and positional positions be fed into an engine to determine how accurately and to a lesser extent 'quickly' the proper 'answer' is found. How does one do this...well, the thought comes to mind that LCZero or perhaps better, Alpha Zero, could be put to work on this test suite first- lots of games ran and the 'proper idea' found/verified (if it contradicts conventional wisdom and truly offers better chances) for each position...then all the other engines ran thru the gauntlet. Just my suggestion. Anyway...

Question 1:
When I check to see the individual results vs others, I see the introduction of the LCZero engine starting with...I believe...Komodo 13.1 Does the fact that It and Stockfish (multiple recent iterations) do so well against Komodo in any way account for the ratings leveling off in the recent versions? I do not see LCZero or Stockfish 10 being ran against previous versions and perhaps Komodo's strength compared to earlier multiple versions of Stockfish involved less of a gap (?).

Question 2:
This ties into my comment above...do you think non-game play testing could in any way be used to determine a different view of "strength" (positional/judgement) for more critical positions? This way you evaluate...say the strength of evaluation on (making something up here) a Kasparov - Karpov game, move 24 which is in our hypothetical test and for which there might be multiple reasonable choices, but maybe 1 leads more of an advantage after many logical moves. If you had engines playing engines starting at say move 16 of that same game, some elite engines might actually play the game to move 24, while other elite engine might have deviated (and the choices could be perfectly good as well) at some point along the line and never get to chew on that 'test' position...and that would be a 'downside' of game play to determine using ratings to determine 'strength'. Whereas in a test suite you would have them all chewing on the same position.

Or in regards to Q2 - it is just generally thought that engine vs engine play produces its own unique set of critical positions (but of course, they are never the same for each engine as they are pushing the game along....) throughout a game...and that kind of makes up for the value one might find in a good, extensive and properly vetted test suite as I propose above?


Overall Cat Program CPU Elo + - GamesScore Av.Op.Draws
4 CPU
10 1 Komodo 12.1.1 x64 4CPU 4 3419 13 13 2249 62.00% 3318 56.30%
13 2 Komodo 12.3 x64 4CPU 4 3402 21 21 734 65.10% 3285 56.40%
8 1 Komodo 11.2 x64 4CPU 4 3421 13 13 2052 68.70% 3261 49.80%
12 2 Komodo 11.3 x64 4CPU 4 3406 15 15 1799 65.70% 3280 55.10%
14 3 Komodo 11.01 x64 4CPU 4 3395 15 15 1855 70.00% 3222 47.70%
15 1 Komodo 10.4 x64 4CPU 4 3395 13 13 2276 75.00% 3181 42.00%
20 2 Komodo 10.2 x64 4CPU 4 3355 15 15 1885 68.70% 3203 45.40%
21 3 Komodo 10.3 x64 4CPU 4 3354 18 18 1102 66.70% 3224 52.70%
24 4 Komodo 10.1 x64 4CPU 4 3340 18 18 1061 71.80% 3168 46.10%
25 5 Komodo 10 x64 4CPU 4 3340 12 12 2716 71.80% 3158 42.90%
27 1 Komodo 9.42 x64 4CPU 4 3333 12 12 2406 70.40% 3162 44.40%
28 2 Komodo 9.2 x64 4CPU 4 3331 12 12 2177 72.90% 3141 42.50%
34 3 Komodo 9.3 x64 4CPU 4 3319 15 15 1497 72.10% 3143 46.40%
37 4 Komodo 9.1 x64 4CPU 4 3308 14 14 2049 71.40% 3124 45.00%
40 5 Komodo 9.0 x64 4CPU 4 3298 14 14 2023 73.20% 3108 43.60%
51 1 Komodo 8.0 x64 4CPU 4 3250 14 14 1629 60.20% 3171 53.70%
64 1 Komodo 7.0 x64 4CPU 4 3211 15 15 1408 59.30% 3138 55.80%
1 CPU
23 1 Komodo 13.01 x64 1CPU 1 3340 18 18 945 58.60% 3276 56.70%
30 1 Komodo 12.2.2 x64 1CPU 1 3327 15 15 1601 65.40% 3207 47.10%
32 1 Komodo 11.3 x64 1CPU 1 3324 16 16 1710 71.80% 3142 45.10%
33 2 Komodo 11.2 x64 1CPU 1 3320 11 11 2899 71.10% 3145 45.60%
38 3 Komodo 11.01 x64 1CPU 1 3303 15 15 1949 72.60% 3114 44.60%
36 1 Komodo 10.4 x64 1CPU 1 3310 15 15 1711 72.00% 3128 43.80%
44 2 Komodo 10.3 x64 1CPU 1 3282 16 16 1315 61.10% 3194 54.00%
46 3 Komodo 10.2 x64 1CPU 1 3258 15 15 1658 67.10% 3122 47.90%
47 4 Komodo 10.1 x64 1CPU 1 3257 15 15 1500 67.00% 3122 45.70%
50 5 Komodo 10 x64 1CPU 1 3251 17 17 1498 75.90% 3037 36.50%
56 1 Komodo 9.42 x64 1CPU 1 3233 15 15 1396 74.50% 3035 39.40%
58 2 Komodo 9.3 x64 1CPU 1 3230 14 14 2038 67.00% 3092 44.80%
62 3 Komodo 9.2 x64 1CPU 1 3217 14 14 1750 71.70% 3044 41.70%
70 4 Komodo 9.1 x64 1CPU 1 3196 14 14 1637 70.00% 3038 46.20%
71 5 Komodo 9.0 x64 1CPU 1 3193 16 16 1299 71.60% 3023 43.00%
MCTS
48 1 Komodo 13.01 x64 1CPU (MCTS) 1 3256 16 16 1035 49.70% 3259 56.90%
74 1 Komodo 12.3 x64 1CPU (MCTS) 1 3189 15 15 1900 53.00% 3165 50.20%
Question 1: When testing chess engines, you quickly find that having enough resources together the error margins down is hard, especially at longer time controls. The testing groups try hard, but even getting 2000 games is difficult, and those error margins are big. It is better to look at progress over longer times. But I do not see "ratings leveling off" for Komodo 13.01. We have been spending most of our time improving Komodo MCTS, and you can see that in the data, but regular Komodo 13.01 is showing a good gain from its predecessors.

Question 2: although using test positions to estimate ratings was done long ago, many by Komodo's GM Larry Kaufman, they only can test limited features, and so are not done much now. We have found it better to play whole games and test the whole engine instead of just the parts that might get triggered in more limited positions.
lkaufman
Posts: 5960
Joined: Sun Jan 10, 2010 6:15 am
Location: Maryland USA

Re: Komodo CEGT 40/20 and 'Strength'

Post by lkaufman »

mjlef wrote: Mon May 20, 2019 8:13 pm
leavenfish wrote: Mon May 20, 2019 7:51 pm Hopefully I am not barking up a dead tree or wasting anyone's time....this is mostly asked of Team Komodo, Team CEGT...or anyone who really has something to say.

I am going to hope my formatting stays true for this cut/paste.
These are the Komodo versions since 9 for the CEGT 40/20 just released.
I have sorted by CPU and then by rating (Category = the Komodo version family (9 thru 13)).

Now, I am no slave to looking to ratings as I use an engines only for game analysis (not game play vs other engines) but I realize it is an easy way to try to gauge game play "strength" over a bunch of quick games. For my purposes, I would think a large sample of tactical and positional positions be fed into an engine to determine how accurately and to a lesser extent 'quickly' the proper 'answer' is found. How does one do this...well, the thought comes to mind that LCZero or perhaps better, Alpha Zero, could be put to work on this test suite first- lots of games ran and the 'proper idea' found/verified (if it contradicts conventional wisdom and truly offers better chances) for each position...then all the other engines ran thru the gauntlet. Just my suggestion. Anyway...

Question 1:
When I check to see the individual results vs others, I see the introduction of the LCZero engine starting with...I believe...Komodo 13.1 Does the fact that It and Stockfish (multiple recent iterations) do so well against Komodo in any way account for the ratings leveling off in the recent versions? I do not see LCZero or Stockfish 10 being ran against previous versions and perhaps Komodo's strength compared to earlier multiple versions of Stockfish involved less of a gap (?).

Question 2:
This ties into my comment above...do you think non-game play testing could in any way be used to determine a different view of "strength" (positional/judgement) for more critical positions? This way you evaluate...say the strength of evaluation on (making something up here) a Kasparov - Karpov game, move 24 which is in our hypothetical test and for which there might be multiple reasonable choices, but maybe 1 leads more of an advantage after many logical moves. If you had engines playing engines starting at say move 16 of that same game, some elite engines might actually play the game to move 24, while other elite engine might have deviated (and the choices could be perfectly good as well) at some point along the line and never get to chew on that 'test' position...and that would be a 'downside' of game play to determine using ratings to determine 'strength'. Whereas in a test suite you would have them all chewing on the same position.

Or in regards to Q2 - it is just generally thought that engine vs engine play produces its own unique set of critical positions (but of course, they are never the same for each engine as they are pushing the game along....) throughout a game...and that kind of makes up for the value one might find in a good, extensive and properly vetted test suite as I propose above?


Overall Cat Program CPU Elo + - GamesScore Av.Op.Draws
4 CPU
10 1 Komodo 12.1.1 x64 4CPU 4 3419 13 13 2249 62.00% 3318 56.30%
13 2 Komodo 12.3 x64 4CPU 4 3402 21 21 734 65.10% 3285 56.40%
8 1 Komodo 11.2 x64 4CPU 4 3421 13 13 2052 68.70% 3261 49.80%
12 2 Komodo 11.3 x64 4CPU 4 3406 15 15 1799 65.70% 3280 55.10%
14 3 Komodo 11.01 x64 4CPU 4 3395 15 15 1855 70.00% 3222 47.70%
15 1 Komodo 10.4 x64 4CPU 4 3395 13 13 2276 75.00% 3181 42.00%
20 2 Komodo 10.2 x64 4CPU 4 3355 15 15 1885 68.70% 3203 45.40%
21 3 Komodo 10.3 x64 4CPU 4 3354 18 18 1102 66.70% 3224 52.70%
24 4 Komodo 10.1 x64 4CPU 4 3340 18 18 1061 71.80% 3168 46.10%
25 5 Komodo 10 x64 4CPU 4 3340 12 12 2716 71.80% 3158 42.90%
27 1 Komodo 9.42 x64 4CPU 4 3333 12 12 2406 70.40% 3162 44.40%
28 2 Komodo 9.2 x64 4CPU 4 3331 12 12 2177 72.90% 3141 42.50%
34 3 Komodo 9.3 x64 4CPU 4 3319 15 15 1497 72.10% 3143 46.40%
37 4 Komodo 9.1 x64 4CPU 4 3308 14 14 2049 71.40% 3124 45.00%
40 5 Komodo 9.0 x64 4CPU 4 3298 14 14 2023 73.20% 3108 43.60%
51 1 Komodo 8.0 x64 4CPU 4 3250 14 14 1629 60.20% 3171 53.70%
64 1 Komodo 7.0 x64 4CPU 4 3211 15 15 1408 59.30% 3138 55.80%
1 CPU
23 1 Komodo 13.01 x64 1CPU 1 3340 18 18 945 58.60% 3276 56.70%
30 1 Komodo 12.2.2 x64 1CPU 1 3327 15 15 1601 65.40% 3207 47.10%
32 1 Komodo 11.3 x64 1CPU 1 3324 16 16 1710 71.80% 3142 45.10%
33 2 Komodo 11.2 x64 1CPU 1 3320 11 11 2899 71.10% 3145 45.60%
38 3 Komodo 11.01 x64 1CPU 1 3303 15 15 1949 72.60% 3114 44.60%
36 1 Komodo 10.4 x64 1CPU 1 3310 15 15 1711 72.00% 3128 43.80%
44 2 Komodo 10.3 x64 1CPU 1 3282 16 16 1315 61.10% 3194 54.00%
46 3 Komodo 10.2 x64 1CPU 1 3258 15 15 1658 67.10% 3122 47.90%
47 4 Komodo 10.1 x64 1CPU 1 3257 15 15 1500 67.00% 3122 45.70%
50 5 Komodo 10 x64 1CPU 1 3251 17 17 1498 75.90% 3037 36.50%
56 1 Komodo 9.42 x64 1CPU 1 3233 15 15 1396 74.50% 3035 39.40%
58 2 Komodo 9.3 x64 1CPU 1 3230 14 14 2038 67.00% 3092 44.80%
62 3 Komodo 9.2 x64 1CPU 1 3217 14 14 1750 71.70% 3044 41.70%
70 4 Komodo 9.1 x64 1CPU 1 3196 14 14 1637 70.00% 3038 46.20%
71 5 Komodo 9.0 x64 1CPU 1 3193 16 16 1299 71.60% 3023 43.00%
MCTS
48 1 Komodo 13.01 x64 1CPU (MCTS) 1 3256 16 16 1035 49.70% 3259 56.90%
74 1 Komodo 12.3 x64 1CPU (MCTS) 1 3189 15 15 1900 53.00% 3165 50.20%
Question 1: When testing chess engines, you quickly find that having enough resources together the error margins down is hard, especially at longer time controls. The testing groups try hard, but even getting 2000 games is difficult, and those error margins are big. It is better to look at progress over longer times. But I do not see "ratings leveling off" for Komodo 13.01. We have been spending most of our time improving Komodo MCTS, and you can see that in the data, but regular Komodo 13.01 is showing a good gain from its predecessors.

Question 2: although using test positions to estimate ratings was done long ago, many by Komodo's GM Larry Kaufman, they only can test limited features, and so are not done much now. We have found it better to play whole games and test the whole engine instead of just the parts that might get triggered in more limited positions.
The 1 cpu ratings look about right to me, they show progress about in line with our estimates. The 4 cpu ones are a bit puzzling, but Komodo 13 isn't on there yet. To get the answer to your question, just look up the performance of recent Komodos vs Lc0 and see if it is noticeably below the actual ratings. I don't have time to do this, but I would be interested if someone does this. Regarding rating by problem sets, I think it has been shown that Lc0 behaves very differently on problem sets than similarly rated versions of Stockfish. You could make any engine look good or bad simply by tilting the percentage of tactical vs. "positional" problems one way or the other. Thirty years ago it worked pretty well since tactics were pretty much all that decided engine games back then.
Komodo rules!