Stockfish Handicap Matches

Discussion of anything and everything relating to chess playing software and machines.

Moderators: bob, hgm, Harvey Williamson

Forum rules
This textbox is used to restore diagrams posted with the [d] tag before the upgrade.
lkaufman
Posts: 4228
Joined: Sun Jan 10, 2010 5:15 am
Location: Maryland USA
Contact:

Re: Stockfish Handicap Matches

Post by lkaufman » Wed Jun 24, 2020 9:18 pm

chrisw wrote:
Wed Jun 24, 2020 9:04 pm
lkaufman wrote:
Wed Jun 24, 2020 7:31 pm
chrisw wrote:
Wed Jun 24, 2020 6:43 pm
lkaufman wrote:
Wed Jun 24, 2020 6:08 pm
chrisw wrote:
Wed Jun 24, 2020 5:30 pm
lkaufman wrote:
Wed Jun 24, 2020 3:24 pm
Rebel wrote:
Wed Jun 24, 2020 10:18 am
Finished the elo 2900 pool.

Stockfish gauntlet, knight-odds, tc=40/10

Code: Select all

   # ENGINE          : RATING    POINTS  PLAYED    (%)
   1 cheng4_4.39     : 3273.3     128.0     200   64.0%
   2 Bobcat_8        : 3250.9     122.0     200   61.0%
   3 Stockfish_11    : 3172.5     269.5     600   44.9%
   4 Crafty_25.6     : 3103.3      80.5     200   40.3%
tc=40/20

Code: Select all

   # ENGINE          : RATING    POINTS  PLAYED    (%)
   1 cheng4_4.39     : 3315.9     137.5     200   68.8%
   2 Bobcat_8        : 3294.0     132.0     200   66.0%
   3 Stockfish_11    : 3177.8     274.5     600   45.8%
   4 Crafty_25.6     : 3012.3      56.0     200   28.0%
tc=40/40

Code: Select all

   # ENGINE          : RATING    POINTS  PLAYED    (%)
   1 cheng4_4.39     : 3318.8     146.0     200   73.0%
   2 Bobcat_8        : 3295.0     140.5     200   70.3%
   3 Stockfish_11    : 3144.5     242.0     600   40.3%
   4 Crafty_25.6     : 3041.7      71.5     200   35.8%
Next, 2800 pool.
So results improved steadily with more time as expected for cheng and bobcat, but not for crafty (between 40/10 and 40/20 regression); I wonder why? Two questions: How were the positions used chosen from the ChrisW set? I'm finding that taking them from the middle (pruning equal number from each end) is the fairest and closest simulation to real knight odds.
We can’t go cherry picking positions according to subjective criteria. And this concept of “real knight odds” is about as subjective as it gets, and it isn’t reached by asking an engine to evaluate at the root and using that as the definition. Imagine defining “real chess odds” by asking an engine to search from the root and give the answer. 42?
There are no “real knight odds”, all there is are positions without the knight and see how the results work out from *many* tests. We can try to use “natural” positions without either side having an apparent head start, eg remove the outliers.
Nor are we trying to determine what knights odds are in some numerical sense, we trying to determine how modern engines do against strong oldies with various handicaps, the first handicap being minus a knight.

Also, did Stockfish use default Contempt, or 0, or max (100)? It would do best with 100 I'm sure.
It’s better to just use defaults, too much parameter fiddling around just confuses everything.

Anyway, I prepared suites of 25, 100, 250 and 1000 epds. They are each a randomly selected subset of about 1200 epds taken from, I forget, it says in the github readme, roughly 370 to 420 I think. Probably that selection is actually in line with your desires, actually.

Posit from me: the most sensible course would be to use those sets only for a while, we’ll soon see if the 25 suite gives very different results from the 1000 suite, and then we can start worrying if small subsets and the positions in general are too noisy. For example, we don’t know right now if the anomalous(?) results of Crafty are down to unlucky position selection.
First result using your database. We took your 5000 knight odds set, which you had already pruned to 3870 positions, and removed 1435 positions from each end, producing a list of 1000 positions exactly in the center of your list, and put it in our tester. I hope you will agree that this is fair and unbiased. The score range was -4.30 to -4.11. quite narrow, and just by chance the worst score was the same score I got from the root position at 10 seconds for both positions. For the first test, I just had Komodo 14 play against itself at the very fast time control of 10 seconds + 0.1" increment, and the result was that White won one game, one draw, and Black won 998, so 1139 elo advantage. I'm sure that at a more normal time control the result would have been even more lopsided, probably just 100%. But the results of the tests between unrelated engines aren't showing a knight handicap to be worth a thousand elo. I suppose it's just a lot harder to give a handicap to someone who knows everything you know than it is to someone with very different skills.
You should be using the 25, 100, 250 or 1000 knight odds databases, depending of how many games in the gauntlet, posted to github last night. Then everybody is using the same base data.

Personally I am not interested in what an engine that would cost me a hundred euros to use does, so, again, because of free widespread access, it’s more interesting to stay with Stockfish (or LC0). Unsurprising that your program trounces itself when given knight odds.

You should, btw, know better than to ascribe 1000 Elo to a 99% result, let alone extrapolate from it. Elo scale is not able, nor meant, to deal with tail results of that nature.
I made the book from your data before I knew you were going to post subsets yourself. Regarding Elo, I know that the elo estimate for 99.85% is subject to a large margin of error, for multiple reasons, but the point was to show that your set of positions is completely winning for Black, as it should be, and that it is not easy to explain why 2750 rated engines only break even from these positions vs SF.
My main goal with this is to find an engine that will perform just as well taking knight odds as would a strong human player of the same Elo, so that we could reasonably predict results of engine vs GM handicap matches by simulation.
trying hard to decode this.
Failed.
You probably haven't followed all the Komodo vs. master/GM handicap matches, so to make a long story short, we know that at Rapid time controls (let's say 15' + 10" to standardize), Komodo (on 32 core machine) does well giving knight odds to players below about 2300 FIDE, poorly vs. players above that. If I say that Komodo performs about 2300 FIDE giving knight odds at this time control, I won't be far off from the truth. So I'd like to find an engine that would be equal with a 2300 FIDE player in standard chess at 15' + 10" which would also be even with Komodo at this TC at knight odds. It's obvious that the normal engines are hundreds of elos away from this, they need to be something like 2750 CCRL rapid, and even those ratings are unrealistically low compared to human FIDE ratings at Rapid. I'm looking for some engine below 2300 CCRL that can hold its own with Komodo (or Stockfish, doesn't matter that much) at knight odds. The closest I've come is the weakened Stockfish levels, but it looks like even they need to be much stronger than 2300 CCRL to have a chance at knight odds, although I don't really know what SF levels would be rated on that list at Rapid. I can determine this, but it will take a lot of time.
Komodo rules!

lkaufman
Posts: 4228
Joined: Sun Jan 10, 2010 5:15 am
Location: Maryland USA
Contact:

Re: Stockfish Handicap Matches

Post by lkaufman » Thu Jun 25, 2020 2:52 am

Rebel wrote:
Wed Jun 24, 2020 6:30 pm
lkaufman wrote:
Wed Jun 24, 2020 5:57 pm
Rebel wrote:
Wed Jun 24, 2020 5:31 pm
Stockfish gauntlet, knight-odds, ccrl elo pool <2800

tc=40/40 only.

Code: Select all

Rank Name                          Elo     +/-   Games   Score    Draw 
   0 Stockfish_11                   52      20    1000   57.4%   16.2%  
   1 Benjamin                       26      43     200   53.8%   19.5%
   2 ProDeo                         17      45     200   52.5%   14.0%
   3 Fruit_2.3                     -54      45     200   42.3%   16.5%
   4 Fruit_2.1                    -123      47     200   33.0%   15.0%  
   5 Ruffian_2                    -135      47     200   31.5%   16.0%   
Most of these engines don't have an exactly named copy in CCRL 40/15 (Benjamin and ProDeo have version numbers, two others aren't identical), but it looks like roughly 2750 on that list is the break-even point for SF 11 at that TC. I'll be curious to see if Komodo 14 can score as well against the same opponents. It would score much better with high Contempt, but so would Stockfish, so I guess it's fair enough to compare.
Have it running, takes about 2 hours.

Regarding elo's:

Code: Select all

Rank Name                          Elo     +/-   Games   Score    Draw  CCRL
   0 Stockfish_11                   52      20    1000   57.4%   16.2%  3537
   1 Benjamin                       26      43     200   53.8%   19.5%  2646
   2 ProDeo 2.2                     17      45     200   52.5%   14.0%  2770
   3 Fruit_2.3                     -54      45     200   42.3%   16.5%  2783
   4 Fruit_2.1                    -123      47     200   33.0%   15.0%  2684
   5 Ruffian_2                    -135      47     200   31.5%   16.0%  2608
Interesting is that Benjamin while listed 124 elo less than ProDeo scores better. I think that much of this kind of testing has to do with the killer instinct of an engine. And Benjamin is the gambit version of ProDeo. Style decides?
I think I've figured out a big part of the mystery of why both Stockfish and Komodo can give knight odds to such strong engines in your tests. When I first read that you were testing at 40/10, I assumed that the 10 meant ten minutes, since we are constantly referring to CCRL tests at 40/15 meaning 15 minutes. But when you starting giving results at 40/40 and even 40/80, I realized that you must mean 40/x SECONDS, not minutes, since you are unlikely to have enough hardware to play so many games so quickly (correct me if I'm wrong here)! So even these 40/40 results are (if I'm right now) bullet games, not rapid games. At bullet chess even the top human GMs would have trouble beating Stockfish or Komodo at knight odds. So the question is, what level engine can Komodo and Stockfish give knight odds to in rapid chess (15' + 10" being the standard now)? Obviously, it will be a weaker engine than these 2700+ engines, but how much weaker? I do have one data point: at 3' + 2" (blitz, roughly midway between bullet and rapid) I got a +94 elo result for Komodo 14 vs. Arasan 14 64 bit, about 2640 on CCRL 40/15 list (est. based on versions just before and after), which is a 2734 performance. But I used Contempt 150, which helps a lot at knight odds; I'll have to redo the test without Contempt (or with the default of just 4). My best guess is that with default Contempt, Komodo 14 and SF need opponents in the mid 2600s in blitz, and in the mid 2500s in Rapid (15' + 10"). I can run Komodo vs. Arasan 14 overnight at Rapid with default Contempt; if I'm correct Komodo will lose but not too badly. I run 63 games at once, so the thousand games won't finish in 8 or 9 hours, but maybe in 15 hours or so.
Komodo rules!

User avatar
Laskos
Posts: 10240
Joined: Wed Jul 26, 2006 8:21 pm
Full name: Kai Laskos

Re: Stockfish Handicap Matches

Post by Laskos » Thu Jun 25, 2020 5:08 am

lkaufman wrote:
Thu Jun 25, 2020 2:52 am
Rebel wrote:
Wed Jun 24, 2020 6:30 pm
lkaufman wrote:
Wed Jun 24, 2020 5:57 pm
Rebel wrote:
Wed Jun 24, 2020 5:31 pm
Stockfish gauntlet, knight-odds, ccrl elo pool <2800

tc=40/40 only.

Code: Select all

Rank Name                          Elo     +/-   Games   Score    Draw 
   0 Stockfish_11                   52      20    1000   57.4%   16.2%  
   1 Benjamin                       26      43     200   53.8%   19.5%
   2 ProDeo                         17      45     200   52.5%   14.0%
   3 Fruit_2.3                     -54      45     200   42.3%   16.5%
   4 Fruit_2.1                    -123      47     200   33.0%   15.0%  
   5 Ruffian_2                    -135      47     200   31.5%   16.0%   
Most of these engines don't have an exactly named copy in CCRL 40/15 (Benjamin and ProDeo have version numbers, two others aren't identical), but it looks like roughly 2750 on that list is the break-even point for SF 11 at that TC. I'll be curious to see if Komodo 14 can score as well against the same opponents. It would score much better with high Contempt, but so would Stockfish, so I guess it's fair enough to compare.
Have it running, takes about 2 hours.

Regarding elo's:

Code: Select all

Rank Name                          Elo     +/-   Games   Score    Draw  CCRL
   0 Stockfish_11                   52      20    1000   57.4%   16.2%  3537
   1 Benjamin                       26      43     200   53.8%   19.5%  2646
   2 ProDeo 2.2                     17      45     200   52.5%   14.0%  2770
   3 Fruit_2.3                     -54      45     200   42.3%   16.5%  2783
   4 Fruit_2.1                    -123      47     200   33.0%   15.0%  2684
   5 Ruffian_2                    -135      47     200   31.5%   16.0%  2608
Interesting is that Benjamin while listed 124 elo less than ProDeo scores better. I think that much of this kind of testing has to do with the killer instinct of an engine. And Benjamin is the gambit version of ProDeo. Style decides?
I think I've figured out a big part of the mystery of why both Stockfish and Komodo can give knight odds to such strong engines in your tests. When I first read that you were testing at 40/10, I assumed that the 10 meant ten minutes, since we are constantly referring to CCRL tests at 40/15 meaning 15 minutes. But when you starting giving results at 40/40 and even 40/80, I realized that you must mean 40/x SECONDS, not minutes, since you are unlikely to have enough hardware to play so many games so quickly (correct me if I'm wrong here)! So even these 40/40 results are (if I'm right now) bullet games, not rapid games. At bullet chess even the top human GMs would have trouble beating Stockfish or Komodo at knight odds. So the question is, what level engine can Komodo and Stockfish give knight odds to in rapid chess (15' + 10" being the standard now)? Obviously, it will be a weaker engine than these 2700+ engines, but how much weaker? I do have one data point: at 3' + 2" (blitz, roughly midway between bullet and rapid) I got a +94 elo result for Komodo 14 vs. Arasan 14 64 bit, about 2640 on CCRL 40/15 list (est. based on versions just before and after), which is a 2734 performance. But I used Contempt 150, which helps a lot at knight odds; I'll have to redo the test without Contempt (or with the default of just 4). My best guess is that with default Contempt, Komodo 14 and SF need opponents in the mid 2600s in blitz, and in the mid 2500s in Rapid (15' + 10"). I can run Komodo vs. Arasan 14 overnight at Rapid with default Contempt; if I'm correct Komodo will lose but not too badly. I run 63 games at once, so the thousand games won't finish in 8 or 9 hours, but maybe in 15 hours or so.
We hyper-analyzed these issues several years ago. IIRC, at bullet engine-engine Knight odds are some 600 (logistic) engine Elo points, at rapid 45min + 15s some 1200 logistic engine ELO points, and by continuation at tournament TC maybe 1400 engine Elo points (the last one was never really tested). The handicap is heavily TC dependent, and with humans it could be even more dependent (humans are weak at bullet and blitz). I don't think that even a perfect engine can give Knight odds to Carlsen at tournament TC. Currently SF and Komodo in whatever configuration are no stronger than 2100 FIDE Elo points Knight odds at tournament TC against a human.

User avatar
Laskos
Posts: 10240
Joined: Wed Jul 26, 2006 8:21 pm
Full name: Kai Laskos

Re: Stockfish Handicap Matches

Post by Laskos » Thu Jun 25, 2020 5:47 am

Laskos wrote:
Thu Jun 25, 2020 5:08 am
lkaufman wrote:
Thu Jun 25, 2020 2:52 am
Rebel wrote:
Wed Jun 24, 2020 6:30 pm
lkaufman wrote:
Wed Jun 24, 2020 5:57 pm
Rebel wrote:
Wed Jun 24, 2020 5:31 pm
Stockfish gauntlet, knight-odds, ccrl elo pool <2800

tc=40/40 only.

Code: Select all

Rank Name                          Elo     +/-   Games   Score    Draw 
   0 Stockfish_11                   52      20    1000   57.4%   16.2%  
   1 Benjamin                       26      43     200   53.8%   19.5%
   2 ProDeo                         17      45     200   52.5%   14.0%
   3 Fruit_2.3                     -54      45     200   42.3%   16.5%
   4 Fruit_2.1                    -123      47     200   33.0%   15.0%  
   5 Ruffian_2                    -135      47     200   31.5%   16.0%   
Most of these engines don't have an exactly named copy in CCRL 40/15 (Benjamin and ProDeo have version numbers, two others aren't identical), but it looks like roughly 2750 on that list is the break-even point for SF 11 at that TC. I'll be curious to see if Komodo 14 can score as well against the same opponents. It would score much better with high Contempt, but so would Stockfish, so I guess it's fair enough to compare.
Have it running, takes about 2 hours.

Regarding elo's:

Code: Select all

Rank Name                          Elo     +/-   Games   Score    Draw  CCRL
   0 Stockfish_11                   52      20    1000   57.4%   16.2%  3537
   1 Benjamin                       26      43     200   53.8%   19.5%  2646
   2 ProDeo 2.2                     17      45     200   52.5%   14.0%  2770
   3 Fruit_2.3                     -54      45     200   42.3%   16.5%  2783
   4 Fruit_2.1                    -123      47     200   33.0%   15.0%  2684
   5 Ruffian_2                    -135      47     200   31.5%   16.0%  2608
Interesting is that Benjamin while listed 124 elo less than ProDeo scores better. I think that much of this kind of testing has to do with the killer instinct of an engine. And Benjamin is the gambit version of ProDeo. Style decides?
I think I've figured out a big part of the mystery of why both Stockfish and Komodo can give knight odds to such strong engines in your tests. When I first read that you were testing at 40/10, I assumed that the 10 meant ten minutes, since we are constantly referring to CCRL tests at 40/15 meaning 15 minutes. But when you starting giving results at 40/40 and even 40/80, I realized that you must mean 40/x SECONDS, not minutes, since you are unlikely to have enough hardware to play so many games so quickly (correct me if I'm wrong here)! So even these 40/40 results are (if I'm right now) bullet games, not rapid games. At bullet chess even the top human GMs would have trouble beating Stockfish or Komodo at knight odds. So the question is, what level engine can Komodo and Stockfish give knight odds to in rapid chess (15' + 10" being the standard now)? Obviously, it will be a weaker engine than these 2700+ engines, but how much weaker? I do have one data point: at 3' + 2" (blitz, roughly midway between bullet and rapid) I got a +94 elo result for Komodo 14 vs. Arasan 14 64 bit, about 2640 on CCRL 40/15 list (est. based on versions just before and after), which is a 2734 performance. But I used Contempt 150, which helps a lot at knight odds; I'll have to redo the test without Contempt (or with the default of just 4). My best guess is that with default Contempt, Komodo 14 and SF need opponents in the mid 2600s in blitz, and in the mid 2500s in Rapid (15' + 10"). I can run Komodo vs. Arasan 14 overnight at Rapid with default Contempt; if I'm correct Komodo will lose but not too badly. I run 63 games at once, so the thousand games won't finish in 8 or 9 hours, but maybe in 15 hours or so.
We hyper-analyzed these issues several years ago. IIRC, at bullet engine-engine Knight odds are some 600 (logistic) engine Elo points, at rapid 45min + 15s some 1200 logistic engine ELO points, and by continuation at tournament TC maybe 1400 engine Elo points (the last one was never really tested). The handicap is heavily TC dependent, and with humans it could be even more dependent (humans are weak at bullet and blitz). I don't think that even a perfect engine can give Knight odds to Carlsen at tournament TC. Currently SF and Komodo in whatever configuration are no stronger than 2100 FIDE Elo points Knight odds at tournament TC against a human.
My estimates for good "human" sparring at tournament TC is Lc0

11248 at 1000 nodes --- 2750 FIDE
11248 at 100 nodes --- 2450 FIDE
11248 at 10 nodes --- 2100 FIDE.

Again, tournament TC. At Blitz 5min + 3s add some 300 Elo points to those to get FIDE Blitz rating. All this is rough estimation, but it can be improved by testing.

lkaufman
Posts: 4228
Joined: Sun Jan 10, 2010 5:15 am
Location: Maryland USA
Contact:

Re: Stockfish Handicap Matches

Post by lkaufman » Thu Jun 25, 2020 3:56 pm

Laskos wrote:
Thu Jun 25, 2020 5:47 am
Laskos wrote:
Thu Jun 25, 2020 5:08 am
lkaufman wrote:
Thu Jun 25, 2020 2:52 am
Rebel wrote:
Wed Jun 24, 2020 6:30 pm
lkaufman wrote:
Wed Jun 24, 2020 5:57 pm
Rebel wrote:
Wed Jun 24, 2020 5:31 pm
Stockfish gauntlet, knight-odds, ccrl elo pool <2800

tc=40/40 only.

Code: Select all

Rank Name                          Elo     +/-   Games   Score    Draw 
   0 Stockfish_11                   52      20    1000   57.4%   16.2%  
   1 Benjamin                       26      43     200   53.8%   19.5%
   2 ProDeo                         17      45     200   52.5%   14.0%
   3 Fruit_2.3                     -54      45     200   42.3%   16.5%
   4 Fruit_2.1                    -123      47     200   33.0%   15.0%  
   5 Ruffian_2                    -135      47     200   31.5%   16.0%   
Most of these engines don't have an exactly named copy in CCRL 40/15 (Benjamin and ProDeo have version numbers, two others aren't identical), but it looks like roughly 2750 on that list is the break-even point for SF 11 at that TC. I'll be curious to see if Komodo 14 can score as well against the same opponents. It would score much better with high Contempt, but so would Stockfish, so I guess it's fair enough to compare.
Have it running, takes about 2 hours.

Regarding elo's:

Code: Select all

Rank Name                          Elo     +/-   Games   Score    Draw  CCRL
   0 Stockfish_11                   52      20    1000   57.4%   16.2%  3537
   1 Benjamin                       26      43     200   53.8%   19.5%  2646
   2 ProDeo 2.2                     17      45     200   52.5%   14.0%  2770
   3 Fruit_2.3                     -54      45     200   42.3%   16.5%  2783
   4 Fruit_2.1                    -123      47     200   33.0%   15.0%  2684
   5 Ruffian_2                    -135      47     200   31.5%   16.0%  2608
Interesting is that Benjamin while listed 124 elo less than ProDeo scores better. I think that much of this kind of testing has to do with the killer instinct of an engine. And Benjamin is the gambit version of ProDeo. Style decides?
I think I've figured out a big part of the mystery of why both Stockfish and Komodo can give knight odds to such strong engines in your tests. When I first read that you were testing at 40/10, I assumed that the 10 meant ten minutes, since we are constantly referring to CCRL tests at 40/15 meaning 15 minutes. But when you starting giving results at 40/40 and even 40/80, I realized that you must mean 40/x SECONDS, not minutes, since you are unlikely to have enough hardware to play so many games so quickly (correct me if I'm wrong here)! So even these 40/40 results are (if I'm right now) bullet games, not rapid games. At bullet chess even the top human GMs would have trouble beating Stockfish or Komodo at knight odds. So the question is, what level engine can Komodo and Stockfish give knight odds to in rapid chess (15' + 10" being the standard now)? Obviously, it will be a weaker engine than these 2700+ engines, but how much weaker? I do have one data point: at 3' + 2" (blitz, roughly midway between bullet and rapid) I got a +94 elo result for Komodo 14 vs. Arasan 14 64 bit, about 2640 on CCRL 40/15 list (est. based on versions just before and after), which is a 2734 performance. But I used Contempt 150, which helps a lot at knight odds; I'll have to redo the test without Contempt (or with the default of just 4). My best guess is that with default Contempt, Komodo 14 and SF need opponents in the mid 2600s in blitz, and in the mid 2500s in Rapid (15' + 10"). I can run Komodo vs. Arasan 14 overnight at Rapid with default Contempt; if I'm correct Komodo will lose but not too badly. I run 63 games at once, so the thousand games won't finish in 8 or 9 hours, but maybe in 15 hours or so.
We hyper-analyzed these issues several years ago. IIRC, at bullet engine-engine Knight odds are some 600 (logistic) engine Elo points, at rapid 45min + 15s some 1200 logistic engine ELO points, and by continuation at tournament TC maybe 1400 engine Elo points (the last one was never really tested). The handicap is heavily TC dependent, and with humans it could be even more dependent (humans are weak at bullet and blitz). I don't think that even a perfect engine can give Knight odds to Carlsen at tournament TC. Currently SF and Komodo in whatever configuration are no stronger than 2100 FIDE Elo points Knight odds at tournament TC against a human.
My estimates for good "human" sparring at tournament TC is Lc0

11248 at 1000 nodes --- 2750 FIDE
11248 at 100 nodes --- 2450 FIDE
11248 at 10 nodes --- 2100 FIDE.

Again, tournament TC. At Blitz 5min + 3s add some 300 Elo points to those to get FIDE Blitz rating. All this is rough estimation, but it can be improved by testing.


Yes, we reached those conclusions a few years ago based on self-testing of the same engine at different time controls, IIRC. Whether it applies to games involving humans or to games with NN or NNUE engines that didn't exist then is unknown. Your estimate of 2100 FIDE Elo for Knight odds at TC vs human is pretty consistent with my estimate of 2300 for same at 15' + 10". In this new world of online-only chess it seems that 5 hour games may be mostly a thing of the past, and that Rapid has become the main form of chess, like it or not. Lc0 is probably a better substitute for a human than a conventional program for these tests; note that the networks starting with 70xxx are now the best for such purposes as they reverted to the no-resign training that made 11248 so good at playing lost positions, and they now play quite well down a piece or more, and they are stronger and play more sensibly than the old network. For testing against CPU engines there is the problem that you can't make proper use of a machine with many CPUs and just one GPU; you could use the Lc0 cpu version I suppose. Perhaps the best "human" for such tests might be an NNUE like Stockfish NNUE, but it's a bit early to say yet.
My overnight test of Komodo 14 giving knight odds to Arasan 14 at 15' + 10" on one thread is down 8.6 elo after 686 games, implying an elo of about 2570 CCRL Rapid at knight odds. This would probably mean something like 2670 on 32 Threads (each doubling is worth much less giving knight odds than in normal chess, so maybe 20 elo per doubling), which in turn means something like 2900 FIDE Rapid based on some estimates I made of the likely human equivalence in Rapid of CCRL rapid ratings, and this was without even setting Contempt. So it appears that using conventional engines to predict the rating needed to beat Komodo at knight odds in Rapid overstates the reality by something like 600 elo! Exactly why this is so is a bit of a mystery, even allowing for the fact that the engines don't fully appreciate the circumstances and aren't optimized for winning when up a piece. Perhaps this won't be the case if we substitute an NN (or NNUE) engine for the human.
Komodo rules!

User avatar
Rebel
Posts: 5265
Joined: Thu Aug 18, 2011 10:04 am

Re: Stockfish Handicap Matches

Post by Rebel » Thu Jun 25, 2020 7:56 pm

From the SF11 - Crafty knight-odds match.


White wins by adjudication} 1-0

Wrong adjudication by cute-chess, never seen that before.

And another one.


And there are many more.
90% of coding is debugging, the other 10% is writing bugs.

chrisw
Posts: 3234
Joined: Tue Apr 03, 2012 2:28 pm

Re: Stockfish Handicap Matches

Post by chrisw » Thu Jun 25, 2020 8:34 pm

Rebel wrote:
Thu Jun 25, 2020 7:56 pm
From the SF11 - Crafty knight-odds match.


White wins by adjudication} 1-0

Wrong adjudication by cute-chess, never seen that before.

And another one.


And there are many more.
Crafty sends out weirdo texts in the middle of search, often right in the middle of a PV stream. Common one is “Trojan check” or something. If a UI is not expecting random non UCI output mixed in with the UCI stream?

User avatar
Rebel
Posts: 5265
Joined: Thu Aug 18, 2011 10:04 am

Re: Stockfish Handicap Matches

Post by Rebel » Thu Jun 25, 2020 9:34 pm

lkaufman wrote:
Thu Jun 25, 2020 2:52 am
Rebel wrote:
Wed Jun 24, 2020 6:30 pm
Interesting is that Benjamin while listed 124 elo less than ProDeo scores better. I think that much of this kind of testing has to do with the killer instinct of an engine. And Benjamin is the gambit version of ProDeo. Style decides?
I think I've figured out a big part of the mystery of why both Stockfish and Komodo can give knight odds to such strong engines in your tests. When I first read that you were testing at 40/10, I assumed that the 10 meant ten minutes, since we are constantly referring to CCRL tests at 40/15 meaning 15 minutes. But when you starting giving results at 40/40 and even 40/80, I realized that you must mean 40/x SECONDS, not minutes, since you are unlikely to have enough hardware to play so many games so quickly (correct me if I'm wrong here)! So even these 40/40 results are (if I'm right now) bullet games, not rapid games. At bullet chess even the top human GMs would have trouble beating Stockfish or Komodo at knight odds. So the question is, what level engine can Komodo and Stockfish give knight odds to in rapid chess (15' + 10" being the standard now)? Obviously, it will be a weaker engine than these 2700+ engines, but how much weaker? I do have one data point: at 3' + 2" (blitz, roughly midway between bullet and rapid) I got a +94 elo result for Komodo 14 vs. Arasan 14 64 bit, about 2640 on CCRL 40/15 list (est. based on versions just before and after), which is a 2734 performance. But I used Contempt 150, which helps a lot at knight odds; I'll have to redo the test without Contempt (or with the default of just 4). My best guess is that with default Contempt, Komodo 14 and SF need opponents in the mid 2600s in blitz, and in the mid 2500s in Rapid (15' + 10"). I can run Komodo vs. Arasan 14 overnight at Rapid with default Contempt; if I'm correct Komodo will lose but not too badly. I run 63 games at once, so the thousand games won't finish in 8 or 9 hours, but maybe in 15 hours or so.
Sorry for the time control confusion, I used the cute-chess format.

It won't be easy to find engines fit for the purpose you want it to be, a short test:

Code: Select all

Komodo 14  1 sec  2 sec  4 sec  8 sec  CCRL
Fruit 2.1  29.3%  30.5%  41.0%  47.5%  2684
Benjamin   46.0%  56.5%                2646
While Benjamin is CCRL rated lower it only needs 2 seconds average to beat Komodo.
90% of coding is debugging, the other 10% is writing bugs.

lkaufman
Posts: 4228
Joined: Sun Jan 10, 2010 5:15 am
Location: Maryland USA
Contact:

Re: Stockfish Handicap Matches

Post by lkaufman » Fri Jun 26, 2020 12:22 am

Rebel wrote:
Thu Jun 25, 2020 9:34 pm
lkaufman wrote:
Thu Jun 25, 2020 2:52 am
Rebel wrote:
Wed Jun 24, 2020 6:30 pm
Interesting is that Benjamin while listed 124 elo less than ProDeo scores better. I think that much of this kind of testing has to do with the killer instinct of an engine. And Benjamin is the gambit version of ProDeo. Style decides?
I think I've figured out a big part of the mystery of why both Stockfish and Komodo can give knight odds to such strong engines in your tests. When I first read that you were testing at 40/10, I assumed that the 10 meant ten minutes, since we are constantly referring to CCRL tests at 40/15 meaning 15 minutes. But when you starting giving results at 40/40 and even 40/80, I realized that you must mean 40/x SECONDS, not minutes, since you are unlikely to have enough hardware to play so many games so quickly (correct me if I'm wrong here)! So even these 40/40 results are (if I'm right now) bullet games, not rapid games. At bullet chess even the top human GMs would have trouble beating Stockfish or Komodo at knight odds. So the question is, what level engine can Komodo and Stockfish give knight odds to in rapid chess (15' + 10" being the standard now)? Obviously, it will be a weaker engine than these 2700+ engines, but how much weaker? I do have one data point: at 3' + 2" (blitz, roughly midway between bullet and rapid) I got a +94 elo result for Komodo 14 vs. Arasan 14 64 bit, about 2640 on CCRL 40/15 list (est. based on versions just before and after), which is a 2734 performance. But I used Contempt 150, which helps a lot at knight odds; I'll have to redo the test without Contempt (or with the default of just 4). My best guess is that with default Contempt, Komodo 14 and SF need opponents in the mid 2600s in blitz, and in the mid 2500s in Rapid (15' + 10"). I can run Komodo vs. Arasan 14 overnight at Rapid with default Contempt; if I'm correct Komodo will lose but not too badly. I run 63 games at once, so the thousand games won't finish in 8 or 9 hours, but maybe in 15 hours or so.
Sorry for the time control confusion, I used the cute-chess format.

It won't be easy to find engines fit for the purpose you want it to be, a short test:

Code: Select all

Komodo 14  1 sec  2 sec  4 sec  8 sec  CCRL
Fruit 2.1  29.3%  30.5%  41.0%  47.5%  2684
Benjamin   46.0%  56.5%                2646
While Benjamin is CCRL rated lower it only needs 2 seconds average to beat Komodo.
Very good, it looks like Benjamin is much closer to what I'm looking for than most (if not all) other engines. Based on CCRL ratings and my estimates for converting to FIDE rapid, it should be roughly a fair match at standard chess, Rapid (15' + 10") time control, with Magnus Carlsen on one core on a modern I7. By extrapolating to about 25" per move it should score somewhere in the 80 to 90% range at knight odds vs Komodo. That's probably still less than Carlsen would score, but it's not ridiculous. So, some questions about Benjamin.
1. Is it currently available? If so, how?
2. Is there a Linux version? I use Windows myself, but our Komodo tester uses Linux. I can test either way, but much faster on our tester.
3. Does it have a way to reduce the level of play (moderately), other than just giving it less time? If not, shortening the time should work fine.
4. Any insights into why it is so much better at this than Fruit? You mentioned gambit-style play, but that's not normally the sensible way to exploit an extra piece. More like the way to play when down a piece!
5. Final question: When you say "1 sec" (for example) above, is that movetime = 1 second, or 40 moves in 40 seconds so average one second? It's not a huge difference, but obviously quality of play is higher in the second case.
The idea here is that when and if we find a way to make Komodo play much better down a piece than it does now, we need a way to prove this without going to the trouble and expense of a GM match without some idea that we might do well. While it is unrealistic to expect to beat an active GM at knight odds in a Rapid match, if we specify Armageddon knight odds, meaning that draws count as wins for White (which is logical, he's down a piece), then it becomes a realistic goal.
Komodo rules!

User avatar
Rebel
Posts: 5265
Joined: Thu Aug 18, 2011 10:04 am

Re: Stockfish Handicap Matches

Post by Rebel » Fri Jun 26, 2020 8:52 am

Wrote a util that checks if the result tag is right in cute-chess, from the 200 SF11-Crafty games the util reported 52 errors, won games by Crafty reported by cute-chess as a win for Stockfish ??

Stockfish_11 - Crafty_25.6 [1-0] Stockfish_11 last move -62.66/34 1.8s
Stockfish_11 - Crafty_25.6 [1-0] Stockfish_11 last move -12.54/13 0.048s
Stockfish_11 - Crafty_25.6 [1-0] Stockfish_11 last move -64.15/20 0.50s
Stockfish_11 - Crafty_25.6 [1-0] Stockfish_11 last move -17.17/17 0.29s
Stockfish_11 - Crafty_25.6 [1-0] Stockfish_11 last move -11.32/16 0.12s
Stockfish_11 - Crafty_25.6 [1-0] Stockfish_11 last move -9.49/14 0.096s
Stockfish_11 - Crafty_25.6 [1-0] Stockfish_11 last move -M26/27 0.69s
Stockfish_11 - Crafty_25.6 [1-0] Stockfish_11 last move -16.27/21 4.0s
Stockfish_11 - Crafty_25.6 [1-0] Stockfish_11 last move -14.00/20 0.64s
Stockfish_11 - Crafty_25.6 [1-0] Stockfish_11 last move -9.35/12 0.11s
Stockfish_11 - Crafty_25.6 [1-0] Stockfish_11 last move -9.00/21 1.9s
Stockfish_11 - Crafty_25.6 [1-0] Stockfish_11 last move -56.28/26 4.6s
Stockfish_11 - Crafty_25.6 [1-0] Stockfish_11 last move -10.17/16 0.33s
Stockfish_11 - Crafty_25.6 [1-0] Stockfish_11 last move -9.45/16 0.25s
Stockfish_11 - Crafty_25.6 [1-0] Stockfish_11 last move -16.37/22 0.97s
Stockfish_11 - Crafty_25.6 [1-0] Stockfish_11 last move -M20/32 0.11s
Stockfish_11 - Crafty_25.6 [1-0] Stockfish_11 last move -15.38/19 1.3s
Stockfish_11 - Crafty_25.6 [1-0] Stockfish_11 last move -20.86/15 0.094s
Stockfish_11 - Crafty_25.6 [1-0] Stockfish_11 last move -M12/30 0.053s
Stockfish_11 - Crafty_25.6 [1-0] Stockfish_11 last move -68.32/23 1.7s
Stockfish_11 - Crafty_25.6 [1-0] Stockfish_11 last move -M46/31 0.82s
Stockfish_11 - Crafty_25.6 [1-0] Stockfish_11 last move -12.16/14 0.029s
Stockfish_11 - Crafty_25.6 [1-0] Stockfish_11 last move -M16/37 0.27s
Stockfish_11 - Crafty_25.6 [1-0] Stockfish_11 last move -65.97/27 0.30s
Stockfish_11 - Crafty_25.6 [1-0] Stockfish_11 last move -56.09/1 0s
Stockfish_11 - Crafty_25.6 [1-0] Stockfish_11 last move -M18/23 0.032s
Stockfish_11 - Crafty_25.6 [1-0] Stockfish_11 last move -26.43/17 0.51s
Stockfish_11 - Crafty_25.6 [1-0] Stockfish_11 last move -M18/49 2.2s
Stockfish_11 - Crafty_25.6 [1-0] Stockfish_11 last move -22.86/15 0.12s
Stockfish_11 - Crafty_25.6 [1-0] Stockfish_11 last move -68.94/29 0.26s
Stockfish_11 - Crafty_25.6 [1-0] Stockfish_11 last move -11.33/17 0.42s
Stockfish_11 - Crafty_25.6 [1-0] Stockfish_11 last move -9.55/18 0.47s
Stockfish_11 - Crafty_25.6 [1-0] Stockfish_11 last move -9.41/22 0.091s
Stockfish_11 - Crafty_25.6 [1-0] Stockfish_11 last move -M20/27 0.43s
Stockfish_11 - Crafty_25.6 [1-0] Stockfish_11 last move -70.33/1 0s
Stockfish_11 - Crafty_25.6 [1-0] Stockfish_11 last move -M22/48 1.5s
Stockfish_11 - Crafty_25.6 [1-0] Stockfish_11 last move -M6/245 0.10s
Stockfish_11 - Crafty_25.6 [1-0] Stockfish_11 last move -13.62/16 0.17s
Stockfish_11 - Crafty_25.6 [1-0] Stockfish_11 last move -M4/245 0.008s
Stockfish_11 - Crafty_25.6 [1-0] Stockfish_11 last move -57.16/17 0.13s
Stockfish_11 - Crafty_25.6 [1-0] Stockfish_11 last move -65.53/26 1.2s
Stockfish_11 - Crafty_25.6 [1-0] Stockfish_11 last move -M26/38 1.5s
Stockfish_11 - Crafty_25.6 [1-0] Stockfish_11 last move -M28/42 0.63s
Stockfish_11 - Crafty_25.6 [1-0] Stockfish_11 last move -M20/33 0.13s
Stockfish_11 - Crafty_25.6 [1-0] Stockfish_11 last move -M16/50 1.1s
Stockfish_11 - Crafty_25.6 [1-0] Stockfish_11 last move -M16/34 0.19s
Stockfish_11 - Crafty_25.6 [1-0] Stockfish_11 last move -63.28/24 0.64s
Stockfish_11 - Crafty_25.6 [1-0] Stockfish_11 last move -M16/40 0.32s
Stockfish_11 - Crafty_25.6 [1-0] Stockfish_11 last move -M26/35 0.30s
Stockfish_11 - Crafty_25.6 [1-0] Stockfish_11 last move -M14/31 0.048s
Stockfish_11 - Crafty_25.6 [1-0] Stockfish_11 last move -11.89/20 1.2s
Stockfish_11 - Crafty_25.6 [1-0] Stockfish_11 last move -M54/21 0.26s

Definitely a cute-chess bug.

Found out that adding the option "whitepov" fixes the problem and Crafty scored as one would expect.

Code: Select all

   # ENGINE          : RATING    POINTS  PLAYED    (%)
   1 Crafty_25.6     : 3034.6     119.5     200   59.8%
   2 Stockfish_11    : 2965.4      80.5     200   40.3%
90% of coding is debugging, the other 10% is writing bugs.

Post Reply