Stockfish vs. Lc0: IMHO disappointing result for Lc0

Guenther · Post by **Guenther** » Sun Jul 28, 2019 9:20 am

mvanthoor wrote: ↑Sat Jul 27, 2019 10:06 pm Today I ran a short 20-game match between Stockfish 10 and Lc0. Specs of the match:

Stockfish 10 x64 BMI2 on Intel i7-6700K, 4 threads, 8GB hashtable
Lc0 0.21.3, w42850 on GTX 1070. 4 threads, everything else default.
Syzygy 5 men tablebase, 8 move Performance.bin opening book.
Adjucation by GUI bo overwhelming material advantage or Syzygy when win/draw/loss in endgame.

...

In some games, Leela makes exceedingly weird moves, and lost game 1 in 21 moves because of a blunder

...

Are the games available somewhere and also the other 20 games mentioned later in this thread?
(If possible old pgn format w/o gimmicks just eval/depth time)
I would like to look at them.

M ANSARI · Post by **M ANSARI** » Sun Jul 28, 2019 10:05 am

You have to realize that GPU graphic cards that have the ability to do AI are in the first generation. So if you look at CPU power, this would be sort of like running SF using a single core 386 CPU. Of course a GPU card like 2080Ti is expensive today, but so was a 386 when it first came out. My guess is that GPU's that can do AI will quickly get much more powerful and much cheaper. There is no doubt that AI will transform every thing in our life and maybe it will be a transformation similar to when humanity discovered electricity. Lc0 is only competitive once you have reasonably good hardware. I don't think you need a 2080Ti card for that and most likely the new 2070 Super cards are very competitive with SF. This hardware pricing of cards that can do AI will probably change exponentially with much more powerful cards coming out at a fraction of today's prices. Also Lc0 will probably patch many of its weaknesses (tactical and endgame weakness) via software … remember Lc0 is only a little over a year old.

Ovyron · Post by **Ovyron** » Sun Jul 28, 2019 12:49 pm

mvanthoor wrote: ↑Sun Jul 28, 2019 2:43 am If I'd run 100 or a 1000 games, I'd roughly expect the result to be "+30 -5 =65" or "+300 -50 =650" in favor of Stockfish 10 on this computer, give or take a few points.

If you managed to create an algorithm that managed to extrapolate the results of 100 or 1000 games with just 20, you'd be our hero.

Currently the biggest problem of Computer Chess is testing time, and how some things require hundreds of thousands of games to measure them. If you could play a fraction of the games and extrapolate the rest accurately, it would have a big impact.

It'd probably require some algorithm that measures the expected elo performance of the individual moves of the games, and then project what would be the end result of the match after an arbitrary number of games played with that performance, but it's not clear if such a system would be able to produce results much faster than playing the actual games (the main benefit would be that you wouldn't need a pricey GPU to test, someone that already has the GPU runs 20 games, and you use those to extrapolate the final score of the game with the system on your hardware.)

I wonder if a NN could be created for this (play 1000 games, feed the first 20 to the NN, make it try to predict the actual results of the 1000 games, eventually you can have a NN that is very accurate in its predictions), but otherwise, there's no way to know the result of a 100 game match just looking at 20 games, not even of 40 (and this is a difficult problem, i.e. you can cherry pick 20 games from the 1000 where one side wins them all, how is it possible to extrapolate the 1000 games outcome from those?)

S.Taylor · Post by **S.Taylor** » Sun Jul 28, 2019 1:17 pm

mvanthoor wrote: ↑Sat Jul 27, 2019 10:06 pm Today I ran a short 20-game match between Stockfish 10 and Lc0. Specs of the match:

Stockfish 10 x64 BMI2 on Intel i7-6700K, 4 threads, 8GB hashtable
Lc0 0.21.3, w42850 on GTX 1070. 4 threads, everything else default.
Syzygy 5 men tablebase, 8 move Performance.bin opening book.
Adjucation by GUI bo overwhelming material advantage or Syzygy when win/draw/loss in endgame.

The result was +4 -1 =15 in favor of Stockfish 10.

To be honest, after all the hype surrounding Lc0, I find the result to be disappointing. I'd expected the result to be the reverse, to be honest.When looking into networks, I found https://www.sp-cc.de/lc0-testing.htm, and the network I used is stronger than the ones used there (+60 ELO).

I haven't looked into things such as Leela Ratio or anything yet. I'm not trying to match one engine against another on the same hardware or anything: I wanted to know: how much stronger or weaker is Lc0, running on a GTX 1070, compared to Stockfish running to the specifications of CCRL 40/4?

I ran the match at a time control of 40 moves in 85 seconds as, on my computer, that is the setting to use for CCRL 40/4. In CCRL 40/4. I wanted to know where a full power Lc0 on GTX1070 would fall in the CCRL 40/4 list. Stockfish has a rating of 3547, and the result of +4-1=15 shows a rating advantage of +52 of Stockfish over Lc0, setting Lc0 at 3495. That is only 6 points above the rating of 3486 which Lc0 attains in the CCRL 40/4 list (al be it with a different network), despite the GTX 1070 being a much more powerful card that the GTX 1050. That seems disappointing.

Also, the games are not very interesting. Often, after 30-35 moves or so, everything has been traded down to an endgame. Also, it's often Stockfish preventing a draw by threefold repetition (because of the default contempt probably), and even so, many games ended in threefold repetition. In some games, Leela makes exceedingly weird moves, and lost game 1 in 21 moves because of a blunder. With regard to Stockfish, I can mostly understand what it's trying to do with a move, but with Lc0, I'm often left guessing. Because Lc0 "only" searches 10K nodes or so in the endgame, while Stockfish is often already into the 10+ million, Stockfish reaches the endgame database much faster. I often see Leela struggling to look beyond 12 ply or so, while Stockfish is soaring into the 40 ply range, reaching the endgame database from the late middle game.

Of course, my expectation wasn't for Lc0 to blow Stockfish out of the water with a 20-0 result, but I did expect it to win with a +2 score or so. Could/should I be using a different network (I've seen some networks that were smaller, faster, and had a higher ELO-rating than the 42850 I used)? Are my expectations wrong, and is a GTX 1070 just not powerful enough?

I don't play a lot of games. I always pick a midrange card; in this case I picked the GTX 1070 in 2016, because of The Witcher 3, but if I don't acquire a newer game that needs a lot more power, this card is likely to also be in my next computer. I do need/use a lot of CPU-power for some of my tasks, so the 6700K will probably be replaced by a 12 core machine, at least. If Stockfish already wins by +4-1, running on an old i7-6700K against Lc0 on a GTX 1070, I shudder to think how it would decimate Lc0 @ GTX 1070 when running on one of the new Zen3 CPU's with 12 or 16 cores if I should get a new computer (but not a new graphics card).

PS: I found the JH.T6.532a net used for the CCRL rating. I'll rerun a longer test. The match will be run at 40 moves in 85 seconds repeating to comply with the CCRL 40/4 list, and Lc0 0.21.3 JH.T6.532 will run full-out on a GTX 1070. That should give an approximation of Elo-difference between the GTX 1050 and 1070, at least for this particular net.

PS2: I have also put the current list into a spreadsheet with a filter on the Elo-field. The strongest network I've found is 10968, from august 22, 2018 (so it's an old one... how can it be so strong? Were all the other networks much weaker... meaning that old network Elo can't be related to new networks? I'll run a test using that network as well.)

Imagine what it would look like if the human world championship was 40 moves per 85 seconds.
After all that preparation and sponsors and venue and press, 12 games would be over in about half an hour.
And then, all the cellebrations and books on it, until years 2 years later

mvanthoor · Post by **mvanthoor** » Sun Jul 28, 2019 3:46 pm

Guenther wrote: ↑Sun Jul 28, 2019 9:20 am Are the games available somewhere and also the other 20 games mentioned later in this thread?
(If possible old pgn format w/o gimmicks just eval/depth time)
I would like to look at them.

The games are attached to this post.

mvanthoor · Post by **mvanthoor** » Sun Jul 28, 2019 4:05 pm

S.Taylor wrote: ↑Sun Jul 28, 2019 1:17 pm
Imagine what it would look like if the human world championship was 40 moves per 85 seconds.
After all that preparation and sponsors and venue and press, 12 games would be over in about half an hour.
And then, all the cellebrations and books on it, until years 2 years later

As I said, I wanted to test Lc0 against Stockfish on CCRL terms, so I'd be able to see where Lc0 would fall in the CCRL list.

This isn't really possible, due to the things I described before.

If you have an engine E, running on CPU A, using a time control of 40/4, this engine might have 3400 Elo.
If you run Lc0 on a GTX 1070, also at a time control of 40/4, it might also have 3400 Elo.

Now, get a new computer, with a CPU B, exactly twice as fast as CPU A, and keep everything else the same.
CCRL specifications require that the engine on the new CPU has the same power as it had on the old CPU; thus you have to cut the TC in half.

Now the time control will be 40/2. The engine E will still be 3400 Elo.
However, Lc0, running on the same GTX 1070 as before (old card in the new computer), will now also run at 40/2.

The engine has twice the speed, but half the thinking time. So, if it did X nodes in 60 seconds, it now does X * 2 nodes in 30 seconds, which is exactly the same, but Lc0 lost half its thinking time with no compensation. Testing Lc0 under the CCRL-specifications for A/B CPU-engines benefits the A/B engines: the faster the CPU in the computer becomes, the shorter the time control will be, and the weaker Lc0 seems to be in comparison. It's logical even, as Elo is not a set value. It's a statistical comparison. Cutting the time in half, and then speeding up the CPU to compensate but not speeding up the GPU, *DOES* indeed weaken Lc0 as compared to the CPU.

I can't readily see a way around this, to be honest.

Because of the small sample size and the above caveat with regard to CCRL specifications, the only thing that can be tentatively said is:

"At a time control of 40m/85s, Stockfish 10 x64 4T on a 6700K is about 80 Elo stronger than Lc0's currently strongest network on a GTX 1070."

I could run an a test running a series of 100 40m/4m+5s games, shorten the opening book to 6 moves instead of 8, and remove the EGTB (except for adjudiaction) to make the engines play as much of the game themselves as possible. Then also play each opening twice, to have each engine play out the opening from both sides. That'd probably give a better account of Lc0's playing strength... but the ratings will not be compatible with CCRL 40/4, obviously.

mwyoung · Post by **mwyoung** » Sun Jul 28, 2019 6:14 pm

mvanthoor wrote: ↑Sun Jul 28, 2019 4:05 pm
S.Taylor wrote: ↑Sun Jul 28, 2019 1:17 pm
Imagine what it would look like if the human world championship was 40 moves per 85 seconds.
After all that preparation and sponsors and venue and press, 12 games would be over in about half an hour.
And then, all the cellebrations and books on it, until years 2 years later
As I said, I wanted to test Lc0 against Stockfish on CCRL terms, so I'd be able to see where Lc0 would fall in the CCRL list.

This isn't really possible, due to the things I described before.

If you have an engine E, running on CPU A, using a time control of 40/4, this engine might have 3400 Elo.
If you run Lc0 on a GTX 1070, also at a time control of 40/4, it might also have 3400 Elo.

Now, get a new computer, with a CPU B, exactly twice as fast as CPU A, and keep everything else the same.
CCRL specifications require that the engine on the new CPU has the same power as it had on the old CPU; thus you have to cut the TC in half.

Now the time control will be 40/2. The engine E will still be 3400 Elo.
However, Lc0, running on the same GTX 1070 as before (old card in the new computer), will now also run at 40/2.

The engine has twice the speed, but half the thinking time. So, if it did X nodes in 60 seconds, it now does X * 2 nodes in 30 seconds, which is exactly the same, but Lc0 lost half its thinking time with no compensation. Testing Lc0 under the CCRL-specifications for A/B CPU-engines benefits the A/B engines: the faster the CPU in the computer becomes, the shorter the time control will be, and the weaker Lc0 seems to be in comparison. It's logical even, as Elo is not a set value. It's a statistical comparison. Cutting the time in half, and then speeding up the CPU to compensate but not speeding up the GPU, *DOES* indeed weaken Lc0 as compared to the CPU.

I can't readily see a way around this, to be honest.

Because of the small sample size and the above caveat with regard to CCRL specifications, the only thing that can be tentatively said is:

"At a time control of 40m/85s, Stockfish 10 x64 4T on a 6700K is about 80 Elo stronger than Lc0's currently strongest network on a GTX 1070."

I could run an a test running a series of 100 40m/4m+5s games, shorten the opening book to 6 moves instead of 8, and remove the EGTB (except for adjudiaction) to make the engines play as much of the game themselves as possible. Then also play each opening twice, to have each engine play out the opening from both sides. That'd probably give a better account of Lc0's playing strength... but the ratings will not be compatible with CCRL 40/4, obviously.

CCRL testing is broken in regards to testing NN engines. Their method is bad and does not show the true strength of the NN engines. For the reason you have mentioned.

Scaling works both ways. Hardware, or time. Here is the results of the latest Lc0 42845 result at 1m+1s 2950x 16 cpu vs 2080 ti. So you will need to test at longer TC, or buy faster hardware. As my TC increases Lc0 overtakes Stockfish.

1m+1s.jpg

mvanthoor · Post by **mvanthoor** » Sun Jul 28, 2019 8:08 pm

mwyoung wrote: ↑Sun Jul 28, 2019 6:14 pm CCRL testing is broken in regards to testing NN engines. Their method is bad and does not show the true strength of the NN engines. For the reason you have mentioned.

Stockfish's current CCRL playing strength is basically locked to a 12 year old computer. When CCRL started, the games were (probably) run at a real 40 moves in 4 minutes time control on a specific computer. To be able to use faster hardware, the faster hardware uses shorter time controls. To obtain the speed of the hardware, Crafty 19.17 is used. Engines got stronger, so Elo ratings rose because of software improvements, but in effect, the list still shows ratings as they would have been on that older computer.

Is there an engine match program that can set different time controls for each engine?

Then I could do a chain match to "reconvert" back from the current (shortened) time control to a real 40m/4m TC:

Match 1: Stockfish A, 40m/85sec (CCRL-spec, 3547) vs. Stockfish B, 40m/4m ("full power" for that time control on the 6700K, Rating X as compared to the CCRL-spec stockfish)

Match 2: Stockfish B (Rating X) vs. Lc0, both at 40m/4m (Rating Y for Lc0).

This would give you the CCRL-compatible rating at which Lc0 would have played, if you could have put the GTX 1070 into that 12 year old computer and have it run at full speed.

Such an in-between tournament, if large enough, would also be useful to recalibrate the entire CCRL-list against a new time control, such as a real 40m/5m+10sec, as the 40/4 matches are becoming hard to run. In the near future, we'll be seeing 40/4 equivalents such as 40 moves in 30 seconds, or even faster... (But that would be another topic.)

Ovyron · Post by **Ovyron** » Sun Jul 28, 2019 8:41 pm

mvanthoor wrote: ↑Sun Jul 28, 2019 8:08 pmIs there an engine match program that can set different time controls for each engine?

I think Aquarium can do that.

A free alternative would be using different fixed time per move for each engine in the Lucas Chess GUI. If it worked for Alpha Zero, it can work for Leela.

(I like fixed time per move better than "n moves in 40 minutes" anyway, since in repeating time control engines tend to play their moves inconsistently, usually faster as they approach the next phase, so "1 minute for 1 move" is better than 40/40. You can take a look at their move quality regardless of how they manage their time)

mvanthoor · Post by **mvanthoor** » Mon Jul 29, 2019 12:39 am

Ovyron wrote: ↑Sun Jul 28, 2019 8:41 pm
mvanthoor wrote: ↑Sun Jul 28, 2019 8:08 pmIs there an engine match program that can set different time controls for each engine?
I think Aquarium can do that.

Thanks for the option. I'll look into it if I can't find anything else.

It seems Little Blitzer can do it:

http://www.kimiensoftware.com/software/ ... tleblitzer

"Can allocate per-engine time with sub-millisecond accuracy."

For some reason (can't remember) I chose Cute Chess over Little Blitzer some time ago. (When testing engines on one core with ponder off, you can run 4 games in parallel on a quad core CPU with Cute Chess or Little Blitzer. That's why I don't use the Fritz11 GUI.) I'll have a look into LB again, because Cute Chess can't stop a tournament and then resume (something which the Fritz11 GUI can). Maybe LB can stop tournaments.

If I can make this different time control per engine work somehow, then for certain I'm going to try what I posted above:

Run a 100 game match with SF10 @ 40/85sec against SF10 @ 40/4, so I can discover the the SF10 rating for real 40/4 games on this computer.
Then I'll run an Lc0 vs. SF10 match with 40/4 for both of them, so I'll be able to discover the real 40/4 rating of Lc0 on a GTX 1070. (For the used network.)

Stockfish vs. Lc0: IMHO disappointing result for Lc0

Re: Stockfish vs. Lc0: IMHO disappointing result for Lc0

Re: Stockfish vs. Lc0: IMHO disappointing result for Lc0

Re: Stockfish vs. Lc0: IMHO disappointing result for Lc0

Re: Stockfish vs. Lc0: IMHO disappointing result for Lc0

Re: Stockfish vs. Lc0: IMHO disappointing result for Lc0

Re: Stockfish vs. Lc0: IMHO disappointing result for Lc0

Re: Stockfish vs. Lc0: IMHO disappointing result for Lc0

Re: Stockfish vs. Lc0: IMHO disappointing result for Lc0

Re: Stockfish vs. Lc0: IMHO disappointing result for Lc0

Re: Stockfish vs. Lc0: IMHO disappointing result for Lc0