LCzero sacs a knight for nothing

Daniel Shawul · Post by **Daniel Shawul** » Fri Apr 20, 2018 6:28 pm

jkiliani wrote:
mirek wrote:
Daniel Shawul wrote: A GTX 1080 Ti is 11 TFlops, and 64 cores is 1 TFlops so that is an 11X hardware advantage. Why not use the same 64 CPU cores for it and see if it will beat Stockfish ?
You have elevated the minimum hardware requirement for A0 to be competitive with Stockfish to 11 x 64 = 700 CPU cores each thinking for 1 min per move. So if you had used one core, which is the standard in chess rating lists, you have to use a time control of 11 hours per move
Ah this again? So comparing performance per $ or performance per Watt obviously doesn't concern you, right? I can see team of scientists deciding if they will run their simulation on 10x1080Ti or 7000 cpu cores and in the end they will go for the cpus, because while the GPUs would be cheaper it would also provide an unfair advantage over cpu

We will see what HW and engines e.g. correspondence players will use once LC0 gets to A0 level. If you want to insist that 1 CPU core is the only correct metric on which to measure engine strength than have it your way. But don't be surprised if in correspondence game you get completely smashed by engine which according to your "correct 1 core rating" will be like 500 elo weaker to you alpha-beta searcher.

Daniel Shawul wrote: In any case you are backing off from your "bet" that the net is going to improve the tactics. Now you insist on some form of hardware advantage to cover for tactical weakness. Which is it?
I am not backing off, the test is independent of hardware, we can just let it calculate roughly as many nodes as were used in game before the blunder was played and see how it goes with new weights e.g. in weekly intervals. Until the net gets resized again.
To accommodate Daniel's valid concerns, we should run both Alpha-Beta engines and neural net engines on the abacus

Seriously, equivalent power use is a fair metric. Otherwise what would be the point of improving hardware at all? If Alpha-Beta crunchers find a good way to use modern GPUs they should implement these by all means...

I have no objection to that, infact I have mentioned several times the only metric that would make sense to me is a per dollar/watt compariosn. However, i would like to know if the success of A0 came from a hardware or software improvement. DeepBlue was a hardware success story and they used FPGA to accelerate their eval.
In the case of A0, the eval is a bulky NN that is hugely accelerated with a "speciality" hardware.

Comparing scorpio-mcts-min with L0 that more or less use similar algorithms I get results that is in largely in favour of Scorpio atleast upto 320+2 tc. Scorpio has a 100x faster eval that the 10x128 nn of eval and yet it beats it. L0 chose for a bulky eval so it has to pay for its slowness on the same hardware, otherwise it would be an unfair comparison. When CCLS run L0 vs scorpio-mcts-min leala beat it but L0 was running on a 1080.

20+0.1

Code: Select all

Score of lczero vs scorpio-mcts-min&#58; 2 - 40 - 3  &#91;0.078&#93; 45 
Elo difference&#58; -429.59 +/- 252.08 
SPRT&#58; llr -3.01, lbound -2.94, ubound 2.94 - H0 was accepted 
Finished match

40+0.1

Code: Select all

Score of lczero vs scorpio-mcts-min&#58; 3 - 35 - 8  &#91;0.152&#93; 46 
Elo difference&#58; -298.39 +/- 125.93 
SPRT&#58; llr -3.03, lbound -2.94, ubound 2.94 - H0 was accepted 
Finished match

80+0.2

Code: Select all

Finished game 36 &#40;scorpio-mcts-min vs lczero&#41;&#58; * &#123;No result&#125; 
Score of lczero vs scorpio-mcts-min&#58; 0 - 30 - 5  &#91;0.071&#93; 35 
Elo difference&#58; -445.58 +/- 206.80 
Finished match

320+2

Code: Select all

scorpio-mcts-min     91    183    183     15    73.3%   -90    13.3% 
          lczero    -90    183    183     15    26.7%    91    13.3%

Daniel

Werewolf · Post by **Werewolf** » Fri Apr 20, 2018 7:09 pm

It is extraordinary that the latest version, 154, despite being quite good now in terms of elo - cannot find the winning move below even after 1.5 million rollouts.

It took nearly 20 minutes on my Geforce 1060 to do this.

The 1990 Mephisto Lyon 68020 gets this in a second...

[pgn] 1.e4 e5 2.Nf3 d6 3.Nc3 g6 4.Bc4 Bg4 5.Ne5 [/pgn]

Werewolf · Post by **Werewolf** » Fri Apr 20, 2018 7:39 pm

Not as simple as I thought. It seems LCZero IS learning SOME tactical patterns.

The one below was solved in a mere 3 seconds (same version etc. as above)

[pgn] 1. e4 e6 2. d4 d5 3. e5 Bb4+ 4. c3 Ba5 5. Bd3 Ne7 6. Nf3 O-O 7. b4 Bb6 8. a3 a6 [/pgn]

The winning move is of course 9.Bh7+!

By contrast the Lyon 68020 takes a few minutes...

Guenther · Post by **Guenther** » Fri Apr 20, 2018 7:42 pm

Werewolf wrote:Not as simple as I thought. It seems LCZero IS learning SOME tactical patterns.

The one below was solved in a mere 3 seconds (same version etc. as above)

[pgn] 1. e4 e6 2. d4 d5 3. e5 Bb4+ 4. c3 Ba5 5. Bd3 Ne7 6. Nf3 O-O 7. b4 Bb6 8. a3 a6 [/pgn]

The winning move is of course 9.Bh7+!

By contrast the Lyon 68020 takes a few minutes...

Just for the record, already 7. Bxh7+ is winning here

Albert Silver · Post by **Albert Silver** » Fri Apr 20, 2018 7:43 pm

jdart wrote:Still, if on current TCEC hardware it is dropping a piece, and if you gave L0 10x the CPU power it has on the TCEC system, it still seems to me likely to me that it would fail to find tactics that Stockfish does find, especially on that hardware.

A0 on the other hand outplayed Stockfish rather convincingly in the games that I saw. It was on big custom hardware, but I have to think it must have had a different/better algorithm too.

--Jon

I was under the impression that the algorithm used here was the one the DeepMind team published in that preliminary paper.

"Instead of an alpha-beta search with domain-specific enhancements, AlphaZero uses a general purpose Monte-Carlo tree search (MCTS) algorithm. Each search consists of a series of simulated games of self-play that traverse a tree from root Sroot to leaf. Each simulation proceeds by selecting in each state s a move a with low visit count, high move probability and high value (averaged over the leaf states of simulations that selected a from s) according to the current neural network f0. The search returns a vector n representing a probability distribution over moves, either proportionally or greedily with respect to the visit counts at the root state."

Or was that incomplete or unusable for some reason?

Werewolf · Post by **Werewolf** » Fri Apr 20, 2018 7:48 pm

Guenther wrote:
Werewolf wrote:Not as simple as I thought. It seems LCZero IS learning SOME tactical patterns.

The one below was solved in a mere 3 seconds (same version etc. as above)

[pgn] 1. e4 e6 2. d4 d5 3. e5 Bb4+ 4. c3 Ba5 5. Bd3 Ne7 6. Nf3 O-O 7. b4 Bb6 8. a3 a6 [/pgn]

The winning move is of course 9.Bh7+!

By contrast the Lyon 68020 takes a few minutes...
Just for the record, already 7. Bxh7+ is winning here

Yes, but 7.b4 is also good as if black prevents Bh7+ he loses a piece.
It seems LCZero finds it harder than normal engines to find the best move when there is a good alternative.

Laskos · Post by **Laskos** » Fri Apr 20, 2018 9:15 pm

George Tsavdaris wrote:
Laskos wrote: Can somehow confirm with LittleBlitzer and InBetween, from 3-mover balanced book:
Code: Select all
Games Completed = 30 of 100 &#40;Avg game length = 2.370 sec&#41;
Settings = RR/64MB/1000ms per move/M 9000cp for 30 moves, D 150 moves/EPD&#58;C&#58;\LittleBlitzer\3moves_GM_04.epd&#40;817&#41;
Time = 195 sec elapsed, 455 sec remaining
 1.  LCZero CPU ID153 p=1     	12.0/30	7-13-10  	&#40;L&#58; m=13 t=0 i=0 a=0&#41;	&#40;D&#58; r=7 i=0 f=2 s=1 a=0&#41;	&#40;tpm=33.3 d=6.09 nps=35&#41;
 2.  SF9 depth=1              	18.0/30	13-7-10  	&#40;L&#58; m=7 t=0 i=0 a=0&#41;	&#40;D&#58; r=7 i=0 f=2 s=1 a=0&#41;	&#40;tpm=10.9 d=1.00 nps=43940&#41;
Can you repeat EXACTLY the same but with ID154 that in selfplay it gives +50 ELO compared to 153?
I want to see how the +50 ELO of selfplay are translated even in this short match.

PS: In order to do this you set Stockfish do a 1 ply search(how? i forgot about all these things) and for LC0 you just put in the parameters the "-p 1" or something else is needed also?

I don't see much meaning in such test. Anyway, I use another methodology than their self-games, no-book, fixed playouts matches. This seems seriously flawed from many points of view. I use games against one different engine of comparable strength in my conditions (1s/move on 1 core), fixed time, 3-mover balanced book for diversity.

My progress (probably more real) is visible here, where ID124 still stands as the strongest. The training after that, during the bug, is clearly visible as 120+ Elo points regression, the recovery to ID124 level of ID154 (just several hours ago) is also visible. Their misleading graph is showing the ever optimistic increase:
http://lczero.org/

ID154 is 280 Elo points stronger than ID124 in their plot, while my result is consistent with that they are of similar strength, the net just recovered from the serious bug.

Code: Select all

Games Completed = 200 of 200 &#40;Avg game length = 98.619 sec&#41;
Settings = Gauntlet/64MB/1000ms per move/M 2500cp for 3 moves, D 150 moves/EPD&#58;C&#58;\LittleBlitzer\3moves_GM_04.epd&#40;817&#41;
Time = 5478 sec elapsed, 0 sec remaining
 1.  LCZero CPU ID124         	65.0/200	44-114-42  	&#40;L&#58; m=114 t=0 i=0 a=0&#41;	&#40;D&#58; r=31 i=4 f=4 s=2 a=1&#41;	&#40;tpm=947.8 d=12.50 nps=202&#41;
 2.  Jabba 1.0                	135.0/200	114-44-42  	&#40;L&#58; m=44 t=0 i=0 a=0&#41;	&#40;D&#58; r=31 i=4 f=4 s=2 a=1&#41;	&#40;tpm=802.9 d=8.98 nps=0&#41;


 

Games Completed = 200 of 200 &#40;Avg game length = 88.613 sec&#41;
Settings = Gauntlet/64MB/1000ms per move/M 9000cp for 30 moves, D 150 moves/EPD&#58;C&#58;\LittleBlitzer\3moves_GM_04.epd&#40;817&#41;
Time = 5056 sec elapsed, 0 sec remaining
 1.  LCZero CPU ID131         	41.5/200	27-144-29  	&#40;L&#58; m=144 t=0 i=0 a=0&#41;	&#40;D&#58; r=27 i=0 f=1 s=0 a=1&#41;	&#40;tpm=947.0 d=12.50 nps=126&#41;
 2.  Jabba 1.0                	158.5/200	144-27-29  	&#40;L&#58; m=27 t=0 i=0 a=0&#41;	&#40;D&#58; r=27 i=0 f=1 s=0 a=1&#41;	&#40;tpm=803.9 d=8.82 nps=0&#41;


Games Completed = 200 of 200 &#40;Avg game length = 92.903 sec&#41;
Settings = Gauntlet/64MB/1000ms per move/M 5500cp for 30 moves, D 150 moves/EPD&#58;C&#58;\LittleBlitzer\3moves_GM_04.epd&#40;817&#41;
Time = 5264 sec elapsed, 0 sec remaining
 1.  LCZero CPU ID139         	39.5/200	22-143-35  	&#40;L&#58; m=143 t=0 i=0 a=0&#41;	&#40;D&#58; r=25 i=6 f=3 s=1 a=0&#41;	&#40;tpm=948.6 d=12.52 nps=175&#41;
 2.  Jabba 1.0                	160.5/200	143-22-35  	&#40;L&#58; m=22 t=0 i=0 a=0&#41;	&#40;D&#58; r=25 i=6 f=3 s=1 a=0&#41;	&#40;tpm=803.7 d=8.73 nps=0&#41;
 
 
Games Completed = 200 of 200 &#40;Avg game length = 91.517 sec&#41;
Settings = Gauntlet/64MB/1000ms per move/M 9000cp for 30 moves, D 150 moves/EPD&#58;C&#58;\LittleBlitzer\3moves_GM_04.epd&#40;817&#41;
Time = 4855 sec elapsed, 0 sec remaining
 1.  LCZero CPU ID147         	45.0/200	35-145-20  	&#40;L&#58; m=145 t=0 i=0 a=0&#41;	&#40;D&#58; r=14 i=4 f=2 s=0 a=0&#41;	&#40;tpm=945.0 d=12.48 nps=178&#41;
 2.  Jabba 1.0                	155.0/200	145-35-20  	&#40;L&#58; m=35 t=0 i=0 a=0&#41;	&#40;D&#58; r=14 i=4 f=2 s=0 a=0&#41;	&#40;tpm=804.0 d=8.87 nps=0&#41; 
 
 
Games Completed = 200 of 200 &#40;Avg game length = 97.840 sec&#41;
Settings = Gauntlet/64MB/1000ms per move/M 9000cp for 30 moves, D 150 moves/EPD&#58;C&#58;\LittleBlitzer\3moves_GM_04.epd&#40;817&#41;
Time = 5223 sec elapsed, 0 sec remaining
 1.  LCZero CPU ID152         	58.0/200	42-126-32  	&#40;L&#58; m=126 t=0 i=0 a=0&#41;	&#40;D&#58; r=24 i=3 f=3 s=1 a=1&#41;	&#40;tpm=948.7 d=12.53 nps=242&#41;
 2.  Jabba 1.0                	142.0/200	126-42-32  	&#40;L&#58; m=42 t=0 i=0 a=0&#41;	&#40;D&#58; r=24 i=3 f=3 s=1 a=1&#41;	&#40;tpm=803.2 d=9.03 nps=0&#41;


Games Completed = 200 of 200 &#40;Avg game length = 103.383 sec&#41;
Settings = Gauntlet/64MB/1000ms per move/M 9000cp for 30 moves, D 150 moves/EPD&#58;C&#58;\LittleBlitzer\3moves_GM_04.epd&#40;817&#41;
Time = 5473 sec elapsed, 0 sec remaining
 1.  LCZero CPU ID154         	63.0/200	40-114-46  	&#40;L&#58; m=114 t=0 i=0 a=0&#41;	&#40;D&#58; r=34 i=4 f=6 s=0 a=2&#41;	&#40;tpm=952.3 d=12.49 nps=282&#41;
 2.  Jabba 1.0                	137.0/200	114-40-46  	&#40;L&#58; m=40 t=0 i=0 a=0&#41;	&#40;D&#58; r=34 i=4 f=6 s=0 a=2&#41;	&#40;tpm=804.0 d=9.15 nps=0&#41;

One issue might be that different nets scale a bit differently with TC and hardware, but I didn't quite observed something like this. They all surely scale better than standard A/B engines.

CMCanavessi · Post by **CMCanavessi** » Fri Apr 20, 2018 10:22 pm

My tests arrive to very similar numbers to yours, Kai (though I've tested 150 as the strongest, and have not tested any newer network... maybe 156 will be the next one). And yes, the self-play match games between networks on the main site is terrible and misleading.

My gauntlet numbers:

The calculated Elo:

Michel · Post by **Michel** » Fri Apr 20, 2018 10:44 pm

And yes, the self-play match games between networks on the main site is terrible and misleading.

I wonder why that is. Now that the matches are no longer used for gating and there is much more opening variety, the graph should in principle be correct on average.

So it seems that elo is not additive in this case.

One possible explanation might be that buggy engines do not satisfy the elo model. This was an observation by HGM in a slightly different context. Of course it is bit unclear how to define a buggy engine...

MonteCarlo · Post by **MonteCarlo** » Fri Apr 20, 2018 10:50 pm

Daniel Shawul wrote: I suspect the averaging of scores is responsible for this blunder. When a position has a few good moves and the policy network fails to pick them, these things can happen.

It's the second thing you mentioned, not the first.

I ran a few times the same position to similar visit counts as what were seen in TCEC for that position (with the correct history), and each time the crucial Re1+ had either 0 or 1 visits.

The probabilities output by the policy head for some of those moves are just too low for the search to explore it enough for averaging to even be a problem.

With 0 visits to the correct line it obviously doesn't matter what backup operator you use, while with 1 visit (and no quiescence, of course) you still haven't resolved the tactic to the end (and the value from the value head for Re1+ is still incorrectly favorable for white).

The crucial line was just both essentially unexplored and (when it was minimally explored) misevaluated. It wouldn't have mattered in this particular case what kind of backup was used.

It's an interesting question in the general case how best to backup values, but in this specific case it was just a failure of the net through and through

LCzero sacs a knight for nothing

Re: LCzero sacs a knight for nothing

Re: LCzero sacs a knight for nothing

Re: LCzero sacs a knight for nothing

Re: LCzero sacs a knight for nothing

Re: LCzero sacs a knight for nothing

Re: LCzero sacs a knight for nothing

Re: LCzero sacs a knight for nothing

Re: LCzero sacs a knight for nothing

Re: LCzero sacs a knight for nothing

Re: LCzero sacs a knight for nothing