I think it's still not that bad, as Ferdy had shown, that 60% of positions have at least 2 close in value moves by SF_dev of today standards (and another 20% debatable). The rest can be treated as baseline consisting usually of very easy to solve unique solutions solved by most engines. The sensitivity of Sim isn't harmed seriously. But if one comes up with a better set of positions, I would be glad to re-do the experiment.Guenther wrote: ↑Fri May 03, 2019 6:09 pmI want to add that it is still possible that you get similar results somehow regarding LC0 nets.Laskos wrote: ↑Fri May 03, 2019 5:39 pm"Equally good moves" according to whom? Several months ago I did the following: From 2moves_v1.epd I selected the positions where SF10 considered them at some 1-2s/position to be within [-0.05, 0.05] eval. Then I analyzed the selection with a good Lc0 net for a similar time/position. It resulted in a new Gaussian of evals (for SF it was a narrow rectangle around 0.00), with a standard deviation of about 40cp IIRC (0.40). Almost as bad as the untrimmed 2moves-v1.epd Gaussian itself.Guenther wrote: ↑Fri May 03, 2019 4:52 pmI agree with Uri. I think it would be better to improve the positions instead of trying to interpret the lc0 stats on theUri Blass wrote: ↑Fri May 03, 2019 4:20 pmLaskos wrote: ↑Fri May 03, 2019 2:10 pmSure, "quiet" or having multiple possible almost equivalent moves according to say Stockfish. It still might have a unique best move according to perfect chess. Still, as (positional) strength goes up, the similarity will increase, although not all the moves are unique according to perfect chess (tablebases).
Say, 60% of positions have an unique move according to TBs, 40% have 2 equivalent moves according to TBs.
Two different perfect players will have
60 + 40/2 = 80% similarity on average.
The similarity between a non-perfect player, say solving 70% of unique moves and 80% of 2-moves positions, with a perfect player is
0.7*60 + 0.8*40/2 = 58% similarity on average.
So, as an engine gains (positional) strength, its similarity with perfect player increases, but never reaches anywhere close to 100% values.
What I see with SF_dev is that it seems to climb up with time control in similarity to Lc0, as though Lc0 is some guru pretending to be a perfect player. I am not sure what that means and what Lc0 is up to in its high similarity across the runs.
Observe also from the data that SF_dev distances itself from other "normal" engines, when giving it much more time per position.
I believe that many chess positions(and maybe most chess positions that you get in games) have more than 2 equivalent best moves in the meaning of theoretical result so maybe an engine that play a random perfect move is not going to have even 50% similiarity with stockfish or lc0.
Playing random best moves is not a good strategy because your drawing moves may allow even weak players to get a draw easily.
10(?) years old original simtest postions.
There should be bazillions of late opening and middlegame positions with at least 2-5 equally good moves.
Ferdy had shown that you and Uri are surely partly right, 20-40% (at most) of positions being unfit to qualify as having multiple "equally good moves", but this is a controllable noise. It shows up mostly in the form that usually we have some 35-45% matches for unrelated top engines instead of probably more desirable say 20%. But the sensitivity of the Sim tester is affected only mildly. I do think that the issue is only slightly modifying the outcomes, but the meaning of outcomes and conclusions should stay pretty well.
If someone comes up with a new Sim having some 10,000 better positions, I would be glad to check my speculations I wrote above.
(it is just a bit disappointing that the positions in the test are not that quiet as expected nowadays and have so much single or two best moves)
OTH someone at the LC0 discord chat said it could be also that the NNs, which are built by a kind of generic process, might always
diverge to similar results, if enough training steps are done and that at 100ms it practically plays almost the move the policy head
favours.
Yes, as I don't know much how the NN are built, whether there are random drifts in weights, policy and value heads landscape, whether the optima landscape is a simple one or not, I do not know if it is normal that all the runs converge to similar move choices. Do the people on discord agree that all the runs should converge to very similar nets as move choices go? A late t35 10b net converged too to the same move choices.
100ms/position was only the first run, afterwards I used 300ms/position or 2000-2500 nodes/position, and similarity only _increased_ by 1-2%. 2000-2500 nodes per position are enough to check for both policy and value heads. It's a higher number of nodes than during the training games. And will probably check the similarity of t30 and t40 late nets at even longer 1000ms/position, or some 15-20,000 nodes per position.