Houston: We have lift off ...

Discussion of anything and everything relating to chess playing software and machines.

Moderators: hgm, Rebel, chrisw

jp
Posts: 1470
Joined: Mon Apr 23, 2018 7:54 am

Re: Houston: We have lift off ...

Post by jp »

chrisw wrote: Tue Nov 20, 2018 2:20 am You can’t just pick one net out of a series and claim the series is “best”, you need to show stability in the series in general. I don’t think that’s been shown, for any net series actually. Occasional headline glitches don’t mean anything, if they can’t be held, they’re not there.
But do we care which series is "best"? Probably all users care about is which NN ID is "best", because they then use that one. And it could be just luck that that ID is best & again users won't care.
chrisw
Posts: 4313
Joined: Tue Apr 03, 2012 4:28 pm

Re: Houston: We have lift off ...

Post by chrisw »

jp wrote: Tue Nov 20, 2018 6:35 am
chrisw wrote: Tue Nov 20, 2018 2:20 am You can’t just pick one net out of a series and claim the series is “best”, you need to show stability in the series in general. I don’t think that’s been shown, for any net series actually. Occasional headline glitches don’t mean anything, if they can’t be held, they’re not there.
But do we care which series is "best"? Probably all users care about is which NN ID is "best", because they then use that one. And it could be just luck that that ID is best & again users won't care.
Maybe so, but I’m not a “user”, and I’m treating this thread by its title, whether or not there is “liftoff” and analysing that, whether the Elo strengths exists and or are stable. Isn’t that what the development is about? I think I demonstrated that nobody really knows because the testing and the conclusions drawn are shot with problems that are being overlooked. But by all means use what you is best from the results you can see.
jp
Posts: 1470
Joined: Mon Apr 23, 2018 7:54 am

Re: Houston: We have lift off ...

Post by jp »

chrisw wrote: Tue Nov 20, 2018 10:40 am Maybe so, but I’m not a “user”, and I’m treating this thread by its title, whether or not there is “liftoff” and analysing that, whether the Elo strengths exists and or are stable. Isn’t that what the development is about? I think I demonstrated that nobody really knows because the testing and the conclusions drawn are shot with problems that are being overlooked. But by all means use what you is best from the results you can see.
I think everyone in this thread already agrees there's no liftoff, so you may be a bit late with your conclusions. Just look at Kai's posts. No one takes the self-play testing too seriously, but that's now not lifting off either. Have you seen the non-self-play estimates? No liftoff there.
chrisw
Posts: 4313
Joined: Tue Apr 03, 2012 4:28 pm

Re: Houston: We have lift off ...

Post by chrisw »

jp wrote: Tue Nov 20, 2018 11:44 am
chrisw wrote: Tue Nov 20, 2018 10:40 am Maybe so, but I’m not a “user”, and I’m treating this thread by its title, whether or not there is “liftoff” and analysing that, whether the Elo strengths exists and or are stable. Isn’t that what the development is about? I think I demonstrated that nobody really knows because the testing and the conclusions drawn are shot with problems that are being overlooked. But by all means use what you is best from the results you can see.
I think everyone in this thread already agrees there's no liftoff, so you may be a bit late with your conclusions. Just look at Kai's posts. No one takes the self-play testing too seriously, but that's now not lifting off either. Have you seen the non-self-play estimates? No liftoff there.
Thanks.
User avatar
Laskos
Posts: 10948
Joined: Wed Jul 26, 2006 10:21 pm
Full name: Kai Laskos

Re: Houston: We have lift off ...

Post by Laskos »

Leto wrote: Mon Nov 19, 2018 7:22 pm
Laskos wrote: Mon Nov 19, 2018 6:39 am
Leto wrote: Mon Nov 19, 2018 1:18 am
Laskos wrote: Fri Nov 16, 2018 3:56 pm I don't know why you are so enthusiastic. Runs 20xxx and 30xxx are pretty pathetic, especially considering how much resources they have eaten up. Some folks there have overdone something. Just a quick check with the latest engine (rc4) and one of the latest nets:

TC: 60'' + 1''

Code: Select all

Rank Name                          Elo     +/-   Games   Score   Draws
     SF8                           120      68      60   66.7%   43.3%
   
   1 lc0_v19_11261                   0     111      20   50.0%   50.0%
   2 lc0_v19_31214                -147     128      20   30.0%   40.0%
   3 lc0_v19_9155                 -241     127      20   20.0%   40.0%
Finished match
So, run 30xxx is still ~150 Elo points below run 10xxx, and barely ~100 Elo points above 6x64 net 9155 (run 9xxx). Taking into account that the games with 6x64 net were 10-12 times faster and taking into account the hardware resources allocated, the whole run 9xxx could have been completed in less than a day. Lame runs, these newest ones. But I still hope that they will improve some 200 real Elo points over current level, although this is not granted at all.
I don't think Test30 is this close to Test10 in strength, I still think it's several hundred elo weaker. What's 60" + 1", is that game in 1 minute with an extra second per move?
Yes, 1m + 1s. It is close, Test30 is about 100 Elo points weaker than Test10.
I highly doubt that. I have it at just slightly stronger than Stockfish 5 1CPU at 1 minute blitz and my tests run 200 games each. The best I've gotten from a test30 net is a 59% score which would put it about 60 elo stronger than SF5 1CPU on my machine (Ryzen 5 2600 with Nvidia 1080). If I had a 2080ti maybe it would perform about 100 elo higher on my machine than SF5 but that would still not put it anywhere near the best Test10 net 11250 which is about as strong as Stockfish 9.

See this chart, it has 11250 between 200 and 300 elo stronger than the current test30 networks: https://docs.google.com/spreadsheets/d/ ... =952456918
I don't know what they are measuring, I measure the strength against SF8. With my new RTX2070, I have re-done some tests at short time control, and got the following succession:

Score of SF8 vs lc0_v19_31255: 24 - 22 - 54 [0.510] 100
Elo difference: 6.95 +/- 46.39
Finished match

Score of SF8 vs lc0_v19_31366: 16 - 25 - 59 [0.455] 100
Elo difference: -31.35 +/- 43.69
Finished match

Score of SF8 vs lc0_v19_31405: 21 - 33 - 46 [0.440] 100
Elo difference: -41.89 +/- 50.38
Finished match

Score of SF8 vs lc0_v19_31440: 16 - 35 - 49 [0.405] 100
Elo difference: -66.82 +/- 48.92
Finished match

===========================
Score of SF8 vs lc0_v19_11261: 10 - 43 - 47 [0.335] 100
Elo difference: -119.11 +/- 49.97
Finished match
===========================

In almost 6 million games, from ID 31255 to ID 31440, there is some progress, of about 70 real Elo points, while in their self-play ratings, they show 1700 self-Elo improvement. And the latest 30xxx nets are only about 50 real Elo points weaker than the best nets of 10xxx run (overall, still the best run). Again, I don't know what those charts are, these are my results against SF8. I might decrease the time control to have more games (200 instead of 100 per match), as I see the nets scaling almost identically against SF8. But I will check rarely, as the progress now in real-Elo is still extremely slow.
User avatar
Ozymandias
Posts: 1532
Joined: Sun Oct 25, 2009 2:30 am

Re: Houston: We have lift off ...

Post by Ozymandias »

Laskos wrote: Thu Nov 22, 2018 8:27 amI don't know what those charts are
Someone has proposed that they reflect Leela's ego, he doesn't seem to be wrong.