best way to determine elos of a group

Michel · Post by **Michel** » Sun Aug 04, 2019 11:11 am

Laskos wrote: ↑Sun Aug 04, 2019 10:12 am To add:

The effort in the first, "naive" case, is 5000 games. In the second, uniformly spaced --- 8000 games. In the third, k^2 spaced --- 8000 games again.
The effort with hundreds of nets will be only slightly, say 10% larger than the "naive" case, and the effort goes to "firm up" the anchors. More nets, sparser the anchors are (I have chosen somewhat arbitrarily the square root sparseness).

It seems you have come up with a pretty nice scheme!

I still like my own idea of a self organizing tree (rather than shaping the tree beforehand as you are doing) but I have not actually checked that it would work satisfactorily. The advantage of shaping the tree beforehand is that you can allocate an optimal number of games to the edges in advance (more to the stem, less to the side branches).

Daniel Shawul · Post by **Daniel Shawul** » Sun Aug 04, 2019 4:14 pm

Thanks Kai. I am using your method now where the anchor IDs are (n^2-1) since my IDs start from 0.
I am also playing 4x more games when switching anchors. You can see the initial plot here with the standard errors.

http://scorpiozero.ddns.net/

The anchors ID-0 and ID-3 have smaller standard errors.
But it looks like i may have a bug as performance dipped heavily on the 4th net.,,argh!

Laskos · Post by **Laskos** » Sun Aug 04, 2019 4:23 pm

Michel wrote: ↑Sun Aug 04, 2019 11:11 am
Laskos wrote: ↑Sun Aug 04, 2019 10:12 am To add:

The effort in the first, "naive" case, is 5000 games. In the second, uniformly spaced --- 8000 games. In the third, k^2 spaced --- 8000 games again.
The effort with hundreds of nets will be only slightly, say 10% larger than the "naive" case, and the effort goes to "firm up" the anchors. More nets, sparser the anchors are (I have chosen somewhat arbitrarily the square root sparseness).
It seems you have come up with a pretty nice scheme!

I still like my own idea of a self organizing tree (rather than shaping the tree beforehand as you are doing) but I have not actually checked that it would work satisfactorily. The advantage of shaping the tree beforehand is that you can allocate an optimal number of games to the edges in advance (more to the stem, less to the side branches).

I think I can propose a less ad-hoc methodology. One important thing is the shape of the derivative of the logistic. We want to have anchors in the places of the run where the current net is not far Elo-wise towards the last anchor, i.e. the Elo is such that the derivative of the logistic for that Elo is high. Here is the shape of the derivative of the usual logistic:

D_logistic.jpg

It is pretty obvious that for 200 Elo points difference, the derivative deviates enough to warrant a new anchor, A new anchor added to N previous anchors adds a factor of 1/N to the new variance. The proposal is this: the first anchor is the true anchor ID-0. Play successive nets against this anchor for established number of games, say 200. As soon as you find an Elo difference above 200, set this net as the new anchor and play 4 or 8 times more games, say 1600 games ("firming up"). Do the same with successive nets against this new anchor. So on. Ad-hoc for diversity of opponents: estimate the length of the whole run, planned to contain very roughly N IDs. If for a 2 * sqrt(N) nets no net came 200 Elo above (or below, in fact) the last anchor, set the new anchor at this 2 * sqrt(N) distance from the last anchor.

A silly subtlety here, looking at the plot, is whether for long runs (say 1000 IDs) the threshold should be 200 Elo points or 150 Elo points, as on long runs and many anchors the cost of adding a new anchor diminishes. Full runs from random player to some good strength should run for 3000-4000 absolute Elo compared to fixed 0 of random net. So, about 20 anchors according to the threshold, and if some 1000 IDs, maybe some 10-15 more anchors due to ad-hoc "diversity criterion" (2 * sqrt(N) spacing). This way, no large Elo differences will occur and that without inflating the number of nets. No prior knowledge of run behavior is needed aside the rough estimated of its IDs length. Ad-hoc spacing might go to higher values if one is really protracting the end of the run.

Here are the results in Ordo for best behaved until now, but specific k^2 spacing, and this more general proposal:

k^2 spacing of anchors:

Code: Select all

   # PLAYER    : RATING  ERROR    POINTS  PLAYED     (%)   CFS(next)
   1 SF24      : 1459.15  62.99     141.5     200    70.8      70    
   2 SF25      : 1446.17  46.11    1107.0    1600    69.2      84    
   3 SF23      : 1422.71  62.29     132.5     200    66.3      55    
   4 SF22      : 1418.82  62.31     131.5     200    65.8      52    
   5 SF20      : 1416.88  61.29     131.0     200    65.5      91    
   6 SF21      : 1373.93  61.94     119.5     200    59.8      72    
   7 SF18      : 1355.94  60.41     114.5     200    57.3      83    
   8 SF17      : 1327.67  60.51     106.5     200    53.3      64    
   9 SF19      : 1317.17  61.23     103.5     200    51.8      72    
  10 SF16      : 1304.94  43.59    2410.5    4800    50.2      90    
  11 SF15      : 1271.38  63.79     156.0     200    78.0      88    
  12 SF13      : 1230.81  61.51     147.5     200    73.8      63    
  13 SF14      : 1219.71  60.87     145.0     200    72.5      96    
  14 SF11      : 1162.42  59.04     131.0     200    65.5      52    
  15 SF12      : 1160.49  59.42     130.5     200    65.3     100    
  16 SF10      : 1066.20  56.91     104.5     200    52.3      78    
  17 SF9       : 1050.47  39.30    2149.5    4400    48.9      97    
  18 SF8       : 986.44  69.00     176.0     200    88.0      99    
  19 SF7       : 894.80  60.91     162.5     200    81.3      90    
  20 SF6       : 849.82  57.11     154.0     200    77.0     100    
  21 SF5       : 707.91  52.96     119.5     200    59.8     100    
  22 SF4       : 638.91  30.14    1770.5    4000    44.3     100    
  23 SF3       : 536.33  58.23     167.5     200    83.8     100    
  24 SF2       : 400.24  48.72     140.5     200    70.3     100    
  25 SF1       : 250.23  18.42    1539.5    3600    42.8     100    
  26 SF0       :   0.00   ----     308.0    1600    19.3     ---

The more general proposal. Anchors are found to be after 200 games as per usual nets and are: 1,3,6,9,14,23 (200 Elo points threshold):

Code: Select all

   # PLAYER    : RATING  ERROR    POINTS  PLAYED     (%)   CFS(next)
   1 SF24      : 1432.17  62.81     104.5     200    52.3      64    
   2 SF25      : 1421.72  62.36     101.5     200    50.8      60    
   3 SF23      : 1416.50  45.58    1393.0    2000    69.7      60    
   4 SF22      : 1410.19  63.55     148.5     200    74.3      68    
   5 SF20      : 1394.61  63.79     145.0     200    72.5      76    
   6 SF21      : 1371.30  61.40     139.5     200    69.8      99    
   7 SF18      : 1296.51  59.89     120.0     200    60.0      55    
   8 SF19      : 1292.89  61.64     119.0     200    59.5      52    
   9 SF17      : 1291.09  60.04     118.5     200    59.3      55    
  10 SF16      : 1287.49  59.08     117.5     200    58.8      68    
  11 SF15      : 1273.23  59.95     113.5     200    56.8      99    
  12 SF14      : 1225.97  41.86    2184.0    4800    45.5      64    
  13 SF13      : 1216.46  61.14     148.5     200    74.3      95    
  14 SF12      : 1161.37  59.39     135.5     200    67.8      79    
  15 SF11      : 1136.12  58.26     129.0     200    64.5      99    
  16 SF10      : 1065.39  56.88     109.5     200    54.8      94    
  17 SF9       : 1032.24  37.54    1891.5    4000    47.3      97    
  18 SF8       : 985.99  56.71     142.0     200    71.0      97    
  19 SF7       : 926.56  55.50     127.0     200    63.5     100    
  20 SF6       : 830.23  33.86    1845.5    3600    51.3     100    
  21 SF5       : 734.77  55.26     148.5     200    74.3      99    
  22 SF4       : 658.25  51.61     130.0     200    65.0     100    
  23 SF3       : 550.55  27.36    1735.0    3600    48.2     100    
  24 SF2       : 420.11  50.94     143.0     200    71.5     100    
  25 SF1       : 260.10  18.90    1617.5    3400    47.6     100    
  26 SF0       :   0.00   ----     293.0    1600    18.3     ---

We see that this more general case behaves as well as the best previous k^2 ad-hoc procedure (compare the errors of ID25 in the first case and ID23 in the general case). Some discussion might be about choosing the threshold to be 200 or 150 Elo points for long runs, and the ad-hoc spacing for diversity to 2 or 3 or 4 times sqrt(N), if the runs are protracted.

Laskos · Post by **Laskos** » Sun Aug 04, 2019 4:34 pm

Daniel Shawul wrote: ↑Sun Aug 04, 2019 4:14 pm Thanks Kai. I am using your method now where the anchor IDs are (n^2-1) since my IDs start from 0.
I am also playing 4x more games when switching anchors. You can see the initial plot here with the standard errors.

http://scorpiozero.ddns.net/

The anchors ID-0 and ID-3 have smaller standard errors.
But it looks like i may have a bug as performance dipped heavily on the 4th net.,,argh!

Good. That dip is unpleasant, but with k^2 procedure, it will be treated as just a detail. In my more general procedure I just proposed, it would have harmed much more (it would have been taken as an anchor

).

Daniel Shawul · Post by **Daniel Shawul** » Tue Aug 06, 2019 12:44 am

Things are going pretty smoothly now after adding "gating" to the training.
Networks that failed to equal or beat previous version are not used for generating new games.
In just 18k games, it gained 900 elo gradually and is surely learning piece values. It currently knows to
capture when it has the chance but it hasn't yet learned not to place its pieces on attacked squares or leaving them hanging.

Plot is shown here.

I have temporarily went back to training a new net every 512 games with a sliding window of 15 nets data, and training for 8 minibatches (4096 pos each).
I am gonna let this run continue for a while to see where it will end up.

Daniel

Michel · Post by **Michel** » Tue Aug 06, 2019 5:35 am

Daniel Shawul wrote: ↑Tue Aug 06, 2019 12:44 am Things are going pretty smoothly now after adding "gating" to the training.
Networks that failed to equal or beat previous version are not used for generating new games.
In just 18k games, it gained 900 elo gradually and is surely learning piece values. It currently knows to
capture when it has the chance but it hasn't yet learned not to place its pieces on attacked squares or leaving them hanging.

Plot is shown here.

I have temporarily went back to training a new net every 512 games with a sliding window of 15 nets data, and training for 8 minibatches (4096 pos each).
I am gonna let this run continue for a while to see where it will end up.

Daniel

You have to be careful because gating may cause bias in the elo measurements. For example assume that all networks are in fact equally strong and in a linear tournament you keep only those that score >50% against the previous one (which will happen 50% of the time). You would measure a steady "progress" of elo when in fact there is none.

I do not know what your current procedure is but ideally the games uses for gating should be different from
the games used for measuring.

Of course this is if you want an unbiased estimate.

Laskos · Post by **Laskos** » Tue Aug 06, 2019 6:54 am

Daniel Shawul wrote: ↑Tue Aug 06, 2019 12:44 am Things are going pretty smoothly now after adding "gating" to the training.
Networks that failed to equal or beat previous version are not used for generating new games.
In just 18k games, it gained 900 elo gradually and is surely learning piece values. It currently knows to
capture when it has the chance but it hasn't yet learned not to place its pieces on attacked squares or leaving them hanging.

Plot is shown here.

I have temporarily went back to training a new net every 512 games with a sliding window of 15 nets data, and training for 8 minibatches (4096 pos each).
I am gonna let this run continue for a while to see where it will end up.

Daniel

Ah, back to "Ego" rating (I think you called them so) of early Lc0 runs? Take seriously what Michel says. Gating will cause bias, and in your case, if I am not wrong, a linear inflation in Elo with the number of nets, which is very bad, hard to estimate and has a large error of estimation itself. You don't use sparse "anchors" anymore, just play net versus previous net? Then I don't understand the Elo margins in the plot, or you went back to Bayeselo wrong defaults for Elo margins? If you play net versus previous net, your Elo margins should go as ID number N^(1/2), and I don't see that in the plot.

Otherwise, the progression looks nice. But the Elo things there are no any absolute Elo. If you play 200 games matches, you might end after 1000 nets with an Elo progress of say 6000-7000 Elos, of which the linear N^1 inflation due to gating might be 3000 +/- 1500 Elo points, and the correct statistical error margins going as N^(1/2) another 1000 Elo points. If you want absolute Elo, do as Michel said, different matches for it, and just for the pre-determined "anchor" IDs. With sparse k^2 "anchors" and 4x more games per "anchor", the effort for 1000 IDs will be in 10%-15% range of your gating effort, but you will get an unbiased absolute Elo estimate with controllable error margins going as a fraction of N^(1/4), which is reasonable for say 1000 IDs total.

Daniel Shawul · Post by **Daniel Shawul** » Tue Aug 06, 2019 2:15 pm

Michel wrote: ↑Tue Aug 06, 2019 5:35 am
Daniel Shawul wrote: ↑Tue Aug 06, 2019 12:44 am Things are going pretty smoothly now after adding "gating" to the training.
Networks that failed to equal or beat previous version are not used for generating new games.
In just 18k games, it gained 900 elo gradually and is surely learning piece values. It currently knows to
capture when it has the chance but it hasn't yet learned not to place its pieces on attacked squares or leaving them hanging.

Plot is shown here.

I have temporarily went back to training a new net every 512 games with a sliding window of 15 nets data, and training for 8 minibatches (4096 pos each).
I am gonna let this run continue for a while to see where it will end up.

Daniel
You have to be careful because gating may cause bias in the elo measurements. For example assume that all networks are in fact equally strong and in a linear tournament you keep only those that score >50% against the previous one (which will happen 50% of the time). You would measure a steady "progress" of elo when in fact there is none.

I do not know what your current procedure is but ideally the games uses for gating should be different from
the games used for measuring.

Of course this is if you want an unbiased estimate.

A0 used a margin of 55% but I used 50% just as a precaution. Note that the plot shows both the failed and passed nets and a network is trained always starting from the previous net even though that may have failed. So the gating is used for controlling production of good quality games. I don't think in this situation elo will just add up. I think what actually adds up the elo is when every network is used as an anchor. In your example of equal strength networks, the way i do gating will not add up elos if only the first net is used as anchor.

Due to a bug I do not keep around the best performing net so far (alpha*) around for long. I keep it just for one training step so mine is more like best of 2 previous nets. I am not sure if I should go full blown gating due to the issues you have raised.

Question: Given enough resources and time, it is better to train a net as frequently as possible e.g. I am producing nets every 512 games instead of 32k games and it seemed to fasten training a lot. Ofcourse I am almost doubling the training time by training and evaluating nets at this high frequency, but I am pretty sure that the 32k training will not produce a 1000 elo better net in just one step which the 512 games training did.

Daniel Shawul · Post by **Daniel Shawul** » Tue Aug 06, 2019 2:27 pm

Laskos wrote: ↑Tue Aug 06, 2019 6:54 am Ah, back to "Ego" rating (I think you called them so) of early Lc0 runs? Take seriously what Michel says. Gating will cause bias, and in your case, if I am not wrong, a linear inflation in Elo with the number of nets, which is very bad, hard to estimate and has a large error of estimation itself. You don't use sparse "anchors" anymore, just play net versus previous net?

I use sparse anchors, every n^2-1 net is an anchor. So the elos don't add up like lc0's. See my previous reply as to why i think that is the case.

Then I don't understand the Elo margins in the plot, or you went back to Bayeselo wrong defaults for Elo margins? If you play net versus previous net, your Elo margins should go as ID number N^(1/2), and I don't see that in the plot.

I am actually now using BayesElo with the covariance option that gives higher deltas when all nets are chained.
For a really chained test run look at the first test run ( select "Test run nets-1" from the combox in the left side) and you will see how big the errors were!

Otherwise, the progression looks nice. But the Elo things there are no any absolute Elo. If you play 200 games matches, you might end after 1000 nets with an Elo progress of say 6000-7000 Elos, of which the linear N^1 inflation due to gating might be 3000 +/- 1500 Elo points, and the correct statistical error margins going as N^(1/2) another 1000 Elo points. If you want absolute Elo, do as Michel said, different matches for it, and just for the pre-determined "anchor" IDs. With sparse k^2 "anchors" and 4x more games per "anchor", the effort for 1000 IDs will be in 10%-15% range of your gating effort, but you will get an unbiased absolute Elo estimate with controllable error margins going as a fraction of N^(1/4), which is reasonable for say 1000 IDs total.

See my previous reply. Ofcourse we can't totally remove elo inflation unless we do full round-robin tournaments. I am producing nets more frequently for now, which means more anchors but they are still sparse ( 0, 3, 9, 16, 25, 36..ectc) and it will reach a spacing of 150 by the 1000th net.
I think training new net every 512 games results in faster training but elo might swing a lot due to decreased stability as it has started to show now around thethe 50k games mark. If you look at the match games they have become much nicer now where a piece will be captured if left hanging.

Michel · Post by **Michel** » Wed Aug 07, 2019 10:13 am

Ofcourse we can't totally remove elo inflation unless we do full round-robin tournaments.

Just a tiny bit of nitpicking, since I think we basically agree.

Elo inflation is not caused by the type of tournament. Elo inflation happens if the structure of the tournament is not fixed beforehand but depends on the results during the tournament.

To come back to my example of a tournament where all players are equally strong. In a linear tournament there will be no elo inflation (no bias - but large error bars). There will be elo inflation however if every new game is against the "strongest" (by result) player so far.

The models underlying BayesElo and Ordo do not account for this type of tournament and I do not think it is possible unless one has a true prior for the elo distribution (i.e. not a fictitious prior as in Bayesian
statistics).

best way to determine elos of a group

Re: best way to determine elos of a group

Re: best way to determine elos of a group

Re: best way to determine elos of a group

Re: best way to determine elos of a group

Re: best way to determine elos of a group

Re: best way to determine elos of a group

Re: best way to determine elos of a group

Re: best way to determine elos of a group

Re: best way to determine elos of a group

Re: best way to determine elos of a group