I think it's not fair that all test suites in STS get the _equal_ weightage. I believe some test suites are more important than others, obviously. I just don't know how to give the proper/balanced weightages. Therefore, I gave equal weightages for all the test suites just to get the idea of rough strength of ratings of new engines or rough strength improvement of one version over the other.
In cases where two engines are almost nearly equal strength-wise. There's need to place emphasis on the engine's knowledge on certain test suites.
do this calculation with all results you have got, always replacing the STS_scores. This way find a set of weights that result in the minimum sum of differences of STS_elo and the real engine_elo for all engines.
if you lay out the data properly in a table (all the STSscores and elos of the engines) any statistic program could calculate the weights in no time.
do this calculation with all results you have got, always replacing the STS_scores. This way find a set of weights that result in the minimum sum of differences of STS_elo and the real engine_elo for all engines.
if you lay out the data properly in a table (all the STSscores and elos of the engines) any statistic program could calculate the weights in no time.
Thank you! This looks like a brilliant suggestion.
I will experiment with this for a while with the set of ten engines but I think it might be best to wait until I release 7 more test suites in the next 6 months because there are more important strategical ideas that are yet to be put into practice via new test suites, and that the weightages may greatly vary once I release the next suite, and so on. STS is still in an early stage.
Swami:
Pay attention to the overfitting problem, doing the regression!
If you have results from n testsuites you need to estimate (n+1) parameters, so that you must put under test a number of engines much bigger than that...
now you have 10 engines and 8(+1, if you allow for a nonnull intercept) parameters, which leaves to you only 2 degrees of freedom, at most....
It boils down to a very good descriptive model, but with a poor predicting power.
swami wrote:
I will experiment with this for a while with the set of ten engines but I think it might be best to wait until I release 7 more test suites in the next 6 months because there are more important strategical ideas that are yet to be put into practice via new test suites, and that the weightages may greatly vary once I release the next suite, and so on. STS is still in an early stage.
Swami, I suggest you post a (long) list with all the engines in first column, then the score of each test, one test per column then at the end the CEGT/CCRL ELO esitmation, something like this:
I might do it. I would like to try, also, some other less classical methodologies, on these data.
It's my academic area.
However the basic problem, as I said, is to have data about enough engines, in order to have a good predictive model .
noctiferus wrote:I might do it. I would like to try, also, some other less classical methodologies, on these data.
It's my academic area.
However the basic problem, as I said, is to have data about enough engines, in order to have a good predictive model .
Hi Enrico,
That sounds great! I'd like to know the what the total number of engines is that you need their data for? Does 100 sound good? I'd be willing to test 150 or more if that satisfies the necessary requirement to process the set of datas.
noctiferus wrote:I might do it. I would like to try, also, some other less classical methodologies, on these data.
It's my academic area.
However the basic problem, as I said, is to have data about enough engines, in order to have a good predictive model .
Hi Enrico,
That sounds great! I'd like to know the what the total number of engines is that you need their data for? Does 100 sound good? I'd be willing to test 150 or more if that satisfies the necessary requirement to process the set of datas.
I would like to suggest engines for which does exsist a reliable ELO estimation by CCRL and CEGT, otherwise are useless.