You can decide for your own test in your own way, mine is the human chess- understanding of how much more one move does for the side to move compared to the alternative move, is one of the candidates or more than one to be called a real game changer, and how "hard" is the position hardware- time- wise for the engines to find the one or the other one solution. And of course the basis of such evaluation is interactive analysis with engines and their output- evals, mainly of SF dev. but not only with this one, in cases of blind spots of SF's I use branches like Crystal and others too, now and then Lc0...chesskobra wrote: ↑Wed Sep 18, 2024 12:19 am How are the numbers corresponding to different moves obtained? Is it by normalizing the evaluation of the top move to 100 and adjusting the evaluations of other moves in proportion?
Those comparisons of evaluation give to me the relation between the points to be earned by position and move, the calibration in numeric height has to consider, how many positions of which kind are in which one suite,and which hardware- TC is planned for which engine- pool to test with.
Ferdinand Mosca's way was, when I watched at last, this one:
https://github.com/fsmosca/STS-Rating