fkarger wrote: ↑Sat Jun 14, 2025 12:19 pm
you could calculate
average values and the like which might be a nice extension.
Pure comparison of the numbers of solutions alone don't show the for such tests relevant random noise, Frank, more then how many soltutions are solved at the one and the other one run, much more relevant is the question, how often have the same positions been solved in one and in the other one run, you see?
EloStatTS (Frank Schubert) compares position by position and run by run and that each time a new one run is added to old ones already stored and compared to each other.
That's why the error bars get lower and lower, the more and more runs are stored and compared to each other, with each new one run all the old ones get new rating and ranking. But the most lowering of this error bar you see with near to each other high performant runs, because within those the same positions are solved again and again more or less exactly, the time indices count only for these positions solved by two runs in common.
That the points are converted to Elo too (of course not to be compared to Elo- performances of other tests) isn't the important point to me, I like to use this tool especially because it shows how many of the positions have been solved or not solved run by run, not only how many in sum, but how many times the same one positions. That the exact time to solution counts as for WDL- measurements of the runs, is another one fine feature. If you want to know, how exactly the tool works, read about it here:
https://glarean-magazin.ch/wp-content/u ... bert-1.pdf
So the error bars that are given by the tool comparing the two runs of the same engine with the same setting (hardware, threads, hash and TC) show much more than only the difference of the amounts of solved positions would.
Here I did let SF dev. 250602 run the second 80 positions of the 160 (nr. 81-160) with 8threads of the 16x3.5GHz CPU, 8G hash and 3'/pos. two times and let EloStatTS compare the tow runs (R1 and R2):
Code: Select all
Program Elo +/- Matches Score Av.Op. S.Pos. MST1 MST2 RIndex
1 Stockfish250602-8t-R2 : 3503 72 24 50.9 % 3497 19/ 80 45.5s 148.1s 0.68
2 Stockfish250602-8t-R1 : 3497 71 24 49.1 % 3503 18/ 80 33.3s 147.0s 0.70
MST1 : Mean solution time (solved positions only)
MST2 : Mean solution time (solved and unsolved positions)
RIndex: Score according to solution time ranking for each position
The point is, there's a difference of only 1 sinlge solution (19 and 18), but look at the error bar of 72 and 71! That doesn't come from so much difference in time- indices, they count only for those positions solved in both runs, that comes from the lack of such positions solved two times in both runs.
Again I stored the complete files of solutions per positions for both runs too and can send them on demand per mail, don't want to make this anyhow big position even bigger for that and upload wasn't worth it to me.
I hope that makes it clearer, what I mean with random noise, that's an intrinsic value connected with the positions of the suite, of course also to the hardware- TC and the engines, but even within the comparison of the same engine (-version and -setting) you get such high an error bar. Comparing that to other ones not much bigger suites of hard (for the used hardware- TC and engines) positions:
Code: Select all
Program Elo +/- Matches Score Av.Op. S.Pos. MST1 MST2 RIndex
...
31 Monty-250119-6t : 3357 8 2328 28.7 % 3516 22/128 4.8s 25.7s 0.14
32 Dragon3.3-MCTS-6t : 3356 8 2333 28.6 % 3515 27/128 9.4s 25.6s 0.09
MST1 : Mean solution time (solved positions only)
MST2 : Mean solution time (solved and unsolved positions)
RIndex: Score according to solution time ranking for each position
The comparison isn't quite fair, because the two engines in this one list are 2 out of 32 runs, which lowers the error bars by the number of runs comparedm but it's the end of the list (weakest performances measured) and the two engines perform yet rather differently. At the top of this list with 8threads best engines and settings solve 96 and 97 positions and the error bars of them are 5 and 6.
Hope I made my point finally yet clear enough, why I wouldn't use these second one 80 positions out of the 160, at least not together with other quite different kind of positions, and they are so special in their character, that I wouldn't know, with which others like just more like those, I'd combine such usefully. More of exactly that kind, ok, yet it would still be a matter of how much hardware- time would one want to spend for such very special and hard to compare to any other ones results.
Not only would they need a very special hardware- TC of their own depending on the engines and the hardware used much too, the character of them is of such special kind and requirement to the engines, they simply should not be mixed up with very different (kinds of) other ones, at least not as for my personal pov. of positional testing, regards
Peter.