You cannot have an elo system across different platforms. If you have a rating list for M1 devices (ARM architecture), then it stays that. NNUE performs terrible on ARM for example. If it is for Android, then your rating list is only for Android. For Windows on 64-bit arch you get a separate list. To be fair, even on Windows, different instruction sets produce different numbers. NNUE engines are almost twice as fast on AVX2 CPUs, while HCE engines don't benefit that much from AVX.AlexChess wrote: ↑Fri Dec 31, 2021 4:26 pmThank you for you comment Sopel.Sopel wrote: ↑Fri Dec 31, 2021 1:22 pmAlexChess wrote: ↑Fri Dec 31, 2021 7:48 amEven trusted rating lists aren't so reliable also playing thousands of games (but only between few engines, not the totality available)connor_mcmonigle wrote: ↑Thu Dec 30, 2021 3:31 pm
You literally said you value style over Elo in your previous message. However, you'd almost certainly be unable to tell apart Kayra from SF were the names hidden as they play effectively identically... what exactly do you think you're measuring? There are so many manipulated variables and your sample size is so small that your testing is meaningless. We have rating lists to perform third party, "real world" tests already.
I have nothing against what you're doing. Have fun, but don't pretend your playing with chess engines has any value for engine developers.
1. Mixed results obtained using 1,2,4,8 CPUs
2. The same engine (eg Stockfish 14.1) is evaluated from 3500 up to 3900 ELO!!
3. Different time controls
4. Different hardwares (somebody states that every time you double the hardware speed, you gain 80 ELO)
Best regards, Alex
1. Usually attributed to small sample size. I'm not aware of any significant results with more than one thread (apart from fishtest), as they are quite costly. Also, some engines just scale worse with amount of threads.
2. This starts making sense as soon as you learn that Elo is a relative metric
3. It's very rare that an engine's strength changes drastically compared to others depending on time control. Usually explained by small sample size.
4. What? Apart from testing Lc0 I don't know anyone who would use different hardware for different engines being tested in a single match. The relative performance across hardware is usually pretty comparable between engines.
4. I mean how can I obtain a trustable ELO using a Ryzen 9 5850 and my Android Snap 626 Smartphone![]()
Note: I'm trying to understand how to make my tests more meaningful.
Best regards and happy New year 2022!
What do you want from an engine on Android? do you want it to be the best? that is something nobody cares about it. If you want the best engine (for whatever the best means), you go for a massive data center, using SF latest dev version or LC0 or Dragon. If you want to enjoy, then toss elo and rating lists away and try to enjoy the chess itself