This is BS.hgm wrote: ↑Sun Apr 28, 2024 8:46 am The most important difference is that in those days there existed many chess engines, and if you did not test your ideas by playing gauntlets against a variety of opponents, the results were often meaningless. That required 4 times as many games to get the type I and type II errors below the desired thresholds than using self play, and of course computers were several orders of magnitude slower, so that it took a lot of resources to generate the games in the first place.
Now that all engines are basically the same, playing gauntlets still amounts to self play, and makes no sense anymore. It is much easier to design an engine for just beating one particular opponent, than it would be to develop an engine that is generally good.
Stockfish was doing selfplay only since start of fishtest and quickly became top 1-2 engine - and fishtest started in 2013. And until NNs came it was disgustingly dominant, SF9 HCE is still the strongest HCE engine ever produced while latest dev with HCE was like 130 elo stronger than it. And self-gains were confirmed with rating lists gains at every turn more or less.
"Gauntlet testing" is just bad and this is all about it, it has nothing to do with engines in the gauntlet being different or not different.