Ethereal 12.50 vs Stockfish 12 CCRL FRC testing has this chunk of games right at the very end of the 300 games they played :
1 1 0 = 0 0 = 1 0 1 1 1
+6-4=2 for Ethereal. +4-1 in the last 5 games.
Overall results are +22-185=93.
One thing that should be said is that error bars are directly correlated to the results variance. Play e.g. 100 games N times and measure the distribution of results. When the draw rate is higher (as it is with longer TC and more threads), the standard deviation will go down, while the sample size is unchanged. But I don't know if intrinsic WDL is enough or if you need experimental data to build a good prior for error bar computations.
A simple thought experiment involves an engine X playing a drawless game. In self-play playing both sides in equal amount, the intrinsic WDL will always be 50/0/50. However, at very long TC the engine might achieve 100% win with the strong side and no variance at all, while at very short TC the weak side at the start of the game could snatch wins, creating variance. So, it seems intrinsic WDL by itself isn't enough to properly assess variance. And of course, measured WDL will be off compared to the intrinsic WDL, so this increases unreliability.
In the end, that you might get away with less games for the same confidence doesn't mean multiple orders of magnitude less.