I also use cutechess-cli for automated testing but haven't tried the sprt facility.
Some thoughts...
I do runs of 20,000 games which rarely may be terminated early if the change gives a value consistently well outside the error window.
I used to use a normal openings book like you but changed it a couple of months ago as I noticed that with so many games I was getting repeated openings. I now use the openings-8ply-10k.pgn file which gives you 10,000 unique openings (and playing white and black gives you 20,000 unique games).
These days sadly it is rare to find an improvement that is outside the error margins (which is still +/-3.6 after 20,000 games). But I usually adopt the change if it is +ve even though it is small. Also I have often seen changes looking promising after 'only' 5,000 games but sadly it dissipates to nothing after 20,000
I also set the concurrency to one fewer than the number of cores to give cutechess-cli and the OS room to breathe
