At first iterations if you are using well-known techniques (like implementing hashtable, LMR, null move) a few hundred game match between _dev and master version would be enough to confirm that your code is right and will give you rough estimate of how much elo you gained from it.
But later in developement (lets say starting ~2000 - 2200 ccrl elo strength) proper testing system with statistics estimations of gain is a must. I would heavily warn you about any small sample size tests, as even patches that are looking like +10 elo after 1000 games can be -10 when proper testing is finished.
I think most common tool for it is cutechess-cli (you can run it both on Windows and Linux). With it you can run multiple games at once at your PC, significantly speeding up testing (that for some changes would be >20000 games).
Code: Select all
cutechess-cli -engine conf=Drofa_dev tc=0/10+0.1 -engine conf=Drofa_master tc=0/10+0.1 -tournament round-robin -rounds 30000 -sprt elo0=0 elo1=5 alpha=0.05 beta=0.05 -resign movecount=6 score=1000 -draw movenumber=45 movecount=5 score=15 -concurrency 8 -repeat -openings file=8moves_OPENBENCH.pgn format=pgn order=random plies=30 -pgnout eval_.pgn
Most common bounds (elo0, elo1) is [0, 5] for improvements, [-5, 0] for simplifications. As i am yet to study tuning i`m also using [-1, 4] or if i`m sure that the term is usefull and can be further tuned.
I am personally test with 10s + 0.1s/ per move using concurrency = number of threads for every change i make, than when a i have some amount of changes (lets say 4-5 pathces, or 1-2 if i`m not sure in short TC result) i do a re-test with 60s + 1s/per move vs the version that does not have any of that patches. After i have decent amount of patches (10-12) i test vs 1-2 engines with ccrl elo that i guess would be close to the patched version.
Other helpful feature to implement is "bench" command. It gives engine set of a diverse positions to evaluate with fixed depth search (lets say 15) and outputs nodes searched and nps estimations. Very convinient for pruning estimating (i could be wrong with bench, because i never implemented it myself)