Testing strategies for my engines playing strength

lithander · Post by **lithander** » Mon Jan 04, 2021 1:38 pm

I've started developing my first chess engine in C#. My goal with it is to learn and to maximize the ratio of ELO/complexity so to speak. As it is my first attempt i follow the KISS principle. For example instead of the 0x88 trick and bitboards I just use an array for 64 squares. So for now the engine is both slow and weak but simple to understand!

What I really liked was the how I discovering 'perft' and 'divide' allowed me to make sure my move generator is working correctly. I can just type "perft 5 testsuite.txt" after code-changes and I know if I'm moving in the right direction - or not.

Now I've begun with the eval & search. Is there a similar strategy how I could make sure my engine's gaining strength? Are tournaments the best way or is there something else? If tournaments is the way to go can you suggest a tool that can run games in parallel? I dabbled with Arena Chess GUI a bit but it takes forever while my 6-core CPU stays mostly idle and the build-in engines are winning every game against my current version.

Sorry, if the question is dumb. It's my first post here and I'm still learning a lot every day!

Nomis · Post by **Nomis** » Mon Jan 04, 2021 2:05 pm

Hello,

I'm in the same boat as you are (well mostly, I kissed KISS goodbye a while ago...) and indeed making sure your engine is improving is definitely one of the difficult aspects of writing a chess engine.

In the first stages of development, even simple changes will result in huge ELO gains and playing a few games versus the previous version of your engine is often enough to see if it's getting better or worse. Later, it's apparently unavoidable, you'll have to run several thousand games to see if you win or lose 5 ELO.

Testing versus other engines makes sense of course. I'm almost convinced only testing your engine vs itself will produce weird results. At some point I introduced a change in my search that would make my engine lose 100% versus the previous version and yet perform slightly better against other engines.

When other engines are too strong, I use these strategies:
- use the difficulty setting on other engines. I originally made my engine play against stockfish level 3, now I'm using level 8

- use different (and very short) time controls for the other engine
- use a weaker engine. There's a nice ELO-ordered list on the CCRL list.

Sadly I can't recommend a test environment as I'm using Linux.

brianr · Post by **brianr** » Mon Jan 04, 2021 2:39 pm

Testing engines is its own specialty, however we all have to figure out some methodology to know if progress is being made.
I cannot tell you how many times I have messed this up. That said, some suggestions:

Learn to use cutechess-cli with an engines.json file for configurations.
GUIs strongly not recommended for tournaments.
Learn to use Ordo to see the results.
Learn about picking books.
When engines are very close and strong (you may not be there yet), using a slightly imbalanced book can help reduce the number of draws, which otherwise increases a lot.
Remember to use the -repeat option.
Find the most recent versions of cutechess-cli and Ordo.
Use tablebases for adjudication, unless you specifically want to test your engine's endgame code.
Do not use weakened or "strength" settings for other engines as this is quite inconsistent.
Instead, pick an engine of appropriate strength from the CCRL or CEGT lists.
Play "most of the time" with your engine against itself old v new.
Try very hard to only change one thing at a time.
From "time to time" test against a pool of other engines.
If things seem "weird" run two identical copies of your engine A v B and the results should stay VERY close to 50/50
Run until at least 95% (99 or 100 better) CFS.
For minor eval changes you might get away with fixed nodes per move games if your engine supports that.
For everything else, time per move games (with an increment) is better.
Avoid super fast increments depending on you CPU speed.
I suggest no faster than 0.02 seconds increment.
There is a lot of automatic over-clocking in CPUs these days and IMHO a slightly longer TC can mitigate that.
I still try not to use more threads than cores for timed games (ok for fixed nodes).
Once you are up to speed, then learn about SPRT.

Testing engines is a very exacting process and it is easy to lose patience.
Avoid SSS (Small Sample Size) results.

No4b · Post by **No4b** » Mon Jan 04, 2021 2:58 pm

At first iterations if you are using well-known techniques (like implementing hashtable, LMR, null move) a few hundred game match between _dev and master version would be enough to confirm that your code is right and will give you rough estimate of how much elo you gained from it.

But later in developement (lets say starting ~2000 - 2200 ccrl elo strength) proper testing system with statistics estimations of gain is a must. I would heavily warn you about any small sample size tests, as even patches that are looking like +10 elo after 1000 games can be -10 when proper testing is finished.

I think most common tool for it is cutechess-cli (you can run it both on Windows and Linux). With it you can run multiple games at once at your PC, significantly speeding up testing (that for some changes would be >20000 games).

Code: Select all

cutechess-cli -engine conf=Drofa_dev tc=0/10+0.1 -engine conf=Drofa_master tc=0/10+0.1 -tournament round-robin -rounds 30000 -sprt elo0=0 elo1=5 alpha=0.05 beta=0.05 -resign movecount=6 score=1000 -draw movenumber=45 movecount=5 score=15 -concurrency 8 -repeat -openings file=8moves_OPENBENCH.pgn format=pgn order=random plies=30 -pgnout eval_.pgn

Most common bounds (elo0, elo1) is [0, 5] for improvements, [-5, 0] for simplifications. As i am yet to study tuning i`m also using [-1, 4] or if i`m sure that the term is usefull and can be further tuned.

I am personally test with 10s + 0.1s/ per move using concurrency = number of threads for every change i make, than when a i have some amount of changes (lets say 4-5 pathces, or 1-2 if i`m not sure in short TC result) i do a re-test with 60s + 1s/per move vs the version that does not have any of that patches. After i have decent amount of patches (10-12) i test vs 1-2 engines with ccrl elo that i guess would be close to the patched version.

Other helpful feature to implement is "bench" command. It gives engine set of a diverse positions to evaluate with fixed depth search (lets say 15) and outputs nodes searched and nps estimations. Very convinient for pruning estimating (i could be wrong with bench, because i never implemented it myself)

brianr · Post by **brianr** » Mon Jan 04, 2021 3:12 pm

I forgot a few more things.

When you think you engine is stable, enter a tournament, like the monthly tourneys announced here.
NOTHING will find bugs in an engine like a tournament.
But please make an effort to get your engine stable.
It is a pain when an engine has major problems for the tourney manager and unfair to the others.
Also, ask the testing sites to give it a go (CCRL, CEGT, etc).

IanKennedy · Post by **IanKennedy** » Mon Jan 04, 2021 5:02 pm

lithander wrote: ↑Mon Jan 04, 2021 1:38 pm I've started developing my first chess engine in C#. My goal with it is to learn and to maximize the ratio of ELO/complexity so to speak. As it is my first attempt i follow the KISS principle. For example instead of the 0x88 trick and bitboards I just use an array for 64 squares. So for now the engine is both slow and weak but simple to understand!

What I really liked was the how I discovering 'perft' and 'divide' allowed me to make sure my move generator is working correctly. I can just type "perft 5 testsuite.txt" after code-changes and I know if I'm moving in the right direction - or not.

Now I've begun with the eval & search. Is there a similar strategy how I could make sure my engine's gaining strength? Are tournaments the best way or is there something else? If tournaments is the way to go can you suggest a tool that can run games in parallel? I dabbled with Arena Chess GUI a bit but it takes forever while my 6-core CPU stays mostly idle and the build-in engines are winning every game against my current version.

Sorry, if the question is dumb. It's my first post here and I'm still learning a lot every day!

If you aren't at the level to win against the Arena bundled engines there are plenty knocking around - TSCP is a good one to start with as is CDrill (based on TSCP but slightly stronger). Those can bridge the gap. If you aren't even that strong yet TarraschToyEngine or Toledo Nanochess... and so on.

Not sure what your point about Arena is but there are guis which let you run concurrent tournaments if you are single threaded but want to use the cores up, I believe Banksia does this.

fabianVDW · Post by **fabianVDW** » Mon Jan 04, 2021 8:56 pm

Learn to use OpenBench: https://github.com/AndyGrant/OpenBench

It might seem like quite a task for you to set it up, and it is, but it will pay out in the long run. It will track all your tests, is well funded, automatically distributed (for instance if you had multiple machines) and not too hard to setup. View http://chess.grantnet.us/ for some running instances of OB. When your engine is strong enough and stable, you can even ask Andrew Grant to join the OpenBench instance at http://chess.grantnet.us/.

For more questions ping the OpenBench people on Discord or here.

lithander · Post by **lithander** » Mon Jan 04, 2021 10:12 pm

Wow, thanks for all the detailed replies. That's a lot of information to dig into and will keep me busy for a while!

lithander · Post by **lithander** » Wed Jan 13, 2021 9:39 pm

I've got a follow up question:

How are opening books typically handled in tournaments?

Can each engine use an opening book of their choice? Of arbitrary large size and in a custom formats?

Or are they configured to all use the same one? If so, and I wanted to have my engine take part in such tournaments, what file formats need to be supported? What are standard opening books and where do I find them?

And if no opening books are allowed and an engine has some amount of opening lines encoded in their executable, are they banned or will this go undetected?

lithander · Post by **lithander** » Wed Jan 27, 2021 11:38 am

(I've found the answer to my above question)

I've been thoroughly impressed by how useful a perft was for developing a bug-free move generator. I found a testsuite.txt and now I can validate the move generator whenever I feel I made a risky change. Very handy.

Are there similar strategies for testing the search? I'm thinking of a list with interesting positions where you track the eval result, the duration it took to find it, the nodes explored. Maybe have reference values from Stockfish in that data to compare your engines moves against and automatically find positions where it blunders. If you change the pruning nodes and duration should improve but the eval shouldn't change. If you improve the eval or search depth you expect the mean accuracy vs Stockfish to improve.

Are there well established testing procedures like that? Data sets? Did you do something like that with your own engine? (Apart from running tournaments between different versions of course which has already been discussed?)

Testing strategies for my engines playing strength

Testing strategies for my engines playing strength

Re: Testing strategies for my engines playing strength

Re: Testing strategies for my engines playing strength

Re: Testing strategies for my engines playing strength

Re: Testing strategies for my engines playing strength

Re: Testing strategies for my engines playing strength

Re: Testing strategies for my engines playing strength

Re: Testing strategies for my engines playing strength

Re: Testing strategies for my engines playing strength

Re: Testing strategies for my engines playing strength