Testing Search and Evaluation

Cheney · Post by **Cheney** » Sat Jun 27, 2020 8:26 pm

Hi,

I decided to start working on my chess engine project again after a little break from it and am having a mental lapse on the following:

When I make evaluation changes (adding features, tuning parameters), I have my engines play each other at a fixed depth.
When I make search changes, I have my engines play a timed game.

In my evaluation, I am adding new features and tuning parameters. I observed if I set a fixed depth of 4, the new engine is dominant (100-5-1), but when I set my depth to 6, the engines are just about equal.

Does anyone have an idea on why this may be? Maybe the new features are good but the parameter values need more tuning? Maybe latter may be true and since I use tapered eval then maybe the EG values need the tuning?

Thank you

-Cheney

jdart · Post by **jdart** » Sat Jun 27, 2020 11:24 pm

Fixed-depth testing is not really an established method for eval tuning.

The standard way to tune eval parameters is the "Texel" tuning method, more properly termed logistic regression. Lots of discussion of that in this forum, if you search.

--Jon

jonkr · Post by **jonkr** » Sun Jun 28, 2020 1:10 am

Because the extreme amount of difference is unlikely in my experience, it would be good to look at the games and verify the process is working. In testing there were a few times I've noticed weird results then found that either the openings weren't set at all or the opening randomization wasn't working, so that's the first thing that comes to mind as something to check to make sure (if there is hash or history data you don't clear game might vary even at fixed depth with same start.)

I did do a little fixed depth testing at one point, it wasn't the best/most accurate way to tune especially compared to my later automatic tuning of eval values, but it did show at least some correlation for larger changes. Smaller changes I couldn't run enough longer games to know how accurate any correspondence was. (eg. turning off tactical eval features might be like 40 Elo less over large amount of games, turning off passed pawns or king safety might be like 100 Elo each, don't remember exactly, but was in general ballpark. )

Cheney · Post by **Cheney** » Sun Jun 28, 2020 10:27 am

My hiatus has been a few years and one reason I took a break was, sorry to say, Texel tuning; I spent a long time trying to understand it and getting it to work. I have read and participated in a few posts here. In the end, I believe my code was working but my tuned engine always lost to my base engine. That being said, I am currently looking to get the juices flowing again and revisit Texel.

I had a process, it was similar to someone else's I read about on line (I think it was Ed Schroder's, but I cannot recall). This is where the idea of using a set depth came into play. Since I do not have notes on the exact process and since I cannot find the page I read those years ago, I am left trying to reverse engineer the process.

I would like to try to continue to add a few more evaluation features and manually tune. Maybe you are aware of a link or post which discusses a process that may help me out? But, I will take another shot at Texel in the near future.

brianr · Post by **brianr** » Sun Jun 28, 2020 12:11 pm

The "Texel type" tuning gained about 100-150 Elo for tinker, IIRC.

While that was great, it was a one time improvement.
Tinker's eval went from very poor to mediocre.
After that, smaller changes exceeded my patience to measure given my hardware at the time.
If your eval is already pretty good, improvements will get smaller and harder to measure.

For eval changes, fixed nodes (not depth) testing seemed fine at the time.

Cheney · Post by **Cheney** » Sun Jun 28, 2020 1:06 pm

Fixed nodes? I know I have recently seen some mention to that on another post. I have never thought of that or really recall looking into its value add. My engine uses winboard which I see has an NPS command/parameter. Is this what you are referring to?

Also, I use cutechess-cli to perform all testing. It looks like cutechess has a node count parameter and not an NPS. I will look into this further and test it out, but, on the surface, I do not know if cutechess's nodes parameter is the same as winboard's nps parameter.

Cheney · Post by **Cheney** » Sun Jun 28, 2020 1:19 pm

FYI … it appears CuteChess's "nodes" parameter is used for NPS and is sent via the winboard NPS command.

I will try this out, thank you for the help!

When I get back to trying Texel, I will review all the latest posts and probably ask some questions

Testing Search and Evaluation

Testing Search and Evaluation

Re: Testing Search and Evaluation

Re: Testing Search and Evaluation

Re: Testing Search and Evaluation

Re: Testing Search and Evaluation

Re: Testing Search and Evaluation

Re: Testing Search and Evaluation