Engine testing: search vs eval

sedicla · Post by **sedicla** » Thu Jul 12, 2012 10:51 pm

My engine now is on a stage that is hard to make progress. I would like to make sure it is free of bugs, or at least most of them.
Also i'm not sure what to focus, search or eval. From what i heard a wrong eval will hurt a good search and vice versa ...
So i was thinking to use this strategy:
Disable all search enhancements ( futility ,razor, null , etc) and then run against the epd test suite like 250 tactics. First analise changes only in eval and then start turning on the search enhancements oneby one.
Anyone had tried that? Now i am more interested in making sure it is correct.

jdart · Post by **jdart** » Fri Jul 13, 2012 4:32 pm

Testing against a 250-position epd test suite is almost useless. Based on my experience, your results will not correlate well with performance in games. You need to test with actual games, preferably a very large number of them.

Personally I have gotten more improvement from search changes vs. eval changes.

--Jon

diep · Post by **diep** » Fri Jul 13, 2012 4:54 pm

jdart wrote:Testing against a 250-position epd test suite is almost useless. Based on my experience, your results will not correlate well with performance in games. You need to test with actual games, preferably a very large number of them.

Personally I have gotten more improvement from search changes vs. eval changes.

--Jon

Other than SMP search, I tend to disagree.

If i equip todays diep evaluation with 1998 search of Diep, which is nullmove R=3, hardly check extensions and that's it and 8 probes in hashtable, then that's going to be probably 700 elo stronger than if i would use 1998 diep's evaluation with todays search.

Searching even 30 ply is total useless with a crap evaluation.

If i equip todays Diep's eval with 1999 diep's search, which uses what you nowadays call LMR, i'm not sure how many elo points it is away from todays Diep.

Won't be that much elowise.

Would be amazed if it's more than a 100 elopoints.

The real difference will be that the search back then had af ew inefficiencies in qsearch (i was doing more then) and was doing a selective search which makes not so much sense today.

Of course the only real big difference is that Diep's smp search 1999 had 1 bug crashing it sometimes (fixed in october 2000) and it doesn't scale as well as todays SMP search.

If we look at deepfritz6 from 1999, the first parallel fritz version, it was getting at a 500Mhz box @ 4 cpu's, already easily 17 ply.

My guess is that if you would put todays evaluation in search from back then you see a similar difference like diep.

Several programmers who get 25+ ply search depth report that doing that super pruning moving them from say a search depth of 19 ply or so to 28 ply, that this jump of nearly 10 plies in a dubious manner delivers them 50-70 elopoints.

Don · Post by **Don** » Sat Jul 14, 2012 1:50 pm

sedicla wrote:My engine now is on a stage that is hard to make progress. I would like to make sure it is free of bugs, or at least most of them.
Also i'm not sure what to focus, search or eval. From what i heard a wrong eval will hurt a good search and vice versa ...
So i was thinking to use this strategy:
Disable all search enhancements ( futility ,razor, null , etc) and then run against the epd test suite like 250 tactics. First analise changes only in eval and then start turning on the search enhancements oneby one.
Anyone had tried that? Now i am more interested in making sure it is correct.

You will get most of your improvement from evaluation - so I strongly recommend you put most of the effort on that. Your search must be bug free of course and you should use the standard techniques such as null move prunning, LMR and other things and pay particular attention to move ordering - using a modern history style of approach. But beyond that evaluation is super critical. In Komodo, a program that I believe has the best evaluation of all programs, I was not able to remove ANY of it without immediately noticing a reduction in strength - even picking the terms I thought might be relatively minor.

There is one thing everyone should be aware of with evaluation - the power of small weights. When creating a new evaluation term for a program the immediate temptation is to set it with a value that is too high. But the higher you set it the more disruptive it is likely to be - it could do more damage than good. So when tuning start with a conservative estimate - if the idea is good you should see a gain even if the weight is not ideal. You can generally tune evaluation features with very fast games - otherwise it would be an almost impossible task - you need a LOT of games to get it right. Probably billions of games have been devoted to tuning Komodo's evaluation weights over the years. Few people have the patience to build a strong evaluation function but that is one of the secrets.

Richard Allbert · Post by **Richard Allbert** » Sun Jul 15, 2012 1:58 pm

Hi Don,

Can you give some pointers to this?

I've finally decided to start taking a disciplined approach to testing, have written a program to set up testing games using Cutechess.

The problem is knowing where to start.

For example, clean out the Eval function and tune just one Paramter

When the next parameter is tuned, it will be tuned vs the first value , and so on.

The values you end up with then depend on the order in which they are introduced? This doesn't seem a god way to do things.

If you get a result where VersionA has 2000elo +- 10 and VersionB 2020 +-10, then that is equal, correct? Unclear, as both fall within the error margin.

Can you give some tips on starting the testing, please

?

I've tested a group of oppenents at 20s+0.2s for stabilty, all was ok, and I've used Bob Hyatt's openings.epd as starting positions.

Any help is appreciated. I don't have a huge cluster, unfortunately, rather 4 cores spare

Regards

Richard

ZirconiumX · Post by **ZirconiumX** » Sun Jul 15, 2012 2:04 pm

Richard,

The work has been done for you.

http://remi.coulom.free.fr/CLOP/

It can tune 12 values at the same time.

Matthew:out

Don · Post by **Don** » Sun Jul 15, 2012 2:04 pm

I am actually planning to write a tutorial of sorts on this - one for dummies as they say. It will be simple to understand, very readable and it will try to explain all the myths and misconceptions that you hear a lot. It will also talk about the tradeoffs that you must take.

Richard Allbert wrote:Hi Don,

Can you give some pointers to this?

I've finally decided to start taking a disciplined approach to testing, have written a program to set up testing games using Cutechess.

The problem is knowing where to start.

For example, clean out the Eval function and tune just one Paramter

When the next parameter is tuned, it will be tuned vs the first value , and so on.

The values you end up with then depend on the order in which they are introduced? This doesn't seem a god way to do things.

If you get a result where VersionA has 2000elo +- 10 and VersionB 2020 +-10, then that is equal, correct? Unclear, as both fall within the error margin.

Can you give some tips on starting the testing, please ?

I've tested a group of oppenents at 20s+0.2s for stabilty, all was ok, and I've used Bob Hyatt's openings.epd as starting positions.

Any help is appreciated. I don't have a huge cluster, unfortunately, rather 4 cores spare

Regards

Richard

Richard Allbert · Post by **Richard Allbert** » Sun Jul 15, 2012 2:08 pm

Hi Matthew,

Yes, I've seen that, and downloaded, but my question was a different one.

I'm more bothered about what to start with, and what road to take in deciding what to tune, rather than the method of verifying improvements.

Thanks for you reply, though!

How's insane FF today?

Richard Allbert · Post by **Richard Allbert** » Sun Jul 15, 2012 2:09 pm

This would be very useful

lucasart · Post by **lucasart** » Sun Jul 15, 2012 2:14 pm

ZirconiumX wrote:Richard,

The work has been done for you.

http://remi.coulom.free.fr/CLOP/

It can tune 12 values at the same time.

Matthew:out

Have you ever tried it ?

I spent quite a lot of time trying to get CLOP to converge, but it never did, and the QLR max never stayed inside the window. All I wanted to do is optimize 5 parameters (5 piece values). I tried all sort of combinations to reduce the dimension (like force knight=bishop or even force everything byt N=B=x and optimize x), but I never got any sign of convergence, even after 10,000 games. And I was using pretty large windows too.

I wonder if anyone managed to get it working at least for piece value tuning. (5 dimension).

Engine testing: search vs eval

Engine testing: search vs eval

Re: Engine testing: search vs eval

Re: Engine testing: search vs eval

Re: Engine testing: search vs eval

Re: Engine testing: search vs eval

Re: Engine testing: search vs eval

Re: Engine testing: search vs eval

Re: Engine testing: search vs eval

Re: Engine testing: search vs eval

Re: Engine testing: search vs eval