Eric Stock wrote:Bob, Don and Vincent, thank you for the replies.
Here is an update on my testing. I am currently testing MTDCrafty vs Regular Crafty with only null-move disabled. I modified Lazy Eval, and Futility pruning to work in a fail soft environment, plus I increased the value of the "lazy-cutoff-score" in Crafty. After 366 games, MTD has scored 188, regular Crafty 178. I will play 1000 games. The time controls are 1 sec per move. I am using littlemainbook which comes with Arena to randomize the starting positions. Both programs are searching 9-11 plies in the middle game.
When I test with WAC, at 1 second per move, regular Crafty solves 290 and MTD solves 287. Any test position where Crafty's search changes drastically between iterations favours the regular Crafty as MTD must make many researches to get the new score. However, these test suite results don't seem to translate when testing with games (at least the way I am doing it).
Vincent, I am very interested in some of the things you mention.
Perhaps, my testing needs to be improved? Could you suggest some test suites I can use. I am aware that Win At Chess is far too easy for a program of Crafty's strength. What is the name of the test suite you use which has 213 positions?
Also, do you have any suggestions for my test matches. Should I increase the time controls?
As far as null-move pruning is concerned. My tests indicate that it works poorly with MTD(f) in comparison to PVS. When playing 1000 matches to depth 7, MTD Crafty scored 420 vs 580 for PVS. At 1 second per move, it is a little closer, but regular Crafty is still clearly stronger.
I have an idea to make null-move work with MTD. I am going to try testing it next. To me it makes sense, but I try stuff all the time and usually it doesn't work, but I have a good feeling about this one, so we'll see.
To summarize,
1. The only problem I see with MTD is that null-move doesn't work properly with it.
2. I am aware of the problem of MTD(f) having a bad guess and having to do many re-searches (several hundred). There are a couple of WAC positions where this happens. However, my test-matches do not seem to indicate this is a problem in real - games.
3. My techniques for testing need to be improved.
Eric Stock
Hi Eric,
Very good to discuss what you do in public. Please note that this is very valuable and very good. The way how to measure is very important.
Of course what GCP mentions is most important that's playing a lot of games.
Depth limitation for the games i feel is not a good idea.
To make things clear to start with is running a bunch of positions, preferably NOT testset positions. Those positions are all fail high positions, so not very relevant. The big problem with MTD is namely positions where the program is in doubt and doubts everything and where score drops a little. Fail low positions quality therefore more than fail high positions.
I'm not using testset positions, but a bunch of positions taken from games.
I'll give an email with them to you if you drop me an email or message with your email. My email adress is easy to find: diep at xs4all dot nl.
Just give each engine a few minutes.
Please don't do tests of X seconds. Crafty has built in commands to run a bunch of positions full automatic. You want to run this while you're celebrating christmas.
For a master thesis i would find it more than acceptable if you run these bunches of positions at the full blown versions of crafty 23.0, the normal crafty 23.1 version and your version with MTD version.
So turn on everything that benefits crafty 23.0 for playstrength and then also run the same things at your modified version. Single core measurement is enough.
Use everywhere the same settings and use a time limit of say 3 minutes or 7 minutes. I'd prefer 7 minutes, that makes it exactly a 24 hours run. Also the run has significance for future then.
Then you write down from each position in an excel sheet the biggest depth that was fully FINISHED. So *started* is not enough.
Now with those 2 runs of each 24 hours you can then do math.
You calculate, full automatic, 2 different things:
a) average depth difference
b) worst case depth difference
And you can also calculate from it things like errors and so on.
All this is a first step. Setting up a method of playing games is much tougher and requires way more time. Of course it is much better.
However my assumption is that with that B worst case depth difference you can already show such a huge worst case in MTD over PVS, that testing games doesn't even make sense anymore. My experience is that it can soon run up to 5 ply depth difference for MTD, which in competative computerchess is complete suicide.
Vincent