A request to engine authors

michiguel · Post by **michiguel** » Mon Apr 28, 2008 12:44 am

bob wrote:
michiguel wrote:
bob wrote:
Tord Romstad wrote:Hi all,

I like to run test games at a fixed number of nodes per move, because it improves reproducibility, and because it enables me to run other CPU-intensive tasks on the computer without influencing the results of the running tests.

Unfortunately, it seems that almost no engine authors bother to support the UCI "go nodes ..." command, which means that I can currently only test against older versions of my own program.

UCI engine authors, could you please implement "go nodes ..." in your next version? It's only a minute of work, after all. Every time you increment your move counter, check if it exceeds the maximum limit, and stop the search if it does.

Tord
Make a note that this can give highly biased results. A simple change to your eval changes the shape of the tree ever-so-slightly, but more than enough to change the game outcome even if the change is no better or worse than without...

I've played with this idea extensively and discarded it as being unusable...
I am not sure I understand what you mean. Nodes are proportional to time, given a certain NPS, which is not influenced by changes in the tree shape. Playing games at X seconds/move is not terribly different than playing Y nodes /move.

Miguel
Here's what I am saying. First a question. Suppose you play a game between program A and B, with nodes set to 2,000,000 nodes each. You would expect that if you replay the game, the moves will be identical the second time around, correct? And they would, and I have verified this for several programs. But now you replay the same starting position, but set the nodes to 2,001,000. That is, one thousand more nodes. Would you expect the same moves and game? You won't get it. Also verified with fruit, glaurung 1 and 2, arasan, Crafty, and a couple of others.

What this means is that if you set the nodes to X, and then you change the evaluation even a tiny bit, you will search a _different_ X nodes. And you will get a different game, with possibly a different result, even though the two engines are playing at exactly the same level as before. This is why it is so difficult to get repeatability over a test like the Silver test, because we use a fixed time limit, which is not the same as a fixed number of nodes due to system timer fluctuations...

I have the feeling that there is some miscommunication. You have noise no matter what system you choose. The difference is that using nodes you can come back and reproduce everything and I do not see any big drawback compared to using time. The problems you mentioned are present there too.

Miguel

bob · Post by **bob** » Mon Apr 28, 2008 2:53 am

michiguel wrote:
bob wrote:
Tord Romstad wrote:
bob wrote:Make a note that this can give highly biased results.
Of course there is always a risk of that. It is always necessary to be careful when interpreting results, and to run other kinds of tests in addition.

In a perfect world, I would have had hundreds of computers dedicated to running test matches, and would have been running thousands of slow ponder on games against a number of different opponents every day.

My world is far from perfect. I have a single dual-core computer, and I am rarely able to dedicate more than 5-10 hours of CPU time per week to running engine vs engine matches. I have no other choice than using on various testing techniques which are all flawed in some way, and be aware of those flaws when interpreting the results. Frequently, I have to rely on intuition rather than statistics. Does it happen that I am wrong? Without a doubt, but as often as I am right more often than wrong, the result in the long run is a net progress. I still find it very easy to improve my engine, so clearly my approach is not completely hopeless.

A simple change to your eval changes the shape of the tree ever-so-slightly, but more than enough to change the game outcome even if the change is no better or worse than without...

I've played with this idea extensively and discarded it as being unusable...
Works well enough for me.

Tord
If you care which is better, program A or program B,it will provide a good answer. If you want to know which is better version A.20 or A.21, this is not a good test... You still need a _lot_ of games
The same amount of games than playing time/move. I do not understand this point.

If you play x nodes/move rather than y time/move, what you gain is that if you see a blunder or something weird, you will be able to reproduce the whole game and debug quicker. Not only that, but you can also use the computer while there is a match and nothing will be affected, or play faster games without being worry that the OS is messing it up. As I mentioned before, it would be even possible to run slower debug versions in matches.

Miguel

I didn't say it wouldn't work. With the proviso that it is necessary to play the same number of games overall that you would play using time. Hence the "this has been tried and found not usable" as a method of playing fewer games while reducing the variability to zero for a single starting position...

It always has its plusses, particularly as you mentioned dealing with debugging. But most that do this kind of testing are beyond the debugging stage and want to evaluate changes on a "good/no-good" type basis, and this offers nothing useful in that regard because you need even more starting positions to get enough unique games to evaluate changes.

Tord Romstad · Post by **Tord Romstad** » Mon Apr 28, 2008 8:40 am

michiguel wrote: If you play x nodes/move rather than y time/move, what you gain is that if you see a blunder or something weird, you will be able to reproduce the whole game and debug quicker. Not only that, but you can also use the computer while there is a match and nothing will be affected, or play faster games without being worry that the OS is messing it up. As I mentioned before, it would be even possible to run slower debug versions in matches.

These are exactly the reason why I prefer to use fixed-node test games, too.

bob wrote:I didn't say it wouldn't work. With the proviso that it is necessary to play the same number of games overall that you would play using time. Hence the "this has been tried and found not usable" as a method of playing fewer games while reducing the variability to zero for a single starting position...

Absolutely, no disagreement here. Fixed-nodes searches don't reduce the number of games you need. However, in my case, they increase the number of games I can play, because I can always have a match running in the background with low priority while I use the computer for other CPU intensive tasks.

bob wrote:It always has its plusses, particularly as you mentioned dealing with debugging. But most that do this kind of testing are beyond the debugging stage and want to evaluate changes on a "good/no-good" type basis,

Not quite. I, and I think most others, mostly evaluate changes on a "surely this must be an improvement?" basis, which is subtly different. My program is still very young, and basic things are still missing. When I add something new and important (most recently king safety), I run some tests to verify that it works. Typically I am 95% sure the new version is stronger even before I run any tests, I therefore don't need to test quite as thoroughly as when I have no idea which of two versions is the stronger. A few hundred fast fixed-node games are usually sufficient.

Of course, things are very different for a mature and highly polished program like yours, but I think most chess programs are closer to mine than to yours in this respect.

Tord

Uri Blass · Post by **Uri Blass** » Mon Apr 28, 2008 12:12 pm

Tord Romstad wrote:
michiguel wrote: If you play x nodes/move rather than y time/move, what you gain is that if you see a blunder or something weird, you will be able to reproduce the whole game and debug quicker. Not only that, but you can also use the computer while there is a match and nothing will be affected, or play faster games without being worry that the OS is messing it up. As I mentioned before, it would be even possible to run slower debug versions in matches.
These are exactly the reason why I prefer to use fixed-node test games, too.

bob wrote:I didn't say it wouldn't work. With the proviso that it is necessary to play the same number of games overall that you would play using time. Hence the "this has been tried and found not usable" as a method of playing fewer games while reducing the variability to zero for a single starting position...
Absolutely, no disagreement here. Fixed-nodes searches don't reduce the number of games you need. However, in my case, they increase the number of games I can play, because I can always have a match running in the background with low priority while I use the computer for other CPU intensive tasks.

bob wrote:It always has its plusses, particularly as you mentioned dealing with debugging. But most that do this kind of testing are beyond the debugging stage and want to evaluate changes on a "good/no-good" type basis,
Not quite. I, and I think most others, mostly evaluate changes on a "surely this must be an improvement?" basis, which is subtly different. My program is still very young, and basic things are still missing. When I add something new and important (most recently king safety), I run some tests to verify that it works. Typically I am 95% sure the new version is stronger even before I run any tests, I therefore don't need to test quite as thoroughly as when I have no idea which of two versions is the stronger. A few hundred fast fixed-node games are usually sufficient.

Of course, things are very different for a mature and highly polished program like yours, but I think most chess programs are closer to mine than to yours in this respect.

Tord

I wonder how do you get the confidence of 95% that the new version is stronger even before running tests.

There are some possible reasons that adding king safety code can be counter productive.

reason 1:you have a bug in your code.
reason 2:Your weights are too high and the program overevaluate king safety.
reason 3:Your code is too slow.

Uri

A request to engine authors

Re: A request to engine authors

Re: A request to engine authors

Re: A request to engine authors

Re: A request to engine authors