A request to engine authors

michiguel · Post by **michiguel** » Sun Apr 27, 2008 5:29 am

bob wrote:
Tord Romstad wrote:Hi all,

I like to run test games at a fixed number of nodes per move, because it improves reproducibility, and because it enables me to run other CPU-intensive tasks on the computer without influencing the results of the running tests.

Unfortunately, it seems that almost no engine authors bother to support the UCI "go nodes ..." command, which means that I can currently only test against older versions of my own program.

UCI engine authors, could you please implement "go nodes ..." in your next version? It's only a minute of work, after all. Every time you increment your move counter, check if it exceeds the maximum limit, and stop the search if it does.

Tord
Make a note that this can give highly biased results. A simple change to your eval changes the shape of the tree ever-so-slightly, but more than enough to change the game outcome even if the change is no better or worse than without...

I've played with this idea extensively and discarded it as being unusable...

I am not sure I understand what you mean. Nodes are proportional to time, given a certain NPS, which is not influenced by changes in the tree shape. Playing games at X seconds/move is not terribly different than playing Y nodes /move.

Miguel

hgm · Post by **hgm** » Sun Apr 27, 2008 9:25 am

This is because Bob expects something different than we do, namely reduce the sensitivity of gauntlet results to small changes in the program.

We only want to use it to make gauntlet results insensitive to workload of the computer.

Edsel Apostol · Post by **Edsel Apostol** » Sun Apr 27, 2008 11:56 am

Hi Tord,

Twisted Logic has support for the UCI command "go nodes" though it is only accurate to 10 % percent of the NPS as I only check for nodes overflow every 100 milliseconds, and there is no Linux or Mac version so I think you couldn't use it. It also supports "searchmoves". I just want to inform anyone that might read this in case they didn't know yet.

Tord Romstad · Post by **Tord Romstad** » Sun Apr 27, 2008 12:28 pm

Bill Rogers wrote:Being as not everyone counts nodes in exactly the same way

That's completely irrelevant. The point is not to test whether engine X is better than engine Y when computing the same number of nodes, but to test which of two similar versions of my engine (which are assumed to be equally fast) is stronger against a given set of opponents. It doesn't matter if other engines count nodes in different ways, as long as I keep counting nodes in the same way.

why don't you just play them all to a fixed ply depth.

Because that would give me the choice between setting the depth so low that searches in the endgame would be ridiculously shallow and inaccurate, or setting the depth so high that searches in the middle game would take far too long.

It seems to be basically the same thing and everyone counts plys in the same way.

This is once again irrelevant, for the reasons described above, but it's not even close to true that everyone counts plys the same way. In fact, I am quite sure the variability here is far bigger than in the ways nodes are counted.

Tord

Tord Romstad · Post by **Tord Romstad** » Sun Apr 27, 2008 12:47 pm

bob wrote:Make a note that this can give highly biased results.

Of course there is always a risk of that. It is always necessary to be careful when interpreting results, and to run other kinds of tests in addition.

In a perfect world, I would have had hundreds of computers dedicated to running test matches, and would have been running thousands of slow ponder on games against a number of different opponents every day.

My world is far from perfect. I have a single dual-core computer, and I am rarely able to dedicate more than 5-10 hours of CPU time per week to running engine vs engine matches. I have no other choice than using on various testing techniques which are all flawed in some way, and be aware of those flaws when interpreting the results. Frequently, I have to rely on intuition rather than statistics. Does it happen that I am wrong? Without a doubt, but as often as I am right more often than wrong, the result in the long run is a net progress. I still find it very easy to improve my engine, so clearly my approach is not completely hopeless.

A simple change to your eval changes the shape of the tree ever-so-slightly, but more than enough to change the game outcome even if the change is no better or worse than without...

I've played with this idea extensively and discarded it as being unusable...

Works well enough for me.

Tord

Zach Wegner · Post by **Zach Wegner** » Sun Apr 27, 2008 5:55 pm

bob wrote:Make a note that this can give highly biased results. A simple change to your eval changes the shape of the tree ever-so-slightly, but more than enough to change the game outcome even if the change is no better or worse than without...

I've played with this idea extensively and discarded it as being unusable...

How about this idea: from the GUI, you tell the engine each move how many nodes to search. But instead of an exact number, you send a random number that is +/- X% of the desired number. This way, the node counting technique still simulates time in that playing many games will get a good sampling over the "strength/search time" distribution, i.e. reduce bias. And of course, you still get an answer that is independent of the CPU time used.

Now, keep in mind that some engines (like mine) only check to stop every N nodes (or every time they reach the root node). But they also only check the time every N nodes, so as long as N is small enough that there is a good amount of variance within the range of node counts, it should be fine.

This idea does take away the "reproducibility" aspect that Tord wanted, but I think that reproducibility is the root of the problem Bob described, so all the better.

P.S. I don't know anything about statistics.

bob · Post by **bob** » Sun Apr 27, 2008 6:02 pm

michiguel wrote:
bob wrote:
Tord Romstad wrote:Hi all,

I like to run test games at a fixed number of nodes per move, because it improves reproducibility, and because it enables me to run other CPU-intensive tasks on the computer without influencing the results of the running tests.

Unfortunately, it seems that almost no engine authors bother to support the UCI "go nodes ..." command, which means that I can currently only test against older versions of my own program.

UCI engine authors, could you please implement "go nodes ..." in your next version? It's only a minute of work, after all. Every time you increment your move counter, check if it exceeds the maximum limit, and stop the search if it does.

Tord
Make a note that this can give highly biased results. A simple change to your eval changes the shape of the tree ever-so-slightly, but more than enough to change the game outcome even if the change is no better or worse than without...

I've played with this idea extensively and discarded it as being unusable...
I am not sure I understand what you mean. Nodes are proportional to time, given a certain NPS, which is not influenced by changes in the tree shape. Playing games at X seconds/move is not terribly different than playing Y nodes /move.

Miguel

Here's what I am saying. First a question. Suppose you play a game between program A and B, with nodes set to 2,000,000 nodes each. You would expect that if you replay the game, the moves will be identical the second time around, correct? And they would, and I have verified this for several programs. But now you replay the same starting position, but set the nodes to 2,001,000. That is, one thousand more nodes. Would you expect the same moves and game? You won't get it. Also verified with fruit, glaurung 1 and 2, arasan, Crafty, and a couple of others.

What this means is that if you set the nodes to X, and then you change the evaluation even a tiny bit, you will search a _different_ X nodes. And you will get a different game, with possibly a different result, even though the two engines are playing at exactly the same level as before. This is why it is so difficult to get repeatability over a test like the Silver test, because we use a fixed time limit, which is not the same as a fixed number of nodes due to system timer fluctuations...

bob · Post by **bob** » Sun Apr 27, 2008 6:04 pm

Tord Romstad wrote:
bob wrote:Make a note that this can give highly biased results.
Of course there is always a risk of that. It is always necessary to be careful when interpreting results, and to run other kinds of tests in addition.

In a perfect world, I would have had hundreds of computers dedicated to running test matches, and would have been running thousands of slow ponder on games against a number of different opponents every day.

My world is far from perfect. I have a single dual-core computer, and I am rarely able to dedicate more than 5-10 hours of CPU time per week to running engine vs engine matches. I have no other choice than using on various testing techniques which are all flawed in some way, and be aware of those flaws when interpreting the results. Frequently, I have to rely on intuition rather than statistics. Does it happen that I am wrong? Without a doubt, but as often as I am right more often than wrong, the result in the long run is a net progress. I still find it very easy to improve my engine, so clearly my approach is not completely hopeless.

A simple change to your eval changes the shape of the tree ever-so-slightly, but more than enough to change the game outcome even if the change is no better or worse than without...

I've played with this idea extensively and discarded it as being unusable...
Works well enough for me.

Tord

If you care which is better, program A or program B,it will provide a good answer. If you want to know which is better version A.20 or A.21, this is not a good test... You still need a _lot_ of games

bob · Post by **bob** » Sun Apr 27, 2008 6:06 pm

Zach Wegner wrote:
bob wrote:Make a note that this can give highly biased results. A simple change to your eval changes the shape of the tree ever-so-slightly, but more than enough to change the game outcome even if the change is no better or worse than without...

I've played with this idea extensively and discarded it as being unusable...
How about this idea: from the GUI, you tell the engine each move how many nodes to search. But instead of an exact number, you send a random number that is +/- X% of the desired number. This way, the node counting technique still simulates time in that playing many games will get a good sampling over the "strength/search time" distribution, i.e. reduce bias. And of course, you still get an answer that is independent of the CPU time used.

Now, keep in mind that some engines (like mine) only check to stop every N nodes (or every time they reach the root node). But they also only check the time every N nodes, so as long as N is small enough that there is a good amount of variance within the range of node counts, it should be fine.

This idea does take away the "reproducibility" aspect that Tord wanted, but I think that reproducibility is the root of the problem Bob described, so all the better.

Nothing wrong with that, and you could tune it for each program. Run 10 different posistions for a fixed time and repeat several times. Measure the variance in the nodes searched, and use that to seed the distribution range for the random alteration...

P.S. I don't know anything about statistics.

michiguel · Post by **michiguel** » Mon Apr 28, 2008 12:39 am

bob wrote:
Tord Romstad wrote:
bob wrote:Make a note that this can give highly biased results.
Of course there is always a risk of that. It is always necessary to be careful when interpreting results, and to run other kinds of tests in addition.

In a perfect world, I would have had hundreds of computers dedicated to running test matches, and would have been running thousands of slow ponder on games against a number of different opponents every day.

My world is far from perfect. I have a single dual-core computer, and I am rarely able to dedicate more than 5-10 hours of CPU time per week to running engine vs engine matches. I have no other choice than using on various testing techniques which are all flawed in some way, and be aware of those flaws when interpreting the results. Frequently, I have to rely on intuition rather than statistics. Does it happen that I am wrong? Without a doubt, but as often as I am right more often than wrong, the result in the long run is a net progress. I still find it very easy to improve my engine, so clearly my approach is not completely hopeless.

A simple change to your eval changes the shape of the tree ever-so-slightly, but more than enough to change the game outcome even if the change is no better or worse than without...

I've played with this idea extensively and discarded it as being unusable...
Works well enough for me.

Tord
If you care which is better, program A or program B,it will provide a good answer. If you want to know which is better version A.20 or A.21, this is not a good test... You still need a _lot_ of games

The same amount of games than playing time/move. I do not understand this point.

If you play x nodes/move rather than y time/move, what you gain is that if you see a blunder or something weird, you will be able to reproduce the whole game and debug quicker. Not only that, but you can also use the computer while there is a match and nothing will be affected, or play faster games without being worry that the OS is messing it up. As I mentioned before, it would be even possible to run slower debug versions in matches.

Miguel

A request to engine authors

Re: A request to engine authors

Re: A request to engine authors

Re: A request to engine authors

Re: A request to engine authors

Re: A request to engine authors

Re: A request to engine authors

Re: A request to engine authors

Re: A request to engine authors

Re: A request to engine authors

Re: A request to engine authors