fixed nodes testing

Don · Post by **Don** » Wed Jul 20, 2011 8:33 pm

Larry and I are experimenting with doing some of our testing at fixed nodes. It seems that time testing is not giving us very consistent results. Fixed depth testing is very consistent but has it's own problems and issues and is not how program naturally play.

I fixed Komodo up to support fixed node levels and now it does this perfectly, searching EXACTLY the number of nodes that the user sets and stopping.

One problem is that the strong programs that we need for testing have lousy support for this. Some don't have this support at all, and others such as stockfish may exceed the nodes by an order of magnitude or more, or else abort much earlier. Because of this, we seem to be reduced to self-testing and we prefer testing against a variety of opponents.

The UCI standard says, "search x nodes only" so I would like to suggest that the authors implement this level correctly. The next release of Komodo will have this working correctly of course. Komodo does require at least a 1 ply search to complete before it will abort, primarily to ensure that an actual move is returned and the result not be totally random.

Does anyone have any experience with this kind of testing? For the most part I think I am aware of the pros and cons but I would like to hear from others on this.

Zach Wegner · Post by **Zach Wegner** » Wed Jul 20, 2011 8:45 pm

It's pretty easy to modify Stockfish and the IPP family to do this, which I did a while back...

I typically tested with fixed node count that was randomized per move. I can't say I'd recommend it, but I don't know of anything much better

BubbaTough · Post by **BubbaTough** » Wed Jul 20, 2011 9:01 pm

I do exact fixed node testing for automatic self-tuning via self-play, but we don't support it in the released version just because of the minor performance hit.

If you want a private version that supports that I would be happy to send it to you, but I suspect Hannibal is not nearly strong enough for you to test against. As Zach said, modifying Stockfish is probably 5 minutes work, and would give you a good play partner. If you continue to improve at your current rate, you are going to risk running out of playing partners anyway, so you might be reduced to self-test no matter what you do anyway.

-Sam

marcelk · Post by **marcelk** » Wed Jul 20, 2011 9:29 pm

Don wrote:Larry and I are experimenting with doing some of our testing at fixed nodes. It seems that time testing is not giving us very consistent results.

That will remove one source of variation but not one that influences the accuracy of the measurement. Unless you can somehow determine the NPS variation between positions, because with that the node count will become a better alternative to measuring time than by using a clock. I think that is quite hypothetical, just use a better clock in the first place then. Once you assume an average NPS, the node counts will have become an imprecise estimate of the actual run time and therefore won't help improve the end result, only the repeatability of the experiment (which can then trick you into believing that you have a more precise measurement than you actually have, because it is so non-intuitive...). It is the same as doing a long Monte Carlo simulation run with one fixed random number sequence, which doesn't provide better or worse result compared to one with one truly random seed.

This is the reason I stopped worrying about doing fixed node tests. I want to see the inconsistencies: that way they constantly remind me that the measurements are inherently noisy and that elo results should be taken with a much bigger grain of salt than the confidence ranges suggest.

It does provide repeatability of course if that is what you're after.

Don · Post by **Don** » Wed Jul 20, 2011 9:31 pm

BubbaTough wrote:I do exact fixed node testing for automatic self-tuning via self-play, but we don't support it in the released version just because of the minor performance hit.

If you want a private version that supports that I would be happy to send it to you, but I suspect Hannibal is not nearly strong enough for you to test against. As Zach said, modifying Stockfish is probably 5 minutes work, and would give you a good play partner. If you continue to improve at your current rate, you are going to risk running out of playing partners anyway, so you might be reduced to self-test no matter what you do anyway.

-Sam

If we are more than about 70 ELO weaker we will generally handicap the other program. It's much more efficient to play a much stronger program and handicap them so that the testing goes faster but at the moment there is no program that we have to handicap. But you never know when that will change.

So it may happen that we will have to start handicapping Komodo at some point, but even if Komodo improves faster than the other top programs it will take some time before we can put that much distance between all the top programs we test against if ever. We are not the only one improving our program.

I do not take a performance hit in Komodo for this. I poll for UCI input and goal time usage every few hundred nodes, but Komodo increments the node counter every time a move is made, so I check this in the move make routine after every single node to see if we have reached our node goal. If we have, I set the abort flag which stops the search. Technically the search may consume a few more nodes before unwinding but no information is updated once the abort flag is set, including the nodecout, pv, score or anything else of relevance. I tested this and there is no measurable speed hit.

BubbaTough · Post by **BubbaTough** » Wed Jul 20, 2011 9:37 pm

marcelk wrote:
Don wrote:Larry and I are experimenting with doing some of our testing at fixed nodes. It seems that time testing is not giving us very consistent results.
That will remove one source of variation but not one that influences the accuracy of the measurement. Unless you can somehow determine the NPS variation between positions, because with that the node count will become a better alternative to measuring time than by using a clock. I think that is quite hypothetical, just use a better clock in the first place then. Once you assume an average NPS, the node counts will still be an imprecise estimate of the actual run time and therefore won't help improve the end result, only the repeatability of the experiment (which can then trick you into believing that you have a more precise measurement than you actually have, because it is so non-intuitive...). It is the same as doing a long Monte Carlo simulation run with one fixed random number sequence, which doesn't provide better or worse result compared to one with one truly random seed.

This is the reason I stopped worrying about doing fixed node tests. I want to see the inconsistencies: that way they constantly remind me that the measurements are inherently noisy and that elo results should be taken with a much bigger grain of salt than the confidence ranges suggest.

It does provide repeatability of course if that is what you're after.

I like fixed node testing because it makes it easy to do testing on lots of different computers and combine results.

-Sam

Don · Post by **Don** » Wed Jul 20, 2011 9:48 pm

marcelk wrote:
Don wrote:Larry and I are experimenting with doing some of our testing at fixed nodes. It seems that time testing is not giving us very consistent results.
That will remove one source of variation but not one that influences the accuracy of the measurement. Unless you can somehow determine the NPS variation between positions, because with that the node count will become a better alternative to measuring time than by using a clock. I think that is quite hypothetical, just use a better clock in the first place then. Once you assume an average NPS, the node counts will still be an imprecise estimate of the actual run time and therefore won't help improve the end result, only the repeatability of the experiment (which can then trick you into believing that you have a more precise measurement than you actually have, because it is so non-intuitive...). It is the same as doing a long Monte Carlo simulation run with one fixed random number sequence, which doesn't provide better or worse result compared to one with one truly random seed.

This is the reason I stopped worrying about doing fixed node tests. I want to see the inconsistencies: that way they constantly remind me that the measurements are inherently noisy and that elo results should be taken with a much bigger grain of salt than the confidence ranges suggest.

It does provide repeatability of course if that is what you're after.

We do not confine our testing to a single method or time control. We use fixed depth as well as fischer and not we might also use fixed nodes after we gain some experience with it and determine that it is another good tool in our tool bag.

A big plus is that our program is deterministic when using this kind of testing. Our tester also takes detailed time measurements, which can be off by +/- 1 percent or more depending on what day of the week you compile the program

When trying to measure a 2 or 3 ELO improvement this is a lot of noise. However fixed node testing does not solve that problem, but at least we have data on the timings. For example if we measure 3 ELO but the average time per move increases by 4 or 5 percent, we know how to make an ELO adjustment.

Different search algorithms and even evaluation affect the nodes per second in minor ways so we don't pretend that this is a workaround for that - we know it's still an issue.

What I suspect is happening is that our time control algorithm is producing a lot of noise. We can run 100,000 games at two different but very similar time controls and get results on each side of the error margin. So this is also an experiment to see if we can get more consistency in that regard. I don't know of any obvious reason why this might be so, it's just a hunch.

marcelk · Post by **marcelk** » Wed Jul 20, 2011 9:56 pm

Don wrote:Our tester also takes detailed time measurements, which can be off by +/- 1 percent or more depending on what day of the week you compile the program :-)

I also have such variation from run to run with the same binary. When I do a speed test, I always run it 10 times and take the fastest result. Not because it is representative, but because the minimum reproduces better than the average. But it is still noisy and the next day it can be better without any obvious reason.

It doesn't help that modern OSes randomize their memory allocation either.

Laskos · Post by **Laskos** » Wed Jul 20, 2011 10:01 pm

Don wrote:
We do not confine our testing to a single method or time control

I think the first rule of thumb is self-testing with identical time use. Then all others.

Kai

hgm · Post by **hgm** » Wed Jul 20, 2011 10:52 pm

Some UCI engines report in info strings when they start a new iteration. In the PV just before it (the final one ofthe last iteration), they will have reported a node count.

You could just implement a standard time management based on these node counts rather than on time in the interface. i.e. send the engine a stop command if its node count fora completed iteration exeeds a certain limit, and use the first PV move of the previous iteration as bestmove, so that effectively it is like it refrained from starting a new iteration. And then you would subtract the nodes actually used from the node budget of the total game.

That way you could even use engines that do not implement go nodes at all. And an additional advantage would be that you use the node budget for the game much more efficietly, because you never would have to interrupt iterations. The UCI go nodes is a crappy mode, equivalent to fixed maximum time per move, which wastes lots of time by interrupting iterations at inopportune moments.

fixed nodes testing

fixed nodes testing

Re: fixed nodes testing

Re: fixed nodes testing

Re: fixed nodes testing

Re: fixed nodes testing

Re: fixed nodes testing

Re: fixed nodes testing

Re: fixed nodes testing

Re: fixed nodes testing

Re: fixed nodes testing