Hardware oblivious testing

Discussion of chess software programming and technical issues.

Moderators: hgm, Rebel, chrisw, Ras, hgm, chrisw, Rebel, Ras

User avatar
Don
Posts: 5106
Joined: Tue Apr 29, 2008 4:27 pm

Hardware oblivious testing

Post by Don »

A major problem in testing chess programs is that they perform differently under different conditions. Larry and I knew this, but we were surprised at the difference. On one of his machines for example we learned that Komodo does not perform as well as the programs we test against as measured by a few percent lower nodes per second. In other words, on that machine we drop more nodes per second that the foreign programs we test against.

But it's extremely useful to be able to combine our results and in fact I went to some trouble to construct a distributed automated tester. A beautiful thing that allows us to test on Linux and Windows by distributing a single binary "client program" to our testers. We configure the tests, the clients run them as directed by the server.

Unfortunately, the results depend more on WHO happens to be running tests at the time. It's no good for measuring small changes.

We don't care how we stand relative to other programs when we are simply measuring our own progress, we just need stable and consistent testing conditions. This is an important concept to understand in order to appreciate what follows.

Now the most obvious idea is to run fixed depth or fixed node testing. These have their place, but they both have serious problems. Any change that speeds up or slows down the program (and most changes have some impact in this regard) cannot easily be reconciled. Also, fixed nodes plays horrible chess, for the same investment in time the quality of the games are reduced enormously, probably due to the fact that when you start a search it makes sense to try to finish the iteration. Fixed nodes does not do that. Also, many foreign programs do not honor that or else it's not done correctly. But even so, as I said, the results cannot be reconciled. Add an evaluation feature that takes a significant time to compute and it will automatically look good on fixed node testing because we don't have to accept the nodes per second hit.

Cut to the chase

So what is to be done? Here is a solution. Most programs report the total nodes spent on the search. We need to implement a test that is based on nodes searched but handled like any normal time control. Additionally, we would like to not have to modify each program to use this system - so we need to trick each program into doing this even though it does not have that capability. You can do this using the following trick:

1. Pick some reference hardware and get good measurement on the nodes per second for each program being tested.

2. Use what is learned in step 1 to produce an adjustment factor.

The tester basically ignores the time clock and makes decisions based on the nodes reported by the program. For obvious reasons, pondering must be turned off. Let's say we have 2 program that play the same strength, but one does 1 million nodes per second and the other does 2 million nodes per second. Let's say the tester notices that each program as 1 (pseudo) second left on each programs clock in a sudden death game. For the fast program, it reports that it has 1/2 second left and for the slow program it reports that it has 1 second left. What you should get is consistent play that is independent of hardware. When a program reports a move the tester converts the nodes it reports to time and debits it's clock based on that.

Unfortunately, there are still a couple of problems with this idea. The nodes per second for any given program is not consistent from move to move but I wonder how much different in practice that will make? The goal is not to nail the relative differences in foreign programs but to provide a consistent test. Still, time and nodes are not the same and I would expect to get some gnarly side-effects, perhaps time losses and other things.
Capital punishment would be more effective as a preventive measure if it were administered prior to the crime.
User avatar
hgm
Posts: 28268
Joined: Fri Mar 10, 2006 10:06 am
Location: Amsterdam
Full name: H G Muller

Re: Hardware oblivious testing

Post by hgm »

Don wrote:So what is to be done? Here is a solution. Most programs report the total nodes spent on the search. We need to implement a test that is based on nodes searched but handled like any normal time control.

Additionally, we would like to not have to modify each program to use this system - so we need to trick each program into doing this even though it does not have that capability. You can do this using the following trick:

1. Pick some reference hardware and get good measurement on the nodes per second for each program being tested.

2. Use what is learned in step 1 to produce an adjustment factor.

The tester basically ignores the time clock and makes decisions based on the nodes reported by the program.
Sounds like you are reinventing XBoard nps mode! :lol:
For obvious reasons, pondering must be turned off. Let's say we have 2 program that play the same strength, but one does 1 million nodes per second and the other does 2 million nodes per second. Let's say the tester notices that each program as 1 (pseudo) second left on each programs clock in a sudden death game. For the fast program, it reports that it has 1/2 second left and for the slow program it reports that it has 1 second left. What you should get is consistent play that is independent of hardware. When a program reports a move the tester converts the nodes it reports to time and debits it's clock based on that.

Unfortunately, there are still a couple of problems with this idea. The nodes per second for any given program is not consistent from move to move but I wonder how much different in practice that will make? The goal is not to nail the relative differences in foreign programs but to provide a consistent test. Still, time and nodes are not the same and I would expect to get some gnarly side-effects, perhaps time losses and other things.
User avatar
Don
Posts: 5106
Joined: Tue Apr 29, 2008 4:27 pm

Re: Hardware oblivious testing

Post by Don »

hgm wrote:
Don wrote:So what is to be done? Here is a solution. Most programs report the total nodes spent on the search. We need to implement a test that is based on nodes searched but handled like any normal time control.

Additionally, we would like to not have to modify each program to use this system - so we need to trick each program into doing this even though it does not have that capability. You can do this using the following trick:

1. Pick some reference hardware and get good measurement on the nodes per second for each program being tested.

2. Use what is learned in step 1 to produce an adjustment factor.

The tester basically ignores the time clock and makes decisions based on the nodes reported by the program.
Sounds like you are reinventing XBoard nps mode! :lol:
How does that work?
For obvious reasons, pondering must be turned off. Let's say we have 2 program that play the same strength, but one does 1 million nodes per second and the other does 2 million nodes per second. Let's say the tester notices that each program as 1 (pseudo) second left on each programs clock in a sudden death game. For the fast program, it reports that it has 1/2 second left and for the slow program it reports that it has 1 second left. What you should get is consistent play that is independent of hardware. When a program reports a move the tester converts the nodes it reports to time and debits it's clock based on that.

Unfortunately, there are still a couple of problems with this idea. The nodes per second for any given program is not consistent from move to move but I wonder how much different in practice that will make? The goal is not to nail the relative differences in foreign programs but to provide a consistent test. Still, time and nodes are not the same and I would expect to get some gnarly side-effects, perhaps time losses and other things.
Capital punishment would be more effective as a preventive measure if it were administered prior to the crime.
brianr
Posts: 540
Joined: Thu Mar 09, 2006 3:01 pm
Full name: Brian Richardson

Re: Hardware oblivious testing

Post by brianr »

I suggest taking a step back to look at why your engine seems to slow down more than the others before anything else. I am somewhat surprised it is at all measurable in elo terms (including -/+ uncertainty).

However, if it is a reasonably constant speed difference amount, just add that to your time control adjustments and not tinker with the nodes or depth, etc.

I do adjust the time controls based on the speed of my systems, but nothing else. I typically only use dual quad cores. Obviously, this limits my test accuracy to more than just a few elo with overnight runs of about 10K games.

In any case, your automated distributed tester is a nice idea as I have another quad and two more dual core systems available.

Any hints for Windows-based systems?

Thanks.
User avatar
Don
Posts: 5106
Joined: Tue Apr 29, 2008 4:27 pm

Re: Hardware oblivious testing

Post by Don »

brianr wrote:I suggest taking a step back to look at why your engine seems to slow down more than the others before anything else. I am somewhat surprised it is at all measurable in elo terms (including -/+ uncertainty).
Remember, we are not talking about some massive slowdown, just a minor one. Even 1 percent is pretty odd and is clearly measurable in ELO. We are not talking about 10% or anything near that.

However, if it is a reasonably constant speed difference amount, just add that to your time control adjustments and not tinker with the nodes or depth, etc.

I do adjust the time controls based on the speed of my systems, but nothing else. I typically only use dual quad cores. Obviously, this limits my test accuracy to more than just a few elo with overnight runs of about 10K games.

In any case, your automated distributed tester is a nice idea as I have another quad and two more dual core systems available.

Any hints for Windows-based systems?

Thanks.
Capital punishment would be more effective as a preventive measure if it were administered prior to the crime.
User avatar
hgm
Posts: 28268
Joined: Fri Mar 10, 2006 10:06 am
Location: Amsterdam
Full name: H G Muller

Re: Hardware oblivious testing

Post by hgm »

Don wrote:How does that work?
XBoard sends an nps command to the engine, like

nps 2000

meaning that the engine should run at virtual speed 2000 nps. That means that the routine of the engine that reads the clock must not return the wall-clock time, but in stead the nodeCount/2000 (in seconds).

XBoard will then decrease the engine's clock by the number of nodes the engine reports in its thinking output.

Any time control can be used with this (classical, incremental, fixed time per move), but all involved times are translated to nodes at the specified 'exchange rate'. The feature is enabled by the command-line options -firstNPS N, -secondNPS N.

The engine must support this mode, which it announces by sending feature nps=1 at startup. Polyglot supports this, but I think in other TC modes than fixed-time per move it must manage the time for the engine by itself, and use the UCI go-nodes to run the engine, as it seems UCI engines do not combine movestogo with nodes. I am not sure whether Polyglot attemps to send a stop command when an iteration finishes. (Probably not.)
User avatar
Don
Posts: 5106
Joined: Tue Apr 29, 2008 4:27 pm

Re: Hardware oblivious testing

Post by Don »

hgm wrote:
Don wrote:How does that work?
XBoard sends an nps command to the engine, like

nps 2000

meaning that the engine should run at virtual speed 2000 nps. That means that the routine of the engine that reads the clock must not return the wall-clock time, but in stead the nodeCount/2000 (in seconds).

XBoard will then decrease the engine's clock by the number of nodes the engine reports in its thinking output.

Any time control can be used with this (classical, incremental, fixed time per move), but all involved times are translated to nodes at the specified 'exchange rate'. The feature is enabled by the command-line options -firstNPS N, -secondNPS N.

The engine must support this mode, which it announces by sending feature nps=1 at startup. Polyglot supports this, but I think in other TC modes than fixed-time per move it must manage the time for the engine by itself, and use the UCI go-nodes to run the engine, as it seems UCI engines do not combine movestogo with nodes. I am not sure whether Polyglot attemps to send a stop command when an iteration finishes. (Probably not.)
That is not quite the same as my proposal, but it's good. In my proposal the idea is to fake out a program that doesn't support this. But I can see a lot of problems with this unfortunately.

I am not sure all UCI programs send their node counts when they make a move - so that could be an issue. But presumably, if a UCI engine were to support this it would take care to also send the nodes immediately before making a move.
Capital punishment would be more effective as a preventive measure if it were administered prior to the crime.
ZirconiumX
Posts: 1339
Joined: Sun Jul 17, 2011 11:14 am
Full name: Hannah Ravensloft

Re: Hardware oblivious testing

Post by ZirconiumX »

Stockfish does not do what you want it to do. When Signals.stop is raised it moves instantly, and does not dump any info about nodes etc.

Matthew:out
User avatar
hgm
Posts: 28268
Joined: Fri Mar 10, 2006 10:06 am
Location: Amsterdam
Full name: H G Muller

Re: Hardware oblivious testing

Post by hgm »

To simulate it for egines that don't have it, you will have to bypass the engine's time control. That means it will be hard to do tricks like granting extra time on a fail low. You would probably be limited to a a simple time management that always finishes an iteration, because only at the end of an iteration you can expect the engine to report nodes.

But that can be done. You would just specify generous time, and use UCI 'stop' or WB move-now commands to terminate the search when the node count goes over the calculated limit.

Perhaps I should switch handling of nps mode in Polyglot to use this system. All UCI engines should support 'stop'. (In WB protocol move-now support is rare.) Polyglot has to take over time management of the engine in this mode anyway, but the way I do it now is just calculate FUDGEFACTOR*remainingNodes/remainingMoves, and feed that to the engine as 'go nodes'. Because I assumed that UCI engines would be smart enough in go-nodes or go-movetime mode to produce a move when they see that the remaining nodes/time budget is so low that starting a new iteration is a waste of time. But it seems they stupidly waste the entire budget (basically granting their opponent free ponder time...)

So it might be better to just send them 'go infinite', and let reported nodes that come with a new PV be used to to determine when to send 'stop'.

Like you remark, all this can only work for engines that faithfully and concisely report nodes.
ZirconiumX wrote:Stockfish does not do what you want it to do. When Signals.stop is raised it moves instantly, and does not dump any info about nodes etc.
This doesn't present too much of a problem if you do it as I described above, waiting for the engine to spontaneously to report nodes, and only stop it immediately after it does. You might miss a few nodes doen in the round-trip delay, but the engine pretty much would have wasted those anyway, so it is of no import.
diep
Posts: 1822
Joined: Thu Mar 09, 2006 11:54 pm
Location: The Netherlands

Re: Hardware oblivious testing

Post by diep »

brianr wrote:I suggest taking a step back to look at why your engine seems to slow down more than the others before anything else. I am somewhat surprised it is at all measurable in elo terms (including -/+ uncertainty).
Komodo is tactical weak.

So at small search depths when Komodo isn't through the tactical barrier yet, it gets slaughtered of course by engines that also get a high nps, yet are tactical missing little.