I think you are still walking around the edge of a trap. Do you mix time controls in your testing? I don't. Yet mixing hardware is doing EXACTLY that. When playing timed matches, faster hardware searches deeper. Fortunately I don't have to deal with this myself, as I confine my testing to a specific cluster that has uniform hardware for every node. Using mixed hardware is going to produce mixed results that have yet another degree of freedom besides just the engine changes you made. Now you play some games at a deeper depth which changes the engine search (deeper vs shallower where you play better at shallower or vice versa).Don wrote:A major problem in testing chess programs is that they perform differently under different conditions. Larry and I knew this, but we were surprised at the difference. On one of his machines for example we learned that Komodo does not perform as well as the programs we test against as measured by a few percent lower nodes per second. In other words, on that machine we drop more nodes per second that the foreign programs we test against.
But it's extremely useful to be able to combine our results and in fact I went to some trouble to construct a distributed automated tester. A beautiful thing that allows us to test on Linux and Windows by distributing a single binary "client program" to our testers. We configure the tests, the clients run them as directed by the server.
Unfortunately, the results depend more on WHO happens to be running tests at the time. It's no good for measuring small changes.
We don't care how we stand relative to other programs when we are simply measuring our own progress, we just need stable and consistent testing conditions. This is an important concept to understand in order to appreciate what follows.
Now the most obvious idea is to run fixed depth or fixed node testing. These have their place, but they both have serious problems. Any change that speeds up or slows down the program (and most changes have some impact in this regard) cannot easily be reconciled. Also, fixed nodes plays horrible chess, for the same investment in time the quality of the games are reduced enormously, probably due to the fact that when you start a search it makes sense to try to finish the iteration. Fixed nodes does not do that. Also, many foreign programs do not honor that or else it's not done correctly. But even so, as I said, the results cannot be reconciled. Add an evaluation feature that takes a significant time to compute and it will automatically look good on fixed node testing because we don't have to accept the nodes per second hit.
Cut to the chase
So what is to be done? Here is a solution. Most programs report the total nodes spent on the search. We need to implement a test that is based on nodes searched but handled like any normal time control. Additionally, we would like to not have to modify each program to use this system - so we need to trick each program into doing this even though it does not have that capability. You can do this using the following trick:
1. Pick some reference hardware and get good measurement on the nodes per second for each program being tested.
2. Use what is learned in step 1 to produce an adjustment factor.
The tester basically ignores the time clock and makes decisions based on the nodes reported by the program. For obvious reasons, pondering must be turned off. Let's say we have 2 program that play the same strength, but one does 1 million nodes per second and the other does 2 million nodes per second. Let's say the tester notices that each program as 1 (pseudo) second left on each programs clock in a sudden death game. For the fast program, it reports that it has 1/2 second left and for the slow program it reports that it has 1 second left. What you should get is consistent play that is independent of hardware. When a program reports a move the tester converts the nodes it reports to time and debits it's clock based on that.
Unfortunately, there are still a couple of problems with this idea. The nodes per second for any given program is not consistent from move to move but I wonder how much different in practice that will make? The goal is not to nail the relative differences in foreign programs but to provide a consistent test. Still, time and nodes are not the same and I would expect to get some gnarly side-effects, perhaps time losses and other things.
Seems like a can of worms that will always add significant noise to your testing, but without any reliable way to measure what part of the results is noise and what is important...