Hardware oblivious testing

bob · Post by **bob** » Thu Sep 27, 2012 5:27 pm

Don wrote:A major problem in testing chess programs is that they perform differently under different conditions. Larry and I knew this, but we were surprised at the difference. On one of his machines for example we learned that Komodo does not perform as well as the programs we test against as measured by a few percent lower nodes per second. In other words, on that machine we drop more nodes per second that the foreign programs we test against.

But it's extremely useful to be able to combine our results and in fact I went to some trouble to construct a distributed automated tester. A beautiful thing that allows us to test on Linux and Windows by distributing a single binary "client program" to our testers. We configure the tests, the clients run them as directed by the server.

Unfortunately, the results depend more on WHO happens to be running tests at the time. It's no good for measuring small changes.

We don't care how we stand relative to other programs when we are simply measuring our own progress, we just need stable and consistent testing conditions. This is an important concept to understand in order to appreciate what follows.

Now the most obvious idea is to run fixed depth or fixed node testing. These have their place, but they both have serious problems. Any change that speeds up or slows down the program (and most changes have some impact in this regard) cannot easily be reconciled. Also, fixed nodes plays horrible chess, for the same investment in time the quality of the games are reduced enormously, probably due to the fact that when you start a search it makes sense to try to finish the iteration. Fixed nodes does not do that. Also, many foreign programs do not honor that or else it's not done correctly. But even so, as I said, the results cannot be reconciled. Add an evaluation feature that takes a significant time to compute and it will automatically look good on fixed node testing because we don't have to accept the nodes per second hit.

Cut to the chase

So what is to be done? Here is a solution. Most programs report the total nodes spent on the search. We need to implement a test that is based on nodes searched but handled like any normal time control. Additionally, we would like to not have to modify each program to use this system - so we need to trick each program into doing this even though it does not have that capability. You can do this using the following trick:

1. Pick some reference hardware and get good measurement on the nodes per second for each program being tested.

2. Use what is learned in step 1 to produce an adjustment factor.

The tester basically ignores the time clock and makes decisions based on the nodes reported by the program. For obvious reasons, pondering must be turned off. Let's say we have 2 program that play the same strength, but one does 1 million nodes per second and the other does 2 million nodes per second. Let's say the tester notices that each program as 1 (pseudo) second left on each programs clock in a sudden death game. For the fast program, it reports that it has 1/2 second left and for the slow program it reports that it has 1 second left. What you should get is consistent play that is independent of hardware. When a program reports a move the tester converts the nodes it reports to time and debits it's clock based on that.

Unfortunately, there are still a couple of problems with this idea. The nodes per second for any given program is not consistent from move to move but I wonder how much different in practice that will make? The goal is not to nail the relative differences in foreign programs but to provide a consistent test. Still, time and nodes are not the same and I would expect to get some gnarly side-effects, perhaps time losses and other things.

I think you are still walking around the edge of a trap. Do you mix time controls in your testing? I don't. Yet mixing hardware is doing EXACTLY that. When playing timed matches, faster hardware searches deeper. Fortunately I don't have to deal with this myself, as I confine my testing to a specific cluster that has uniform hardware for every node. Using mixed hardware is going to produce mixed results that have yet another degree of freedom besides just the engine changes you made. Now you play some games at a deeper depth which changes the engine search (deeper vs shallower where you play better at shallower or vice versa).

Seems like a can of worms that will always add significant noise to your testing, but without any reliable way to measure what part of the results is noise and what is important...

Don · Post by **Don** » Thu Sep 27, 2012 6:52 pm

bob wrote:
Don wrote:A major problem in testing chess programs is that they perform differently under different conditions. Larry and I knew this, but we were surprised at the difference. On one of his machines for example we learned that Komodo does not perform as well as the programs we test against as measured by a few percent lower nodes per second. In other words, on that machine we drop more nodes per second that the foreign programs we test against.

But it's extremely useful to be able to combine our results and in fact I went to some trouble to construct a distributed automated tester. A beautiful thing that allows us to test on Linux and Windows by distributing a single binary "client program" to our testers. We configure the tests, the clients run them as directed by the server.

Unfortunately, the results depend more on WHO happens to be running tests at the time. It's no good for measuring small changes.

We don't care how we stand relative to other programs when we are simply measuring our own progress, we just need stable and consistent testing conditions. This is an important concept to understand in order to appreciate what follows.

Now the most obvious idea is to run fixed depth or fixed node testing. These have their place, but they both have serious problems. Any change that speeds up or slows down the program (and most changes have some impact in this regard) cannot easily be reconciled. Also, fixed nodes plays horrible chess, for the same investment in time the quality of the games are reduced enormously, probably due to the fact that when you start a search it makes sense to try to finish the iteration. Fixed nodes does not do that. Also, many foreign programs do not honor that or else it's not done correctly. But even so, as I said, the results cannot be reconciled. Add an evaluation feature that takes a significant time to compute and it will automatically look good on fixed node testing because we don't have to accept the nodes per second hit.

Cut to the chase

So what is to be done? Here is a solution. Most programs report the total nodes spent on the search. We need to implement a test that is based on nodes searched but handled like any normal time control. Additionally, we would like to not have to modify each program to use this system - so we need to trick each program into doing this even though it does not have that capability. You can do this using the following trick:

1. Pick some reference hardware and get good measurement on the nodes per second for each program being tested.

2. Use what is learned in step 1 to produce an adjustment factor.

The tester basically ignores the time clock and makes decisions based on the nodes reported by the program. For obvious reasons, pondering must be turned off. Let's say we have 2 program that play the same strength, but one does 1 million nodes per second and the other does 2 million nodes per second. Let's say the tester notices that each program as 1 (pseudo) second left on each programs clock in a sudden death game. For the fast program, it reports that it has 1/2 second left and for the slow program it reports that it has 1 second left. What you should get is consistent play that is independent of hardware. When a program reports a move the tester converts the nodes it reports to time and debits it's clock based on that.

Unfortunately, there are still a couple of problems with this idea. The nodes per second for any given program is not consistent from move to move but I wonder how much different in practice that will make? The goal is not to nail the relative differences in foreign programs but to provide a consistent test. Still, time and nodes are not the same and I would expect to get some gnarly side-effects, perhaps time losses and other things.
I think you are still walking around the edge of a trap. Do you mix time controls in your testing? I don't. Yet mixing hardware is doing EXACTLY that. When playing timed matches, faster hardware searches deeper. Fortunately I don't have to deal with this myself, as I confine my testing to a specific cluster that has uniform hardware for every node. Using mixed hardware is going to produce mixed results that have yet another degree of freedom besides just the engine changes you made. Now you play some games at a deeper depth which changes the engine search (deeper vs shallower where you play better at shallower or vice versa).

Seems like a can of worms that will always add significant noise to your testing, but without any reliable way to measure what part of the results is noise and what is important...

But the whole point (and the name of the thread) is "hardware oblivious testing", try to make a test that is not dependent on hardware. We already have such a test, it's called "fixed nodes" and "fixed depth", but they do not exercise the time control algorithm and they vary in time to completion so it cannot be uses as is to measure program improvement.

My proposal tries to fix this - but I agree that it does not succeed. If I were only testing Komodo versions I think I could build a hardware oblivious test that was also accurate, it would simply be a new level implemented that treats nodes like clock ticks, similar to fixed nodes except that it would use real time controls. But each program version would have to be carefully calibrated in advance on some reference hardware to measure it's nodes per second and that would be used as an adjustment factor. Such a program would play exactly the same on any hardware when tested like this, it would just test faster on some hardware than other. It would be "hardware oblivious", just as fixed nodes or fixed depth is, but it might not be fully valid because if some change affected the nodes per second more in the endgame than in the opening, it would be a biased test.

Hardware is one variable that adds significant noise to all tests which is the nature of hardware but makes it more difficult to measure minor improvements. I don't mind optimizing to some reference hardware (such as the most popular and ubiquitous modern chipset) and one way to do this is to use nodes per second as measured on your reference hardware.

ernest · Post by **ernest** » Thu Sep 27, 2012 7:28 pm

Don wrote:Even 1 percent is pretty odd and is clearly measurable in ELO.

Well, 1% => 1 Elo! Clearly measurable ???

Don · Post by **Don** » Thu Sep 27, 2012 8:08 pm

ernest wrote:
Don wrote:Even 1 percent is pretty odd and is clearly measurable in ELO.
Well, 1% => 1 Elo! Clearly measurable ???

Yes, our testing will keep about 70% of 1 ELO improvements while rejecting about 93% of the 1 ELO regressions based on the rules we use. If the improvement is more than 1 ELO it will keep a lot higher percentage of course.

However, 1% is not 1 ELO. 1% is 2 or 3 ELO when you test fast. It's only about 1 ELO when you are running matches that take a few minutes to complete.

bob · Post by **bob** » Thu Sep 27, 2012 8:17 pm

Don wrote:
bob wrote:
Don wrote:A major problem in testing chess programs is that they perform differently under different conditions. Larry and I knew this, but we were surprised at the difference. On one of his machines for example we learned that Komodo does not perform as well as the programs we test against as measured by a few percent lower nodes per second. In other words, on that machine we drop more nodes per second that the foreign programs we test against.

But it's extremely useful to be able to combine our results and in fact I went to some trouble to construct a distributed automated tester. A beautiful thing that allows us to test on Linux and Windows by distributing a single binary "client program" to our testers. We configure the tests, the clients run them as directed by the server.

Unfortunately, the results depend more on WHO happens to be running tests at the time. It's no good for measuring small changes.

We don't care how we stand relative to other programs when we are simply measuring our own progress, we just need stable and consistent testing conditions. This is an important concept to understand in order to appreciate what follows.

Now the most obvious idea is to run fixed depth or fixed node testing. These have their place, but they both have serious problems. Any change that speeds up or slows down the program (and most changes have some impact in this regard) cannot easily be reconciled. Also, fixed nodes plays horrible chess, for the same investment in time the quality of the games are reduced enormously, probably due to the fact that when you start a search it makes sense to try to finish the iteration. Fixed nodes does not do that. Also, many foreign programs do not honor that or else it's not done correctly. But even so, as I said, the results cannot be reconciled. Add an evaluation feature that takes a significant time to compute and it will automatically look good on fixed node testing because we don't have to accept the nodes per second hit.

Cut to the chase

So what is to be done? Here is a solution. Most programs report the total nodes spent on the search. We need to implement a test that is based on nodes searched but handled like any normal time control. Additionally, we would like to not have to modify each program to use this system - so we need to trick each program into doing this even though it does not have that capability. You can do this using the following trick:

1. Pick some reference hardware and get good measurement on the nodes per second for each program being tested.

2. Use what is learned in step 1 to produce an adjustment factor.

The tester basically ignores the time clock and makes decisions based on the nodes reported by the program. For obvious reasons, pondering must be turned off. Let's say we have 2 program that play the same strength, but one does 1 million nodes per second and the other does 2 million nodes per second. Let's say the tester notices that each program as 1 (pseudo) second left on each programs clock in a sudden death game. For the fast program, it reports that it has 1/2 second left and for the slow program it reports that it has 1 second left. What you should get is consistent play that is independent of hardware. When a program reports a move the tester converts the nodes it reports to time and debits it's clock based on that.

Unfortunately, there are still a couple of problems with this idea. The nodes per second for any given program is not consistent from move to move but I wonder how much different in practice that will make? The goal is not to nail the relative differences in foreign programs but to provide a consistent test. Still, time and nodes are not the same and I would expect to get some gnarly side-effects, perhaps time losses and other things.
I think you are still walking around the edge of a trap. Do you mix time controls in your testing? I don't. Yet mixing hardware is doing EXACTLY that. When playing timed matches, faster hardware searches deeper. Fortunately I don't have to deal with this myself, as I confine my testing to a specific cluster that has uniform hardware for every node. Using mixed hardware is going to produce mixed results that have yet another degree of freedom besides just the engine changes you made. Now you play some games at a deeper depth which changes the engine search (deeper vs shallower where you play better at shallower or vice versa).

Seems like a can of worms that will always add significant noise to your testing, but without any reliable way to measure what part of the results is noise and what is important...
But the whole point (and the name of the thread) is "hardware oblivious testing", try to make a test that is not dependent on hardware. We already have such a test, it's called "fixed nodes" and "fixed depth", but they do not exercise the time control algorithm and they vary in time to completion so it cannot be uses as is to measure program improvement.

My proposal tries to fix this - but I agree that it does not succeed. If I were only testing Komodo versions I think I could build a hardware oblivious test that was also accurate, it would simply be a new level implemented that treats nodes like clock ticks, similar to fixed nodes except that it would use real time controls. But each program version would have to be carefully calibrated in advance on some reference hardware to measure it's nodes per second and that would be used as an adjustment factor. Such a program would play exactly the same on any hardware when tested like this, it would just test faster on some hardware than other. It would be "hardware oblivious", just as fixed nodes or fixed depth is, but it might not be fully valid because if some change affected the nodes per second more in the endgame than in the opening, it would be a biased test.

Hardware is one variable that adds significant noise to all tests which is the nature of hardware but makes it more difficult to measure minor improvements. I don't mind optimizing to some reference hardware (such as the most popular and ubiquitous modern chipset) and one way to do this is to use nodes per second as measured on your reference hardware.

I've explained my dislike of using nodes to emulate a timed search many times. Programs typically vary their nps as the game progresses. Some go faster, some go slower. Some change by large amounts (Ferret had a 3x swing in simple endings, going 3x faster in terms of NPS than in the early middlegame. So you end up searching to different depths using this approach, as the game progresses, compared to what you would reach using pure time as a limiting constraint. And if you reach positions where your greater endgame speed comes into play, you have an advantage that is not the result of recent changes, or at least not directly. You might add a term that unintentionally encourages simplification / exchanges, which might hurt you in normal games, but since it hurts a little positionally, but speeds you up significantly, it would look good, when it would look bad in a timed match..

Hardware independence is a good idea. But the question is, can it be done practically and accurately? I don't see how. I avoid this by never spreading my testing across any 2 of our clusters (we have 3) since the cpus are different on each cluster. I know it is a pain to use a uniform platform, but I really have difficulty seeing how to overcome the issues...

Don · Post by **Don** » Thu Sep 27, 2012 8:41 pm

bob wrote: I've explained my dislike of using nodes to emulate a timed search many times. Programs typically vary their nps as the game progresses. Some go faster, some go slower. Some change by large amounts (Ferret had a 3x swing in simple endings, going 3x faster in terms of NPS than in the early middlegame. So you end up searching to different depths using this approach, as the game progresses, compared to what you would reach using pure time as a limiting constraint. And if you reach positions where your greater endgame speed comes into play, you have an advantage that is not the result of recent changes, or at least not directly. You might add a term that unintentionally encourages simplification / exchanges, which might hurt you in normal games, but since it hurts a little positionally, but speeds you up significantly, it would look good, when it would look bad in a timed match..

Hardware independence is a good idea. But the question is, can it be done practically and accurately? I don't see how. I avoid this by never spreading my testing across any 2 of our clusters (we have 3) since the cpus are different on each cluster. I know it is a pain to use a uniform platform, but I really have difficulty seeing how to overcome the issues...

That is indeed the question. I think if I were running only komodo the behavior is consistent from version to version for most changes. But it will never be perfect. But it's clear that even if it isn't perfect, it will achieve hardware independence. So a critical step is calibrating each version beforehand. If I made a change that makes the program stronger but also slows it down, I must get a precise measurement before I start the tests so that it can be handicapped appropriately. The idea is that I have nailed down the behavior on my reference machine. So in theory, I can run it on anything and it will give me the results I would expect on the reference machine.

As you point out the NPS may speed up in the ending. It will still be a fair test as long as any changes do not affect different game stages differently but it will not be a perfect test of how it would play at that time control. It would be like playing with a clock that gradually sped up, it would be fair but it wouldn't reflect the same exact results you would achieve using a real clock. So I have to admit that I'm a little skeptical myself. Also, I don't believe every change affects the endgame and opening equally. For example if I change the LMR paramaters, I cannot tell you that it will have the same exact impact on nodes per second regardless of game phase. Same with extensions and many other types of changes.

The real question is whether the noise generated by this is less than the noise generated by testing on various hardware platforms and machines that are also running various daemons. I think my machine runs updatdb every night for example, so during the moments that is running the test is probably very noisy. I know that you have recompiled your kernels and have set up your workstations to be much more quiet so that may be less of an issue for you.

Modern Times · Post by **Modern Times** » Thu Sep 27, 2012 8:45 pm

Unless you have access to identical machines, as with Bob and the cluster, or IPON with his six identical machines, you just have to live with the "noise" and accept that you can't measure tiny Elo increases. Actually I'm not sure you can anyway, you mentioned 1 or 2 Elo ? To measure that you would need hundreds of thousands of games, no ?

Don · Post by **Don** » Thu Sep 27, 2012 9:16 pm

Modern Times wrote:Unless you have access to identical machines, as with Bob and the cluster, or IPON with his six identical machines, you just have to live with the "noise" and accept that you can't measure tiny Elo increases. Actually I'm not sure you can anyway, you mentioned 1 or 2 Elo ? To measure that you would need hundreds of thousands of games, no ?

That's a loaded question with no correct answer. I guess the technical answer is YES but you have to specify the certainty. Can we measure 1 ELO with 100 percent certainty? No. I cannot measure 100 ELO with 100 percent certainty either although I can measure the confidence that it's an improvement with something like 99.9999 percent certainty.

I did some simulations and basically with the rules that we use a 1 ELO regression is rejected most of the time, and a 1 ELO improvement is accepted about 70% of the time. I'm talking about EXACTLY a 1 ELO regression or improvement - if it's 2 or 3 ELO it is accepted or rejected with much more confidence. I can do this by playing about 40,000 games. I don't know how to answer your question about whether we are actually "resolving" 1 ELO improvements. If you go strictly by error margins, then 40,000 does not resolve a 1 ELO change. I don't the number but it is probably on the ballpark of 200,000 or so.

There is a tradeoff between number of games you are willing to play and how willing you are to accept small regressions vs how often you are willing to reject small improvements. You can design your tests to reject regressions with arbitrary confidence if you are willing to throw out potential improvements too. You simply run the test and if the candidate version does not win by a substantial margin (which you can specify in advance) you reject the change.

bob · Post by **bob** » Thu Sep 27, 2012 11:08 pm

Modern Times wrote:Unless you have access to identical machines, as with Bob and the cluster, or IPON with his six identical machines, you just have to live with the "noise" and accept that you can't measure tiny Elo increases. Actually I'm not sure you can anyway, you mentioned 1 or 2 Elo ? To measure that you would need hundreds of thousands of games, no ?

To get to +/- 1, you need over 100K. Depends on the variability of the opponents. I run 30K all the time and see +/-3 to +/-4

Here are a few sample BayesElo results from a while back when the game totals would reach 100K and beyond:

Code: Select all

Name                  Elo    +    -  games score oppo. draws
Stockfish 1.8 64bit  2815    1    1 180000   78%  2611   27%
Stockfish 1.8 64bit  2817    2    2 168000   78%  2612   27%
Stockfish 1.8 64bit  2836    2    2 108000   79%  2614   24%
Stockfish 1.8 64bit  2797    2    2  60000   74%  2621   24%
Stockfish 1.8 64bit  2798    3    3  54000   74%  2622   24%

Shows how hard it is to get to that +/- 1. 30K for +/- 3, double that for +/-2, triple that for +/- 1. 180K games is a challenge without a cluster.

Ajedrecista · Post by **Ajedrecista** » Fri Sep 28, 2012 7:18 pm

Hello Don:

Don wrote:If you go strictly by error margins, then 40,000 does not resolve a 1 ELO change. I don't the number but it is probably on the ballpark of 200,000 or so.

First of all, I am not an expert on the issue. Just said that, I use a very similar model to the one that was exposed in the section 'Engine testing' of Mizar chess engine, but I do not provide a link because the domain has recently expired.

Some months ago, I wrote some Fortran 95 programmes that try to estimate things like this one; the following code box is computed for 1 Elo of what you called resolution and the assumption of 0% of draws:

Code: Select all

Confidence = 2*LOS - 1:

LOS = 97.5% ---> n_min =  463708 games (95% confidence).
LOS = 99%   ---> n_min =  653278 games (98% confidence).
LOS = 99.5% ---> n_min =  800908 games (99% confidence).
LOS = 99.9% ---> n_min = 1152736 games (99.8% confidence).

It is valid in a match of only two engines.

If you want to include the draw ratio, I took pencil and paper and got that, for scores near 50%-50% (as in the case of 1 Elo difference) and very small standard deviations (i.e. a very high number of games, as here), then given a draw ratio (I will call it D), the relationship between standard deviations is more less sigma(D) ~ sqrt(1 - D)*sigma(D = 0); with the same assumptions, error bars are directly proportional to standard deviations, just multiplying them by 1600/ln(10) = constant (it is a good approximation, and is better with higher number of games and almost 0 Elo difference). So, if I call e to error bars, then |e(D)| ~ sqrt(1 - D)*|e(D = 0)|. I must input to my programme new error bars that are equivalent to the other error bars without the draw ratio included, for example: for x Elo difference and a draw ratio D, I must input to my programme x/sqrt(1 - D) instead of x.

I have computed a small code box with x = 1 Elo (the resolution you wondered about) and your typical draw ratio of 51%; 1/sqrt(1 - 0.51) = 1/sqrt(0.49) = 1/0.7 ~ 1.43 (my programme rounds up to 0.01 Elo):

Code: Select all

Confidence = 2*LOS - 1:

LOS = 97.5% ---> n_min ~ 226762 games (95% confidence).
LOS = 99%   ---> n_min ~ 319466 games (98% confidence).
LOS = 99.5% ---> n_min ~ 391660 games (99% confidence).
LOS = 99.9% ---> n_min ~ 563712 games (99.8% confidence).

It is valid in a match of only two engines.

If I do the ratio between the second group of n_min and the first group of n_min, I get around 0.489 in the four cases: it should be 1 - D = 0.49, but due to roundings, 1.43 is not exactly 1/sqrt(1 - 0.51) = 1/0.7; in fact, (1/1.43)² ~ 0.489 and not the expected 1 - D = 1 - 0.51 = 0.49.

Then I remembered that there is an even easiest way for calculate n_min(D) from n_min(D = 0): if I call K = |error bar|*sqrt(n) = |e|*sqrt(n), then n = (K/|e|)². Some months ago, I 'discovered' that under the same assumptions (scores near 50%-50% and sigma << 1), then K(z, D) ~ 800z*sqrt(1 - D)/ln(10), where z is the parameter that indicates the confidence level in a normal distribution (i.e.: z ~ 1.96 for 95% confidence and LOS = 0.5 + confidence/2 = 0.5 + 0.95/2 = 0.975 = 97.5% of LOS). Given z, the only difference between both K(z, D) is the factor sqrt(1 - D), so the only difference between both K²(z, D) = 1 - D. The easiest way for estimate n_min from the first code box is multiplying those numbers by 1 - D.

Assuming 1 Elo of resolution (please note that I am writing indifferently about error bars, resolution and difference in this very concrete issue, so I am maybe mixing things) and a draw ratio of 51%:

Code: Select all

Confidence = 2*LOS - 1:

(Rounding up to the nearest upper even number):

LOS = 97.5% ---> n_min ~ 227218 games (95% confidence).
LOS = 99%   ---> n_min ~ 320108 games (98% confidence).
LOS = 99.5% ---> n_min ~ 392446 games (99% confidence).
LOS = 99.9% ---> n_min ~ 564842 games (99.8% confidence).

It is valid in a match of only two engines.

The differences between the second and the third code boxes are small IMHO, but the last code box is easier of calculate. I have double checked these results with other own programmes and all the programmes agree between them, so I guess that my pencil and paper math was right.

Sorry for this long and hardly readable post. Please bear in mind that I am writing about my model, that is not the universal truth but only an approximation with the mean and the standard deviation of a normal distribution. I hope no typos.

Regards from Spain.

Ajedrecista.

Hardware oblivious testing

Re: Hardware oblivious testing

Re: Hardware oblivious testing

Re: Hardware oblivious testing

Re: Hardware oblivious testing

Re: Hardware oblivious testing

Re: Hardware oblivious testing

Re: Hardware oblivious testing

Re: Hardware oblivious testing

Re: Hardware oblivious testing

Re: Hardware oblivious testing.