Testing procedure

Don · Post by **Don** » Wed May 27, 2009 9:29 pm

gingell wrote:I've been doing my testing along the following lines: Make a change, compile with and without that change, and have the new and the old versions play maybe 100 games against each other. I understand that the error bars are still pretty wide with this number of games, but I've found it's a good enough starting point to see whether a change should be pursued further.

... other stuff

Thanks for any comment.

100 games is plenty if your improvement is worth several hundred ELO. 100 games will reveal that you made an improvement with fair certainty.

But the bad news is that most changes are worth less than 10 ELO unless your program is pretty raw, in which case most changes are still going to be worth far less than 100 ELO. Which means 100 games is almost worthless.

The REALLY bad news is that you need several thousand games to measure changes with certainty, and that is usually something like 50 thousand. If you base decisions about what to keep and throw away based on smaller samples than say 50,000, then you will find that you are making random changes to your program. Of course this is not true if some of your changes are giving you 40 or 50 ELO, you might be able to get away with just a few thousand games in that case.

More bad news is that you cannot get a meaningful sample unless you have enormous CPU power at your disposal or you test fast. I am forced to test fast, even though I don't believe in fast testing.

I tend to test using fixed depth searches, and occasionally super fast Fischer when I want to check out major versions. The slowest fixed depth search I can get away with is 9 ply. My tester allows me to play 18 games per minute on my quad at 9 ply searches.

Sometimes I want to find out something really fast. I can play many 3 ply games per second. I sometimes use that to quick test a concept, keeping in mind that it is not to be trusted. It's just data I take with a grain of salt.

kranium · Post by **kranium** » Wed May 27, 2009 9:36 pm

mcostalba wrote:Hi Norman,

thanks for your words. I am fully aware you can appreciate such things because working on improving a very strong engine, as you did with Cyclone, it is the poster child of testing issues. Much more then starting from scratch because while you are adding important features to your engine the development part takes the bigger part of work...also because testing, as example, that NULL move search is a good thing to have does not requires enormous efforts

But please consider the part of the title that says "most resource efficent way" this is the key especually for amateurs or for people that doesn't have big iron to play with.

I know for dr. Hyatt is a science, but his science is to run 32.000 games each change. With this big arsenal he can simply skip analisis of what change he is testing and against what engines is better to test.

Its answer to the question: "What is the best gear to run with your bicycle at the top of that mountain? " is simply "get a car ! "....lucky him!

And I think Rybka author can afford big iron the same. So they simply don't need such a paper. Of course I don't claim they are not able to write it. Especially Dr. Hyatt could easily IMHO...but he doesn't because he doesn't need.

Returning to your ultra fast idea: it is indeed _almost_ the best thing but you have to be careful. As a trivial example consider selective search. Futility pruning in Glaurung starts at ply 7 and more or less is the same with Cyclone. So if your ultra fast search does not reach at least ply 9-10 and change you are testing is about futility then you can get artifacts with ultra fast games.

In any case a conscious analysis of _what_ you want to test is IMHO always required especially when testing with conditions (as ultra fast time controls) that are very far from official 40/40 games where your engine will be evaluated when released.

well said...

you use the expression 'big iron' a couple of times above...

is this a literal translation to English of an Italian expression? what exactly does it mean? (lots of money/resources?)

Norm

Don · Post by **Don** » Wed May 27, 2009 9:38 pm

hgm wrote:In principle, orthogonal multi-testing would allow you to evauate 10 changes in 1024 games.

You simple must expand on that one! I'm interested. How would you go about such a thing?

kranium · Post by **kranium** » Wed May 27, 2009 9:43 pm

hgm wrote:
kranium wrote:how does it work?
http://www.talkchess.com/forum/viewtopic.php?p=260829

yes that makes a lot of sense...
thanks.

bob · Post by **bob** » Wed May 27, 2009 9:53 pm

Don wrote:
hgm wrote:In principle, orthogonal multi-testing would allow you to evauate 10 changes in 1024 games.
You simple must expand on that one! I'm interested. How would you go about such a thing?

He's explained it previously. The main caveat is that the changes must not be correlated in any way so that they are independent.

Here's the basic idea.

Suppose you decide you need 32,000 games to measure changes that are maybe 10 elo changes. Rather than playing the original program for 32,000 games to get a base value, and then playing a new version with just one change for 32,000 games to compare, you can do the following.

Suppose you have two changes you want to test. You play 16000 games with no changes, then 16000 games with change A, then 16000 games with change B, then 16000 games with A and B included. Total = 64,000 games. Previously you would play 128,000 games, 32,000 with each of the 4 possibilities.

Now you have 64,000 games. But when you look carefully, you have 32,000 games with feature A, and 32,000 without feature A. You also have 32,000 games with feature B, and 32,000 games without feature B. Think of the four cases like this:

--, A-, -B, AB

note that for either feature, there are two groups with the feature on, and two groups with the feature off. If you have 4 changes, that gives you 16 possible combinations, and you do the same trick, except now you just play 4,000 games per possibility. And now you have the groups

----, A---, -B--, --C-, ---D, and all the other permutations. Note that 1/2 of the permutations have feature A, and the other half do not. Ditto for B, C and D. So you can slice and dice the games to compare any feature to a version without the feature.

If two of the features somehow interact, however, the test is not as useful, so you really need to be sure that the 4 changes don't influence each other in unexpected ways.

I don't test multiple features very often at all, so this doesn't help me much. I generally work on one thing at a time...

Dann Corbit · Post by **Dann Corbit** » Wed May 27, 2009 9:53 pm

kranium wrote:
mcostalba wrote:Hi Norman,

thanks for your words. I am fully aware you can appreciate such things because working on improving a very strong engine, as you did with Cyclone, it is the poster child of testing issues. Much more then starting from scratch because while you are adding important features to your engine the development part takes the bigger part of work...also because testing, as example, that NULL move search is a good thing to have does not requires enormous efforts

But please consider the part of the title that says "most resource efficent way" this is the key especually for amateurs or for people that doesn't have big iron to play with.

I know for dr. Hyatt is a science, but his science is to run 32.000 games each change. With this big arsenal he can simply skip analisis of what change he is testing and against what engines is better to test.

Its answer to the question: "What is the best gear to run with your bicycle at the top of that mountain? " is simply "get a car ! "....lucky him!

And I think Rybka author can afford big iron the same. So they simply don't need such a paper. Of course I don't claim they are not able to write it. Especially Dr. Hyatt could easily IMHO...but he doesn't because he doesn't need.

Returning to your ultra fast idea: it is indeed _almost_ the best thing but you have to be careful. As a trivial example consider selective search. Futility pruning in Glaurung starts at ply 7 and more or less is the same with Cyclone. So if your ultra fast search does not reach at least ply 9-10 and change you are testing is about futility then you can get artifacts with ultra fast games.

In any case a conscious analysis of _what_ you want to test is IMHO always required especially when testing with conditions (as ultra fast time controls) that are very far from official 40/40 games where your engine will be evaluated when released.
well said...

you use the expression 'big iron' a couple of times above...

is this a literal translation to English of an Italian expression? what exactly does it mean? (lots of money/resources?)

Norm

Typically, the term "big iron" refers to a mainframe computer system (e.g. an IBM 3090 or a z10 EC) but in this context it means really powerful hardware.

Don · Post by **Don** » Wed May 27, 2009 9:53 pm

mcostalba wrote:I think that one of the most valuable things that a kwnoledge contributor can give to the comunity is a paper with the title "The most resource efficent way of testing a chess engine"

That would be an excellent paper if done well.

In the past I have experimented with learning algorithms such as PBIL, which is related to genetic algorithms. The key requirement of PBIL is that you produce a set of players and pick the most "fit" from among them. In some domains fitness is directly measurable, in others it has to be estimated and chess skill of course is one of those.

So given a set of random players, what is the most efficient way to determine which program is strongest? I relaxed the requirement slightly and recognized that the problem is really how to find a GOOD player with the least amount of resources.

I'm sure there are probabilistic ways to do this. But the solution I came to is a divide and conquer approach. Given a nearly infinite pool of players (which I can generate at will) a knockout competition will very quickly find a reasonable opponent. The round robin approach is probably orders of magnitude less efficient. In a knockout, there is a very good chance that the winner will be one of the top players and the effort required to find this player is pretty small. Early rounds quickly separate the men from the boys. It also makes sense to play longer matches in later rounds, since a late round is a tiny fraction of the total effort.

So I wonder if some variation of this method might be used for testing in general - although it's doubtful since I am usually testing only 1 change at a time.

mcostalba · Post by **mcostalba** » Wed May 27, 2009 9:53 pm

kranium wrote:
mcostalba wrote:Hi Norman,

well said...

you use the expression 'big iron' a couple of times above...

is this a literal translation to English of an Italian expression? what exactly does it mean? (lots of money/resources?)

Norm
It means powerful computers:

http://en.wikipedia.org/wiki/Big_iron

mhull · Post by **mhull** » Wed May 27, 2009 9:55 pm

mcostalba wrote:
kranium wrote:
mcostalba wrote:Hi Norman,

well said...

you use the expression 'big iron' a couple of times above...

is this a literal translation to English of an Italian expression? what exactly does it mean? (lots of money/resources?)

Norm
It means powerful computers:

http://en.wikipedia.org/wiki/Big_iron
It could also refer to a large calibre side arm (e.g. 44 magnum pistol).

kranium · Post by **kranium** » Wed May 27, 2009 9:58 pm

mhull wrote:
mcostalba wrote:
kranium wrote:
mcostalba wrote:Hi Norman,

well said...

you use the expression 'big iron' a couple of times above...

is this a literal translation to English of an Italian expression? what exactly does it mean? (lots of money/resources?)

Norm
It means powerful computers:

http://en.wikipedia.org/wiki/Big_iron
It could also refer to a large calibre side arm (large hand gun).
yes...

or someone who hits a golf ball with one of the metal clubs really far...

Testing procedure

Re: Testing procedure

Re: Testing procedure

Re: Testing procedure

Re: Testing procedure

Re: Testing procedure

Re: Testing procedure

Re: Testing procedure

Re: Testing procedure

Re: Testing procedure

Re: Testing procedure