Testing procedure

Don · Post by **Don** » Wed May 27, 2009 10:14 pm

Don wrote:
mcostalba wrote:I think that one of the most valuable things that a kwnoledge contributor can give to the comunity is a paper with the title "The most resource efficent way of testing a chess engine"
That would be an excellent paper if done well.

In the past I have experimented with learning algorithms such as PBIL, which is related to genetic algorithms. The key requirement of PBIL is that you produce a set of players and pick the most "fit" from among them. In some domains fitness is directly measurable, in others it has to be estimated and chess skill of course is one of those.

So given a set of random players, what is the most efficient way to determine which program is strongest? I relaxed the requirement slightly and recognized that the problem is really how to find a GOOD player with the least amount of resources.

I'm sure there are probabilistic ways to do this. But the solution I came to is a divide and conquer approach. Given a nearly infinite pool of players (which I can generate at will) a knockout competition will very quickly find a reasonable opponent. The round robin approach is probably orders of magnitude less efficient. In a knockout, there is a very good chance that the winner will be one of the top players and the effort required to find this player is pretty small. Early rounds quickly separate the men from the boys. It also makes sense to play longer matches in later rounds, since a late round is a tiny fraction of the total effort.

So I wonder if some variation of this method might be used for testing in general - although it's doubtful since I am usually testing only 1 change at a time.

I don't think this by itself is going to work well, especially when the players are close together in strength. One could add rounds to the tournament, or increases the number of games per individual match and perhaps this should take error margins into account.

Someone better than myself at math might be able to come up with the proper tournament schedule (or other method) for "finding" the strongest player from a set of random players with the least amount of effort.

Whatever is done should be based on trying to recognize quickly where effort is being wasted evaluating a clearly inferior player.

hgm · Post by **hgm** » Wed May 27, 2009 11:03 pm

bob wrote:If two of the features somehow interact, however, the test is not as useful, so you really need to be sure that the 4 changes don't influence each other in unexpected ways.

If the features interact, testing them one by one is also no good.

But the nice thing about the orthogonal multitesting is that it gives you an idea if the features interact, from the same data set. E.g. with two features A ad B, you make test runs with the 4 combinations {--}, {A-}, {-B} and {AB}. To get the effect of A, you take

{A-} - {--} + {AB} - {-B}.

To get the effect of B, you take

{-B} - {--} + {AB} - {A-}.

To get an idea of their interaction, you take

{AB} + {--} - {A-} - {-B}.

All these quantities have the same statistical error, corresponding to the total number of games. If the effect of A and B together is the sum of the effect of A alone and B alone, the last quantity should be zero. If it is significantly positive or negative, you know A and B interact.

Don · Post by **Don** » Wed May 27, 2009 11:47 pm

Don wrote:
hgm wrote:In principle, orthogonal multi-testing would allow you to evauate 10 changes in 1024 games.
You simple must expand on that one! I'm interested. How would you go about such a thing?

Woops, I missed your explanation on this. You idea here is excellent!

The only issue I have after thinking about it is that you do miss the diversity of having foreign players. I often test with other programs not authored by myself as it sometimes happens that a big "improvement" is almost worthless against other players. However, the economy of effort here is huge and this is a great idea.

mcostalba · Post by **mcostalba** » Thu May 28, 2009 12:38 am

Don wrote:
Don wrote:
hgm wrote:In principle, orthogonal multi-testing would allow you to evauate 10 changes in 1024 games.
You simple must expand on that one! I'm interested. How would you go about such a thing?
Woops, I missed your explanation on this. You idea here is excellent!

The only issue I have after thinking about it is that you do miss the diversity of having foreign players. I often test with other programs not authored by myself as it sometimes happens that a big "improvement" is almost worthless against other players. However, the economy of effort here is huge and this is a great idea.

Yes. I agree it would be very interesting a digression, possibly by the author, about the generalization of the method to multi-opponents case.

krazyken · Post by **krazyken** » Thu May 28, 2009 1:16 am

mcostalba wrote:
Don wrote:
Don wrote:
hgm wrote:In principle, orthogonal multi-testing would allow you to evauate 10 changes in 1024 games.
You simple must expand on that one! I'm interested. How would you go about such a thing?
Woops, I missed your explanation on this. You idea here is excellent!

The only issue I have after thinking about it is that you do miss the diversity of having foreign players. I often test with other programs not authored by myself as it sometimes happens that a big "improvement" is almost worthless against other players. However, the economy of effort here is huge and this is a great idea.
Yes. I agree it would be very interesting a digression, possibly by the author, about the generalization of the method to multi-opponents case.

Where you get the idea that you aren't playing a variety of opponents?

bob · Post by **bob** » Thu May 28, 2009 3:17 am

Don wrote:
mcostalba wrote:I think that one of the most valuable things that a kwnoledge contributor can give to the comunity is a paper with the title "The most resource efficent way of testing a chess engine"
That would be an excellent paper if done well.

In the past I have experimented with learning algorithms such as PBIL, which is related to genetic algorithms. The key requirement of PBIL is that you produce a set of players and pick the most "fit" from among them. In some domains fitness is directly measurable, in others it has to be estimated and chess skill of course is one of those.

So given a set of random players, what is the most efficient way to determine which program is strongest? I relaxed the requirement slightly and recognized that the problem is really how to find a GOOD player with the least amount of resources.

I'm sure there are probabilistic ways to do this. But the solution I came to is a divide and conquer approach. Given a nearly infinite pool of players (which I can generate at will) a knockout competition will very quickly find a reasonable opponent. The round robin approach is probably orders of magnitude less efficient. In a knockout, there is a very good chance that the winner will be one of the top players and the effort required to find this player is pretty small. Early rounds quickly separate the men from the boys. It also makes sense to play longer matches in later rounds, since a late round is a tiny fraction of the total effort.

I don't think that question is _nearly_ as interesting as the question "given two program versions that are very close to each other, what is the most efficient way to determine with reasonable reliability which one is best???"

So I wonder if some variation of this method might be used for testing in general - although it's doubtful since I am usually testing only 1 change at a time.

bob · Post by **bob** » Thu May 28, 2009 7:52 am

hgm wrote:
bob wrote:If two of the features somehow interact, however, the test is not as useful, so you really need to be sure that the 4 changes don't influence each other in unexpected ways.
If the features interact, testing them one by one is also no good.

But the nice thing about the orthogonal multitesting is that it gives you an idea if the features interact, from the same data set. E.g. with two features A ad B, you make test runs with the 4 combinations {--}, {A-}, {-B} and {AB}. To get the effect of A, you take

{A-} - {--} + {AB} - {-B}.

To get the effect of B, you take

{-B} - {--} + {AB} - {A-}.

To get an idea of their interaction, you take

{AB} + {--} - {A-} - {-B}.

All these quantities have the same statistical error, corresponding to the total number of games. If the effect of A and B together is the sum of the effect of A alone and B alone, the last quantity should be zero. If it is significantly positive or negative, you know A and B interact.

You don't have to convince me.

I think you proposed an excellent idea although at times it probably won't apply. One particularly difficult case that is probably better dealt with via some optimization method like annealing or whatever is where you have two variables that have a large range, and they both need to be tuned together. This is a huge computational problem for the way I currently test. 10 steps for each value can produce 100 X 32,000 games which is still impractical even at very fast time controls.

bob · Post by **bob** » Thu May 28, 2009 7:56 am

Don wrote:
Don wrote:
hgm wrote:In principle, orthogonal multi-testing would allow you to evauate 10 changes in 1024 games.
You simple must expand on that one! I'm interested. How would you go about such a thing?
Woops, I missed your explanation on this. You idea here is excellent!

The only issue I have after thinking about it is that you do miss the diversity of having foreign players. I often test with other programs not authored by myself as it sometimes happens that a big "improvement" is almost worthless against other players. However, the economy of effort here is huge and this is a great idea.

You don't have to give up on that. You can still play against N opponents. But you don't need as many games. In the example I gave, I was using multiple opponents. Given two ideas, A and B, you play AX (X = B change is not present) against the gauntlet, then XB, then XX, and then AB so that you have all 4 combinations. But since you have two sets of games with A on, and two sets with B on, you play 1/4 the number of games for each of the 4 gauntlets, because you are going to combine them using BayesElo to give you an overall rating for A on vs A off, and B on vs B off... But you play fewer games with each thing on or off since they can be combined.

bob · Post by **bob** » Thu May 28, 2009 7:57 am

mcostalba wrote:
Don wrote:
Don wrote:
hgm wrote:In principle, orthogonal multi-testing would allow you to evauate 10 changes in 1024 games.
You simple must expand on that one! I'm interested. How would you go about such a thing?
Woops, I missed your explanation on this. You idea here is excellent!

The only issue I have after thinking about it is that you do miss the diversity of having foreign players. I often test with other programs not authored by myself as it sometimes happens that a big "improvement" is almost worthless against other players. However, the economy of effort here is huge and this is a great idea.
Yes. I agree it would be very interesting a digression, possibly by the author, about the generalization of the method to multi-opponents case.

when we discussed this, multi-opponents was part of the idea. It doesn't matter what you play the 4 versions of your program against, it could be one opponent or a group of opponents, it works exactly the same.

Tord Romstad · Post by **Tord Romstad** » Thu May 28, 2009 2:47 pm

mcostalba wrote:Returning to your ultra fast idea: it is indeed _almost_ the best thing but you have to be careful. As a trivial example consider selective search. Futility pruning in Glaurung starts at ply 7 and more or less is the same with Cyclone. So if your ultra fast search does not reach at least ply 9-10 and change you are testing is about futility then you can get artifacts with ultra fast games.

I used to think so, too.

One of the biggest surprises I got when porting my program to the iPhone was how little I had to tune the selective search to the slow CPU. On the iPhone, I get only around 8 plies on average in the middle game, and I thought the search would fall apart totally when using 7 plies of aggressive, dubious forward pruning. To my surprise, it turned out to work just as well as on fast computers. Pruning more conservatively or reducing the limit from 7 plies to a lower number made the program play worse.

Testing procedure

Re: Testing procedure

Re: Testing procedure

Re: Testing procedure

Re: Testing procedure

Re: Testing procedure

Re: Testing procedure

Re: Testing procedure

Re: Testing procedure

Re: Testing procedure

Re: Testing procedure