CLOP for Noisy Black-Box Parameter Optimization

Rémi Coulom · Post by **Rémi Coulom** » Wed Sep 07, 2011 10:19 pm

I added the script of Don, as well as a data file demonstrating the application of CLOP to piece values in Chess, contributed by Gian-Carlo Pascutto:
http://remi.coulom.free.fr/CLOP/

Thanks to both.

Rémi

Rémi Coulom · Post by **Rémi Coulom** » Thu Sep 08, 2011 6:44 pm

Hi,

Don just wrote this to me:

Don wrote:Remi,

Sorry to bug you, but there is an error in the script. Also, I found that it does not run on most unix systems due to an outdated tclsh on most systems. I use 1 modern command not in older tcls.

I also included a README file with my email for help.

So if you can update this, it would probably be useful to people.

the web site of CLOP is now updated with the new version.

Rémi

lkaufman · Post by **lkaufman** » Fri Sep 09, 2011 5:36 am

I would like to ask for more comments about the "correlations" option. If you are dealing with the evaluation function of a chess program, there are a huge number of terms, and many of them (maybe all) do correlate to some degree with each other. Se clearly the "correct" setting is correlations all. However if you have much more than a dozen or so eval terms, it seems to converge too slowly to be practical. With correlations none, things seem to happen much more quickly with a lot of terms. I have a suspicion that the time it takes to converge with correlations none may only go up in proportion to the number of terms (or maybe the square of it), whereas with correlations all it may be exponential. Do you think this is correct or close to it?
So the other question is: if you have to use correlations none to get convergence in any practical amount of time, how valid are the results if there really are correlations? What exactly are the types of errors you would see?
If you had a hundred or so eval terms in a chess program to optimize, how would you use your own software to best do this?

Rémi Coulom · Post by **Rémi Coulom** » Fri Sep 09, 2011 8:11 am

lkaufman wrote:I would like to ask for more comments about the "correlations" option. If you are dealing with the evaluation function of a chess program, there are a huge number of terms, and many of them (maybe all) do correlate to some degree with each other. Se clearly the "correct" setting is correlations all. However if you have much more than a dozen or so eval terms, it seems to converge too slowly to be practical. With correlations none, things seem to happen much more quickly with a lot of terms. I have a suspicion that the time it takes to converge with correlations none may only go up in proportion to the number of terms (or maybe the square of it), whereas with correlations all it may be exponential. Do you think this is correct or close to it?
So the other question is: if you have to use correlations none to get convergence in any practical amount of time, how valid are the results if there really are correlations? What exactly are the types of errors you would see?
If you had a hundred or so eval terms in a chess program to optimize, how would you use your own software to best do this?

In theory, "Correlations all" should always be best in terms of getting a good optimal from the same number of games, but there are at least two problems: first, computing the regression becomes very slow. The number of parameters of the regression for n parameters (of the program) is (n+1)*(n+2)/2. I use Newton's method to do the regression, which has a cost cubic in the number of parameters of the regression, so O(n^6) cost. In practice, this will be too slow when n becomes much larger than a dozen. The second problem is that because the number of parameters increases, there are more risks of over-fitting when the number of games is low. This problem should always go away when the number of games becomes higher.

Both problems are solvable, with a better algorithm (for regression), or stronger regularization (for overfitting) (ie, a stronger prior).

In practice "Correlations none" may seem to converge faster, but what it converges to might not be very good. In my low-dimensional experiments, even when there are no correlations at all, "Correlations none" does not perform significantly better than "Correlations all" in terms of simple regret. So "Correlations all" should be always better whith a high enough number of games.

The problem is for a large number of parameters "high enough" might be extremely high. Right now, it does not seem realistic to use "Correlations all" with more than a dozen parameters, as you noticed. But I will try to improve this.

Rémi

lkaufman · Post by **lkaufman** » Fri Sep 09, 2011 4:43 pm

Rémi Coulom wrote:
lkaufman wrote:I would like to ask for more comments about the "correlations" option. If you are dealing with the evaluation function of a chess program, there are a huge number of terms, and many of them (maybe all) do correlate to some degree with each other. Se clearly the "correct" setting is correlations all. However if you have much more than a dozen or so eval terms, it seems to converge too slowly to be practical. With correlations none, things seem to happen much more quickly with a lot of terms. I have a suspicion that the time it takes to converge with correlations none may only go up in proportion to the number of terms (or maybe the square of it), whereas with correlations all it may be exponential. Do you think this is correct or close to it?
So the other question is: if you have to use correlations none to get convergence in any practical amount of time, how valid are the results if there really are correlations? What exactly are the types of errors you would see?
If you had a hundred or so eval terms in a chess program to optimize, how would you use your own software to best do this?
In theory, "Correlations all" should always be best in terms of getting a good optimal from the same number of games, but there are at least two problems: first, computing the regression becomes very slow. The number of parameters of the regression for n parameters (of the program) is (n+1)*(n+2)/2. I use Newton's method to do the regression, which has a cost cubic in the number of parameters of the regression, so O(n^6) cost. In practice, this will be too slow when n becomes much larger than a dozen. The second problem is that because the number of parameters increases, there are more risks of over-fitting when the number of games is low. This problem should always go away when the number of games becomes higher.

Both problems are solvable, with a better algorithm (for regression), or stronger regularization (for overfitting) (ie, a stronger prior).

In practice "Correlations none" may seem to converge faster, but what it converges to might not be very good. In my low-dimensional experiments, even when there are no correlations at all, "Correlations none" does not perform significantly better than "Correlations all" in terms of simple regret. So "Correlations all" should be always better whith a high enough number of games.

The problem is for a large number of parameters "high enough" might be extremely high. Right now, it does not seem realistic to use "Correlations all" with more than a dozen parameters, as you noticed. But I will try to improve this.

Rémi

If a dozen is the upper practical limit for "correlations all", what would be the upper practical limit for "correlations none"? Does it make a huge difference in this limit or just a small one?

Rémi Coulom · Post by **Rémi Coulom** » Fri Sep 09, 2011 9:47 pm

lkaufman wrote:If a dozen is the upper practical limit for "correlations all", what would be the upper practical limit for "correlations none"? Does it make a huge difference in this limit or just a small one?

I never tried experiments with more than a dozen parameters. This being said, my guess is that "Correlations none" should work up to at least 100 parameters, probably many more. For a given total amount of games, tuning many parameters all at the same time with "Correlations none" should always be much better than tuning them one by one in sequence. But if your parameters are already well tuned, it might be better to tune just a few to extremely accurate values rather than many to less accurate values that may turn out to be worse than those you already have.

I wrote that tool to tune my Go program. My Go program has millions of parameters, automatically tuned by another learning algorithm. I know they are not optimal, but they are good enough, and I have no hope to improve them with CLOP.

It is frustrating, because I know an optimal value for all these parameters exists, and optimal parameters would make my program significantly stronger. It is simply not realistic to expect being able to find that optimum even with computing power that will be available in the far future. I only tune parameters that have a strong influence on strength with CLOP.

Also: when using "Correlations none", you can try to de-correlate your variables. For instance, values of Knight(N) and Bishop(B) are strongly correlated. Instead of optimizing N and B, you can optimize N+B and N-B: they are almost independent.

You should also try to use variables that have a stronger influence on strength. For instance, when optimizing a piece-square table, it is inefficient to tune the value of each square separately. It might be more efficient to use variables such as distance to border, rank, file, distance to promotion, legal moves on an empty board, .... More generally, you can try to be creative to reduce the dimensionality of the optimization problem.

Rémi

Rémi Coulom · Post by **Rémi Coulom** » Fri Sep 09, 2011 9:52 pm

lkaufman wrote:Does it make a huge difference in this limit or just a small one?

For that part of the question: if parameters are strongly correlated, then, yes, it makes a big difference. I tried "Correlations none" on the "Correlated" problem of the paper, and it performed really bad. I'll post the plot if I have time tomorrow.

Rémi

Rémi Coulom · Post by **Rémi Coulom** » Sat Sep 10, 2011 11:49 am

Rémi Coulom wrote:
lkaufman wrote:Does it make a huge difference in this limit or just a small one?
For that part of the question: if parameters are strongly correlated, then, yes, it makes a big difference. I tried "Correlations none" on the "Correlated" problem of the paper, and it performed really bad. I'll post the plot if I have time tomorrow.

Rémi

The dotted blue line is CLOP with "Correlations none", the solid blue line is with "Correlations all". Log-log scale, see paper for details. As you can see "Correlations none" is not a big improvement over "Correlations all" when parameters are completely independent. But when they are actually correlated, "Correlations none" performs really badly.

So I'll have to find a way to scale "Correlations all" to higher dimensions.

Rémi

Michel · Post by **Michel** » Sun Sep 25, 2011 10:18 am

I am making my first forays into parameter tuning so I am trying to read
your paper.

I am not an expert so I have a few naive questions.

When you initially describe the problem (in Section 1.1) f is supposed to take values in the space of Bernouilli trials. Does this mean that the (y_i)_i in Algorithm 1 are contained in {0,1} (i.e. the possible outcomes of
a Bernouilli trial)?

Bernouilli trials are too restrictive even when playing against a single opponent, because of draws.

Is it true that the algorithm remains valid if f takes values in the space
of all probability distributions?

If so what is the algorithm computing? The values x^* such that f(x^*) has maximal expectation value?

Thanks,
Michel

Rémi Coulom · Post by **Rémi Coulom** » Sun Sep 25, 2011 10:47 am

Michel wrote:I am making my first forays into parameter tuning so I am trying to read
your paper.

I am not an expert so I have a few naive questions.

When you initially describe the problem (in Section 1.1) f is supposed to take values in the space of Bernouilli trials. Does this mean that the (y_i)_i in Algorithm 1 are contained in {0,1} (i.e. the possible outcomes of
a Bernouilli trial)?

Bernouilli trials are too restrictive even when playing against a single opponent, because of draws.

Yes, my paper is restricted to 0/1 outcomes, but it is easy to generalize to win/loss/draw. The CLOP software can deal with W/L/D by using the same outcome model as bayeselo.

If you wish to test against multiple opponents, you can do it with CLOP by using the "Replications" parameter. CLOP will not estimate the strength of each individual opponent, but it will maximize the winning rate over replications.

Is it true that the algorithm remains valid if f takes values in the space
of all probability distributions?

If so what is the algorithm computing? The values x^* such that f(x^*) has maximal expectation value?

Thanks,
Michel

I suppose it would work well for any kind of outcome distribution, even deterministic continuous values. The convenient aspect of the logistic model is that variance is a function of mean. So there is no need to have a separate model for noise. My guess is that CLOP would work well even with no noise model, by using the variance of samples in order to estimate the confidence interval of the mean. But that remains to be tested.

Rémi

CLOP for Noisy Black-Box Parameter Optimization

Re: CLOP for Noisy Black-Box Parameter Optimization

Re: CLOP for Noisy Black-Box Parameter Optimization

Re: CLOP for Noisy Black-Box Parameter Optimization

Re: CLOP for Noisy Black-Box Parameter Optimization

Re: CLOP for Noisy Black-Box Parameter Optimization

Re: CLOP for Noisy Black-Box Parameter Optimization

Re: CLOP for Noisy Black-Box Parameter Optimization

Re: CLOP for Noisy Black-Box Parameter Optimization

Re: CLOP for Noisy Black-Box Parameter Optimization

Re: CLOP for Noisy Black-Box Parameter Optimization