CLOP for Noisy BlackBox Parameter Optimization
Moderators: hgm, Harvey Williamson, bob
Forum rules
This textbox is used to restore diagrams posted with the [d] tag before the upgrade.
This textbox is used to restore diagrams posted with the [d] tag before the upgrade.

 Posts: 404
 Joined: Mon Apr 24, 2006 6:06 pm
 Contact:
Re: CLOP for Noisy BlackBox Parameter Optimization
I added the script of Don, as well as a data file demonstrating the application of CLOP to piece values in Chess, contributed by GianCarlo Pascutto:
http://remi.coulom.free.fr/CLOP/
Thanks to both.
Rémi
http://remi.coulom.free.fr/CLOP/
Thanks to both.
Rémi

 Posts: 404
 Joined: Mon Apr 24, 2006 6:06 pm
 Contact:
Re: CLOP for Noisy BlackBox Parameter Optimization
Hi,
Don just wrote this to me:
the web site of CLOP is now updated with the new version.
Rémi
Don just wrote this to me:
Don wrote:Remi,
Sorry to bug you, but there is an error in the script. Also, I found that it does not run on most unix systems due to an outdated tclsh on most systems. I use 1 modern command not in older tcls.
I also included a README file with my email for help.
So if you can update this, it would probably be useful to people.
the web site of CLOP is now updated with the new version.
Rémi
Re: CLOP for Noisy BlackBox Parameter Optimization
I would like to ask for more comments about the "correlations" option. If you are dealing with the evaluation function of a chess program, there are a huge number of terms, and many of them (maybe all) do correlate to some degree with each other. Se clearly the "correct" setting is correlations all. However if you have much more than a dozen or so eval terms, it seems to converge too slowly to be practical. With correlations none, things seem to happen much more quickly with a lot of terms. I have a suspicion that the time it takes to converge with correlations none may only go up in proportion to the number of terms (or maybe the square of it), whereas with correlations all it may be exponential. Do you think this is correct or close to it?
So the other question is: if you have to use correlations none to get convergence in any practical amount of time, how valid are the results if there really are correlations? What exactly are the types of errors you would see?
If you had a hundred or so eval terms in a chess program to optimize, how would you use your own software to best do this?
So the other question is: if you have to use correlations none to get convergence in any practical amount of time, how valid are the results if there really are correlations? What exactly are the types of errors you would see?
If you had a hundred or so eval terms in a chess program to optimize, how would you use your own software to best do this?

 Posts: 404
 Joined: Mon Apr 24, 2006 6:06 pm
 Contact:
Re: CLOP for Noisy BlackBox Parameter Optimization
In theory, "Correlations all" should always be best in terms of getting a good optimal from the same number of games, but there are at least two problems: first, computing the regression becomes very slow. The number of parameters of the regression for n parameters (of the program) is (n+1)*(n+2)/2. I use Newton's method to do the regression, which has a cost cubic in the number of parameters of the regression, so O(n^6) cost. In practice, this will be too slow when n becomes much larger than a dozen. The second problem is that because the number of parameters increases, there are more risks of overfitting when the number of games is low. This problem should always go away when the number of games becomes higher.lkaufman wrote:I would like to ask for more comments about the "correlations" option. If you are dealing with the evaluation function of a chess program, there are a huge number of terms, and many of them (maybe all) do correlate to some degree with each other. Se clearly the "correct" setting is correlations all. However if you have much more than a dozen or so eval terms, it seems to converge too slowly to be practical. With correlations none, things seem to happen much more quickly with a lot of terms. I have a suspicion that the time it takes to converge with correlations none may only go up in proportion to the number of terms (or maybe the square of it), whereas with correlations all it may be exponential. Do you think this is correct or close to it?
So the other question is: if you have to use correlations none to get convergence in any practical amount of time, how valid are the results if there really are correlations? What exactly are the types of errors you would see?
If you had a hundred or so eval terms in a chess program to optimize, how would you use your own software to best do this?
Both problems are solvable, with a better algorithm (for regression), or stronger regularization (for overfitting) (ie, a stronger prior).
In practice "Correlations none" may seem to converge faster, but what it converges to might not be very good. In my lowdimensional experiments, even when there are no correlations at all, "Correlations none" does not perform significantly better than "Correlations all" in terms of simple regret. So "Correlations all" should be always better whith a high enough number of games.
The problem is for a large number of parameters "high enough" might be extremely high. Right now, it does not seem realistic to use "Correlations all" with more than a dozen parameters, as you noticed. But I will try to improve this.
Rémi
Re: CLOP for Noisy BlackBox Parameter Optimization
If a dozen is the upper practical limit for "correlations all", what would be the upper practical limit for "correlations none"? Does it make a huge difference in this limit or just a small one?Rémi Coulom wrote:In theory, "Correlations all" should always be best in terms of getting a good optimal from the same number of games, but there are at least two problems: first, computing the regression becomes very slow. The number of parameters of the regression for n parameters (of the program) is (n+1)*(n+2)/2. I use Newton's method to do the regression, which has a cost cubic in the number of parameters of the regression, so O(n^6) cost. In practice, this will be too slow when n becomes much larger than a dozen. The second problem is that because the number of parameters increases, there are more risks of overfitting when the number of games is low. This problem should always go away when the number of games becomes higher.lkaufman wrote:I would like to ask for more comments about the "correlations" option. If you are dealing with the evaluation function of a chess program, there are a huge number of terms, and many of them (maybe all) do correlate to some degree with each other. Se clearly the "correct" setting is correlations all. However if you have much more than a dozen or so eval terms, it seems to converge too slowly to be practical. With correlations none, things seem to happen much more quickly with a lot of terms. I have a suspicion that the time it takes to converge with correlations none may only go up in proportion to the number of terms (or maybe the square of it), whereas with correlations all it may be exponential. Do you think this is correct or close to it?
So the other question is: if you have to use correlations none to get convergence in any practical amount of time, how valid are the results if there really are correlations? What exactly are the types of errors you would see?
If you had a hundred or so eval terms in a chess program to optimize, how would you use your own software to best do this?
Both problems are solvable, with a better algorithm (for regression), or stronger regularization (for overfitting) (ie, a stronger prior).
In practice "Correlations none" may seem to converge faster, but what it converges to might not be very good. In my lowdimensional experiments, even when there are no correlations at all, "Correlations none" does not perform significantly better than "Correlations all" in terms of simple regret. So "Correlations all" should be always better whith a high enough number of games.
The problem is for a large number of parameters "high enough" might be extremely high. Right now, it does not seem realistic to use "Correlations all" with more than a dozen parameters, as you noticed. But I will try to improve this.
Rémi

 Posts: 404
 Joined: Mon Apr 24, 2006 6:06 pm
 Contact:
Re: CLOP for Noisy BlackBox Parameter Optimization
I never tried experiments with more than a dozen parameters. This being said, my guess is that "Correlations none" should work up to at least 100 parameters, probably many more. For a given total amount of games, tuning many parameters all at the same time with "Correlations none" should always be much better than tuning them one by one in sequence. But if your parameters are already well tuned, it might be better to tune just a few to extremely accurate values rather than many to less accurate values that may turn out to be worse than those you already have.lkaufman wrote:If a dozen is the upper practical limit for "correlations all", what would be the upper practical limit for "correlations none"? Does it make a huge difference in this limit or just a small one?
I wrote that tool to tune my Go program. My Go program has millions of parameters, automatically tuned by another learning algorithm. I know they are not optimal, but they are good enough, and I have no hope to improve them with CLOP.
It is frustrating, because I know an optimal value for all these parameters exists, and optimal parameters would make my program significantly stronger. It is simply not realistic to expect being able to find that optimum even with computing power that will be available in the far future. I only tune parameters that have a strong influence on strength with CLOP.
Also: when using "Correlations none", you can try to decorrelate your variables. For instance, values of Knight(N) and Bishop(B) are strongly correlated. Instead of optimizing N and B, you can optimize N+B and NB: they are almost independent.
You should also try to use variables that have a stronger influence on strength. For instance, when optimizing a piecesquare table, it is inefficient to tune the value of each square separately. It might be more efficient to use variables such as distance to border, rank, file, distance to promotion, legal moves on an empty board, .... More generally, you can try to be creative to reduce the dimensionality of the optimization problem.
Rémi

 Posts: 404
 Joined: Mon Apr 24, 2006 6:06 pm
 Contact:
Re: CLOP for Noisy BlackBox Parameter Optimization
For that part of the question: if parameters are strongly correlated, then, yes, it makes a big difference. I tried "Correlations none" on the "Correlated" problem of the paper, and it performed really bad. I'll post the plot if I have time tomorrow.lkaufman wrote:Does it make a huge difference in this limit or just a small one?
Rémi

 Posts: 404
 Joined: Mon Apr 24, 2006 6:06 pm
 Contact:
Re: CLOP for Noisy BlackBox Parameter Optimization
Rémi Coulom wrote:For that part of the question: if parameters are strongly correlated, then, yes, it makes a big difference. I tried "Correlations none" on the "Correlated" problem of the paper, and it performed really bad. I'll post the plot if I have time tomorrow.lkaufman wrote:Does it make a huge difference in this limit or just a small one?
Rémi
The dotted blue line is CLOP with "Correlations none", the solid blue line is with "Correlations all". Loglog scale, see paper for details. As you can see "Correlations none" is not a big improvement over "Correlations all" when parameters are completely independent. But when they are actually correlated, "Correlations none" performs really badly.
So I'll have to find a way to scale "Correlations all" to higher dimensions.
Rémi
Re: CLOP for Noisy BlackBox Parameter Optimization
I am making my first forays into parameter tuning so I am trying to read
your paper.
I am not an expert so I have a few naive questions.
When you initially describe the problem (in Section 1.1) f is supposed to take values in the space of Bernouilli trials. Does this mean that the (y_i)_i in Algorithm 1 are contained in {0,1} (i.e. the possible outcomes of
a Bernouilli trial)?
Bernouilli trials are too restrictive even when playing against a single opponent, because of draws.
Is it true that the algorithm remains valid if f takes values in the space
of all probability distributions?
If so what is the algorithm computing? The values x^* such that f(x^*) has maximal expectation value?
Thanks,
Michel
your paper.
I am not an expert so I have a few naive questions.
When you initially describe the problem (in Section 1.1) f is supposed to take values in the space of Bernouilli trials. Does this mean that the (y_i)_i in Algorithm 1 are contained in {0,1} (i.e. the possible outcomes of
a Bernouilli trial)?
Bernouilli trials are too restrictive even when playing against a single opponent, because of draws.
Is it true that the algorithm remains valid if f takes values in the space
of all probability distributions?
If so what is the algorithm computing? The values x^* such that f(x^*) has maximal expectation value?
Thanks,
Michel

 Posts: 404
 Joined: Mon Apr 24, 2006 6:06 pm
 Contact:
Re: CLOP for Noisy BlackBox Parameter Optimization
Yes, my paper is restricted to 0/1 outcomes, but it is easy to generalize to win/loss/draw. The CLOP software can deal with W/L/D by using the same outcome model as bayeselo.Michel wrote:I am making my first forays into parameter tuning so I am trying to read
your paper.
I am not an expert so I have a few naive questions.
When you initially describe the problem (in Section 1.1) f is supposed to take values in the space of Bernouilli trials. Does this mean that the (y_i)_i in Algorithm 1 are contained in {0,1} (i.e. the possible outcomes of
a Bernouilli trial)?
Bernouilli trials are too restrictive even when playing against a single opponent, because of draws.
If you wish to test against multiple opponents, you can do it with CLOP by using the "Replications" parameter. CLOP will not estimate the strength of each individual opponent, but it will maximize the winning rate over replications.
I suppose it would work well for any kind of outcome distribution, even deterministic continuous values. The convenient aspect of the logistic model is that variance is a function of mean. So there is no need to have a separate model for noise. My guess is that CLOP would work well even with no noise model, by using the variance of samples in order to estimate the confidence interval of the mean. But that remains to be tested.Is it true that the algorithm remains valid if f takes values in the space
of all probability distributions?
If so what is the algorithm computing? The values x^* such that f(x^*) has maximal expectation value?
Thanks,
Michel
Rémi