How many games do you use in order to tune one parameter

pferd · Post by **pferd** » Tue Dec 02, 2014 11:43 pm

Hey,

how many games do you play in order to tune on parameter on averge. In fishtest they tried to use 200k games in total to tune 12 parameters and ended up badly?

Ferdy · Post by **Ferdy** » Wed Dec 03, 2014 3:46 am

pferd wrote:Hey,

how many games do you play in order to tune on parameter on averge.

Depends on number of parameters, also depends on the parameter itself - it is possible that the default value is already optimal or near optimal and that requires a ton of games to see any tuning improvement.
Consider I am tuning my passed pawn in 7th rank alone, my default value is 210 cp. One way of tuning is do a self-test meaning, default of 210 vs tune_value of 220. Take unique 5k positions as your test positions, then run a match of 10k games initially, 5k x 2 = 10k, color is reversed. If 220 wins you can increase it to 225 for example then test 10k again. However if 220 loses to default 210, then you may try 215 at 10k games (or try 225 for 10k games), if that still loses you may try 205 at 10k again. if 205 wins over default 210, now you may decrease it to 200, check if it still improves better. The selection of 5k initial test positions is just an example, more is better of course.
So how many K games do we have now, say a 10k games up from default and 10k games down from default - 20k games minimum.

In practice it is complicated, for example 210 vs 220, after 10k games the 220 lead is only 5 games (+5 wins), you may extend the test to 5k games for example to see if default 210 can recover or the 220 lead is increased.
There are also cases that after only 5k games out of scheduled 10k games, the 220 for example leads 175 net wins. In that case you may calculate the margin of error or perhaps the LOS if it is already safe to stop the match or just decide to stop it and conclude that there is improvement. Another case is that if 220 loses a net of 200 games, after 3k games for scheduled 10k test games, it is possible that you stop the match at 3k games and better try other parameter values say 225 than continue the test games using 220 param value. The above is only for 1 TC say 40moves/30s repeating. After successfully found a better parameter value in that TC now you may try the test at 40moves/60s repeating. After that now there is also a gauntlet tests different from self-test. So probably I do a minimum of 20k games with or without parameter improvement.

tpetzke · Post by **tpetzke** » Wed Dec 03, 2014 10:02 am

I usually run 16k games in a regression test. I use early stopping only if I intent to reject the patch. So I'm less paranoid about rejecting a maybe good patch than about accepting a maybe bad patch.

Thomas...

bob · Post by **bob** » Wed Dec 03, 2014 3:18 pm

pferd wrote:Hey,

how many games do you play in order to tune on parameter on averge. In fishtest they tried to use 200k games in total to tune 12 parameters and ended up badly?

It depends on the significance of the change. If it is a 20 Elo change, you won't need nearly as many games as you need to recognize a 1 elo gain. To measure 1 elo you have to go well beyond 100K games.

jdart · Post by **jdart** » Wed Dec 03, 2014 4:42 pm

I don't accept changes unless they pass a verification test run with 36,000 games.

I have though used a smaller test run (8800 games) to explore possible changes.

--Jon

AlvaroBegue · Post by **AlvaroBegue** » Wed Dec 03, 2014 4:59 pm

jdart wrote:I don't accept changes unless they pass a verification test run with 36,000 games.

I have though used a smaller test run (8800 games) to explore possible changes.

--Jon

To combine this with what Bob was saying about 1-Elo improvements, a version that is 1-Elo stronger will fail the 36,000-game verification test (defined simply as who wins more points) about 29% of the time, if I did my math right.

cdani · Post by **cdani** » Wed Dec 03, 2014 8:33 pm

Hi.

For a relative weak engine like Andscacs I'm able to find patches that win 5-10 elo relatively often, so mostly I don't use many time tuning things that require a lot of games.

I prefer to do more tests, even if some are regressions that I don't detect, and try to win more elo in less time. Also is a lot more fun

Anyway, most of the fine tuning you do will be rejected when you change something other, that many times you simply don't notice that affects the previous tuning. Or simply will be rejected by yourself when you decide to write it again in a different way.

So I work more with quantity of patches than quality.

It's always the typical compromise between the good and the optimal.

Of course it's not very professional, but it's for fun than I do all this.

PK · Post by PK » Wed Dec 03, 2014 10:09 pm

Most of the time I use a very unscientific approach: "whatever number of games can be played overnight", usually reaching only something between 1000 and 2000. For critical patches (usually major changes in search structure or in king safety) I then run another series of games at longer time control. Plus there is another test played after applying a couple of patches - if it fails, I rerun previous tests, looking for "a stinking egg", i.e. a wrong patch.

What's even more unscientific, I sometimes accept patches that are slight regressions, but are needed them for another reason. For example, I could not solve the conflict between using lazy eval and tunable eval options - somehow these features didn't go together well. Rodent 1.7 will not have lazy eval, even though it costs about 20 Elo and last month has been spent regaining the lost ground. On the plus side, after this change I was able to produce 2 or three distinct personalities holding its own against default (with the parameters that repeteadly failed with lazy eval on).

The same was the case with verified null move - slight regression, but also slightly less short losses and bigger awareness of mate threats. This loss has been recovered in two days.

This is a bit like quenching bugs - if I consider the difficulty with creating alternative personalities a bug, then it has to be fixed. And it might well happen that a bugfix decreases playing strength - at least temporariy, as the engine has been tuned to live with the bug.

cdani · Post by **cdani** » Wed Dec 03, 2014 11:26 pm

I forgot to tell the number of games.

When Andscacs was less than 2600, roughly up to 1500 games per test, many times less than 1000. I don't recommend this to anyone. It won a lot of rating quite fast, but at the expense of many bad patches. Of course I was warned about it by many of you

It was mostly gauntlets.

Thereafter I consolidated to something like 2000 for some months, and I slowly increased them to what I do now with this new version, more or less 4000-6000 at 5+.05 of self test. Then if this works (or anyway if I think is a patch that it's better at longer time controls), maybe other 4000 or more at 15+.04. Sometimes if I think it's very sensitive to time control, some 2000 o more at 30+.04. And when I think it's necessary, also a gauntlet with other engines at different time controls.

Also I can do some tests with other computer of maybe 1000-2000 games with next-to-test patches to discard some nonsenses quickly.

op12no2 · Post by **op12no2** » Thu Dec 04, 2014 3:13 pm

To aid tuning my Javascript engine I created this today:-

http://op12no2.me/toys/lozzadev/tune.htm

Only tested in Chrome.

It fires up two web workers that endlessly play each other. Currently tuning mobility weight. Possible values down the left with overall score using those values next to them. Randomness in games is provided by random first move each side and random move time within a range. This kinda thing works well to fit params to infection models so I figured I'd give it a go. Not sure it'll work as is, but it's a lot of fun.

How many games do you use in order to tune one parameter

How many games do you use in order to tune one parameter

Re: How many games do you use in order to tune one parameter

Re: How many games do you use in order to tune one parameter

Re: How many games do you use in order to tune one parameter

Re: How many games do you use in order to tune one parameter

Re: How many games do you use in order to tune one parameter

Re: How many games do you use in order to tune one parameter

Re: How many games do you use in order to tune one parameter

Re: How many games do you use in order to tune one parameter

Re: How many games do you use in order to tune one parameter