Utilities for testing/tuning.

eric.oldre · Post by **eric.oldre** » Thu Jul 03, 2014 5:59 pm

I have been using two pieces of software to test and tune NoraGrace.

When I want to test changes in strength, I typically run a gauntlet in the Arena GUI. Usually around 150-200 matches at 4:00 game against 4-5 opponents.

When I want to tune a specific parameter I the NoraGrace.Engine.Tune utility. I based this utility on the stockfish tuning method described on the chess programming wiki.

https://chessprogramming.wikispaces.com ... ing+Method

My utility allows me to run 6 games in parallel (6 core cpu) between versions of NoraGrace with various parameters changed. It runs each of the competitors in the same process. However as a downside I need to make whatever parameter I want to change dynamic and create an API to change it. Additionally it can only run games between different versions of NoraGrace and not any other engine. This works OK for tuning eval parameters, but not for much else.

As we all discover eventually, one of the main things holding back progress is the ability to get meaningful test results back in a reasonable amount of time. So my question is if there are utilities out there to allow you to run tournaments in parallel on the same machine? Or even across different machines?

I'm sure that this is a problem a few of you have solved already. (and probably discussed at length in another thread, but I haven't found it)

Eric

tpetzke · Post by **tpetzke** » Thu Jul 03, 2014 6:35 pm

When I want to test changes in strength, I typically run a gauntlet in the Arena GUI. Usually around 150-200 matches at 4:00 game against 4-5 opponents.

I (and a lot of others) use cutechess-cli, but I ran about 100 times more games (usually 16.000) to test changes. With only 200 games your changes must be worth about 100 ELO to allow you to drive any conclusions.

100 ELO is a lot for a single change

Ajedrecista · Post by **Ajedrecista** » Thu Jul 03, 2014 6:52 pm

Hello Eric:

Welcome to TalkChess!

I fear I did not fully understand you. I was going to recommend cutechess-cli but Thomas was faster than me. It is probably better to use cutechess-cli 0.5.1 than 0.6.0 due to a bug that I do not remember right now. It is a command-line interface (this is why it is named cli). Other option is LittleBlitzer GUI (I think that last version is 2.74).

In cutechess-cli, the number of parallel games is set with --concurrency option. If you have a N-core machine, I would recommend you --concurrency N-1 (for example: --concurrency 5 for a 6-core PC, letting the remaining core for the OS).

I agree with Thomas in the fact of running tests of more games to reduce the error bars, which are uncertainties in the Elo estimation. You can consider that |error bar| is proportional to 1/sqrt(games). Probably an option is reduce the time control in a first stage, trying to filter quickly very bad patches, and then, if a change is promising, play at longer time controls to see if this change scales well or vanishes (I am not sure if 'scale' is the correct word here). I think that changes in king safety are very critical with the time control. Please do not stop a test with fix number of games until the end, otherwise you bias the result, enlarge error bars, etc.

You can also think about using SPRT (sequential probability ratio tests) like in Stockfish testing framework, and not only matches with a fixed number of games.

Just said that: I am neither a programmer nor a tester, so I have not experience in this area. I am a mere computer chess afficionado.

I read your first post here and I wish you the best.

Regards from Spain.

Ajedrecista.[/u]

bob · Post by **bob** » Thu Jul 03, 2014 9:10 pm

eric.oldre wrote:I have been using two pieces of software to test and tune NoraGrace.

When I want to test changes in strength, I typically run a gauntlet in the Arena GUI. Usually around 150-200 matches at 4:00 game against 4-5 opponents.

When I want to tune a specific parameter I the NoraGrace.Engine.Tune utility. I based this utility on the stockfish tuning method described on the chess programming wiki.

https://chessprogramming.wikispaces.com ... ing+Method

My utility allows me to run 6 games in parallel (6 core cpu) between versions of NoraGrace with various parameters changed. It runs each of the competitors in the same process. However as a downside I need to make whatever parameter I want to change dynamic and create an API to change it. Additionally it can only run games between different versions of NoraGrace and not any other engine. This works OK for tuning eval parameters, but not for much else.

As we all discover eventually, one of the main things holding back progress is the ability to get meaningful test results back in a reasonable amount of time. So my question is if there are utilities out there to allow you to run tournaments in parallel on the same machine? Or even across different machines?

I'm sure that this is a problem a few of you have solved already. (and probably discussed at length in another thread, but I haven't found it)

Eric

First, this was the problem that led to the crafty personality stuff. Cozzie wanted to use an annealing approach to optimize parameter values, but he needed a UI protocol that allowed this (the earlier versions of Crafty had to have source modified to change anything).

However, that personality approach is pretty messy, so in Crafty I adopted a single special-purpose command "scale". It is recognized in option() and I modify it whenever I want to tune something. For example, if I have a new idea on using a dynamic reduction where it is computed as R = a * b + c, I often need to at least scale such values so that they don't grow too large. So I might change that to R = (a * b + c) / x, but then the question is what is the best x value.

I will make option() spot the scale command which is usually of the form "scale=n" for this kind of case, and take the "n" value and stuff it into x above. Now I can run multiple matches in an automated way, and all that is needed is to either feed each execution instance of Crafty a "scale=n" command to set x for that run, or add it to the end of the .craftyrc for this specific run.

I can then run a series of tests easily and hands-off. My testing shell script has something like this at the top of the main loop:

foreach scale (2.0 2.25 2.5 2.75 3.0 etc)

and then I tell the matchmaker to pass the string "scale=n" where n
comes from the above choices) to the crafty executables. It is a bit more complicated because I want to save all the PGN so that I can look at anything during each run, so there is some clever file naming and stuff that is also done. But to make a specific test, I just need to have a version of Crafty where the "scale=n" command sets whatever I am wanting to test to n...

hgm · Post by **hgm** » Thu Jul 03, 2014 9:44 pm

WinBoard can run independent tournaments in parallel, or can run several games for the same tournament in parallel.

eric.oldre · Post by **eric.oldre** » Fri Jul 04, 2014 4:08 pm

Thanks cutechess-cli was pretty much just what I was looking for!

16k games to test a change? At what time control? Do you distribute your testing over multiple machines?

Utilities for testing/tuning.

Utilities for testing/tuning.

Re: Utilities for testing/tuning.

Re: Utilities for testing/tuning.

Re: Utilities for testing/tuning.

Re: Utilities for testing/tuning.

Re: Utilities for testing/tuning.