Quick Performance Test

Uri Blass · Post by **Uri Blass** » Mon Jun 03, 2013 9:53 pm

Don wrote:
Henk wrote:If tuning is done by a tuner (algorithm) you can also try to find a better tuning algorithm. Or try to constrain the parameters to be tuned to a smaller domain. I guess not all combinations are allowed.
There has been progress, but nobody has come up with anything better than just playing games for testing a single change. One optimization algorithm that has some merit is CLOP, but guess what? It's based on playing thousands of games.

Consider this. To measure a small ELO change, you much play thousands of games. This is using the most direct measure possible, playing actual games.

Now is it reasonable to expect that you could measure this just as accurately using some indirect methods that requires far less effort?

Of course it isn't. If you want to explore this further, try to figure out how to get more out of your testing procedure. CLOP is one such way and HG proposed another methods called orthogonal multi-tuning. Both of these are based on playing games but try to squeeze more information out of those games.

You may need many games for most changes but
it is dependent on the change and there are changes when it is better not to use games to test them.

For example if you do some change that gives 1% speed improvement then I think that playing many games to prove it may be a waste of time and you can simply see that the program get the same number of nodes slightly faster in searching to a fixed depth in many positions.

Another example is that you fix some bug that you know that is relevant only in some rare endgame.

The speed after fixing the bug is the same and you simply change evaluation that is not correct for some tablebases position to correct evaluation in some rare cases.

Again my opinion is that testing it in many games is a waste of time.

Henk · Post by **Henk** » Mon Jun 03, 2013 10:09 pm

So if testing is slow, slow down the generation. I mean only test promising changes. Spending so much computer power for minor changes I would not do that. Even if you can afford it. But I'm in a different position.

Don · Post by **Don** » Mon Jun 03, 2013 10:31 pm

Uri Blass wrote:
Don wrote:
Henk wrote:If tuning is done by a tuner (algorithm) you can also try to find a better tuning algorithm. Or try to constrain the parameters to be tuned to a smaller domain. I guess not all combinations are allowed.
There has been progress, but nobody has come up with anything better than just playing games for testing a single change. One optimization algorithm that has some merit is CLOP, but guess what? It's based on playing thousands of games.

Consider this. To measure a small ELO change, you much play thousands of games. This is using the most direct measure possible, playing actual games.

Now is it reasonable to expect that you could measure this just as accurately using some indirect methods that requires far less effort?

Of course it isn't. If you want to explore this further, try to figure out how to get more out of your testing procedure. CLOP is one such way and HG proposed another methods called orthogonal multi-tuning. Both of these are based on playing games but try to squeeze more information out of those games.
You may need many games for most changes but
it is dependent on the change and there are changes when it is better not to use games to test them.

For example if you do some change that gives 1% speed improvement then I think that playing many games to prove it may be a waste of time and you can simply see that the program get the same number of nodes slightly faster in searching to a fixed depth in many positions.

Another example is that you fix some bug that you know that is relevant only in some rare endgame.

The speed after fixing the bug is the same and you simply change evaluation that is not correct for some tablebases position to correct evaluation in some rare cases.

Again my opinion is that testing it in many games is a waste of time.

Naturally we would never play 50,000 games to test a free 1% speedup. I'm only talking about interesting changes where we don't already know in advance that it's an improvement. And that is probably 98% of the tests we do.

Don · Post by **Don** » Mon Jun 03, 2013 10:35 pm

Henk wrote:So if testing is slow, slow down the generation. I mean only test promising changes. Spending so much computer power for minor changes I would not do that. Even if you can afford it. But I'm in a different position.

We do not make any changes that we do not test but we follow the basic principle you outline here by not MAKING changes that we don't' consider promising.

We are in a position where we don't have promising 50 ELO ideas though.

Andres Valverde · Post by **Andres Valverde** » Mon Jun 03, 2013 11:20 pm

Don wrote: Right now I am running a test at a time control of 3s+0.3 fischer and getting about 36 games per minute using a 6 core i7 and utilizing all 6 cores.

At that time control an average game could last (60 moves x 0.3 secs/move+ 3 secs) x 2 = 42 secs if not more. This is 1.42 games/min/core, or 8.57 games/min using 6 cores. Am I missing something?

Don · Post by **Don** » Mon Jun 03, 2013 11:29 pm

Andres Valverde wrote:
Don wrote: Right now I am running a test at a time control of 3s+0.3 fischer and getting about 36 games per minute using a 6 core i7 and utilizing all 6 cores.
At that time control an average game could last (60 moves x 0.3 secs/move+ 3 secs) x 2 = 42 secs if not more. This is 1.42 games/min/core, or 8.57 games/min using 6 cores. Am I missing something?

It was a misprint. The test is actually 3s + 0.03s not 3s+0.3

Don

lucasart · Post by **lucasart** » Tue Jun 04, 2013 1:19 am

Henk wrote:Developing a program means generate and test.
If generating is fast testing should be fast too. Is there a quick way to test whether your chess program has improved ?
[/code]
By "generate" you mean writing code ?
Henk wrote: Search depth and number of nodes does not say much.
In general no, but it depends. In the early stages of developpement when you have a basic PVS with no search instabilities, then the more you improve your time to depth the more you improve your engine. This becomes less and less clear as you start to introduce some "tradeoffs".
Henk wrote: Playing games takes much time.
Setting up a database with chess positions is a lot of developing effort.
Maybe, but it's the only way. There are two kinds of code patches:

1/ non functional patches: here you need to test for bugs. you simply setup a deterministic run of a few positions at a given depth, and look at the number of nodes. if it hasn't changed, then you are almost sure that no bug was introduced.

2/ functional changes: you have to play games. there's no otherway to find out if your engine is stronger than to play games. forget about the "database of chess positions". this is not necessarly time consuming (you can find an EPD from an external source) but it's completely the wrong way of tuning your engine. if you tune your engine to solve chess problems you are optimizing for the wrong target. playing games is not that bad. here's my testing methodology that I apply with an 8 core machine (I play 7 games concurrently, as 8 introduces a very measurable amount of noise):
* 5000 games in 5"+0.05": if result is < 50% (regardless of error bar) I stop here and reject the patch. that test takes on average (with aggressive draw and resign adjudication in cutechess-cli) 2*(5"+0.05"*60)/7*5000 = 11429" = 3.17h
* 10000 games in 10"+0.1" (if the previous pre-selection test was successful). That one takes 2*(10"+0.1"*60)/7*10000 = 12.7h. I just leave it overnight or start it in the morning before going to work.
If you use git and create branches for your test patches, it forces you to be organized, and that really pays in terms of productivity. While the test is running I create other branches. Sometimes I stop tests early when they are clearly not going well, and I kill the branch and run the next one etc. Eventually one branch is good and I merge it into the master branch etc.

In short: NO there is no easy way.

PS: SPRT is provably the best method, but the cutechess-cli version uses a hardcoded drawelo instead of estimating it out of sample. I need to hack cutechess-cli before I can use it (drawelo makes a big impact on stopping time).

tpetzke · Post by **tpetzke** » Tue Jun 04, 2013 8:25 am

There are many algorithms for finding local maxima. But global maxima: I know only simulated annealing and branch and bound. Well if testing is the bottleneck I think I certainly can forget neural networks.

What have neural networks to do with a possible testing bottleneck.

With testing you change a single small thing in your program and then run a lot of games to see if the change was good.

If you use an optimization algorithm this algorithm takes some time to produce a result. I'm currently use a genetic algorithm and it takes about 2 weeks to converge to a solution. At the end I also perform a single test whether the produced solution is stronger than my original solution.

But I have not changed only a small thing in my program, I assign new values to all of my evaluation parameters at the same time. Others use CLOP for that, but I rather use my own stuff.

Thomas...

Quick Performance Test

Re: Quick Performance Test

Re: Quick Performance Test

Re: Quick Performance Test

Re: Quick Performance Test

Re: Quick Performance Test

Re: Quick Performance Test

Re: Quick Performance Test

Re: Quick Performance Test