Long game vs short game testing

vladstamate · Post by **vladstamate** » Thu Apr 08, 2010 6:56 pm

Hi all,

I have seemingly simple question: If I run n games at 20sec per game and I also run n games at 1min per game, should I trust the results of the second test more?

Intuitively I would say yes, since the second test allows more "game time" therefore he chance of showing weeknesses is increased. Also the engines being tested can reach a larger depth therefore some techniques might fire more often (those that have if(depth>value) in them).

Can this be generalized so that we always trust longer time testing?

Here is a more interesting question: Would you trust a 3000 games @ 20sec per game or a 1000 game @ 1min per game test? On average they both should take roughly the same time (about 90000sec or 1500min or 25 hours).

Am I correct in assuming the above?

Regards,
Vlad.

mainsworthy · Post by **mainsworthy** » Thu Apr 08, 2010 7:27 pm

Hi Vlad

I think it depends on the competition entered, blitz bullet long, also human games get elo rateings different clock times, so elo must be the ubiversal strength.

Hope this helps

vladstamate · Post by **vladstamate** » Thu Apr 08, 2010 7:32 pm

Hi,

This is mainly for engine vs engine testing via lots of games to get more accurate Elo difference, no human games involved.

Regards,
Vlad.

Sven · Post by **Sven** » Thu Apr 08, 2010 7:47 pm

vladstamate wrote:Hi all,

I have seemingly simple question: If I run n games at 20sec per game and I also run n games at 1min per game, should I trust the results of the second test more?

Intuitively I would say yes, since the second test allows more "game time" therefore he chance of showing weeknesses is increased. Also the engines being tested can reach a larger depth therefore some techniques might fire more often (those that have if(depth>value) in them).

Can this be generalized so that we always trust longer time testing?

Here is a more interesting question: Would you trust a 3000 games @ 20sec per game or a 1000 game @ 1min per game test? On average they both should take roughly the same time (about 90000sec or 1500min or 25 hours).

Am I correct in assuming the above?

Regards,
Vlad.

Playing more games results in smaller error bars of the ELO numbers you get from the games. So in general playing more games should be preferred over longer games. In your case, 20 sec/game and 1 min/game are not very far from each other so the engines will not play much stronger in the longer games, so I see no reason to choose the longer games.

60000 games at 1 sec/game could be different since the thinking time per move becomes very short. Weaker engines might search only 3 or 4 plies deep within these approx. 10-20 msec per move so that tactical errors could begin to dominate the results, even though both opponents have the same conditions.

Maybe you want to play 2000 games at 10 sec/game, which can be done overnight on 1 CPU so you don't have to wait a full day for the result.

Sven

bob · Post by **bob** » Fri Apr 09, 2010 5:27 am

vladstamate wrote:Hi all,

I have seemingly simple question: If I run n games at 20sec per game and I also run n games at 1min per game, should I trust the results of the second test more?

Intuitively I would say yes, since the second test allows more "game time" therefore he chance of showing weeknesses is increased. Also the engines being tested can reach a larger depth therefore some techniques might fire more often (those that have if(depth>value) in them).

Can this be generalized so that we always trust longer time testing?

Here is a more interesting question: Would you trust a 3000 games @ 20sec per game or a 1000 game @ 1min per game test? On average they both should take roughly the same time (about 90000sec or 1500min or 25 hours).

Am I correct in assuming the above?

Regards,
Vlad.

Two answers. based on tens of millions of games.

1. For eval changes, and most search changes, short or long time controls won't matter. If you are better at one, you are better at the other. I've verified this by playing games from 10secs +0.1sec inc to 60min+60sec ijncrement.

2. For timing changes, and some search changes that greatly alter the shape of the tree, this can change. A change to time usage will be influenced by the time control used, and really needs to be tuned at the time control you plan to play most of the time. Some search changes can cause a tree explosion at deeper depths if extensions can "cascade" down through the tree. As you go deeper, you extend more and before you know it, you are hung at some depth with no hope of getting to the next ply in finite time.

Osipov Jury · Post by **Osipov Jury** » Fri Apr 09, 2010 7:52 am

When I tested the evaluation function of Rybka 3, I noticed that its use is almost no effect on short time controls. But on the long test results significantly increase.

bob · Post by **bob** » Fri Apr 09, 2010 9:16 pm

Osipov Jury wrote:When I tested the evaluation function of Rybka 3, I noticed that its use is almost no effect on short time controls. But on the long test results significantly increase.

Yes, but you didn't read what I wrote. If you want to make a change to the evaluation, and test at either fast or slow time controls, 99+% of the time, the change will show up better at either. That is if A is original, and A' is modified, and the change is good, A' will generally be N elo better than A at any time control. What the actual elo is is a different matter, and a program can certainly do better or worse against a group of opponents, depending on the time control, but no matter, A' will almost always finish above A, which is the critical point for testing.

Does it matter whether A is 2500 and A' is 2510, or whether A is 2600 and A' is 2610? In either case, A' is better... And that's what we want to know.

Uri Blass · Post by **Uri Blass** » Sat Apr 10, 2010 6:29 pm

bob wrote:
Osipov Jury wrote:When I tested the evaluation function of Rybka 3, I noticed that its use is almost no effect on short time controls. But on the long test results significantly increase.
Yes, but you didn't read what I wrote. If you want to make a change to the evaluation, and test at either fast or slow time controls, 99+% of the time, the change will show up better at either. That is if A is original, and A' is modified, and the change is good, A' will generally be N elo better than A at any time control. What the actual elo is is a different matter, and a program can certainly do better or worse against a group of opponents, depending on the time control, but no matter, A' will almost always finish above A, which is the critical point for testing.

Does it matter whether A is 2500 and A' is 2510, or whether A is 2600 and A' is 2610? In either case, A' is better... And that's what we want to know.

I understand from Jury Osipov that the case that he is talking about is that A is 2500 and A' is 2510 at fast time control when
A is 2500 and A' is 2550 at long time control.

He claims that the changes in the evaluation of rybka help at long time control more than short time control.

I believe that evaluation knowledge that is relevant for the middle game but make the program significantly slower can make the program worse at fast time control but better at long time control and I wonder how many changes did you test that cause a significant reduction of the speed of Crafty(making Crafty at least 10% slower in nodes per second)

Uri

vladstamate · Post by **vladstamate** » Sat Apr 10, 2010 6:44 pm

bob wrote: Two answers. based on tens of millions of games.

1. For eval changes, and most search changes, short or long time controls won't matter. If you are better at one, you are better at the other. I've verified this by playing games from 10secs +0.1sec inc to 60min+60sec ijncrement.

2. For timing changes, and some search changes that greatly alter the shape of the tree, this can change. A change to time usage will be influenced by the time control used, and really needs to be tuned at the time control you plan to play most of the time. Some search changes can cause a tree explosion at deeper depths if extensions can "cascade" down through the tree. As you go deeper, you extend more and before you know it, you are hung at some depth with no hope of getting to the next ply in finite time.

Hi,

This is exactly the kind of information I was looking for.

Thank you,
Vlad.

jarkkop · Post by **jarkkop** » Mon Apr 12, 2010 8:21 am

There must me some differences, because some code sections are could only be activated when depth 20 is reached, eg. in ippolit derivatives. If search time is too short these code sections don't take effect at all => different behaviour.

Long game vs short game testing

Long game vs short game testing

Re: Long game vs short game testing

Re: Long game vs short game testing

Re: Long game vs short game testing

Re: Long game vs short game testing

Re: Long game vs short game testing

Re: Long game vs short game testing

Re: Long game vs short game testing

Re: Long game vs short game testing

Re: Long game vs short game testing