Cluster Testing Pitfalls?

mhull · Post by **mhull** » Tue Aug 23, 2011 7:07 am

Exchange clipped from another thread:

bob wrote:
Don wrote:I think captures are just too volatile to take any chances with.
..

I reduce losing captures. It isn't much of a gain, but it is a gain for me...

It's a gain against the stable of opponents available when you tested the change, yes. But what if it would be a loss against a different stable of opponents? I suppose this question could be asked of any 5-10 Elo tweak or group of tweaks.

Suppose you could have two stables of different programs and test some of these changes against both independently to see if the gain/loss ever go in opposite directions?

Has this been a concern to anyone doing this kind of testing? Has it been tried?

Ferdy · Post by **Ferdy** » Tue Aug 23, 2011 8:36 am

mhull wrote:Exchange clipped from another thread:

bob wrote:
Don wrote:I think captures are just too volatile to take any chances with.
..

I reduce losing captures. It isn't much of a gain, but it is a gain for me...
It's a gain against the stable of opponents available when you tested the change, yes. But what if it would be a loss against a different stable of opponents? I suppose this question could be asked of any 5-10 Elo tweak or group of tweaks.

Suppose you could have two stables of different programs and test some of these changes against both independently to see if the gain/loss ever go in opposite directions?

Has this been a concern to anyone doing this kind of testing? Has it been tried?

Talking about stable opponents, I have tried 30 opponents

, whatever the results I take it as it is. I also think that selection should be varied, there is good at ending, there is good at attacking, there is good at passer handling, there is good in open positions, closed positions and others.

mhull · Post by **mhull** » Tue Aug 23, 2011 8:01 pm

Ferdy wrote:Talking about stable opponents, I have tried 30 opponents , whatever the results I take it as it is. I also think that selection should be varied, there is good at ending, there is good at attacking, there is good at passer handling, there is good in open positions, closed positions and others.

Do you track gain/loss against individual as well as group average? If so, did you find that gain/loss would move in opposite directions against respective opponents?

Daniel Shawul · Post by **Daniel Shawul** » Tue Aug 23, 2011 8:37 pm

For me it is always a gauntlet against 12 engines, among which 1 or 2 is usually older versions of my engine. Improvements usually translate well in larger pool of opponents as well but you can't be sure. I don't think any programmer will have the time to do those 40/40 tests against 50 engines just to prove an improvement so you rely on outside tester for verification of improvements.

mhull · Post by **mhull** » Wed Aug 24, 2011 12:41 am

Ferdy wrote:
mhull wrote:Exchange clipped from another thread:

bob wrote:
Don wrote:I think captures are just too volatile to take any chances with.
..

I reduce losing captures. It isn't much of a gain, but it is a gain for me...
It's a gain against the stable of opponents available when you tested the change, yes. But what if it would be a loss against a different stable of opponents? I suppose this question could be asked of any 5-10 Elo tweak or group of tweaks.

Suppose you could have two stables of different programs and test some of these changes against both independently to see if the gain/loss ever go in opposite directions?

Has this been a concern to anyone doing this kind of testing? Has it been tried?
Talking about stable opponents, I have tried 30 opponents , whatever the results I take it as it is. I also think that selection should be varied, there is good at ending, there is good at attacking, there is good at passer handling, there is good in open positions, closed positions and others.

I just realized my statement could be misunderstood. I meant "stable" as in "a place to keep some horses". So you might have two "stables" with different horses in each. I think Bob has only one "stable" of opponents. I wondered if he ever had time to test changes against two stables of unrelated engines to see whether Elo changes were consistent between the two stables.

bob · Post by **bob** » Wed Aug 24, 2011 1:21 am

mhull wrote:Exchange clipped from another thread:

bob wrote:
Don wrote:I think captures are just too volatile to take any chances with.
..

I reduce losing captures. It isn't much of a gain, but it is a gain for me...
It's a gain against the stable of opponents available when you tested the change, yes. But what if it would be a loss against a different stable of opponents? I suppose this question could be asked of any 5-10 Elo tweak or group of tweaks.

Suppose you could have two stables of different programs and test some of these changes against both independently to see if the gain/loss ever go in opposite directions?

Has this been a concern to anyone doing this kind of testing? Has it been tried?

For me, no. I have a couple of requirements that are problematic.

(1) I need source. I have to compile whatever I run on our cluster. I've yet to find any statically linked executable that works correctly due to the lightweight kernel I run.

(2) I need programs that can play fast games reliably. Many fall apart at 20 seconds + 0.1 increment. I generally play 30K games per test, and only see 2-3 time forfeits, total, by the opponents I use. I really never see Crafty losing on time, which is a goal, obviously, but I also don't want opponents to lose just because the games are too fast.

The programs I use do really well here. I always test at 2-3-4 different time controls if I add a new opponent, to make sure that the fast time control games don't skew wildly away from the longer games. That convinces me that fast time controls are not producing garbage results...

Don · Post by **Don** » Wed Aug 24, 2011 1:58 am

We test against a gauntlet of just 3 opponents. We cannot get too many more without giving them extra time - and we don't want to spend an excessive amount of CPU cycles running other peoples chess program when we are testing ours - so we don't have the luxury of testing against 50 other programs.

But I seriously doubt it really matter that much. We almost always get the same result in self-test as we do with the gauntlet approach.

Don

mhull wrote:Exchange clipped from another thread:

bob wrote:
Don wrote:I think captures are just too volatile to take any chances with.
..

I reduce losing captures. It isn't much of a gain, but it is a gain for me...
It's a gain against the stable of opponents available when you tested the change, yes. But what if it would be a loss against a different stable of opponents? I suppose this question could be asked of any 5-10 Elo tweak or group of tweaks.

Suppose you could have two stables of different programs and test some of these changes against both independently to see if the gain/loss ever go in opposite directions?

Has this been a concern to anyone doing this kind of testing? Has it been tried?

Ferdy · Post by **Ferdy** » Wed Aug 24, 2011 12:11 pm

mhull wrote:
Ferdy wrote:
mhull wrote:Exchange clipped from another thread:

bob wrote:
Don wrote:I think captures are just too volatile to take any chances with.
..

I reduce losing captures. It isn't much of a gain, but it is a gain for me...
It's a gain against the stable of opponents available when you tested the change, yes. But what if it would be a loss against a different stable of opponents? I suppose this question could be asked of any 5-10 Elo tweak or group of tweaks.

Suppose you could have two stables of different programs and test some of these changes against both independently to see if the gain/loss ever go in opposite directions?

Has this been a concern to anyone doing this kind of testing? Has it been tried?
Talking about stable opponents, I have tried 30 opponents , whatever the results I take it as it is. I also think that selection should be varied, there is good at ending, there is good at attacking, there is good at passer handling, there is good in open positions, closed positions and others.
I just realized my statement could be misunderstood. I meant "stable" as in "a place to keep some horses". So you might have two "stables" with different horses in each. I think Bob has only one "stable" of opponents. I wondered if he ever had time to test changes against two stables of unrelated engines to see whether Elo changes were consistent between the two stables.

Sorry what I mean is I have only 1 stable of opponents and those are consist of 30 engines.

I used to have 3, low, mid and hi. But so as not to get confused, I just combined the results and compare, the one with high elo is better.

bob · Post by **bob** » Thu Aug 25, 2011 2:56 am

mhull wrote:
Ferdy wrote:Talking about stable opponents, I have tried 30 opponents , whatever the results I take it as it is. I also think that selection should be varied, there is good at ending, there is good at attacking, there is good at passer handling, there is good in open positions, closed positions and others.
Do you track gain/loss against individual as well as group average? If so, did you find that gain/loss would move in opposite directions against respective opponents?

I see both sets of numbers, and yes, a change can improve overall but you can do worse against on opponent and better against the other(s). I have tried to keep at least one -200 Elo program in my mix, as if you see a drop against that program, it should be of interest. However, I still go for best overall, but I do try to see what changed with the games against the weaker program...

mhull · Post by **mhull** » Thu Aug 25, 2011 5:50 am

bob wrote:
mhull wrote:
Ferdy wrote:Talking about stable opponents, I have tried 30 opponents , whatever the results I take it as it is. I also think that selection should be varied, there is good at ending, there is good at attacking, there is good at passer handling, there is good in open positions, closed positions and others.
Do you track gain/loss against individual as well as group average? If so, did you find that gain/loss would move in opposite directions against respective opponents?
I see both sets of numbers, and yes, a change can improve overall but you can do worse against on opponent and better against the other(s). I have tried to keep at least one -200 Elo program in my mix, as if you see a drop against that program, it should be of interest. However, I still go for best overall, but I do try to see what changed with the games against the weaker program...

I realize the difficulty of compiling a set of opponents that run reliably in the cluster. But there must surely be a nagging question that if you had another group of fiver or six strong opponents, would changes always move you program's relative strength in the same general direction, up or down, in both groups of opponents.

Cluster Testing Pitfalls?

Cluster Testing Pitfalls?

Re: Cluster Testing Pitfalls?

Re: Cluster Testing Pitfalls?

Re: Cluster Testing Pitfalls?

Re: Cluster Testing Pitfalls?

Re: Cluster Testing Pitfalls?

Re: Cluster Testing Pitfalls?

Re: Cluster Testing Pitfalls?

Re: Cluster Testing Pitfalls?

Re: Cluster Testing Pitfalls?