Cluster Testing Pitfalls?

Discussion of chess software programming and technical issues.

Moderators: hgm, Rebel, chrisw

User avatar
mhull
Posts: 13447
Joined: Wed Mar 08, 2006 9:02 pm
Location: Dallas, Texas
Full name: Matthew Hull

Cluster Testing Pitfalls?

Post by mhull »

Exchange clipped from another thread:
bob wrote:
Don wrote:I think captures are just too volatile to take any chances with.
..

I reduce losing captures. It isn't much of a gain, but it is a gain for me...
It's a gain against the stable of opponents available when you tested the change, yes. But what if it would be a loss against a different stable of opponents? I suppose this question could be asked of any 5-10 Elo tweak or group of tweaks.

Suppose you could have two stables of different programs and test some of these changes against both independently to see if the gain/loss ever go in opposite directions?

Has this been a concern to anyone doing this kind of testing? Has it been tried?
Matthew Hull
Ferdy
Posts: 4833
Joined: Sun Aug 10, 2008 3:15 pm
Location: Philippines

Re: Cluster Testing Pitfalls?

Post by Ferdy »

mhull wrote:Exchange clipped from another thread:
bob wrote:
Don wrote:I think captures are just too volatile to take any chances with.
..

I reduce losing captures. It isn't much of a gain, but it is a gain for me...
It's a gain against the stable of opponents available when you tested the change, yes. But what if it would be a loss against a different stable of opponents? I suppose this question could be asked of any 5-10 Elo tweak or group of tweaks.

Suppose you could have two stables of different programs and test some of these changes against both independently to see if the gain/loss ever go in opposite directions?

Has this been a concern to anyone doing this kind of testing? Has it been tried?
Talking about stable opponents, I have tried 30 opponents :) , whatever the results I take it as it is. I also think that selection should be varied, there is good at ending, there is good at attacking, there is good at passer handling, there is good in open positions, closed positions and others.
User avatar
mhull
Posts: 13447
Joined: Wed Mar 08, 2006 9:02 pm
Location: Dallas, Texas
Full name: Matthew Hull

Re: Cluster Testing Pitfalls?

Post by mhull »

Ferdy wrote:Talking about stable opponents, I have tried 30 opponents :) , whatever the results I take it as it is. I also think that selection should be varied, there is good at ending, there is good at attacking, there is good at passer handling, there is good in open positions, closed positions and others.
Do you track gain/loss against individual as well as group average? If so, did you find that gain/loss would move in opposite directions against respective opponents?
Matthew Hull
Daniel Shawul
Posts: 4185
Joined: Tue Mar 14, 2006 11:34 am
Location: Ethiopia

Re: Cluster Testing Pitfalls?

Post by Daniel Shawul »

For me it is always a gauntlet against 12 engines, among which 1 or 2 is usually older versions of my engine. Improvements usually translate well in larger pool of opponents as well but you can't be sure. I don't think any programmer will have the time to do those 40/40 tests against 50 engines just to prove an improvement so you rely on outside tester for verification of improvements.
User avatar
mhull
Posts: 13447
Joined: Wed Mar 08, 2006 9:02 pm
Location: Dallas, Texas
Full name: Matthew Hull

Re: Cluster Testing Pitfalls?

Post by mhull »

Ferdy wrote:
mhull wrote:Exchange clipped from another thread:
bob wrote:
Don wrote:I think captures are just too volatile to take any chances with.
..

I reduce losing captures. It isn't much of a gain, but it is a gain for me...
It's a gain against the stable of opponents available when you tested the change, yes. But what if it would be a loss against a different stable of opponents? I suppose this question could be asked of any 5-10 Elo tweak or group of tweaks.

Suppose you could have two stables of different programs and test some of these changes against both independently to see if the gain/loss ever go in opposite directions?

Has this been a concern to anyone doing this kind of testing? Has it been tried?
Talking about stable opponents, I have tried 30 opponents :) , whatever the results I take it as it is. I also think that selection should be varied, there is good at ending, there is good at attacking, there is good at passer handling, there is good in open positions, closed positions and others.
I just realized my statement could be misunderstood. I meant "stable" as in "a place to keep some horses". So you might have two "stables" with different horses in each. I think Bob has only one "stable" of opponents. I wondered if he ever had time to test changes against two stables of unrelated engines to see whether Elo changes were consistent between the two stables.
Matthew Hull
bob
Posts: 20943
Joined: Mon Feb 27, 2006 7:30 pm
Location: Birmingham, AL

Re: Cluster Testing Pitfalls?

Post by bob »

mhull wrote:Exchange clipped from another thread:
bob wrote:
Don wrote:I think captures are just too volatile to take any chances with.
..

I reduce losing captures. It isn't much of a gain, but it is a gain for me...
It's a gain against the stable of opponents available when you tested the change, yes. But what if it would be a loss against a different stable of opponents? I suppose this question could be asked of any 5-10 Elo tweak or group of tweaks.

Suppose you could have two stables of different programs and test some of these changes against both independently to see if the gain/loss ever go in opposite directions?

Has this been a concern to anyone doing this kind of testing? Has it been tried?
For me, no. I have a couple of requirements that are problematic.

(1) I need source. I have to compile whatever I run on our cluster. I've yet to find any statically linked executable that works correctly due to the lightweight kernel I run.

(2) I need programs that can play fast games reliably. Many fall apart at 20 seconds + 0.1 increment. I generally play 30K games per test, and only see 2-3 time forfeits, total, by the opponents I use. I really never see Crafty losing on time, which is a goal, obviously, but I also don't want opponents to lose just because the games are too fast.

The programs I use do really well here. I always test at 2-3-4 different time controls if I add a new opponent, to make sure that the fast time control games don't skew wildly away from the longer games. That convinces me that fast time controls are not producing garbage results...
User avatar
Don
Posts: 5106
Joined: Tue Apr 29, 2008 4:27 pm

Re: Cluster Testing Pitfalls?

Post by Don »

We test against a gauntlet of just 3 opponents. We cannot get too many more without giving them extra time - and we don't want to spend an excessive amount of CPU cycles running other peoples chess program when we are testing ours - so we don't have the luxury of testing against 50 other programs.

But I seriously doubt it really matter that much. We almost always get the same result in self-test as we do with the gauntlet approach.

Don

mhull wrote:Exchange clipped from another thread:
bob wrote:
Don wrote:I think captures are just too volatile to take any chances with.
..

I reduce losing captures. It isn't much of a gain, but it is a gain for me...
It's a gain against the stable of opponents available when you tested the change, yes. But what if it would be a loss against a different stable of opponents? I suppose this question could be asked of any 5-10 Elo tweak or group of tweaks.

Suppose you could have two stables of different programs and test some of these changes against both independently to see if the gain/loss ever go in opposite directions?

Has this been a concern to anyone doing this kind of testing? Has it been tried?
Ferdy
Posts: 4833
Joined: Sun Aug 10, 2008 3:15 pm
Location: Philippines

Re: Cluster Testing Pitfalls?

Post by Ferdy »

mhull wrote:
Ferdy wrote:
mhull wrote:Exchange clipped from another thread:
bob wrote:
Don wrote:I think captures are just too volatile to take any chances with.
..

I reduce losing captures. It isn't much of a gain, but it is a gain for me...
It's a gain against the stable of opponents available when you tested the change, yes. But what if it would be a loss against a different stable of opponents? I suppose this question could be asked of any 5-10 Elo tweak or group of tweaks.

Suppose you could have two stables of different programs and test some of these changes against both independently to see if the gain/loss ever go in opposite directions?

Has this been a concern to anyone doing this kind of testing? Has it been tried?
Talking about stable opponents, I have tried 30 opponents :) , whatever the results I take it as it is. I also think that selection should be varied, there is good at ending, there is good at attacking, there is good at passer handling, there is good in open positions, closed positions and others.
I just realized my statement could be misunderstood. I meant "stable" as in "a place to keep some horses". So you might have two "stables" with different horses in each. I think Bob has only one "stable" of opponents. I wondered if he ever had time to test changes against two stables of unrelated engines to see whether Elo changes were consistent between the two stables.
Sorry what I mean is I have only 1 stable of opponents and those are consist of 30 engines.

I used to have 3, low, mid and hi. But so as not to get confused, I just combined the results and compare, the one with high elo is better.
bob
Posts: 20943
Joined: Mon Feb 27, 2006 7:30 pm
Location: Birmingham, AL

Re: Cluster Testing Pitfalls?

Post by bob »

mhull wrote:
Ferdy wrote:Talking about stable opponents, I have tried 30 opponents :) , whatever the results I take it as it is. I also think that selection should be varied, there is good at ending, there is good at attacking, there is good at passer handling, there is good in open positions, closed positions and others.
Do you track gain/loss against individual as well as group average? If so, did you find that gain/loss would move in opposite directions against respective opponents?
I see both sets of numbers, and yes, a change can improve overall but you can do worse against on opponent and better against the other(s). I have tried to keep at least one -200 Elo program in my mix, as if you see a drop against that program, it should be of interest. However, I still go for best overall, but I do try to see what changed with the games against the weaker program...
User avatar
mhull
Posts: 13447
Joined: Wed Mar 08, 2006 9:02 pm
Location: Dallas, Texas
Full name: Matthew Hull

Re: Cluster Testing Pitfalls?

Post by mhull »

bob wrote:
mhull wrote:
Ferdy wrote:Talking about stable opponents, I have tried 30 opponents :) , whatever the results I take it as it is. I also think that selection should be varied, there is good at ending, there is good at attacking, there is good at passer handling, there is good in open positions, closed positions and others.
Do you track gain/loss against individual as well as group average? If so, did you find that gain/loss would move in opposite directions against respective opponents?
I see both sets of numbers, and yes, a change can improve overall but you can do worse against on opponent and better against the other(s). I have tried to keep at least one -200 Elo program in my mix, as if you see a drop against that program, it should be of interest. However, I still go for best overall, but I do try to see what changed with the games against the weaker program...
I realize the difficulty of compiling a set of opponents that run reliably in the cluster. But there must surely be a nagging question that if you had another group of fiver or six strong opponents, would changes always move you program's relative strength in the same general direction, up or down, in both groups of opponents.
Matthew Hull