Cluster Testing Pitfalls?

Discussion of chess software programming and technical issues.

Moderators: hgm, Rebel, chrisw

User avatar
Don
Posts: 5106
Joined: Tue Apr 29, 2008 4:27 pm

Re: Cluster Testing Pitfalls?

Post by Don »

mhull wrote:
bob wrote:
mhull wrote:
Ferdy wrote:Talking about stable opponents, I have tried 30 opponents :) , whatever the results I take it as it is. I also think that selection should be varied, there is good at ending, there is good at attacking, there is good at passer handling, there is good in open positions, closed positions and others.
Do you track gain/loss against individual as well as group average? If so, did you find that gain/loss would move in opposite directions against respective opponents?
I see both sets of numbers, and yes, a change can improve overall but you can do worse against on opponent and better against the other(s). I have tried to keep at least one -200 Elo program in my mix, as if you see a drop against that program, it should be of interest. However, I still go for best overall, but I do try to see what changed with the games against the weaker program...
I realize the difficulty of compiling a set of opponents that run reliably in the cluster. But there must surely be a nagging question that if you had another group of fiver or six strong opponents, would changes always move you program's relative strength in the same general direction, up or down, in both groups of opponents.
For Komodo that it the least of my worries. There are numerous problems that can be identified with properly testing a program change and I would put this near the bottom. Here are some others:

1. Getting a large enough sample, i.e. limited CPU resources.
2. Does an improvement at one level equal improve at another?
3. Does computer testing equal true strength improvement?
4. Does self-testing equal true improvement (against humans)
5. Are my openings realistic?
6. Opponent Intransitivity?

I put your concern at the end of the list but they are in no particular order.

If you are overly obsessive about each of these issues it's almost certainly counterproductive and that even includes item number 1, which is the most worthy of obsession! For example if we insisted in getting the error margins down to +/- 1 ELO just to be sure, we would not be able to test more than a couple of changes per week and our progress would slow to to crawl. If you don't spend enough time your changes become almost random.

So we try to things to cover most things but we don't go too crazy. For openings our philosophy is get out of the opening as quickly as possible so that we are testing the entire program and not just it's middle-game play. So for testing our opening library is only 5 moves deep or 10 ply. But in real play we are probably in the book a lot longer than that, so isn't this wrong? This is not the actual condition we will be playing normally. Should I lay awake at night obsessing over this?

Or how about this other "problem", we play each opening only 1 time as white (and once as black) but in reality some of those openings are much more popular than others, so this is not a realistic sampling of how games are played in the real world. What do I do about that? Why should I not make that a top priority? Is that more important or less important than obsessing over whether a player 500 ELO weaker might beat us more often than he should?

I firmly believe that a huge amount of time and energy (and CPU resources) could be wasted worrying about each of these issues, even issue number 1 which probably trumps all the others.

Your concern is the nagging doubt about Intransitivity which is having certain programs performing much better against your program than it should. Of all the things on the list, I put that near the end -- it's not not worth much consideration for us. Please note that I'm not saying it cannot or does not happen, but in our testing methodology it is so far down the list and something has to give!

Also, I think it's a mistake to test against opponents that are too weak or too strong. You need a LOT more games to properly resolve the performance if the ELO is too different. To understand this imagine that you play a 1000 game match against Carlsen, and win 1 game but lose 999. If you had won 2 games instead of 1 you would have won TWICE as many games and it would make an enormous impact on your rating. You would need tens of thousands of games against Carlsen (assuming you are several hundred ELO weaker in strength) to resolve your rating as well as if you had played just a few games against an opponent of nearly equal strength to yourself.

I have nothing against playing a large variety of computer opponents, I think that is good, but it's just not practical for Komodo - there are a limited number of players we can use on our tester that (as Bob has noted) have good auto-test behavior and are close enough to Komodo's strength. There are hundreds of programs we could use if we don't mind playing against programs hundreds of ELO weaker. We could give those other players additional time to equalize the ratings but then you are using most of your CPU resources on OTHER programs instead of your own. In fact I advise that if you are developing a chess program you find a few opponents that are MUCH stronger than your own program and handicap them accordingly so that your program is scoring within 100 ELO (50 is better.) Pick a few program that have significantly different styles if you can. In this way you can get approach 100% of the CPU utilization on just your own program.
bob
Posts: 20943
Joined: Mon Feb 27, 2006 7:30 pm
Location: Birmingham, AL

Re: Cluster Testing Pitfalls?

Post by bob »

Don wrote:
mhull wrote:
bob wrote:
mhull wrote:
Ferdy wrote:Talking about stable opponents, I have tried 30 opponents :) , whatever the results I take it as it is. I also think that selection should be varied, there is good at ending, there is good at attacking, there is good at passer handling, there is good in open positions, closed positions and others.
Do you track gain/loss against individual as well as group average? If so, did you find that gain/loss would move in opposite directions against respective opponents?
I see both sets of numbers, and yes, a change can improve overall but you can do worse against on opponent and better against the other(s). I have tried to keep at least one -200 Elo program in my mix, as if you see a drop against that program, it should be of interest. However, I still go for best overall, but I do try to see what changed with the games against the weaker program...
I realize the difficulty of compiling a set of opponents that run reliably in the cluster. But there must surely be a nagging question that if you had another group of fiver or six strong opponents, would changes always move you program's relative strength in the same general direction, up or down, in both groups of opponents.
For Komodo that it the least of my worries. There are numerous problems that can be identified with properly testing a program change and I would put this near the bottom. Here are some others:

1. Getting a large enough sample, i.e. limited CPU resources.
2. Does an improvement at one level equal improve at another?
3. Does computer testing equal true strength improvement?
4. Does self-testing equal true improvement (against humans)
5. Are my openings realistic?
6. Opponent Intransitivity?

I put your concern at the end of the list but they are in no particular order.

If you are overly obsessive about each of these issues it's almost certainly counterproductive and that even includes item number 1, which is the most worthy of obsession! For example if we insisted in getting the error margins down to +/- 1 ELO just to be sure, we would not be able to test more than a couple of changes per week and our progress would slow to to crawl. If you don't spend enough time your changes become almost random.

So we try to things to cover most things but we don't go too crazy. For openings our philosophy is get out of the opening as quickly as possible so that we are testing the entire program and not just it's middle-game play. So for testing our opening library is only 5 moves deep or 10 ply. But in real play we are probably in the book a lot longer than that, so isn't this wrong? This is not the actual condition we will be playing normally. Should I lay awake at night obsessing over this?

Or how about this other "problem", we play each opening only 1 time as white (and once as black) but in reality some of those openings are much more popular than others, so this is not a realistic sampling of how games are played in the real world. What do I do about that? Why should I not make that a top priority? Is that more important or less important than obsessing over whether a player 500 ELO weaker might beat us more often than he should?

I firmly believe that a huge amount of time and energy (and CPU resources) could be wasted worrying about each of these issues, even issue number 1 which probably trumps all the others.

Your concern is the nagging doubt about Intransitivity which is having certain programs performing much better against your program than it should. Of all the things on the list, I put that near the end -- it's not not worth much consideration for us. Please note that I'm not saying it cannot or does not happen, but in our testing methodology it is so far down the list and something has to give!

Also, I think it's a mistake to test against opponents that are too weak or too strong. You need a LOT more games to properly resolve the performance if the ELO is too different. To understand this imagine that you play a 1000 game match against Carlsen, and win 1 game but lose 999. If you had won 2 games instead of 1 you would have won TWICE as many games and it would make an enormous impact on your rating. You would need tens of thousands of games against Carlsen (assuming you are several hundred ELO weaker in strength) to resolve your rating as well as if you had played just a few games against an opponent of nearly equal strength to yourself.

I have nothing against playing a large variety of computer opponents, I think that is good, but it's just not practical for Komodo - there are a limited number of players we can use on our tester that (as Bob has noted) have good auto-test behavior and are close enough to Komodo's strength. There are hundreds of programs we could use if we don't mind playing against programs hundreds of ELO weaker. We could give those other players additional time to equalize the ratings but then you are using most of your CPU resources on OTHER programs instead of your own. In fact I advise that if you are developing a chess program you find a few opponents that are MUCH stronger than your own program and handicap them accordingly so that your program is scoring within 100 ELO (50 is better.) Pick a few program that have significantly different styles if you can. In this way you can get approach 100% of the CPU utilization on just your own program.
I'd have to second what Don says. Yes, I'd really like to have even 4 good groups of opponents, and run 30K games for each group. I might even be tempted to create a "knowledge group", a "tactical group", an "endgame group" and run against each to see if I hurt one type of performance and help another. This is really getting into what is commonly called "data mining." One can only imagine how much information is "lost" in a 30K game test run where the whole thing is reduced to one Elo number and a small error bar. Solving that problem would be a major breakthrough, that being the ability to mine that 30K game PGN file to discover more about your program that just "is it a little stronger." But it is a _real_ task to do that...

The real testing issue, however, is turnaround time. I'd like to say "run test" and have an instant reply "-1 Elo". I could make some progress like that. At present the testing can take longer as changing a few lines of code takes.. So I am always waiting on the results, rather than trying new ideas and developing code.
User avatar
Laskos
Posts: 10948
Joined: Wed Jul 26, 2006 10:21 pm
Full name: Kai Laskos

Re: Cluster Testing Pitfalls?

Post by Laskos »

Don wrote:

Also, I think it's a mistake to test against opponents that are too weak or too strong. You need a LOT more games to properly resolve the performance if the ELO is too different. To understand this imagine that you play a 1000 game match against Carlsen, and win 1 game but lose 999. If you had won 2 games instead of 1 you would have won TWICE as many games and it would make an enormous impact on your rating. You would need tens of thousands of games against Carlsen (assuming you are several hundred ELO weaker in strength) to resolve your rating as well as if you had played just a few games against an opponent of nearly equal strength to yourself.
Just use the ELO curve. 1% at 50% result gives 7 Elo points, at 70% result 9 points, at 90% 20 Elo points, at 95% 40 points, at 99% result ~200 Elo points. Therefore you can test safely up to 65-75% results. The number of games necessary to achieve the same error margins at 95% result as at 50% result would be (40/7)^2 ~ 33 times more, which is unpractical, but up to 65%-75% results it all seems fine, so one doesn't have to constantly tune the engines up to 100-200 Elo points difference between them.

I have nothing against playing a large variety of computer opponents, I think that is good, but it's just not practical for Komodo - there are a limited number of players we can use on our tester that (as Bob has noted) have good auto-test behavior and are close enough to Komodo's strength. There are hundreds of programs we could use if we don't mind playing against programs hundreds of ELO weaker. We could give those other players additional time to equalize the ratings but then you are using most of your CPU resources on OTHER programs instead of your own. In fact I advise that if you are developing a chess program you find a few opponents that are MUCH stronger than your own program and handicap them accordingly so that your program is scoring within 100 ELO (50 is better.) Pick a few program that have significantly different styles if you can. In this way you can get approach 100% of the CPU utilization on just your own program.
User avatar
Don
Posts: 5106
Joined: Tue Apr 29, 2008 4:27 pm

Re: Cluster Testing Pitfalls?

Post by Don »

Laskos wrote:
Don wrote:

Also, I think it's a mistake to test against opponents that are too weak or too strong. You need a LOT more games to properly resolve the performance if the ELO is too different. To understand this imagine that you play a 1000 game match against Carlsen, and win 1 game but lose 999. If you had won 2 games instead of 1 you would have won TWICE as many games and it would make an enormous impact on your rating. You would need tens of thousands of games against Carlsen (assuming you are several hundred ELO weaker in strength) to resolve your rating as well as if you had played just a few games against an opponent of nearly equal strength to yourself.
Just use the ELO curve. 1% at 50% result gives 7 Elo points, at 70% result 9 points, at 90% 20 Elo points, at 95% 40 points, at 99% result ~200 Elo points. Therefore you can test safely up to 65-75% results. The number of games necessary to achieve the same error margins at 95% result as at 50% result would be (40/7)^2 ~ 33 times more, which is unpractical, but up to 65%-75% results it all seems fine, so one doesn't have to constantly tune the engines up to 100-200 Elo points difference between them.
We don't have a strict rule on this, but I don't remember ever letting the difference get more than approximately 50 ELO and it's usually closer than that. Probably a good rule of thumb is stay within 100 ELO.

I have nothing against playing a large variety of computer opponents, I think that is good, but it's just not practical for Komodo - there are a limited number of players we can use on our tester that (as Bob has noted) have good auto-test behavior and are close enough to Komodo's strength. There are hundreds of programs we could use if we don't mind playing against programs hundreds of ELO weaker. We could give those other players additional time to equalize the ratings but then you are using most of your CPU resources on OTHER programs instead of your own. In fact I advise that if you are developing a chess program you find a few opponents that are MUCH stronger than your own program and handicap them accordingly so that your program is scoring within 100 ELO (50 is better.) Pick a few program that have significantly different styles if you can. In this way you can get approach 100% of the CPU utilization on just your own program.
bob
Posts: 20943
Joined: Mon Feb 27, 2006 7:30 pm
Location: Birmingham, AL

Re: Cluster Testing Pitfalls?

Post by bob »

Don wrote:
Laskos wrote:
Don wrote:

Also, I think it's a mistake to test against opponents that are too weak or too strong. You need a LOT more games to properly resolve the performance if the ELO is too different. To understand this imagine that you play a 1000 game match against Carlsen, and win 1 game but lose 999. If you had won 2 games instead of 1 you would have won TWICE as many games and it would make an enormous impact on your rating. You would need tens of thousands of games against Carlsen (assuming you are several hundred ELO weaker in strength) to resolve your rating as well as if you had played just a few games against an opponent of nearly equal strength to yourself.
Just use the ELO curve. 1% at 50% result gives 7 Elo points, at 70% result 9 points, at 90% 20 Elo points, at 95% 40 points, at 99% result ~200 Elo points. Therefore you can test safely up to 65-75% results. The number of games necessary to achieve the same error margins at 95% result as at 50% result would be (40/7)^2 ~ 33 times more, which is unpractical, but up to 65%-75% results it all seems fine, so one doesn't have to constantly tune the engines up to 100-200 Elo points difference between them.
We don't have a strict rule on this, but I don't remember ever letting the difference get more than approximately 50 ELO and it's usually closer than that. Probably a good rule of thumb is stay within 100 ELO.

I have nothing against playing a large variety of computer opponents, I think that is good, but it's just not practical for Komodo - there are a limited number of players we can use on our tester that (as Bob has noted) have good auto-test behavior and are close enough to Komodo's strength. There are hundreds of programs we could use if we don't mind playing against programs hundreds of ELO weaker. We could give those other players additional time to equalize the ratings but then you are using most of your CPU resources on OTHER programs instead of your own. In fact I advise that if you are developing a chess program you find a few opponents that are MUCH stronger than your own program and handicap them accordingly so that your program is scoring within 100 ELO (50 is better.) Pick a few program that have significantly different styles if you can. In this way you can get approach 100% of the CPU utilization on just your own program.
I think you need one that is well in front. For me, that is stockfish. I want to improve against the group, but NOT at the expense of getting worse than someone that is much stronger. And I have seen that happen a few times in testing. You add something that works against weaker opponents (say king safety that leads you to attack them more) but against a stronger opponent, it can backfire.
User avatar
Don
Posts: 5106
Joined: Tue Apr 29, 2008 4:27 pm

Re: Cluster Testing Pitfalls?

Post by Don »

bob wrote:
Don wrote:
Laskos wrote:
Don wrote:

Also, I think it's a mistake to test against opponents that are too weak or too strong. You need a LOT more games to properly resolve the performance if the ELO is too different. To understand this imagine that you play a 1000 game match against Carlsen, and win 1 game but lose 999. If you had won 2 games instead of 1 you would have won TWICE as many games and it would make an enormous impact on your rating. You would need tens of thousands of games against Carlsen (assuming you are several hundred ELO weaker in strength) to resolve your rating as well as if you had played just a few games against an opponent of nearly equal strength to yourself.
Just use the ELO curve. 1% at 50% result gives 7 Elo points, at 70% result 9 points, at 90% 20 Elo points, at 95% 40 points, at 99% result ~200 Elo points. Therefore you can test safely up to 65-75% results. The number of games necessary to achieve the same error margins at 95% result as at 50% result would be (40/7)^2 ~ 33 times more, which is unpractical, but up to 65%-75% results it all seems fine, so one doesn't have to constantly tune the engines up to 100-200 Elo points difference between them.
We don't have a strict rule on this, but I don't remember ever letting the difference get more than approximately 50 ELO and it's usually closer than that. Probably a good rule of thumb is stay within 100 ELO.

I have nothing against playing a large variety of computer opponents, I think that is good, but it's just not practical for Komodo - there are a limited number of players we can use on our tester that (as Bob has noted) have good auto-test behavior and are close enough to Komodo's strength. There are hundreds of programs we could use if we don't mind playing against programs hundreds of ELO weaker. We could give those other players additional time to equalize the ratings but then you are using most of your CPU resources on OTHER programs instead of your own. In fact I advise that if you are developing a chess program you find a few opponents that are MUCH stronger than your own program and handicap them accordingly so that your program is scoring within 100 ELO (50 is better.) Pick a few program that have significantly different styles if you can. In this way you can get approach 100% of the CPU utilization on just your own program.
I think you need one that is well in front. For me, that is stockfish. I want to improve against the group, but NOT at the expense of getting worse than someone that is much stronger. And I have seen that happen a few times in testing. You add something that works against weaker opponents (say king safety that leads you to attack them more) but against a stronger opponent, it can backfire.
Actually we have always handicapped programs such that Komodo is losing matches and is on the bottom of our rating lists, but that is no longer the case unless we start handicapping Komodo instead. However we expect these other programs (SF, Critter and some clone) to be upgraded too which gives us some time before we have to start to handicap Komodo.

Houdini is of course the exception but Houdini is not reliable for us and we have to have rock solid stable programs. (Houdini does not appear to be able to handle the fast time controls we use without losing some games on time forfeit.)
bob
Posts: 20943
Joined: Mon Feb 27, 2006 7:30 pm
Location: Birmingham, AL

Re: Cluster Testing Pitfalls?

Post by bob »

Don wrote:
bob wrote:
Don wrote:
Laskos wrote:
Don wrote:

Also, I think it's a mistake to test against opponents that are too weak or too strong. You need a LOT more games to properly resolve the performance if the ELO is too different. To understand this imagine that you play a 1000 game match against Carlsen, and win 1 game but lose 999. If you had won 2 games instead of 1 you would have won TWICE as many games and it would make an enormous impact on your rating. You would need tens of thousands of games against Carlsen (assuming you are several hundred ELO weaker in strength) to resolve your rating as well as if you had played just a few games against an opponent of nearly equal strength to yourself.
Just use the ELO curve. 1% at 50% result gives 7 Elo points, at 70% result 9 points, at 90% 20 Elo points, at 95% 40 points, at 99% result ~200 Elo points. Therefore you can test safely up to 65-75% results. The number of games necessary to achieve the same error margins at 95% result as at 50% result would be (40/7)^2 ~ 33 times more, which is unpractical, but up to 65%-75% results it all seems fine, so one doesn't have to constantly tune the engines up to 100-200 Elo points difference between them.
We don't have a strict rule on this, but I don't remember ever letting the difference get more than approximately 50 ELO and it's usually closer than that. Probably a good rule of thumb is stay within 100 ELO.

I have nothing against playing a large variety of computer opponents, I think that is good, but it's just not practical for Komodo - there are a limited number of players we can use on our tester that (as Bob has noted) have good auto-test behavior and are close enough to Komodo's strength. There are hundreds of programs we could use if we don't mind playing against programs hundreds of ELO weaker. We could give those other players additional time to equalize the ratings but then you are using most of your CPU resources on OTHER programs instead of your own. In fact I advise that if you are developing a chess program you find a few opponents that are MUCH stronger than your own program and handicap them accordingly so that your program is scoring within 100 ELO (50 is better.) Pick a few program that have significantly different styles if you can. In this way you can get approach 100% of the CPU utilization on just your own program.
I think you need one that is well in front. For me, that is stockfish. I want to improve against the group, but NOT at the expense of getting worse than someone that is much stronger. And I have seen that happen a few times in testing. You add something that works against weaker opponents (say king safety that leads you to attack them more) but against a stronger opponent, it can backfire.
Actually we have always handicapped programs such that Komodo is losing matches and is on the bottom of our rating lists, but that is no longer the case unless we start handicapping Komodo instead. However we expect these other programs (SF, Critter and some clone) to be upgraded too which gives us some time before we have to start to handicap Komodo.

Houdini is of course the exception but Houdini is not reliable for us and we have to have rock solid stable programs. (Houdini does not appear to be able to handle the fast time controls we use without losing some games on time forfeit.)
I've not tried houdini although I suspect no Linux version is around. And if it were, such versions generally won't run on my cluster. Robolito was hopeless. I tried it a few times, and it was very strong, but it also left hundreds of core files while playing just 6,000 games, which is miserable and skews the results in ways I could not predict. (did it crash just in endgames? just with pawn promotions possible? When being attacked? etc.)
User avatar
Don
Posts: 5106
Joined: Tue Apr 29, 2008 4:27 pm

Re: Cluster Testing Pitfalls?

Post by Don »

bob wrote:
Don wrote:
bob wrote:
Don wrote:
Laskos wrote:
Don wrote:

Also, I think it's a mistake to test against opponents that are too weak or too strong. You need a LOT more games to properly resolve the performance if the ELO is too different. To understand this imagine that you play a 1000 game match against Carlsen, and win 1 game but lose 999. If you had won 2 games instead of 1 you would have won TWICE as many games and it would make an enormous impact on your rating. You would need tens of thousands of games against Carlsen (assuming you are several hundred ELO weaker in strength) to resolve your rating as well as if you had played just a few games against an opponent of nearly equal strength to yourself.
Just use the ELO curve. 1% at 50% result gives 7 Elo points, at 70% result 9 points, at 90% 20 Elo points, at 95% 40 points, at 99% result ~200 Elo points. Therefore you can test safely up to 65-75% results. The number of games necessary to achieve the same error margins at 95% result as at 50% result would be (40/7)^2 ~ 33 times more, which is unpractical, but up to 65%-75% results it all seems fine, so one doesn't have to constantly tune the engines up to 100-200 Elo points difference between them.
We don't have a strict rule on this, but I don't remember ever letting the difference get more than approximately 50 ELO and it's usually closer than that. Probably a good rule of thumb is stay within 100 ELO.

I have nothing against playing a large variety of computer opponents, I think that is good, but it's just not practical for Komodo - there are a limited number of players we can use on our tester that (as Bob has noted) have good auto-test behavior and are close enough to Komodo's strength. There are hundreds of programs we could use if we don't mind playing against programs hundreds of ELO weaker. We could give those other players additional time to equalize the ratings but then you are using most of your CPU resources on OTHER programs instead of your own. In fact I advise that if you are developing a chess program you find a few opponents that are MUCH stronger than your own program and handicap them accordingly so that your program is scoring within 100 ELO (50 is better.) Pick a few program that have significantly different styles if you can. In this way you can get approach 100% of the CPU utilization on just your own program.
I think you need one that is well in front. For me, that is stockfish. I want to improve against the group, but NOT at the expense of getting worse than someone that is much stronger. And I have seen that happen a few times in testing. You add something that works against weaker opponents (say king safety that leads you to attack them more) but against a stronger opponent, it can backfire.
Actually we have always handicapped programs such that Komodo is losing matches and is on the bottom of our rating lists, but that is no longer the case unless we start handicapping Komodo instead. However we expect these other programs (SF, Critter and some clone) to be upgraded too which gives us some time before we have to start to handicap Komodo.

Houdini is of course the exception but Houdini is not reliable for us and we have to have rock solid stable programs. (Houdini does not appear to be able to handle the fast time controls we use without losing some games on time forfeit.)
I've not tried houdini although I suspect no Linux version is around. And if it were, such versions generally won't run on my cluster. Robolito was hopeless. I tried it a few times, and it was very strong, but it also left hundreds of core files while playing just 6,000 games, which is miserable and skews the results in ways I could not predict. (did it crash just in endgames? just with pawn promotions possible? When being attacked? etc.)
We have used the 32 bit Houdini in wine which at the time was still stronger than Komodo - and I think it would still be slightly stronger than Komodo but Houdini does not give up much going to 32 bits.

We test wtih Robolito and have had NO trouble at all. It runs very stable. Perhaps you are using the wrong version?

We will probably have to drop Robbo pretty soon as it is the weakest of the versions we test against so we are looking for a replacement. Houdini is the strongest of the clones, but I wonder if one of the other clones is significantly stronger than Robbo?
bob
Posts: 20943
Joined: Mon Feb 27, 2006 7:30 pm
Location: Birmingham, AL

Re: Cluster Testing Pitfalls?

Post by bob »

Don wrote:
bob wrote:
Don wrote:
bob wrote:
Don wrote:
Laskos wrote:
Don wrote:

Also, I think it's a mistake to test against opponents that are too weak or too strong. You need a LOT more games to properly resolve the performance if the ELO is too different. To understand this imagine that you play a 1000 game match against Carlsen, and win 1 game but lose 999. If you had won 2 games instead of 1 you would have won TWICE as many games and it would make an enormous impact on your rating. You would need tens of thousands of games against Carlsen (assuming you are several hundred ELO weaker in strength) to resolve your rating as well as if you had played just a few games against an opponent of nearly equal strength to yourself.
Just use the ELO curve. 1% at 50% result gives 7 Elo points, at 70% result 9 points, at 90% 20 Elo points, at 95% 40 points, at 99% result ~200 Elo points. Therefore you can test safely up to 65-75% results. The number of games necessary to achieve the same error margins at 95% result as at 50% result would be (40/7)^2 ~ 33 times more, which is unpractical, but up to 65%-75% results it all seems fine, so one doesn't have to constantly tune the engines up to 100-200 Elo points difference between them.
We don't have a strict rule on this, but I don't remember ever letting the difference get more than approximately 50 ELO and it's usually closer than that. Probably a good rule of thumb is stay within 100 ELO.

I have nothing against playing a large variety of computer opponents, I think that is good, but it's just not practical for Komodo - there are a limited number of players we can use on our tester that (as Bob has noted) have good auto-test behavior and are close enough to Komodo's strength. There are hundreds of programs we could use if we don't mind playing against programs hundreds of ELO weaker. We could give those other players additional time to equalize the ratings but then you are using most of your CPU resources on OTHER programs instead of your own. In fact I advise that if you are developing a chess program you find a few opponents that are MUCH stronger than your own program and handicap them accordingly so that your program is scoring within 100 ELO (50 is better.) Pick a few program that have significantly different styles if you can. In this way you can get approach 100% of the CPU utilization on just your own program.
I think you need one that is well in front. For me, that is stockfish. I want to improve against the group, but NOT at the expense of getting worse than someone that is much stronger. And I have seen that happen a few times in testing. You add something that works against weaker opponents (say king safety that leads you to attack them more) but against a stronger opponent, it can backfire.
Actually we have always handicapped programs such that Komodo is losing matches and is on the bottom of our rating lists, but that is no longer the case unless we start handicapping Komodo instead. However we expect these other programs (SF, Critter and some clone) to be upgraded too which gives us some time before we have to start to handicap Komodo.

Houdini is of course the exception but Houdini is not reliable for us and we have to have rock solid stable programs. (Houdini does not appear to be able to handle the fast time controls we use without losing some games on time forfeit.)
I've not tried houdini although I suspect no Linux version is around. And if it were, such versions generally won't run on my cluster. Robolito was hopeless. I tried it a few times, and it was very strong, but it also left hundreds of core files while playing just 6,000 games, which is miserable and skews the results in ways I could not predict. (did it crash just in endgames? just with pawn promotions possible? When being attacked? etc.)
We have used the 32 bit Houdini in wine which at the time was still stronger than Komodo - and I think it would still be slightly stronger than Komodo but Houdini does not give up much going to 32 bits.

We test wtih Robolito and have had NO trouble at all. It runs very stable. Perhaps you are using the wrong version?

We will probably have to drop Robbo pretty soon as it is the weakest of the versions we test against so we are looking for a replacement. Houdini is the strongest of the clones, but I wonder if one of the other clones is significantly stronger than Robbo?
Are using the version that has something like 0.83g or something like that (this was a good while back so my memory is not real clear)? I will try to find out exactly what I tested against. Just checked, and I deleted the thing from the cluster, so no idea. What kind of T/C are you using? I tried at 20s + 0.1s and that's where I was seeing crashes galore and gave up...
User avatar
Don
Posts: 5106
Joined: Tue Apr 29, 2008 4:27 pm

Re: Cluster Testing Pitfalls?

Post by Don »

bob wrote:
Don wrote:
bob wrote:
Don wrote:
bob wrote:
Don wrote:
Laskos wrote:
Don wrote:

Also, I think it's a mistake to test against opponents that are too weak or too strong. You need a LOT more games to properly resolve the performance if the ELO is too different. To understand this imagine that you play a 1000 game match against Carlsen, and win 1 game but lose 999. If you had won 2 games instead of 1 you would have won TWICE as many games and it would make an enormous impact on your rating. You would need tens of thousands of games against Carlsen (assuming you are several hundred ELO weaker in strength) to resolve your rating as well as if you had played just a few games against an opponent of nearly equal strength to yourself.
Just use the ELO curve. 1% at 50% result gives 7 Elo points, at 70% result 9 points, at 90% 20 Elo points, at 95% 40 points, at 99% result ~200 Elo points. Therefore you can test safely up to 65-75% results. The number of games necessary to achieve the same error margins at 95% result as at 50% result would be (40/7)^2 ~ 33 times more, which is unpractical, but up to 65%-75% results it all seems fine, so one doesn't have to constantly tune the engines up to 100-200 Elo points difference between them.
We don't have a strict rule on this, but I don't remember ever letting the difference get more than approximately 50 ELO and it's usually closer than that. Probably a good rule of thumb is stay within 100 ELO.

I have nothing against playing a large variety of computer opponents, I think that is good, but it's just not practical for Komodo - there are a limited number of players we can use on our tester that (as Bob has noted) have good auto-test behavior and are close enough to Komodo's strength. There are hundreds of programs we could use if we don't mind playing against programs hundreds of ELO weaker. We could give those other players additional time to equalize the ratings but then you are using most of your CPU resources on OTHER programs instead of your own. In fact I advise that if you are developing a chess program you find a few opponents that are MUCH stronger than your own program and handicap them accordingly so that your program is scoring within 100 ELO (50 is better.) Pick a few program that have significantly different styles if you can. In this way you can get approach 100% of the CPU utilization on just your own program.
I think you need one that is well in front. For me, that is stockfish. I want to improve against the group, but NOT at the expense of getting worse than someone that is much stronger. And I have seen that happen a few times in testing. You add something that works against weaker opponents (say king safety that leads you to attack them more) but against a stronger opponent, it can backfire.
Actually we have always handicapped programs such that Komodo is losing matches and is on the bottom of our rating lists, but that is no longer the case unless we start handicapping Komodo instead. However we expect these other programs (SF, Critter and some clone) to be upgraded too which gives us some time before we have to start to handicap Komodo.

Houdini is of course the exception but Houdini is not reliable for us and we have to have rock solid stable programs. (Houdini does not appear to be able to handle the fast time controls we use without losing some games on time forfeit.)
I've not tried houdini although I suspect no Linux version is around. And if it were, such versions generally won't run on my cluster. Robolito was hopeless. I tried it a few times, and it was very strong, but it also left hundreds of core files while playing just 6,000 games, which is miserable and skews the results in ways I could not predict. (did it crash just in endgames? just with pawn promotions possible? When being attacked? etc.)
We have used the 32 bit Houdini in wine which at the time was still stronger than Komodo - and I think it would still be slightly stronger than Komodo but Houdini does not give up much going to 32 bits.

We test wtih Robolito and have had NO trouble at all. It runs very stable. Perhaps you are using the wrong version?

We will probably have to drop Robbo pretty soon as it is the weakest of the versions we test against so we are looking for a replacement. Houdini is the strongest of the clones, but I wonder if one of the other clones is significantly stronger than Robbo?
Are using the version that has something like 0.83g or something like that (this was a good while back so my memory is not real clear)? I will try to find out exactly what I tested against. Just checked, and I deleted the thing from the cluster, so no idea. What kind of T/C are you using? I tried at 20s + 0.1s and that's where I was seeing crashes galore and gave up...
When I start up I get this:

RobboLito VERSION 0.084
compiled with PREFETCH

I can test Robbo at time controls much faster than 20s and it is stable.