For Komodo that it the least of my worries. There are numerous problems that can be identified with properly testing a program change and I would put this near the bottom. Here are some others:mhull wrote:I realize the difficulty of compiling a set of opponents that run reliably in the cluster. But there must surely be a nagging question that if you had another group of fiver or six strong opponents, would changes always move you program's relative strength in the same general direction, up or down, in both groups of opponents.bob wrote:I see both sets of numbers, and yes, a change can improve overall but you can do worse against on opponent and better against the other(s). I have tried to keep at least one -200 Elo program in my mix, as if you see a drop against that program, it should be of interest. However, I still go for best overall, but I do try to see what changed with the games against the weaker program...mhull wrote:Do you track gain/loss against individual as well as group average? If so, did you find that gain/loss would move in opposite directions against respective opponents?Ferdy wrote:Talking about stable opponents, I have tried 30 opponents , whatever the results I take it as it is. I also think that selection should be varied, there is good at ending, there is good at attacking, there is good at passer handling, there is good in open positions, closed positions and others.
1. Getting a large enough sample, i.e. limited CPU resources.
2. Does an improvement at one level equal improve at another?
3. Does computer testing equal true strength improvement?
4. Does self-testing equal true improvement (against humans)
5. Are my openings realistic?
6. Opponent Intransitivity?
I put your concern at the end of the list but they are in no particular order.
If you are overly obsessive about each of these issues it's almost certainly counterproductive and that even includes item number 1, which is the most worthy of obsession! For example if we insisted in getting the error margins down to +/- 1 ELO just to be sure, we would not be able to test more than a couple of changes per week and our progress would slow to to crawl. If you don't spend enough time your changes become almost random.
So we try to things to cover most things but we don't go too crazy. For openings our philosophy is get out of the opening as quickly as possible so that we are testing the entire program and not just it's middle-game play. So for testing our opening library is only 5 moves deep or 10 ply. But in real play we are probably in the book a lot longer than that, so isn't this wrong? This is not the actual condition we will be playing normally. Should I lay awake at night obsessing over this?
Or how about this other "problem", we play each opening only 1 time as white (and once as black) but in reality some of those openings are much more popular than others, so this is not a realistic sampling of how games are played in the real world. What do I do about that? Why should I not make that a top priority? Is that more important or less important than obsessing over whether a player 500 ELO weaker might beat us more often than he should?
I firmly believe that a huge amount of time and energy (and CPU resources) could be wasted worrying about each of these issues, even issue number 1 which probably trumps all the others.
Your concern is the nagging doubt about Intransitivity which is having certain programs performing much better against your program than it should. Of all the things on the list, I put that near the end -- it's not not worth much consideration for us. Please note that I'm not saying it cannot or does not happen, but in our testing methodology it is so far down the list and something has to give!
Also, I think it's a mistake to test against opponents that are too weak or too strong. You need a LOT more games to properly resolve the performance if the ELO is too different. To understand this imagine that you play a 1000 game match against Carlsen, and win 1 game but lose 999. If you had won 2 games instead of 1 you would have won TWICE as many games and it would make an enormous impact on your rating. You would need tens of thousands of games against Carlsen (assuming you are several hundred ELO weaker in strength) to resolve your rating as well as if you had played just a few games against an opponent of nearly equal strength to yourself.
I have nothing against playing a large variety of computer opponents, I think that is good, but it's just not practical for Komodo - there are a limited number of players we can use on our tester that (as Bob has noted) have good auto-test behavior and are close enough to Komodo's strength. There are hundreds of programs we could use if we don't mind playing against programs hundreds of ELO weaker. We could give those other players additional time to equalize the ratings but then you are using most of your CPU resources on OTHER programs instead of your own. In fact I advise that if you are developing a chess program you find a few opponents that are MUCH stronger than your own program and handicap them accordingly so that your program is scoring within 100 ELO (50 is better.) Pick a few program that have significantly different styles if you can. In this way you can get approach 100% of the CPU utilization on just your own program.