Bob, as I read your coments, I conclude that we, as home-testers, have to make other actions than playing thousands games. We don't have enought time and computer power to playing so many games.
Anyway, I will try to make a summar of what I consider the basic rules for our situation. Maybe you and other can propose more ideas.
1) Common sense and intuition is a general rule
2) On an equal basis, if two tests lead to the same result, give always more credit to the version more simple.
3) Sometimes, for compare only evaluation changes, apply ply tournaments.
4) Use many engines.
5) Try to combine fast time control tournaments with slow tournaments.
6) Try to make tournaments with different openings positions (maybe mlmfl.ep).
7) if in a 100 games test, only 5% of difference between engine A v1 and engine A v2, then consider that result as iqual for both. For 1000 games tournament maybe 2%. Agree with this numbers?
These are ideas comes to me.....
Comparing two version of the same engine
Moderator: Ras
-
- Posts: 620
- Joined: Fri Feb 08, 2008 10:44 am
- Location: Madrid - Spain
-
- Posts: 36
- Joined: Sun May 14, 2006 8:17 pm
Re: Comparing two version of the same engine
I wonder why you are confident, that this inbreeding can't occure when testing against some (few) other engines.bob wrote: It isn't quite that simple. Your new change might have a side-effect of weakening some other part of your game, but your program doesn't understand (say) the finer points of king-side attack, so you won't notice that this new change has actually made your program worse, because the only opponent you test against can't exploit the weakness...
This is why "inbreeding" is bad for biological reproduction.
I mean, all our engines are far from being perfect (this is true even for Rybka

Somewhere in this thread you mentioned that you gained +75 Elo according to your tests. Did you crosscheck these value against a greater variety of engines?
I'm asking because lately I started to do some work on Spikes evaluation again. I wanted to tune some of the values, and did it by playing thousands of fixed-depth-matches against 6 opponents. Finally I had a margin of +20, and I thought this could be well testable in longer games, against more engines and other positions. In the end Spike was better against the opponest which I used in the fixed-depth-matches - but partially significant worse against others.
-
- Posts: 620
- Joined: Fri Feb 08, 2008 10:44 am
- Location: Madrid - Spain
Re: Comparing two version of the same engine
You play "thousands" of games with only six opponents. Would be better to play agains a lot (>30 i.e.) of engines with less games per engine? maybe this strategy would be more reliable......Ralf wrote:I wonder why you are confident, that this inbreeding can't occure when testing against some (few) other engines.
I mean, all our engines are far from being perfect (this is true even for Rybka), they all have here and there their own weaknesses, bugs, glitches and so on. When you concentrate now on some engines as testing oppponents (I remember that you use Fruit, Glaurung, Arasan and maybe one or two more), don't you think that you tune Crafty against the weaknesses of these engines?
Somewhere in this thread you mentioned that you gained +75 Elo according to your tests. Did you crosscheck these value against a greater variety of engines?
I'm asking because lately I started to do some work on Spikes evaluation again. I wanted to tune some of the values, and did it by playing thousands of fixed-depth-matches against 6 opponents. Finally I had a margin of +20, and I thought this could be well testable in longer games, against more engines and other positions. In the end Spike was better against the opponest which I used in the fixed-depth-matches - but partially significant worse against others.
-
- Posts: 1154
- Joined: Fri Jun 23, 2006 5:18 am
Re: Comparing two version of the same engine
I guess what I am saying is for Rodin you should concentrate on the above quote. Do not go overboard on testing, any sort of ad hoc testing will dobob wrote: Getting to "reasonably strong" is not that hard. And can be done with any sort of ad hoc testing.


-Sam
-
- Posts: 1154
- Joined: Fri Jun 23, 2006 5:18 am
Re: Comparing two version of the same engine
I agree with this point. The key to good testing value per game is variety...variety in position, opponent, and time control. The main problem with self-test is lack of variety, but the same is true to a lesser extent to small groups of opponents as you point out. It is very unclear to me once you have decided how much time you will allocate to a test, how to distribute that between number of opponents, number of positions tested, and time controls. I have not seen any good theories, let alone data, supporting a conclusion on this very interesting issue.Ralf wrote:
I wonder why you are confident, that this inbreeding can't occure when testing against some (few) other engines.
-Sam
-
- Posts: 20943
- Joined: Mon Feb 27, 2006 7:30 pm
- Location: Birmingham, AL
Re: Comparing two version of the same engine
I am not. But I am more confident in testing against 10 others, than against only my own previous version...Ralf wrote:I wonder why you are confident, that this inbreeding can't occure when testing against some (few) other engines.bob wrote: It isn't quite that simple. Your new change might have a side-effect of weakening some other part of your game, but your program doesn't understand (say) the finer points of king-side attack, so you won't notice that this new change has actually made your program worse, because the only opponent you test against can't exploit the weakness...
This is why "inbreeding" is bad for biological reproduction.
Also Toga, and a couple of others as well... the best answer is to play against all possible opponents. But just playing against one is a bad idea, and if that "one" is a previous version of your program it is even worse...
I mean, all our engines are far from being perfect (this is true even for Rybka), they all have here and there their own weaknesses, bugs, glitches and so on. When you concentrate now on some engines as testing oppponents (I remember that you use Fruit, Glaurung, Arasan and maybe one or two more), don't you think that you tune Crafty against the weaknesses of these engines?
Yes I did, and at different time controls to boot...
Somewhere in this thread you mentioned that you gained +75 Elo according to your tests. Did you crosscheck these value against a greater variety of engines?
I'm asking because lately I started to do some work on Spikes evaluation again. I wanted to tune some of the values, and did it by playing thousands of fixed-depth-matches against 6 opponents. Finally I had a margin of +20, and I thought this could be well testable in longer games, against more engines and other positions. In the end Spike was better against the opponest which I used in the fixed-depth-matches - but partially significant worse against others.
-
- Posts: 20943
- Joined: Mon Feb 27, 2006 7:30 pm
- Location: Birmingham, AL
Re: Comparing two version of the same engine
Actually this discussion has happened. max positions is key playing 2 games per position to alternate colors and factor out unbalanced starting points. You can either try a large number of opponents or a large number of positions. I would not try to "fine-tune" _anything_ personally, as it may well become opponent-specific tuning. However, after a ton of testing against a ton of different opponents, there is not a huge difference in what is being done in program A vs program B today. I also do other kinds of testing (including long matches on ICC against specific opponents to verify that a change doesn't make it play worse against opponents not in the test regime).BubbaTough wrote:I agree with this point. The key to good testing value per game is variety...variety in position, opponent, and time control. The main problem with self-test is lack of variety, but the same is true to a lesser extent to small groups of opponents as you point out. It is very unclear to me once you have decided how much time you will allocate to a test, how to distribute that between number of opponents, number of positions tested, and time controls. I have not seen any good theories, let alone data, supporting a conclusion on this very interesting issue.Ralf wrote:
I wonder why you are confident, that this inbreeding can't occure when testing against some (few) other engines.
-Sam
But that misses the point here. Testing against your previous version is _far_ worse than testing against a few other opponents. testing against a few opponents is somewhat worse than testing against a large group...
-
- Posts: 20943
- Joined: Mon Feb 27, 2006 7:30 pm
- Location: Birmingham, AL
Re: Comparing two version of the same engine
You have to be very careful here. Fixed-depth distorts the results significantly. If you add a slow eval term, that program will get an unequal advantage at fixed depth since the slower eval won't be a penalty, it will just take longer to move than the opponent. I don't do _any_ fixed depth testing myself.Kempelen wrote:Bob, as I read your coments, I conclude that we, as home-testers, have to make other actions than playing thousands games. We don't have enought time and computer power to playing so many games.
Anyway, I will try to make a summar of what I consider the basic rules for our situation. Maybe you and other can propose more ideas.
1) Common sense and intuition is a general rule
2) On an equal basis, if two tests lead to the same result, give always more credit to the version more simple.
3) Sometimes, for compare only evaluation changes, apply ply tournaments.
It takes almost 40,000 games to get a +/- 4 Elo accuracy. So the question becomes more of "how significant is the change" which dictates how many games you need to verify that it is better.4) Use many engines.
5) Try to combine fast time control tournaments with slow tournaments.
6) Try to make tournaments with different openings positions (maybe mlmfl.ep).
7) if in a 100 games test, only 5% of difference between engine A v1 and engine A v2, then consider that result as iqual for both. For 1000 games tournament maybe 2%. Agree with this numbers?
These are ideas comes to me.....
-
- Posts: 1154
- Joined: Fri Jun 23, 2006 5:18 am
Re: Comparing two version of the same engine
This is the key question. If true, it justifies the small number of opponents being used. I find Glaurung to play originally with respect to pawn vs. piece endgames, Naum treats trapped pieces originally, Boot has interesting opposite bishop logic, Romi is missing unusually large numbers of critical endgame knowledge given its strength, etc. etc.. Including large multiples programs which are constructed by reading, implementing, and tweaking the same set of ideas is not necessarily that valuable. But including lots of programs like the ones I mention which play at least certain positions very differently from most others in your testing group has great value. How to translate this concept into a plan of action is unclear, particularly since the "originality" of programs is hard to determine for most (non-open source) programs, but it is still and interesting area to me.bob wrote: there is not a huge difference in what is being done in program A vs program B today.
-Sam
-
- Posts: 620
- Joined: Fri Feb 08, 2008 10:44 am
- Location: Madrid - Spain
Re: Comparing two version of the same engine
Well, I was thinking in doing ply tournaments only when changing score values, no when adding new chess knowledge that need different execution time. I don't see any drawback in doing ply tourneys for that kind of testing. do you?bob wrote:You have to be very careful here. Fixed-depth distorts the results significantly. If you add a slow eval term, that program will get an unequal advantage at fixed depth since the slower eval won't be a penalty, it will just take longer to move than the opponent. I don't do _any_ fixed depth testing myself.Kempelen wrote: 3) Sometimes, for compare only evaluation changes, apply ply tournaments.