Comparing two version of the same engine

Kempelen · Post by **Kempelen** » Tue Oct 28, 2008 11:48 am

Bob, as I read your coments, I conclude that we, as home-testers, have to make other actions than playing thousands games. We don't have enought time and computer power to playing so many games.

Anyway, I will try to make a summar of what I consider the basic rules for our situation. Maybe you and other can propose more ideas.

1) Common sense and intuition is a general rule
2) On an equal basis, if two tests lead to the same result, give always more credit to the version more simple.
3) Sometimes, for compare only evaluation changes, apply ply tournaments.
4) Use many engines.
5) Try to combine fast time control tournaments with slow tournaments.
6) Try to make tournaments with different openings positions (maybe mlmfl.ep).
7) if in a 100 games test, only 5% of difference between engine A v1 and engine A v2, then consider that result as iqual for both. For 1000 games tournament maybe 2%. Agree with this numbers?

These are ideas comes to me.....

Ralf · Post by **Ralf** » Tue Oct 28, 2008 12:03 pm

bob wrote: It isn't quite that simple. Your new change might have a side-effect of weakening some other part of your game, but your program doesn't understand (say) the finer points of king-side attack, so you won't notice that this new change has actually made your program worse, because the only opponent you test against can't exploit the weakness...

This is why "inbreeding" is bad for biological reproduction.

I wonder why you are confident, that this inbreeding can't occure when testing against some (few) other engines.

I mean, all our engines are far from being perfect (this is true even for Rybka

), they all have here and there their own weaknesses, bugs, glitches and so on. When you concentrate now on some engines as testing oppponents (I remember that you use Fruit, Glaurung, Arasan and maybe one or two more), don't you think that you tune Crafty against the weaknesses of these engines?

Somewhere in this thread you mentioned that you gained +75 Elo according to your tests. Did you crosscheck these value against a greater variety of engines?

I'm asking because lately I started to do some work on Spikes evaluation again. I wanted to tune some of the values, and did it by playing thousands of fixed-depth-matches against 6 opponents. Finally I had a margin of +20, and I thought this could be well testable in longer games, against more engines and other positions. In the end Spike was better against the opponest which I used in the fixed-depth-matches - but partially significant worse against others.

Kempelen · Post by **Kempelen** » Tue Oct 28, 2008 1:44 pm

Ralf wrote:I wonder why you are confident, that this inbreeding can't occure when testing against some (few) other engines.

I mean, all our engines are far from being perfect (this is true even for Rybka ), they all have here and there their own weaknesses, bugs, glitches and so on. When you concentrate now on some engines as testing oppponents (I remember that you use Fruit, Glaurung, Arasan and maybe one or two more), don't you think that you tune Crafty against the weaknesses of these engines?

Somewhere in this thread you mentioned that you gained +75 Elo according to your tests. Did you crosscheck these value against a greater variety of engines?

I'm asking because lately I started to do some work on Spikes evaluation again. I wanted to tune some of the values, and did it by playing thousands of fixed-depth-matches against 6 opponents. Finally I had a margin of +20, and I thought this could be well testable in longer games, against more engines and other positions. In the end Spike was better against the opponest which I used in the fixed-depth-matches - but partially significant worse against others.

You play "thousands" of games with only six opponents. Would be better to play agains a lot (>30 i.e.) of engines with less games per engine? maybe this strategy would be more reliable......

BubbaTough · Post by **BubbaTough** » Tue Oct 28, 2008 2:44 pm

bob wrote: Getting to "reasonably strong" is not that hard. And can be done with any sort of ad hoc testing.

I guess what I am saying is for Rodin you should concentrate on the above quote. Do not go overboard on testing, any sort of ad hoc testing will do

. Where the cutoff for "reasonably strong" is unclear. I am thinking and hoping that engines < 2800 still fall in the category, but Bob may be right and serious testing is required to improve engines such as mine (and his)

. But a brief trip to your website to see what has been implemented so far seems to indicate that there is much low-hanging fruit left for you to pluck, and I would encourage short, sweet, ad hoc testing.

-Sam

BubbaTough · Post by **BubbaTough** » Tue Oct 28, 2008 2:51 pm

Ralf wrote:
I wonder why you are confident, that this inbreeding can't occure when testing against some (few) other engines.

I agree with this point. The key to good testing value per game is variety...variety in position, opponent, and time control. The main problem with self-test is lack of variety, but the same is true to a lesser extent to small groups of opponents as you point out. It is very unclear to me once you have decided how much time you will allocate to a test, how to distribute that between number of opponents, number of positions tested, and time controls. I have not seen any good theories, let alone data, supporting a conclusion on this very interesting issue.

-Sam

bob · Post by **bob** » Tue Oct 28, 2008 3:02 pm

Ralf wrote:
bob wrote: It isn't quite that simple. Your new change might have a side-effect of weakening some other part of your game, but your program doesn't understand (say) the finer points of king-side attack, so you won't notice that this new change has actually made your program worse, because the only opponent you test against can't exploit the weakness...

This is why "inbreeding" is bad for biological reproduction.
I wonder why you are confident, that this inbreeding can't occure when testing against some (few) other engines.

I am not. But I am more confident in testing against 10 others, than against only my own previous version...

I mean, all our engines are far from being perfect (this is true even for Rybka ), they all have here and there their own weaknesses, bugs, glitches and so on. When you concentrate now on some engines as testing oppponents (I remember that you use Fruit, Glaurung, Arasan and maybe one or two more), don't you think that you tune Crafty against the weaknesses of these engines?

Also Toga, and a couple of others as well... the best answer is to play against all possible opponents. But just playing against one is a bad idea, and if that "one" is a previous version of your program it is even worse...

Somewhere in this thread you mentioned that you gained +75 Elo according to your tests. Did you crosscheck these value against a greater variety of engines?

Yes I did, and at different time controls to boot...

I'm asking because lately I started to do some work on Spikes evaluation again. I wanted to tune some of the values, and did it by playing thousands of fixed-depth-matches against 6 opponents. Finally I had a margin of +20, and I thought this could be well testable in longer games, against more engines and other positions. In the end Spike was better against the opponest which I used in the fixed-depth-matches - but partially significant worse against others.

bob · Post by **bob** » Tue Oct 28, 2008 3:08 pm

BubbaTough wrote:
Ralf wrote:
I wonder why you are confident, that this inbreeding can't occure when testing against some (few) other engines.

I agree with this point. The key to good testing value per game is variety...variety in position, opponent, and time control. The main problem with self-test is lack of variety, but the same is true to a lesser extent to small groups of opponents as you point out. It is very unclear to me once you have decided how much time you will allocate to a test, how to distribute that between number of opponents, number of positions tested, and time controls. I have not seen any good theories, let alone data, supporting a conclusion on this very interesting issue.

-Sam

Actually this discussion has happened. max positions is key playing 2 games per position to alternate colors and factor out unbalanced starting points. You can either try a large number of opponents or a large number of positions. I would not try to "fine-tune" _anything_ personally, as it may well become opponent-specific tuning. However, after a ton of testing against a ton of different opponents, there is not a huge difference in what is being done in program A vs program B today. I also do other kinds of testing (including long matches on ICC against specific opponents to verify that a change doesn't make it play worse against opponents not in the test regime).

But that misses the point here. Testing against your previous version is _far_ worse than testing against a few other opponents. testing against a few opponents is somewhat worse than testing against a large group...

bob · Post by **bob** » Tue Oct 28, 2008 3:11 pm

Kempelen wrote:Bob, as I read your coments, I conclude that we, as home-testers, have to make other actions than playing thousands games. We don't have enought time and computer power to playing so many games.

Anyway, I will try to make a summar of what I consider the basic rules for our situation. Maybe you and other can propose more ideas.

1) Common sense and intuition is a general rule
2) On an equal basis, if two tests lead to the same result, give always more credit to the version more simple.
3) Sometimes, for compare only evaluation changes, apply ply tournaments.

You have to be very careful here. Fixed-depth distorts the results significantly. If you add a slow eval term, that program will get an unequal advantage at fixed depth since the slower eval won't be a penalty, it will just take longer to move than the opponent. I don't do _any_ fixed depth testing myself.

4) Use many engines.
5) Try to combine fast time control tournaments with slow tournaments.
6) Try to make tournaments with different openings positions (maybe mlmfl.ep).
7) if in a 100 games test, only 5% of difference between engine A v1 and engine A v2, then consider that result as iqual for both. For 1000 games tournament maybe 2%. Agree with this numbers?

These are ideas comes to me.....

It takes almost 40,000 games to get a +/- 4 Elo accuracy. So the question becomes more of "how significant is the change" which dictates how many games you need to verify that it is better.

BubbaTough · Post by **BubbaTough** » Tue Oct 28, 2008 3:50 pm

bob wrote: there is not a huge difference in what is being done in program A vs program B today.

This is the key question. If true, it justifies the small number of opponents being used. I find Glaurung to play originally with respect to pawn vs. piece endgames, Naum treats trapped pieces originally, Boot has interesting opposite bishop logic, Romi is missing unusually large numbers of critical endgame knowledge given its strength, etc. etc.. Including large multiples programs which are constructed by reading, implementing, and tweaking the same set of ideas is not necessarily that valuable. But including lots of programs like the ones I mention which play at least certain positions very differently from most others in your testing group has great value. How to translate this concept into a plan of action is unclear, particularly since the "originality" of programs is hard to determine for most (non-open source) programs, but it is still and interesting area to me.

-Sam

Kempelen · Post by **Kempelen** » Tue Oct 28, 2008 4:06 pm

bob wrote:
Kempelen wrote: 3) Sometimes, for compare only evaluation changes, apply ply tournaments.
You have to be very careful here. Fixed-depth distorts the results significantly. If you add a slow eval term, that program will get an unequal advantage at fixed depth since the slower eval won't be a penalty, it will just take longer to move than the opponent. I don't do _any_ fixed depth testing myself.

Well, I was thinking in doing ply tournaments only when changing score values, no when adding new chess knowledge that need different execution time. I don't see any drawback in doing ply tourneys for that kind of testing. do you?

Comparing two version of the same engine

Re: Comparing two version of the same engine

Re: Comparing two version of the same engine

Re: Comparing two version of the same engine

Re: Comparing two version of the same engine

Re: Comparing two version of the same engine

Re: Comparing two version of the same engine

Re: Comparing two version of the same engine

Re: Comparing two version of the same engine

Re: Comparing two version of the same engine

Re: Comparing two version of the same engine