More Participants or More Number of Games?

swami · Post by **swami** » Wed Apr 08, 2009 7:18 am

I remember Bob posting test results of Crafty against 2 or 3 programs. I mean, not the recent one involving the same version of Crafty renamed to 3 distinct names to see the consistency of 3 versions but the different test data posted months ago checking the difference between 22.x to 22.x+1 version to see how much improved the latest 22.x+1 is.

So, Bob, the problem I see with that test is that you only have 3 opponents to play Crafty against, namely Toga, Fruit and Glaurung. I don't believe I can rely too much on such elo changes from crafty even if it's gathered from lakh of games.

I think the data would be more realistic if you added bit more participants: If you only use open source engines. Then adding Scorpio, Delfi, Booot, Cyrano to the list would help. Those are all open source and I believe they are all good challenge to Crafty, and they are closely rated, 50-100 elo apart from Crafty. If that takes more time, then reducing the number of games to something like 15k would do Ok. I don't think error bar difference between running 30k against few engines and 15k games against more engines, would have any effect on stats much.

Do you prefer 10k games against wide range of participants (say 10 engines) or 33k games against few group of participants (3 engines)?

Personally I'd prefer the former rather than latter...

So I don't understand the Bob's claim that 2000-3000 games from the testing sites ain't enough and that the error bars are higher with lesser number of games,while completely ignoring the diverse range of engines, each having different style that Crafty has so far played against.

Surely the performance measured from 2000 something games against the diverse number of engines in the field would overshadow the performance measured via running thousands of games of Crafty pitted with few group of engines?

krazyken · Post by **krazyken** » Wed Apr 08, 2009 9:28 am

Cyrano is a good choice for sure, Booot and Delfi are Windows only, and the latest versions of Scorpio have stability issues.

Cyrano does need a patch to increase the stack size to make it stable. I changed the value in ethread.cpp, line 788, put 256 instead of 64.

Gian-Carlo Pascutto · Wed Apr 08, 2009 12:40 pm

I don't think we have much data on whether performance increases against a small set of engines might be non-transitive against others.

We know *for sure* though the minimum error margins caused by playing a limited number of games.

Since Bob is testing small changes, he's going to be more worried about the latter (which we're certain about affects us) than the former (which is an unknown factor that may or may not affect us).

bob · Post by **bob** » Wed Apr 08, 2009 6:38 pm

swami wrote:I remember Bob posting test results of Crafty against 2 or 3 programs. I mean, not the recent one involving the same version of Crafty renamed to 3 distinct names to see the consistency of 3 versions but the different test data posted months ago checking the difference between 22.x to 22.x+1 version to see how much improved the latest 22.x+1 is.

So, Bob, the problem I see with that test is that you only have 3 opponents to play Crafty against, namely Toga, Fruit and Glaurung. I don't believe I can rely too much on such elo changes from crafty even if it's gathered from lakh of games.

If you had noticed carefully, I have used more than 3 opponents. And still do. I never use less than 4. But for my fast testing, this is enough. For more thorough testing, I use more. I have switched opponents around a bit, but months ago I was using fruit, glaurung 1 and 2, and Arasan. I've replaced Arasan with Toga. I'm not sure exactly what you are finding fault with. I have two programs that are significantly stronger, one that is slightly weaker and one that is a little weaker than that.

Finding decent opponents is not easy. (1) not all run under linux. Most don't. (2) of those that do, many have trouble with fast time controls and lose too many games on time, which doesn't help me at all; (3) of those that are left, many are too weak. You don't learn anything playing programs that are more than 200 Elo weaker.

I think the data would be more realistic if you added bit more participants: If you only use open source engines. Then adding Scorpio, Delfi, Booot, Cyrano to the list would help. Those are all open source and I believe they are all good challenge to Crafty, and they are closely rated, 50-100 elo apart from Crafty. If that takes more time, then reducing the number of games to something like 15k would do Ok. I don't think error bar difference between running 30k against few engines and 15k games against more engines, would have any effect on stats much.

See above. I've tried most of those. I don't remember the specifics, but some just do not deal with fast time controls and I have to modify them to do so. The fast time controls are important for quick testing. While I play lots of longer games as well, I am most interested in quick testing where I can complete several test runs in a short period of time. More importantly, I've tried a larger number of opponents, with a smaller number of games, and found no significant difference in the results, so long as I don't include too many weak engines.

Do you prefer 10k games against wide range of participants (say 10 engines) or 33k games against few group of participants (3 engines)?

Personally I'd prefer the former rather than latter...

So I don't understand the Bob's claim that 2000-3000 games from the testing sites ain't enough and that the error bars are higher with lesser number of games,while completely ignoring the diverse range of engines, each having different style that Crafty has so far played against.

I don't see what there is to not understand. I have not said _anything_ about the testing sites. I've been quite clear that my goal is not to take a group of programs and find out who is best. My goal is to take two versions of crafty that are very similar and find out which is best. That is a _different_ goal from the testing sites. And that is the goal I am interested in. I _have_ said that if a single person just plays 2000-3000 games to determine if their new version is better than their old version, the results are worthless.. It looks like what you don't understand is what I have been writing, not the results.

Surely the performance measured from 2000 something games against the diverse number of engines in the field would overshadow the performance measured via running thousands of games of Crafty pitted with few group of engines?

Not even close...

Would you rather have 2 games against 1000 opponents? Surely that would be better than 32000 games against four (not three) opponents? I'd hope _anyone_ would see the flaw in that...

swami · Post by **swami** » Wed Apr 08, 2009 7:50 pm

If you had noticed carefully, I have used more than 3 opponents. And still do. I never use less than 4. But for my fast testing, this is enough. For more thorough testing, I use more. I have switched opponents around a bit, but months ago I was using fruit, glaurung 1 and 2, and Arasan. I've replaced Arasan with Toga. I'm not sure exactly what you are finding fault with. I have two programs that are significantly stronger, one that is slightly weaker and one that is a little weaker than that.

Well, If I were you, I'd just get myself a spare computer(s) with windows to do all that testing. I think you're missing out a lot of equally matched opponents that would give Crafty a good challenge. Linux maybe more stable and faster but it's not the platform for computer chess engines and matches because there ain't too many engines.

You don't learn anything playing programs that are more than 200 Elo weaker.

With this statement, you just contradicted yourself: Glaurung 2.2 is atleast 150- 200 elo higher than Crafty. Toga likewise. Arasan is about 100 elo lower than Crafty. Not too close in strength, these opponents. Would the ratings estimation be quite accurate considering the differences of crafty's opponents are huge? Perhaps the older version of these Glaurungs and Togas may even the league but old version of same engine got the same style. I'd rather see Crafty playing diffent engines with diverse playing styles to test out Crafty's wits.

Your claim about 2 games against 1000 opponents is sarcastic exaggeration at best

I think I'd prefer minimum of _200_ games against single opponent. That's the minimum cut off. So increase the number of participants, the number of games to be played against each engine remains constant.

I still believe that 2000 games gathered from 10 engines (of nearer strength to Crafty, say 25-50 elo apart) is far more reliable than 30k games gathered from matches against engines that are 100-200 elo apart. When you find that latest version has improved much to belong in higher league, you just replace the sets of engines and add in newer set with even better engines strength-wise. That's the best way to see the progress.

wgarvin · Post by **wgarvin** » Wed Apr 08, 2009 8:51 pm

swami wrote:Well, If I were you, I'd just get myself a spare computer(s) with windows to do all that testing. I think you're missing out a lot of equally matched opponents that would give Crafty a good challenge. Linux maybe more stable and faster but it's not the platform for computer chess engines and matches because there ain't too many engines.
You don't learn anything playing programs that are more than 200 Elo weaker.
With this statement, you just contradicted yourself: Glaurung 2.2 is atleast 150- 200 elo higher than Crafty. Toga likewise. Arasan is about 100 elo lower than Crafty. Not too close in strength, these opponents.

He didn't contradict himself at all. Bob is testing Crafty, not those other engines. Playing against engines that are 200 elo points *stronger* than Crafty is not the same as playing against engines that are 200 elo points *weaker* than Crafty. None of the engines you just listed are 200 points weaker than Crafty, because (as Bob already said) if they were, he would not be able to learn very much by testing against them.

For the kind of testing bob does, a single spare Windows box would take months to finish a single test run.

It's a little presumptuous to come in and say you think you know better than him how to test his engine.

Dr. Hyatt has been writing chess engines for decades and has one of the most rigorous testing setups of ANY chess engine author around.

Statistics is a funny thing -- when it comes to statistics, human intuition is not always reliable. For example, this persistent idea that 200 or 2000 or any similar number of games, is enough to accurately gauge the impact of a small change that might be worth 10 ELO at most. I think Dr. Hyatt has pretty conclusively demolished that idea here in previous threads.

bob · Post by **bob** » Wed Apr 08, 2009 8:52 pm

swami wrote:
If you had noticed carefully, I have used more than 3 opponents. And still do. I never use less than 4. But for my fast testing, this is enough. For more thorough testing, I use more. I have switched opponents around a bit, but months ago I was using fruit, glaurung 1 and 2, and Arasan. I've replaced Arasan with Toga. I'm not sure exactly what you are finding fault with. I have two programs that are significantly stronger, one that is slightly weaker and one that is a little weaker than that.
Well, If I were you, I'd just get myself a spare computer(s) with windows to do all that testing. I think you're missing out a lot of equally matched opponents that would give Crafty a good challenge. Linux maybe more stable and faster but it's not the platform for computer chess engines and matches because there ain't too many engines.

What part of my testing approach don't you get? I have a cluster with 128 nodes, two processors per node. Running _Linux_. There is no other choice. I am not going to go out and buy a bunch of windows boxes just to test. So it is Linux or nothing. I do not _need_ windows engines to test. And I have no idea why you would think that is necessary..

BTW Linux is the platform for the "oldest / longest-running" chess engine around.

You don't learn anything playing programs that are more than 200 Elo weaker.
With this statement, you just contradicted yourself: Glaurung 2.2 is atleast 150- 200 elo higher than Crafty. Toga likewise.

First, I did not contradict anything. Glaurung and Toga are _not_ 150-200 Elo stronger than Crafty. I've posted my numbers here many times. Those two are somewhere around 50 Elo weaker the way I am testing, which excludes any opening book whatsoever. So again, I have no idea what you are talking about. If you were correct, playing against programs that are _stronger_ than yourself will show any improvement. Playing against programs that are weaker (much weaker) shows absolutely nothing unless you break something. I want to go up, not down.

Arasan is about 100 elo lower than Crafty. Not too close in strength, these opponents.

Again, if you would simply follow my results, you would know I am not using Arasan in my testing at present. Fruit is a good engine, and is about 50-70 Elo below Crafty in my testing. Glaurung 1 is about 100 Elo worse. The reason I want _worse_ engines included is obvious for someone that has thought about testing. I don't want to gain 1 Elo against stronger opponents and lose 20 against weaker ones.

Would the ratings estimation be quite accurate considering the differences of crafty's opponents are huge? Perhaps the older version of these Glaurungs and Togas may even the league but old version of same engine got the same style. I'd rather see Crafty playing diffent engines with diverse playing styles to test out Crafty's wits.

That's why I use what I use. The latest Toga and Glaurung are different programs with different styles of play.

Your claim about 2 games against 1000 opponents is sarcastic exaggeration at best I think I'd prefer minimum of _200_ games against single opponent. That's the minimum cut off. So increase the number of participants, the number of games to be played against each engine remains constant.

It is difficult to find 150 opponents (150 * 200 = 30,000 games) that are strong enough, and run under Linux. I agree more opponents would be somewhat better. But not a _lot_ better at the moment.

I still believe that 2000 games gathered from 10 engines (of nearer strength to Crafty, say 25-50 elo apart) is far more reliable than 30k games gathered from matches against engines that are 100-200 elo apart. When you find that latest version has improved much to belong in higher league, you just replace the sets of engines and add in newer set with even better engines strength-wise. That's the best way to see the progress.

Easier said than done, as far as replacing opponents. WIdows is not the answer. "Windows?" is the question. "no" is the answer. As far as your speculation about 2000 games, you are simply _dead_ wrong. You are hung up in the wrong world. I'm not trying to find out which of the group is strongest. I am trying to test two versions of Crafty and decide which is best. 2000 games will _not_ allow that under any circumstances except when a change produces a revolutionary impact. Those are not likely since Crafty is a mature engine. If you can't understand the problem with 2000 games, there's little I can do to explain further. But an error bar of +/- 12 is _not_ going to help you make a decision about whether A' is better than A or not. And I believe _most_ understand that concept.

bob · Post by **bob** » Wed Apr 08, 2009 9:03 pm

wgarvin wrote:
swami wrote:Well, If I were you, I'd just get myself a spare computer(s) with windows to do all that testing. I think you're missing out a lot of equally matched opponents that would give Crafty a good challenge. Linux maybe more stable and faster but it's not the platform for computer chess engines and matches because there ain't too many engines.
You don't learn anything playing programs that are more than 200 Elo weaker.
With this statement, you just contradicted yourself: Glaurung 2.2 is atleast 150- 200 elo higher than Crafty. Toga likewise. Arasan is about 100 elo lower than Crafty. Not too close in strength, these opponents.
He didn't contradict himself at all. Bob is testing Crafty, not those other engines. Playing against engines that are 200 elo points *stronger* than Crafty is not the same as playing against engines that are 200 elo points *weaker* than Crafty. None of the engines you just listed are 200 points weaker than Crafty, because (as Bob already said) if they were, he would not be able to learn very much by testing against them.

For the kind of testing bob does, a single spare Windows box would take months to finish a single test run.

It's a little presumptuous to come in and say you think you know better than him how to test his engine.
Dr. Hyatt has been writing chess engines for decades and has one of the most rigorous testing setups of ANY chess engine author around.

Statistics is a funny thing -- when it comes to statistics, human intuition is not always reliable. For example, this persistent idea that 200 or 2000 or any similar number of games, is enough to accurately gauge the impact of a small change that might be worth 10 ELO at most. I think Dr. Hyatt has pretty conclusively demolished that idea here in previous threads.

Not conclusively enough, it would appear.

We keep coming back to the "basement tournament testing" approach. Where someone is interested in "which engine is strongest". I don't give a hoot about that. All I care about is comparing Crafty version A against Crafty version A', and making a go/no-go decision about keeping A', or keeping A and discarding A'.

I've gotten a ton of suggestions about the LMR issue. I have tried every last one, plus another hundred variations I thought might have a chance. To date, not a single one has worked out. Yet some swear by their results. Or say "this is in my code, I believe it is better..."

I thought I had found something interesting with the offset window idea I mentioned. It found solutions faster overall in tactical tests. And it played _worse_ in real games, whether the games are very fast, or very slow. At various levels of "tuning" on how far to offset the window. Intuition will only take you so far in Computer Chess. I relied on it far too long. It's unfortunate that it requires so many games to reduce the error margin to an acceptable level, but as is often said, "it is what it is, and nothing more..."

I can remember thinking many years ago about Ed's testing approach with Rebel, where he had a group of machines playing games 24/7. And I thought "Wow, wish I could do that..." Now I can do it about 100x better.

All I can say for sure is that whatever works, after rigorous testing, will show up in Crafty, and anybody can look at the source to see what is happening, and have a pretty good feeling that anything new is an improvement with regard to Elo, not necessarily with regard to test suites which I don't use at all.

bob · Post by **bob** » Wed Apr 08, 2009 9:07 pm

Gian-Carlo Pascutto wrote:I don't think we have much data on whether performance increases against a small set of engines might be non-transitive against others.

We know *for sure* though the minimum error margins caused by playing a limited number of games.

Since Bob is testing small changes, he's going to be more worried about the latter (which we're certain about affects us) than the former (which is an unknown factor that may or may not affect us).

I think it might be a bigger danger if we were looking at games played in the testing, and then trying to tune the program to improve on mistakes made because that would be more like training against a specific opponent. We are not developing like that. Our changes are based on intuition and chess skill / understanding, we simply use testing to confirm or disprove our intuition. We don't make a single change to try to beat a single opponent. I can tune king safety differently and improve the results against Glaurung 2 / Toga 2. But I know those changes are not generic in nature and have a high risk of backfiring against a different program.

mhull · Post by **mhull** » Wed Apr 08, 2009 9:27 pm

bob wrote:
Gian-Carlo Pascutto wrote:I don't think we have much data on whether performance increases against a small set of engines might be non-transitive against others.

We know *for sure* though the minimum error margins caused by playing a limited number of games.

Since Bob is testing small changes, he's going to be more worried about the latter (which we're certain about affects us) than the former (which is an unknown factor that may or may not affect us).
I think it might be a bigger danger if we were looking at games played in the testing, and then trying to tune the program to improve on mistakes made because that would be more like training against a specific opponent. We are not developing like that.

But you used to. IIRC, Roman would make suggestions based on his obeservations of crafty's play on ICC or a player would lean on a known weakness (merciless), and you would tune to improve on mistakes made.

bob wrote:Our changes are based on intuition and chess skill / understanding, we simply use testing to confirm or disprove our intuition. We don't make a single change to try to beat a single opponent. I can tune king safety differently and improve the results against Glaurung 2 / Toga 2. But I know those changes are not generic in nature and have a high risk of backfiring against a different program.

But it is true that big increases have been found without the benefit cluster testing. Shredder held its edge for a long time. Now Rybka has a large edge, not found with the aid of cluster testing. Yet, cluster testing is without a doubt a very powerful tool. But it's natural for people to wonder how other championship projects (Shredder, Rybka) discovered their crushingly harmonious balance of techniques without large computing resources.

More Participants or More Number of Games?

More Participants or More Number of Games?

Re: More Participants or More Number of Games?

Re: More Participants or More Number of Games?

Re: More Participants or More Number of Games?

Re: More Participants or More Number of Games?

Re: More Participants or More Number of Games?

Re: More Participants or More Number of Games?

Re: More Participants or More Number of Games?

Re: More Participants or More Number of Games?

Re: More Participants or More Number of Games?