More Participants or More Number of Games?

bob · Post by **bob** » Wed Apr 08, 2009 9:32 pm

mhull wrote:
bob wrote:
Gian-Carlo Pascutto wrote:I don't think we have much data on whether performance increases against a small set of engines might be non-transitive against others.

We know *for sure* though the minimum error margins caused by playing a limited number of games.

Since Bob is testing small changes, he's going to be more worried about the latter (which we're certain about affects us) than the former (which is an unknown factor that may or may not affect us).
I think it might be a bigger danger if we were looking at games played in the testing, and then trying to tune the program to improve on mistakes made because that would be more like training against a specific opponent. We are not developing like that.
But you used to. IIRC, Roman would make suggestions based on his obeservations of crafty's play on ICC or a player would lean on a known weakness (merciless), and you would tune to improve on mistakes made.

Yes. Against a specific "weakness" the program shows. Not against a specific weakness the opponent shows. That was my point. I am not tuning to "beat glaurung 2 or Toga 2." I am tuning to play better chess overall, and then measuring against those to see if the result is better or worse. It is certainly possible, although not extremely likely, that a "better eval" would play worse against one of them. But I am using 4 opponents most of the time, sometimes 6 or 8, and aggregating the results to look at overall change. +=good, -=bad.

It is important to notice that we never tuned against a weakness an opponent had. Which is more dangerous. the "Gambit Tiger" is one example. It could beat humans right and left with a speculative style of play. Against programs with weak king safety, it would also win with spectacular attacks. But against programs that were more solid, it would go down in flames when the attack fails, and the material sacrificed to initiate the attack leads to an easy win by the opponent.

If I were watching individual cluster games and tuning to win games we are losing or drawing, I'd be much more concerned that we were training to beat specific opponents, probably to the detriment of play against other opponents. But we are not doing that. We simply use the test results to accept or reject a change we made that had nothing to do with the opponents and how they play.

bob wrote:Our changes are based on intuition and chess skill / understanding, we simply use testing to confirm or disprove our intuition. We don't make a single change to try to beat a single opponent. I can tune king safety differently and improve the results against Glaurung 2 / Toga 2. But I know those changes are not generic in nature and have a high risk of backfiring against a different program.
But it is true that big increases have been found without the benefit cluster testing. Shredder held its edge for a long time. Now Rybka has a large edge, not found with the aid of cluster testing.

Not quite. Kaufman claims their improvements have been the result of their testing, which is quite similar to mine, except it takes them overnight to play 40,000 games on an 8-core box, where I can play somewhat longer games and get the results back in an hour. But our approaches are quite similar other than the speed at which we get results back. And that is all cluster-testing offers. Speed. Nothing that I couldn't learn with a single box. Except I learn it far faster.

Yet, cluster testing is without a doubt a very powerful tool. But it's natural for people to wonder how other championship projects (Shredder, Rybka) discovered their crushingly harmonious balance of techniques without large computing resources.

thorough testing. They just can't come anywhere near the turnaround time I can produce, other than that our approaches are quite similar.

bob · Post by **bob** » Wed Apr 08, 2009 10:07 pm

swami wrote:
If you had noticed carefully, I have used more than 3 opponents. And still do. I never use less than 4. But for my fast testing, this is enough. For more thorough testing, I use more. I have switched opponents around a bit, but months ago I was using fruit, glaurung 1 and 2, and Arasan. I've replaced Arasan with Toga. I'm not sure exactly what you are finding fault with. I have two programs that are significantly stronger, one that is slightly weaker and one that is a little weaker than that.
Well, If I were you, I'd just get myself a spare computer(s) with windows to do all that testing. I think you're missing out a lot of equally matched opponents that would give Crafty a good challenge. Linux maybe more stable and faster but it's not the platform for computer chess engines and matches because there ain't too many engines.
You don't learn anything playing programs that are more than 200 Elo weaker.
With this statement, you just contradicted yourself: Glaurung 2.2 is atleast 150- 200 elo higher than Crafty. Toga likewise. Arasan is about 100 elo lower than Crafty. Not too close in strength, these opponents. Would the ratings estimation be quite accurate considering the differences of crafty's opponents are huge? Perhaps the older version of these Glaurungs and Togas may even the league but old version of same engine got the same style. I'd rather see Crafty playing diffent engines with diverse playing styles to test out Crafty's wits.

Your claim about 2 games against 1000 opponents is sarcastic exaggeration at best I think I'd prefer minimum of _200_ games against single opponent. That's the minimum cut off. So increase the number of participants, the number of games to be played against each engine remains constant.

I still believe that 2000 games gathered from 10 engines (of nearer strength to Crafty, say 25-50 elo apart) is far more reliable than 30k games gathered from matches against engines that are 100-200 elo apart. When you find that latest version has improved much to belong in higher league, you just replace the sets of engines and add in newer set with even better engines strength-wise. That's the best way to see the progress.

BTW my claim about the number of opponents was not an exaggeration. To get a +/- 4 error bar, you need approximately 32K games. No way around that. More opponents with fewer games per opponent does not drop the error bar at all, unless you play more than 32K games. So I am not quite sure what you were talking about there.

200 games per opponent _still_ leaves me needing 150 opponents to get to my 30,000 games necessary for a fairly wide +/- 4 error bar.

bob · Post by **bob** » Wed Apr 08, 2009 10:21 pm

OK, someone suggested Cyrano. This was not a good suggestion. I have been running tests on my laptop. I have seen two results so far: (1) crafty wins the game or (2) Cyrano crashes/hangs and loses on time.

That's not usable in an automated test as the result using my arbiter software will be (so far) a 100% score for Crafty which is useless.

krazyken · Post by **krazyken** » Thu Apr 09, 2009 12:28 am

bob wrote:OK, someone suggested Cyrano. This was not a good suggestion. I have been running tests on my laptop. I have seen two results so far: (1) crafty wins the game or (2) Cyrano crashes/hangs and loses on time.

That's not usable in an automated test as the result using my arbiter software will be (so far) a 100% score for Crafty which is useless.

did you increase the stack size as I suggested in the other post? I've run a few hundred slower games without crashing after that fix.

bob · Post by **bob** » Thu Apr 09, 2009 1:59 am

krazyken wrote:
bob wrote:OK, someone suggested Cyrano. This was not a good suggestion. I have been running tests on my laptop. I have seen two results so far: (1) crafty wins the game or (2) Cyrano crashes/hangs and loses on time.

That's not usable in an automated test as the result using my arbiter software will be (so far) a 100% score for Crafty which is useless.
did you increase the stack size as I suggested in the other post? I've run a few hundred slower games without crashing after that fix.

The crashing went away, but it still hangs occasionally and will not even terminate after xboard declares a time forfeit. I have to kill the thing to go to the next game...

swami · Post by **swami** » Thu Apr 09, 2009 2:47 am

bob wrote: Easier said than done, as far as replacing opponents. WIdows is not the answer. "Windows?" is the question. "no" is the answer. As far as your speculation about 2000 games, you are simply _dead_ wrong. You are hung up in the wrong world. I'm not trying to find out which of the group is strongest. I am trying to test two versions of Crafty and decide which is best. 2000 games will _not_ allow that under any circumstances except when a change produces a revolutionary impact. Those are not likely since Crafty is a mature engine. If you can't understand the problem with 2000 games, there's little I can do to explain further. But an error bar of +/- 12 is _not_ going to help you make a decision about whether A' is better than A or not. And I believe _most_ understand that concept.

With Windows, You get more than 30 engines that are 30-50 elo apart from Crafty. With Linux you get only 4 or 5 engines, and you begin to resort to using the older versions of the same engines because of the lack of participants.

I wasn't implying that you had to find out which group of engines were stronger. I just said you could see Crafty's progress in a set of 15 engines first in a division gauntlet, call it Division C. If the version plays better to be able to finish consistently in Top 2, You just put that version to higher division(Division B) and do the _gauntlet_, only if you've windows. That's what the football club matches are like, no? Except in this case, one sided gauntlet version mainly to test versions of Crafty.

Ok, then you can play a minimum of 1k games against 30 engines, That's 30,000. Error bars will be lower.

My point is that I'd much prefer to see the diversity of playing styles of various engines pitted against Crafty to test Crafty's wit. 30 windows engines, all with different playing styles, that's my definition of diversity.

With linux, you get no diverse playing styles because the list of players is limited. Adding older version of the same engine(Glaurung, Toga...) is not different engine altogether.

Experiment 1:
4 genuine engines: (Toga, Glaurung, Fruit, Arasan) + 2 or 3 old versions of these engines. Total ~ 7 participants. 30k games.

Experiment 2:
30 genuine engines: (Scorpio, Booot, Cyrano, Colossus, Delfi...etc ). No old versions of the same engine. Total ~ 30 participants. 30K games.

(Don't assume that I'm suggesting a round robin, I have always been suggesting Gauntlet)

Now doesn't anybody see that the 2nd experiment is clearly better?

Do you think the stats will have any effect if you gather 30 k games from the result of 1st experiment or you gather 30k games from experiment 2? I think it does.

For one thing, after 30K games, ratings outputted from first experiment will be much different from the ratings outputted from the 2nd experiment. why? Because Crafty has played 30 different opponents with diverse _playing styles_. Who knows if Crafty doesn't correspond well with engines that are too aggressive. Who knows if it plays well against engines that are too sacrificial. Who knows if it doesn't play all that well with engines that have exceptional understanding of positional chess. That's what diversity is for.

swami · Post by **swami** » Thu Apr 09, 2009 3:06 am

wgarvin wrote:For the kind of testing bob does, a single spare Windows box would take months to finish a single test run.

Could be, but just adding one more computer would help. Windows is the common place for a lot of engines.

If 30 friends who are comparatively challenging to you want to play basketball outdoor. But you only want to play indoors, where you can find 4 players who are also challenging to you. Would you rather stick indoors and test your skills against these 4? I'd have thought playing outdoor would be better.

Comp chess Windows: Outdoor basketball with 30 players -each having much different playing styles. Some may act as your nemesis.

Comp chess Linux: Indoor Basketball with 4 players plus additional 3 juniors that play exactly like the aforementioned 4 seniors.

It's a little presumptuous to come in and say you think you know better than him how to test his engine.
Dr. Hyatt has been writing chess engines for decades and has one of the most rigorous testing setups of ANY chess engine author around.

I have precisely stated nowhere I know about testing more than Bob. I claim here right now that Bob knows more things than me in testing. Does that make you feel better?

I too have rights to argue these matters with Bob or anyone else for that matter.I'm not arguing for the sake of arguing or for the sake of trying to outwit others. I'm arguing for something I feel deserves mentioning.

Statistics is a funny thing -- when it comes to statistics, human intuition is not always reliable. For example, this persistent idea that 200 or 2000 or any similar number of games, is enough to accurately gauge the impact of a small change that might be worth 10 ELO at most. I think Dr. Hyatt has pretty conclusively demolished that idea here in previous threads.

Yes, the keyword here is "accurately" measuring the elo gain of changes. I agree. But the question I asked is under what _conditions_ do you find it better to measure the gain in elo via changes. Certainly in Outdoors basketball or experiment two...

bob · Post by **bob** » Thu Apr 09, 2009 4:58 am

swami wrote:
wgarvin wrote:For the kind of testing bob does, a single spare Windows box would take months to finish a single test run.
Could be, but just adding one more computer would help. Windows is the common place for a lot of engines.

If 30 friends who are comparatively challenging to you want to play basketball outdoor. But you only want to play indoors, where you can find 4 players who are also challenging to you. Would you rather stick indoors and test your skills against these 4? I'd have thought playing outdoor would be better.

Comp chess Windows: Outdoor basketball with 30 players -each having much different playing styles. Some may act as your nemesis.

Comp chess Linux: Indoor Basketball with 4 players plus additional 3 juniors that play exactly like the aforementioned 4 seniors.

It's a little presumptuous to come in and say you think you know better than him how to test his engine.
Dr. Hyatt has been writing chess engines for decades and has one of the most rigorous testing setups of ANY chess engine author around.
I have precisely stated nowhere I know about testing more than Bob. I claim here right now that Bob knows more things than me in testing. Does that make you feel better?

I too have rights to argue these matters with Bob or anyone else for that matter.I'm not arguing for the sake of arguing or for the sake of trying to outwit others. I'm arguing for something I feel deserves mentioning.

Statistics is a funny thing -- when it comes to statistics, human intuition is not always reliable. For example, this persistent idea that 200 or 2000 or any similar number of games, is enough to accurately gauge the impact of a small change that might be worth 10 ELO at most. I think Dr. Hyatt has pretty conclusively demolished that idea here in previous threads.
Yes, the keyword here is "accurately" measuring the elo gain of changes. I agree. But the question I asked is under what _conditions_ do you find it better to measure the gain in elo via changes. Certainly in Outdoors basketball or experiment two...

You are simply wrong. Which is better, using 256 computers to play 256 games at a time, or using maybe 4 to play 2 games at a time? This is about time. If I have to wait 128 hours rather than one hour, I can test one change a wee, basically. That's worse. Far worse. In that right now I can get a 32,000 game test, with an error of +/- four or five, and make a decision.

2000 games will have a huge error bar and is useless. 32,000 games on a few boxes is too slow to be useful...

That leaves one logical choice which is working very well so far based on past results...

bob · Post by **bob** » Thu Apr 09, 2009 5:08 am

swami wrote:
bob wrote: Easier said than done, as far as replacing opponents. WIdows is not the answer. "Windows?" is the question. "no" is the answer. As far as your speculation about 2000 games, you are simply _dead_ wrong. You are hung up in the wrong world. I'm not trying to find out which of the group is strongest. I am trying to test two versions of Crafty and decide which is best. 2000 games will _not_ allow that under any circumstances except when a change produces a revolutionary impact. Those are not likely since Crafty is a mature engine. If you can't understand the problem with 2000 games, there's little I can do to explain further. But an error bar of +/- 12 is _not_ going to help you make a decision about whether A' is better than A or not. And I believe _most_ understand that concept.
With Windows, You get more than 30 engines that are 30-50 elo apart from Crafty. With Linux you get only 4 or 5 engines, and you begin to resort to using the older versions of the same engines because of the lack of participants.

I wasn't implying that you had to find out which group of engines were stronger. I just said you could see Crafty's progress in a set of 15 engines first in a division gauntlet, call it Division C. If the version plays better to be able to finish consistently in Top 2, You just put that version to higher division(Division B) and do the _gauntlet_, only if you've windows. That's what the football club matches are like, no? Except in this case, one sided gauntlet version mainly to test versions of Crafty.

Ok, then you can play a minimum of 1k games against 30 engines, That's 30,000. Error bars will be lower.

My point is that I'd much prefer to see the diversity of playing styles of various engines pitted against Crafty to test Crafty's wit. 30 windows engines, all with different playing styles, that's my definition of diversity.

With linux, you get no diverse playing styles because the list of players is limited. Adding older version of the same engine(Glaurung, Toga...) is not different engine altogether.

I am not doing that execpt with glaurung 1 and 2. And if you watch them play, they are essentially completely different programs in how they play, both positionally and tactically. And they are both _robust_. I tried Cyrano this afternoon as someone suggested it would be a good match. At 1+1 Crafty won about 2 to 1 for the games that were won. Several times cyrano hung and had to be killed. That's simply no good for testing. I have others I play against as well, including gnuchess and a couple of others. I like the current four because out of 32,000 very fast games, there are _no_ losses on time.

Experiment 1:
4 genuine engines: (Toga, Glaurung, Fruit, Arasan) + 2 or 3 old versions of these engines. Total ~ 7 participants. 30k games.

How about what I actually do? roughly 8K games against Toga, Glaurung 2, fruit and glaurung 1. Not 7 participants here, just 4.

Experiment 2:
30 genuine engines: (Scorpio, Booot, Cyrano, Colossus, Delfi...etc ). No old versions of the same engine. Total ~ 30 participants. 30K games.

(Don't assume that I'm suggesting a round robin, I have always been suggesting Gauntlet)

Now doesn't anybody see that the 2nd experiment is clearly better? [/quote]

"better"?? Perhaps a small bit. But not as much better as you would suspect since just to take the case of Cyrano, it can't seem to play a dozen games without hanging and losing on time. That doesn't exactly help me measure _my_ changes.

Do you think the stats will have any effect if you gather 30 k games from the result of 1st experiment or you gather 30k games from experiment 2? I think it does.

I can't answer for 30. I can answer for 9, which is the most I have used. And I found no difference in my measurement of A vs A'. That's what you are overlooking. I don't care how much better A' is than A, or how much better or worse it is against other programs. I only want to know should I keep the changes or not. And the less random noise the better... And there are quite a few programs that introduce a lot of random noise. For example, arasan 9/10 could not cope with my fast time control and lost 1/3 of the games on time. Useless for helping me make a decision.

For one thing, after 30K games, ratings outputted from first experiment will be much different from the ratings outputted from the 2nd experiment. why? Because Crafty has played 30 different opponents with diverse _playing styles_. Who knows if Crafty doesn't correspond well with engines that are too aggressive. Who knows if it plays well against engines that are too sacrificial. Who knows if it doesn't play all that well with engines that have exceptional understanding of positional chess. That's what diversity is for.

Again, you are hung up on the "rating" (you mentioned they would be different.) I don't care about the "rating". I only care whether the new or old version of Crafty did better against the test. not how much better or worse, just "go" or "nogo". And more opponents doesn't help a bit there at the moment. It might, but we are being careful to not tune for the opponents, but are tuning simply to play better chess.

MattieShoes · Post by **MattieShoes** » Thu Apr 09, 2009 5:47 am

This is a bit of a tangent, but do you think it's possible to OVER-test?

I was just thinking most of the current values in crafty are tested to work near optimally with each other, so any tweak of the old settings makes it worse. When testing additions, it could upset this careful balance and make them appear worse, even if they could potentially be better if all the other settings were tuned with respect to this new addition.

I wonder if a better test-bed for new ideas might contain settings that are more generic, less tuned to work perfectly together. I suppose on the other hand, it could lead one down a lot of roads that are dead ends. I don't know, just rambling...

More Participants or More Number of Games?

Re: More Participants or More Number of Games?

Re: More Participants or More Number of Games?

Re: More Participants or More Number of Games?

Re: More Participants or More Number of Games?

Re: More Participants or More Number of Games?

Re: More Participants or More Number of Games?

Re: More Participants or More Number of Games?

Re: More Participants or More Number of Games?

Re: More Participants or More Number of Games?

Re: More Participants or More Number of Games?