Results of Crafty 22.0, Sloppy 0.2.0 and Atlanchess 4.1

Discussion of computer chess matches and engine tournaments.

Moderator: Ras

Tony Thomas

Re: Results of Crafty 22.0, Sloppy 0.2.0 and Atlanchess 4.1

Post by Tony Thomas »

Michael Sherwin wrote:
ilari wrote:
Tony Thomas wrote:Sloppy did perform better, but not as well as I expected. The author had said that it would perform much better at my time controls, but I didnt observe anything that spectacular. Still a very good engine over all.
Thanks for testing the new Sloppy. I have to say I'm a bit surprised as well. Sure, I only tested your time control with a couple hundred games (mostly against Bugchess), but the results looked pretty conclusive.

To be sure, I just ran another test (1 min + 1 sec increment) against Sloppy 0.1.1, and this happened: Match Sloppy-0.2.0 vs. Sloppy-0.1.1: final score 46-20-34. I do all my testing in 64-bit Linux using Xboard as the GUI.

Could you give some info about your testing, mainly operating system, 32-bit or 64-bit, GUI (Arena, Winboard, etc.), GUI settings (pondering, show thinking) and Sloppy's configuration (hash size, opening book, egbbs). Then I might be able to reproduce your results better.
I have found out the hard way, that testing two versions of the same engine against each other can be very misleading. The newer 'better' version can even be worse overall. And just because the new version kills a particular engine, worse than it has ever done before does not mean that it will play better against a weaker engine.

Lots of games against alot of opponents is what is needed. If there is no time for that then I suggest a gauntlet of 100 games against 50 engines. And run it twice. If the score differs much between the two gauntlets then run it again.
It is also possible that with such high error bars, my test results are inconclusive. By testing Romi, I know darn well about engine performing really well against a top engine but rarely drawing a 30 game match against an engine rated 200-300 points.
User avatar
ilari
Posts: 750
Joined: Mon Mar 27, 2006 7:45 pm
Location: Finland

Re: Results of Crafty 22.0, Sloppy 0.2.0 and Atlanchess 4.1

Post by ilari »

Michael Sherwin wrote:I have found out the hard way, that testing two versions of the same engine against each other can be very misleading. The newer 'better' version can even be worse overall. And just because the new version kills a particular engine, worse than it has ever done before does not mean that it will play better against a weaker engine.

Lots of games against alot of opponents is what is needed. If there is no time for that then I suggest a gauntlet of 100 games against 50 engines. And run it twice. If the score differs much between the two gauntlets then run it again.
I'm aware of this. I have tested the new Sloppy with thousands of games against various opponents (mostly Crafty, SlowChess, Wildcat, Bugchess and Pseudo). So I have a pretty good idea of what Sloppy's rating should be on my time controls.

Here's one test I ran last night with Tony's time control: Match Sloppy-0.2.0 vs. Crafty-22.0 JA: final score 48-34-18. I couldn't compile the latest Crafty so I ran Jim Ablett's 32-bit build with wine. Not optimal for Crafty, and it's just 100 games, but still makes me doubt if Crafty really is 75 ELO points stronger than Sloppy.

I'll test the Windows build under Arena with Tony's settings to find out if something is different.
User avatar
ilari
Posts: 750
Joined: Mon Mar 27, 2006 7:45 pm
Location: Finland

Re: Results of Crafty 22.0, Sloppy 0.2.0 and Atlanchess 4.1

Post by ilari »

Okay, I've run my first test under Arena in Windows. Same settings that I used in the previous test against Crafty. This time, the score was dead even:
Sloppy-0.2.0 JA vs. Crafty-22.0 JA: final score 37-37-26

What's different from the previous test:
- The 32-bit Sloppy is a little slower, not much though, thanks to Jim's fast build
- Crafty may have been a tiny bit faster without the Wine emulation layer
- The engines were restarted after each game, even though I thought I had that option disabled

Time for a couple more tests in Windows, one under Winboard and one under Arena with Sloppy's thinking disabled...
Tony Thomas

Re: Results of Crafty 22.0, Sloppy 0.2.0 and Atlanchess 4.1

Post by Tony Thomas »

Your tests indicate that Crafty and Sloppy are almost equal. I guess it is possible that crafty likes my computer a lot. It is also possible that Sloppy's rating fluctuation is due to sloppy playing in the first division unlike crafty which plays in premier. Sloppy hasnt scored well enough for me to promote it to Premier, but for experiments sake, I will let it play in Premier. If it indeed scores better or close to Crafty I will let it stay.
User avatar
ilari
Posts: 750
Joined: Mon Mar 27, 2006 7:45 pm
Location: Finland

Re: Results of Crafty 22.0, Sloppy 0.2.0 and Atlanchess 4.1

Post by ilari »

Tony Thomas wrote:Your tests indicate that Crafty and Sloppy are almost equal.
On your time controls, yes. Here's another test I ran, again the same settings as previously, but now under Winboard: Match Sloppy-0.2.0 JA vs. Crafty-22.0 JA: final score 38-41-21. Another close result, which also suggests that Arena was not to blame.
Tony Thomas

Re: Results of Crafty 22.0, Sloppy 0.2.0 and Atlanchess 4.1

Post by Tony Thomas »

Can you send me the software that makes Sloppy fall in love with my computer? Uri did notice some weird search deapth on my computer a while back, but I think I clicked on internet explorer when the engine was searching. All of the sloppy games were played over night when I wasnt using the computer. By the way, I will have more results of sloppy in about 3 days. I will rename this install to Sloppy 0.2.0 exp JA, to differentiate it from the previous tested version.
User avatar
ilari
Posts: 750
Joined: Mon Mar 27, 2006 7:45 pm
Location: Finland

Re: Results of Crafty 22.0, Sloppy 0.2.0 and Atlanchess 4.1

Post by ilari »

Tony Thomas wrote:Can you send me the software that makes Sloppy fall in love with my computer?
Sloppy probably performs better at longer time controls, which also means that it performs better on faster hardware. So maybe that has an effect.

Here's one more test of Sloppy (64-bit) versus Crafty (32-bit), this time with only 16 megabytes for Sloppy's hash table: Match Sloppy-0.2.0 vs. Crafty-22.0 JA: final score 34-33-33.