Crafty vs Stockfish

bob · Post by **bob** » Wed Sep 15, 2010 3:01 am

Since a "couple here" have some pretty uninformed views, here's a different rating metric.

6000 games, Crafty-23.4, stockfish 1.8.whatever is current, 1 cpu each on my cluster. No books. No pondering. No learning. No outside programs running. No other opponents... no nothing but pure head-to-head.

Code: Select all

   1 Stockfish 1.8 64bit  2707    4    4  6000   78%  2493   22% 
   2 Crafty-23.4-1        2493    4    4  6000   22%  2707   22%

+214, right in line with the big match results (I just extracted crafty vs stockfish). So, does any other rating list actually contain 6,000 games between Crafty and SF, with no opening book interference or anything??? The one nice thing about this particular data is that it is "mano a mano". There is no third-program that could unduly influence things. For example, Rybka may win every game for all I know. Based on CCT events, I would be greatly surprised, but it could happen with a strong opening book. So against stockfish, Crafty would score 22%. Against Rybka, 0% hypothetically. If Rybka's rating (because of all other programs) is not very far above Stockfish's rating, then something has to give, if Crafty is going to be 800 (or more) below Rybka, and needs to be 200 below stockfish, and Rybka and stockfish are close.

Are my 6,000 games more accurate? For no books, absolutely. For what would happen in a tournament? Nope. Then books, and learning, and hardware come into play. But this data certainly suggests that at this time control, on this hardware, using reasonable starting positions with no book, the difference is just over 200, 22% to 78%.

If I wanted to fudge the data, why would I go to the trouble to put myself 214 below, rather than going to 50, or 20, or even equal?

As the saying goes, it simply is what it is in this test. If you don't like the no-book test, then ignore these results. But they are certainly helping me make steady progress. And that was the goal of starting this kind of massive testing methodology...

Milos · Post by **Milos** » Wed Sep 15, 2010 3:18 am

I don't know how you don't understand it.
Your result is correct. H2H Crafty 23.4 vs. SF 1.8 is 214 elo or 22%.
However, real SF 1.8 rating - real Crafty 23.4 rating ~ 280 elo.

H2H comparison diminishes the actual rating difference.
Why? Because you practically tuned your Crafty engine against SF.
Not intentionally, it's just the consequence of non-representative sample of programs you play against Crafty while optimizing your engine!

bob · Post by **bob** » Wed Sep 15, 2010 4:43 am

Milos wrote:I don't know how you don't understand it.
Your result is correct. H2H Crafty 23.4 vs. SF 1.8 is 214 elo or 22%.
However, real SF 1.8 rating - real Crafty 23.4 rating ~ 280 elo.

So there are _other_ programs that are giving me more trouble? And you do understand Elo, correct? In head-to-head, the Elo number gives me an accurate predictor of game outcomes between the two. The more programs you toss in, the more the ratings get spread around, because a single rating number for Crafty has to show how it does against all other opponents on average.

The number of games per opponent affects this.

So a simple statistical question comes to mind. If Crafty and Stockfish play a single game, what is the probability that Crafty will win? 22% (which is an Elo difference of 214) or something less than 22% (an elo of 250 or more). If you think the latter, justify just why my number says 22%, and if I play 100 games, why I get 22 points roughly.

H2H comparison diminishes the actual rating difference.

No it doesn't. It makes it more accurate. The purpose of the two ratings is to predict the outcome of a game between _those_ two players. If all you have are games between them the predictor becomes _very_ accurate. The more players you mix in, the less accurate the predictor becomes. It _must_ be that way. There is no alternative. _If_ you understand the Elo statistical model.

And the model fails in famous cases. That where A beats B 100% of the time, B beats C 100% of the time, and C beats a 100% of the time. All three will have the _same_ rating if you play an equal number of games between all opponents, and that is not worth a flip for predicting who beats who. While if you use just A vs B games to predict the outcome of A vs B, you get "truth" rather than "fiction."

So please make accurate statements. The above is bogus. UNLESS you agree that the non-HtoH Elo is inaccurate and the gap is too wide, and the HtoH only number narrows the gap only because it makes it more accurate.

In that case, I would agree, and accept that my point has been made.

Why? Because you practically tuned your Crafty engine against SF.
Not intentionally, it's just the consequence of non-representative sample of programs you play against Crafty while optimizing your engine!

Nice handwaving. But it doesn't matter. The _only_ use for comparing the rating of two programs is to predict the outcome of a game or games between those two programs. What you think Elo is for is completely beyond me. Apparently the absolute value has some hidden meaning to you. While in Elo's book you will discover that the absolute number means nothing, only the difference in two Elo numbers between two distinct opponents. Boy do you need a course in basic statistics...

mcostalba · Post by **mcostalba** » Wed Sep 15, 2010 12:40 pm

bob wrote:Since a "couple here" have some pretty uninformed views, here's a different rating metric.

6000 games, Crafty-23.4, stockfish 1.8.whatever is current

Did you compiled SF with profiling option ?

Public tested version is JA compiled, and this has a weight....

You can have something similar, but not as fast if you do:

Code: Select all

make profile-build ARCH=x86-64-modern

if your cluster is 32 bits then do:

Code: Select all

make profile-build ARCH=x86-32

Don · Post by **Don** » Wed Sep 15, 2010 4:17 pm

The Intel JA compile is actually worth quite a bit of ELO improvement on Komodo and I would assume on just about any program.

This may explain the discrepancy.

There are 2 other considerations:

1. Crafty is not JA compiled either - so this is more or less a wash. Unless of course Bob is using the Intel compiler.

2. In experiments I have done in the past, head to head ratings come out about the same, within experimental error. If we can demonstrate that head to head is really that biased, then we have to adjust accordingly.

On the other hand, why are we comparing to Stockfish 1.8 instead of Rybka 4? The lists are showing over 300 ELO difference between these two programs.

What we have to do is see how far Crafty was from the top on our two reference dates. Why is this so complicated?

It is disturbing to me that this simple thing is not good enough for Bob and that he continues to produce his own numbers - in this case by running more private tests and rejecting existing data. Doesn't that make anyone wonder what is going on here?

mcostalba wrote:
bob wrote:Since a "couple here" have some pretty uninformed views, here's a different rating metric.

6000 games, Crafty-23.4, stockfish 1.8.whatever is current
Did you compiled SF with profiling option ?

Public tested version is JA compiled, and this has a weight....

You can have something similar, but not as fast if you do:
Code: Select all
make profile-build ARCH=x86-64-modern
if your cluster is 32 bits then do:
Code: Select all
make profile-build ARCH=x86-32

Uri Blass · Post by **Uri Blass** » Wed Sep 15, 2010 5:08 pm

Milos wrote:I don't know how you don't understand it.
Your result is correct. H2H Crafty 23.4 vs. SF 1.8 is 214 elo or 22%.
However, real SF 1.8 rating - real Crafty 23.4 rating ~ 280 elo.

H2H comparison diminishes the actual rating difference.
Why? Because you practically tuned your Crafty engine against SF.
Not intentionally, it's just the consequence of non-representative sample of programs you play against Crafty while optimizing your engine!

Where do you find a rating for Crafty23.4?
It is not free.
Crafty23.3 is free but you cannot learn from the rating of Crafty23.3 about Crafty23.4

Note that Crafty23.3 scored only 10% against Stockfish1.8 in the CCRL 40/4 rating list so your claim that Crafty is tuned against stockfish seems to be nonsense.

I believe that it is almost impossible to tune program A against program B and the FRC CCRL list strongly suggest it because I cannot see a single case when A beat B in a match of 100 FRC games
inspite of being more than 50 elo weaker than B and I believe that there were programmers who tested only against one program.

bob · Post by **bob** » Wed Sep 15, 2010 5:10 pm

Uri Blass wrote:
Milos wrote:I don't know how you don't understand it.
Your result is correct. H2H Crafty 23.4 vs. SF 1.8 is 214 elo or 22%.
However, real SF 1.8 rating - real Crafty 23.4 rating ~ 280 elo.

H2H comparison diminishes the actual rating difference.
Why? Because you practically tuned your Crafty engine against SF.
Not intentionally, it's just the consequence of non-representative sample of programs you play against Crafty while optimizing your engine!
Where do you find a rating for Crafty23.4?
It is not free.
Crafty23.3 is free but you cannot learn from the rating of Crafty23.3 about Crafty23.4

Note that Crafty23.3 scored only 10% against Stockfish1.8 in the CCRL 40/4 rating list so your claim that Crafty is tuned against stockfish seems to be nonsense.

I believe that it is almost impossible to tune program A against program B and the FRC CCRL list strongly suggest it because I cannot see a single case when A beat B in a match of 100 FRC games
inspite of being more than 50 elo weaker than B and I believe that there were programmers who tested only against one program.

Uri: You can take my advice, or leave it, but I advise giving up on this particular poster. This is not about logic and accuracy, it is about trolling to create a disturbance...

bob · Post by **bob** » Wed Sep 15, 2010 5:12 pm

mcostalba wrote:
bob wrote:Since a "couple here" have some pretty uninformed views, here's a different rating metric.

6000 games, Crafty-23.4, stockfish 1.8.whatever is current
Did you compiled SF with profiling option ?

Public tested version is JA compiled, and this has a weight....

You can have something similar, but not as fast if you do:
Code: Select all
make profile-build ARCH=x86-64-modern
if your cluster is 32 bits then do:
Code: Select all
make profile-build ARCH=x86-32

Cluster is 64 bits, everything gets compiled with profiling, Crafty included. While I didn't check stockfish, Crafty typically gets 10-15%. Our cluster compiler is not the most recent, as updating is not so easy when there are so many potential machines. But it is a good version...

bob · Post by **bob** » Wed Sep 15, 2010 5:24 pm

Don wrote:The Intel JA compile is actually worth quite a bit of ELO improvement on Komodo and I would assume on just about any program.

This may explain the discrepancy.

There are 2 other considerations:

1. Crafty is not JA compiled either - so this is more or less a wash. Unless of course Bob is using the Intel compiler.

As I have said many times, that is _all_ I use. Unless I run on AMD. For reasons unknown, gcc seems to be better for AMD processors _every_ time I compare them... Particularly for SMP code. But on Intel boxes, and all of our stuff is currently Intel, icc is far better. And PGO actually works for multi-threaded code as well. gcc crashes and burns.

2. In experiments I have done in the past, head to head ratings come out about the same, within experimental error. If we can demonstrate that head to head is really that biased, then we have to adjust accordingly.

This depends on circumstances. For example, take A, B and C (let A be stockfish). Suppose B and C are -200 head-to-head with stockfish. But suppose B is +100 head-to-head with C (and this is not that uncommon). Elo can't give an accurate rating. Because when you choose any two programs and compare their Elos, the rating should predict the result between them, with reasonable accuracy. There is no rating you can assign to A, B and C that will be accurate. B has to be +100 over C, yet B and C both have to be -200 to A. So you only get a very rough approximation for B and C, and depending on how many games you play between who, you could get A=2600, B=2450 and C=2350. B and C have the right interval between them, but neither A/B or A/C do, either is off by 50.

Nothing can be done for that case except to use the head-to-head rating if you have it. USCF/FIDE doesn't use that kind of data that I am aware of.

On the other hand, why are we comparing to Stockfish 1.8 instead of Rybka 4? The lists are showing over 300 ELO difference between these two programs.

300 points between 1.8 and Rybka? Where? I am certain that is wrong.

What we have to do is see how far Crafty was from the top on our two reference dates. Why is this so complicated?

It isn't, but I can't run head-to-head matches with Rybka 4. In the lists I have seen, stockfish 1.8 is pretty close to R3/4.

It is disturbing to me that this simple thing is not good enough for Bob and that he continues to produce his own numbers - in this case by running more private tests and rejecting existing data. Doesn't that make anyone wonder what is going on here?

Only makes one wonder what your "hidden agenda" is. I am not rejecting anything. I am trying to produce results with as few outside influences as possible, given the facilities that I have. A large rating list doesn't say a thing about how far apart two specific programs are, because the ratings are an average sampling over games between various pairs of programs in the list. I've already explained this. To precisely measure A vs B, you play A vs B and run it thru BayesElo or your favorite Elo calculator. The more programs you use, the less accurate a rating becomes. It gives you a better idea of how any program stacks up against the entire population as a whole, but a single Elo number can't predict the outcome between any two opponents with high accuracy, unless only results from those two opponents playing each other is used...

I simply set out to roughly quantify hardware vs software. You, on the other hand have continually shifted the discussion, make outright false statements (Crafty data shows software more important...) and then bring in deep blue for no valid reason at all. So, what's your game here, instead of questioning what mine is???

mcostalba wrote:
bob wrote:Since a "couple here" have some pretty uninformed views, here's a different rating metric.

6000 games, Crafty-23.4, stockfish 1.8.whatever is current
Did you compiled SF with profiling option ?

Public tested version is JA compiled, and this has a weight....

You can have something similar, but not as fast if you do:
Code: Select all
make profile-build ARCH=x86-64-modern
if your cluster is 32 bits then do:
Code: Select all
make profile-build ARCH=x86-32

BubbaTough · Post by **BubbaTough** » Wed Sep 15, 2010 5:29 pm

If the goal is to compare the two open source programs, it seems MORE fair to compile both the same way, as Bob is doing. The fact that there are versions of Stockfish that use some improved compiling approach may be useful in explaining some of the strength gap on public lists, but is less of a criticism of Bob's methodology than a way of pointing out a slight weakness in using the public lists to compare open source programs (in my opinion).

-Sam

Crafty vs Stockfish

Crafty vs Stockfish

Re: Crafty vs Stockfish

Re: Crafty vs Stockfish

Re: Crafty vs Stockfish

Re: Crafty vs Stockfish

Re: Crafty vs Stockfish

Re: Crafty vs Stockfish

Re: Crafty vs Stockfish

Re: Crafty vs Stockfish

Re: Crafty vs Stockfish