Crafty vs Stockfish

Discussion of chess software programming and technical issues.

Moderator: Ras

User avatar
Don
Posts: 5106
Joined: Tue Apr 29, 2008 4:27 pm

Re: Crafty vs Stockfish

Post by Don »

bob wrote:
Don wrote:On the head to head thing I did a quick study based on some existing data I had. My disclaimer is that I am only going to report the numbers without drawing any conclusions. So draw your own conclusions about the validity of this test.

In this particular set of programs, run at very fast fischer time controls I have this data:

Code: Select all

Rank Name                       Elo    +    - games score oppo. draws 
   1 sf18-12_2                 2992    5    4 29860   65%  2894   24% 
   2 Robbolito-handicapped-6s  2986    5    6 17124   63%  2891   25% 
   3 leiden                    2980    5    5 29859   63%  2897   29% 
   4 Komodo_1.0                2948    5    5 17124   57%  2891   28% 
   5 k-3015.26-3hard           2866    5    6 17124   45%  2891   25% 
   6 k-3015.48-ref             2860   10    9  3957   44%  2891   24% 
   7 spike-24s                 2700    5    5 29860   15%  2950   15% 

These programs are all run at different time controls in order that there is no ridiculous disparity between them. The k- programs are weak versions of an experimental program I'm working on and leiden, Komdo_1.0, k- are all heavily related programs.

Roboo is running twice as fast as Stockfish in order to be approximately equal and Stockfish is running faster than Komodo so that Komodo is not too far behind. Spike is given more time than any other program.

In this test stockfish is 6 ELO stronger.

I wold like to note that in this test the komodo based programs never play each other but the "foreign" programs play everyone.

I removed all games except Robbo and stockfish and get this result:

Code: Select all

Rank Name                       Elo    +    - games score oppo. draws 
   1 sf18-12_2                    1    5    5  5708   50%    -1   33% 
   2 Robbolito-handicapped-6s    -1    5    5  5708   50%     1   33% 
In head to head stockfish is 2 ELO stronger. The error margins are too large to make any firm conclusions but low enough to suggest that the effect (in this test) is minor if any.

You can draw your own conclusions. Perhaps if I used more of a variety of programs and we would see a more noticeable trend?
Depends on what you want.

Do you want the Elo to _accurately_ predict the outcome of games between two specific players? If so, use only head-to-head games, and the Elo will be _extremely_ accurate for those two programs. Notice, for the record, the absolute value of the Elo is meaningless anyway, the only thing that matters is the Elo gap between the two players.

Do you want a rough idea of how everybody stacks up to everybody else? Knowing that the individual Elo numbers become less meaningful for anything but this rough ordering? If so, munge all the pgn together and run it thru bayeselo. And you get a pretty good ranking from top to bottom, but you really can not expect to take any two Elo numbers, compare them and use that to predict head-to-head results.

So two different objectives. One way to reach either. But statistically, the statement "program X is N elo better than program Y" has a very specific meaning, because N is supposed to specify a very accurate winning/losing ratio for those two programs.
The reason this came up is because you were accused of using head to head to draw conclusions about the incorrectness of the rating lists. I don't know if that is what you were really trying to do, but it's pretty odd that you run this test when there is already plenty of data from numerous testing agencies.

If you objective really is to prove they are wrong, you need to run a more appropriate test if you believe what you just wrote. Otherwise, this is interesting, but doesn't relate to any of the recent discussions.

When you mix in other programs, you lose that specificity, to gain an overall view of who is best and who is worse, but the "how much better or worse" is significantly less accurate as a result.

A clear example of "you can't have your cake and eat it too..."
bob
Posts: 20943
Joined: Mon Feb 27, 2006 7:30 pm
Location: Birmingham, AL

Re: Crafty vs Stockfish

Post by bob »

rbarreira wrote:
bob wrote: With the compiler versions I use, you don't get that. You can tell it _explicitly_ which architecture to produce code for so that you don't get that overhead. And I have looked at quite a bit of assembly output (xxx.S files) over the years and have not seen any architectural testing.
You are talking about the -x option, right? That's even worse, as it will produce a binary that will only run on Intel CPUs

I just did this. Created a simple "Hello World" program and compiled it with:

icc -xSSE3 icc.c -o icc

On an Intel CPU it works fine, on an AMD Phenom II X6 (i.e. AMD's newest CPU), this is what happens:

Code: Select all

ricardo@ricardo-desktop:~$ ./icc

Fatal Error: This program was not built to run on the processor in your system.
The allowed processors are: Intel(R) Pentium(R) 4 and compatible Intel processors with Streaming SIMD Extensions 3 (SSE3) instruction support.
edit: even -xSSE2 fails on AMD CPUs... it prints a similar message to the above, saying that it only works on a Pentium 4 or above.
What version are you using? My current version is 11.0. I do not know what was current when I had the opteron issue, but it was at least 4 years ago, as I have been running on a dual-quad-core intel box for the last 3 CCTs...

I haven't tried moving from Intel to AMD in 3 years since the Core-2 has been faster, significantly, for my testing.
rbarreira
Posts: 900
Joined: Tue Apr 27, 2010 3:48 pm

Re: Crafty vs Stockfish

Post by rbarreira »

bob wrote:
rbarreira wrote:
bob wrote: With the compiler versions I use, you don't get that. You can tell it _explicitly_ which architecture to produce code for so that you don't get that overhead. And I have looked at quite a bit of assembly output (xxx.S files) over the years and have not seen any architectural testing.
You are talking about the -x option, right? That's even worse, as it will produce a binary that will only run on Intel CPUs

I just did this. Created a simple "Hello World" program and compiled it with:

icc -xSSE3 icc.c -o icc

On an Intel CPU it works fine, on an AMD Phenom II X6 (i.e. AMD's newest CPU), this is what happens:

Code: Select all

ricardo@ricardo-desktop:~$ ./icc

Fatal Error: This program was not built to run on the processor in your system.
The allowed processors are: Intel(R) Pentium(R) 4 and compatible Intel processors with Streaming SIMD Extensions 3 (SSE3) instruction support.
edit: even -xSSE2 fails on AMD CPUs... it prints a similar message to the above, saying that it only works on a Pentium 4 or above.
What version are you using? My current version is 11.0. I do not know what was current when I had the opteron issue, but it was at least 4 years ago, as I have been running on a dual-quad-core intel box for the last 3 CCTs...

I haven't tried moving from Intel to AMD in 3 years since the Core-2 has been faster, significantly, for my testing.
It's version 11.1. But this stuff has been documented for many years, for example:

http://www.agner.org/optimize/blog/read.php?i=49
http://web.archive.org/web/200706261125 ... intel.html
mcostalba
Posts: 2684
Joined: Sat Jun 14, 2008 9:17 pm

Re: Crafty vs Stockfish

Post by mcostalba »

bob wrote:
mcostalba wrote: BTW what make command have you used to compile SF for this test ?
"make". :)
It is difficult to get it right in this way, there are some variable to pass like POPCNT and PREFECTH and similar that are difficult to properly handle with plain simple cut & paste....anyhow if it is an issue to use original makefile I won't argue anymore...
wgarvin
Posts: 838
Joined: Thu Jul 05, 2007 5:03 pm
Location: British Columbia, Canada

Re: Crafty vs Stockfish

Post by wgarvin »

I'm not sure what the status of it is now, but some past versions of Intel's compiler have blatantly discriminated against non-Intel chips by detecting the vendor string from CPUID and dispatching to non-SSE code if it wasn't 'GenuineIntel'. IIRC they were sued over it and may have agreed not to do that anymore as part of a settlement.

http://www.agner.org/optimize/blog/read.php?i=49

http://www.osnews.com/story/22683/Intel ... _Compiler_

It used to be popular with some users of Intel's compiler, to patch out the version check in the compiled binaries so they would run the fast path on AMD chips too.
bob
Posts: 20943
Joined: Mon Feb 27, 2006 7:30 pm
Location: Birmingham, AL

Re: Crafty vs Stockfish

Post by bob »

mcostalba wrote:
bob wrote:
mcostalba wrote: BTW what make command have you used to compile SF for this test ?
"make". :)
It is difficult to get it right in this way, there are some variable to pass like POPCNT and PREFECTH and similar that are difficult to properly handle with plain simple cut & paste....anyhow if it is an issue to use original makefile I won't argue anymore...
I'm not using your makefile, remember. I used mine, taking the recommended options. As far as popcnt goes, these are not i7's, best cpus are core-2 type xeons without popcnt...
bob
Posts: 20943
Joined: Mon Feb 27, 2006 7:30 pm
Location: Birmingham, AL

Re: Crafty vs Stockfish

Post by bob »

Don wrote:
bob wrote:
Don wrote:On the head to head thing I did a quick study based on some existing data I had. My disclaimer is that I am only going to report the numbers without drawing any conclusions. So draw your own conclusions about the validity of this test.

In this particular set of programs, run at very fast fischer time controls I have this data:

Code: Select all

Rank Name                       Elo    +    - games score oppo. draws 
   1 sf18-12_2                 2992    5    4 29860   65%  2894   24% 
   2 Robbolito-handicapped-6s  2986    5    6 17124   63%  2891   25% 
   3 leiden                    2980    5    5 29859   63%  2897   29% 
   4 Komodo_1.0                2948    5    5 17124   57%  2891   28% 
   5 k-3015.26-3hard           2866    5    6 17124   45%  2891   25% 
   6 k-3015.48-ref             2860   10    9  3957   44%  2891   24% 
   7 spike-24s                 2700    5    5 29860   15%  2950   15% 

These programs are all run at different time controls in order that there is no ridiculous disparity between them. The k- programs are weak versions of an experimental program I'm working on and leiden, Komdo_1.0, k- are all heavily related programs.

Roboo is running twice as fast as Stockfish in order to be approximately equal and Stockfish is running faster than Komodo so that Komodo is not too far behind. Spike is given more time than any other program.

In this test stockfish is 6 ELO stronger.

I wold like to note that in this test the komodo based programs never play each other but the "foreign" programs play everyone.

I removed all games except Robbo and stockfish and get this result:

Code: Select all

Rank Name                       Elo    +    - games score oppo. draws 
   1 sf18-12_2                    1    5    5  5708   50%    -1   33% 
   2 Robbolito-handicapped-6s    -1    5    5  5708   50%     1   33% 
In head to head stockfish is 2 ELO stronger. The error margins are too large to make any firm conclusions but low enough to suggest that the effect (in this test) is minor if any.

You can draw your own conclusions. Perhaps if I used more of a variety of programs and we would see a more noticeable trend?
Depends on what you want.

Do you want the Elo to _accurately_ predict the outcome of games between two specific players? If so, use only head-to-head games, and the Elo will be _extremely_ accurate for those two programs. Notice, for the record, the absolute value of the Elo is meaningless anyway, the only thing that matters is the Elo gap between the two players.

Do you want a rough idea of how everybody stacks up to everybody else? Knowing that the individual Elo numbers become less meaningful for anything but this rough ordering? If so, munge all the pgn together and run it thru bayeselo. And you get a pretty good ranking from top to bottom, but you really can not expect to take any two Elo numbers, compare them and use that to predict head-to-head results.

So two different objectives. One way to reach either. But statistically, the statement "program X is N elo better than program Y" has a very specific meaning, because N is supposed to specify a very accurate winning/losing ratio for those two programs.
The reason this came up is because you were accused of using head to head to draw conclusions about the incorrectness of the rating lists. I don't know if that is what you were really trying to do, but it's pretty odd that you run this test when there is already plenty of data from numerous testing agencies.

Don, I understand the Elo model. I understand that a huge group of opponents produces a different Elo for A and B than just playing A vs B. The statement made was "your tests are showing SF as only better than Crafty by about +200, while other (much bigger) lists have it significantly better than that.

I produced accurate numbers, 6000 games of SF vs Crafty, no duplicated games, no odd opening book lines, etc. Which is far more accurate for determining an Elo that will predict Crafty vs SF game outcomes. So it isn't "odd" at all. It comes from an understanding of what the term "Elo rating" means and then trying to get the most accurate "difference" between Crafty's and SF's Elo.

If you objective really is to prove they are wrong, you need to run a more appropriate test if you believe what you just wrote. Otherwise, this is interesting, but doesn't relate to any of the recent discussions.

It _directly_ answers the question how much stronger is SF than Crafty? Answer, strong enough that Crafty wins 22% of the games, SF wins 78%. And the _only_ Elo difference that will produce that ratio is 216 or whatever it comes out in my data. Not +300, not +250. So my number is extremely accurate to answer _that_ question...

When you mix in other programs, you lose that specificity, to gain an overall view of who is best and who is worse, but the "how much better or worse" is significantly less accurate as a result.

A clear example of "you can't have your cake and eat it too..."
User avatar
Don
Posts: 5106
Joined: Tue Apr 29, 2008 4:27 pm

Re: Crafty vs Stockfish

Post by Don »

bob wrote: Don, I understand the Elo model. I understand that a huge group of opponents produces a different Elo for A and B than just playing A vs B. The statement made was "your tests are showing SF as only better than Crafty by about +200, while other (much bigger) lists have it significantly better than that.
This is not a sample size issue at all. EVERY list out there says your test is producing a different result from all the lists.

I want to know why your numbers are so different. Do you have a theory?

I produced accurate numbers, 6000 games of SF vs Crafty, no duplicated games, no odd opening book lines, etc. Which is far more accurate for determining an Elo that will predict Crafty vs SF game outcomes.
That's all well and good but why are we trying to predict Crafty vs SF game outcomes?

I would rather see a test that is more relevant, one that would predict what ratings you would get against a variety of opponents and compare that to the rating that Stockfish would get.

But a separate issue is why are we not comparing Crafty to the much stronger Rybka 4? If you cannot test that, you could use one of the clones that is much stronger than SF.

So it isn't "odd" at all. It comes from an understanding of what the term "Elo rating" means and then trying to get the most accurate "difference" between Crafty's and SF's Elo.
I just don't see what we are going to do with difference. What we want to know is how far away Crafty is from the top.

In tennis, people are interested in who the number 1 player is. The number 1 player may not do very well against some specific opponent, but wins more against a variety of players.

So I don't care how Crafty does against Stockfish in head to head. The rating lists already give us the numbers. The error margins of a head to head test are not relevant in any way because that is a whole different test.

If you objective really is to prove they are wrong, you need to run a more appropriate test if you believe what you just wrote. Otherwise, this is interesting, but doesn't relate to any of the recent discussions.

It _directly_ answers the question how much stronger is SF than Crafty?
In answers the question, how would they do in a head to head match.

You have said yourself that you should test against a variety of opponents to get an accurate rating.
Answer, strong enough that Crafty wins 22% of the games, SF wins 78%. And the _only_ Elo difference that will produce that ratio is 216 or whatever it comes out in my data. Not +300, not +250. So my number is extremely accurate to answer _that_ question...
But we don't need the answer to that question. I'm happy for you that you know the answer, but how is it relevant to our discussion? In other words, where are you going to take this? This will produce a premise that cannot be applied to the "debate" that we are having.

When you mix in other programs, you lose that specificity, to gain an overall view of who is best and who is worse, but the "how much better or worse" is significantly less accurate as a result.
The overall view is the one we want if you are trying to answer the question what is the relative strength difference between players. I don't know how head to head factors in to any of this.


A clear example of "you can't have your cake and eat it too..."
bob
Posts: 20943
Joined: Mon Feb 27, 2006 7:30 pm
Location: Birmingham, AL

Re: Crafty vs Stockfish

Post by bob »

Don wrote:
bob wrote: Don, I understand the Elo model. I understand that a huge group of opponents produces a different Elo for A and B than just playing A vs B. The statement made was "your tests are showing SF as only better than Crafty by about +200, while other (much bigger) lists have it significantly better than that.
This is not a sample size issue at all. EVERY list out there says your test is producing a different result from all the lists.

I want to know why your numbers are so different. Do you have a theory?

Don:

Are you reading _any_ of the things I write? Here is a simple scenario that happened to me.

Three programs, A, B and C. I played A vs B and found A to be +200 (it was winning about 75% of the games. I played A vs C and found the same. And then I played B vs C and found B was about +100 better than C.

So you run all of that thru (say) BayesElo and what do you get? For Elo to work, the rating difference between A and B _must_ be 200 in order to correctly predict that 75% score. The difference between A and C must be the same, for the same reason. Now, your problem, is to figure out how to make this true:

A - B = 200
A - C = 200
B - C = 100.

There is _no_ solution. A simple try would be A=2600, B=2450, C=2350. That predicts (correctly) that A is stronger than B or C, and B is stronger that C. But if you use 2600 vs 2450, you don't get that 75% prediction, it will be lower, because the difference is not correct.

So a large list does a really good job at showing who is strongest, who is weakest, and how the others should be spread out between those two. But if you take two individual Elo numbers and use that to compute a match result, and then play a long match to get a really precise Elo difference just for those two programs, the numbers may or may not match even closely. In my case above, not a single difference is correct, even though the sorted order is correct.

The numbers I posted last were simply 6,000 games, head-to-head. The Elo difference is precise. The number of games between SF and Crafty in a list is necessarily low, because they are trying to play everyone against everyone. So small head-to-head sample. Now you have to somehow sort programs into order and maintain the individual Elo differences. It isn't possible, particularly in the above simple example. The difference could be there. I don't use any opening books at all. It could be that someone has a better book than I have. Or it could be that someone uses a book to force Crafty to play positions it does not like. Who knows? It would take a _ton_ of time to figure that out, and even my hardware is not enough, particularly due to all the non linux programs. So I'm not going to lose sleep over it, because this is _not_ unexpected. It is perfectly normal for anyone that understands what Elo really means.

I had email exchanges with Remi several years ago about this and other issues, and he guided me along in how to set this stuff up, how one introduces unexpected error (more opponents as discussed above) and such. And after some thought, I developed an understanding of what elo is all about.

If Crafty loses 72% of the games vs program A, it is 216 Elo worse. Not 240. Not 180. This is an exact number. If other lists have a different losing percentage than the 78% in my test, then that is another variable. Why? Different time controls can change things. Ponder on or off? books? Learning? egtbs (I am convinced they hurt me more than they help after a lot of testing).


I produced accurate numbers, 6000 games of SF vs Crafty, no duplicated games, no odd opening book lines, etc. Which is far more accurate for determining an Elo that will predict Crafty vs SF game outcomes.
That's all well and good but why are we trying to predict Crafty vs SF game outcomes?
Because someone mentioned +300. I said that based on _my_ testing that was wrong. I posted results to support that. And here we are...

I would rather see a test that is more relevant, one that would predict what ratings you would get against a variety of opponents and compare that to the rating that Stockfish would get.
When you compare two programs by looking at their Elo ratings, what are you trying to determine? 99% of the people on planet earth are trying to answer the question "if these two play, who comes out on top and by how much." Ratings against a pool of players don't show that. Rating against a single opponent shows it extremely accurately.

I don't test against just one opponent, because I want to make sure that something that works against A doesn't hurt against B. And all I care about in my testing is answering the question "is version A' better or worse than A against this set of opponents?" I don't play A vs A', because I am not interested in that result nearly as much as the results against other programs...

What you do depends on what you want to know. Over the past 3-4 years, I have very carefully defined what I am trying to measure, and post those results. Occasionally, when it takes no effort, I will produce the head-to-head values I gave previously as it takes no work, just grab that part of the PGN and stuff it into BayesElo...




But a separate issue is why are we not comparing Crafty to the much stronger Rybka 4? If you cannot test that, you could use one of the clones that is much stronger than SF.

I have answered this many times. Point me to one that doesn't crash/hang regularly. I have not found one yet, although I have not looked that hard after getting burned so many times. And I do not believe, based on numbers I have seen, that R4 is "much stronger" than SF 1.8. If you have contradictory data, please point me to it. But testing vs R4 is out, without source...


So it isn't "odd" at all. It comes from an understanding of what the term "Elo rating" means and then trying to get the most accurate "difference" between Crafty's and SF's Elo.
I just don't see what we are going to do with difference. What we want to know is how far away Crafty is from the top.
Different question, with different answer. If you want to know exactly how far crafty is from the top program, play 'em head to head for an exact (within statistical reason) answer. That was not the answer being discussed here. I didn't start the topic. I did start a new thread because I do not see why we are discussing this in a deep blue thread, an SMP thread, etc...


In tennis, people are interested in who the number 1 player is. The number 1 player may not do very well against some specific opponent, but wins more against a variety of players.

So I don't care how Crafty does against Stockfish in head to head. The rating lists already give us the numbers. The error margins of a head to head test are not relevant in any way because that is a whole different test.
It is relevant if that is the issue that was raised, explicitly... I didn't bring it up. Someone said +300, I said good numbers show +200, and there we went...


If you objective really is to prove they are wrong, you need to run a more appropriate test if you believe what you just wrote. Otherwise, this is interesting, but doesn't relate to any of the recent discussions.

It _directly_ answers the question how much stronger is SF than Crafty?
In answers the question, how would they do in a head to head match.
Which is _exactly_ what "how much stronger is SF than Crafty?" means in any version of the English language I understand. In a tournament, the top two will eventually meet. Would you want to just stop before you finish the thing and compute "performance ratings" and let the biggest number win? Not desirable.

You have said yourself that you should test against a variety of opponents to get an accurate rating.
You are just blowing the meaning of "rating" all to hell. Elo's book covers that concept. Absolute number means nothing. Just delta between two opponents. The more games between those two opponents, and the fewer games with other opponents, the more accurate that "difference" becomes. Using an overall rating, which is a statistical average, is perfect for ordering everyone, knowing that you are going to get a few wrong because of the peculiarities of the game and the players. But it works well. But if you _really_ want to compare A to B and get a "Elo difference" then head-to-head it is. If you want to take a group and rank 'em from low to high, the rating lists do this well.

But we can't even get different rating systems to agree (USCF, BCF, ECF, CCF, FIDE, and who knows what else) so talking about "accuracy" is a bit misleading. "accuracy in ordering" is something else entirely, and the lists do that pretty nicely. But not accuracy in individual Elo from the intent of the original idea Elo implemented.

Answer, strong enough that Crafty wins 22% of the games, SF wins 78%. And the _only_ Elo difference that will produce that ratio is 216 or whatever it comes out in my data. Not +300, not +250. So my number is extremely accurate to answer _that_ question...
But we don't need the answer to that question. I'm happy for you that you know the answer, but how is it relevant to our discussion? In other words, where are you going to take this? This will produce a premise that cannot be applied to the "debate" that we are having.
The question asked, was how much stronger is SF than Crafty. My number is _the_ definitive answer for the test conditions I use, without the outside noise of other opponents, books, different hardware, different settings, etc. That's all I can say. If you want to take +300, fine by me. But if you play em in a match, SF is not going to win 90% of the games under my conditions. It is going to win 78%.


When you mix in other programs, you lose that specificity, to gain an overall view of who is best and who is worse, but the "how much better or worse" is significantly less accurate as a result.
The overall view is the one we want if you are trying to answer the question what is the relative strength difference between players. I don't know how head to head factors in to any of this.
Unfortunately you are not answering that "what is the strength difference" as that can _only_ be done head-to-head. Otherwise you can answer "what is the "average" ordering of these opponents from strongest to weakest?" But the elo difference between any two will not work as an accurate predictor of results in most cases. It might be close most of the time. But not accurate.


A clear example of "you can't have your cake and eat it too..."
[/quote]
User avatar
Don
Posts: 5106
Joined: Tue Apr 29, 2008 4:27 pm

Re: Crafty vs Stockfish

Post by Don »

bob wrote:
Don wrote:
bob wrote: Don, I understand the Elo model. I understand that a huge group of opponents produces a different Elo for A and B than just playing A vs B. The statement made was "your tests are showing SF as only better than Crafty by about +200, while other (much bigger) lists have it significantly better than that.
This is not a sample size issue at all. EVERY list out there says your test is producing a different result from all the lists.

I want to know why your numbers are so different. Do you have a theory?

Don:

Are you reading _any_ of the things I write? Here is a simple scenario that happened to me.

Three programs, A, B and C. I played A vs B and found A to be +200 (it was winning about 75% of the games. I played A vs C and found the same. And then I played B vs C and found B was about +100 better than C.

So you run all of that thru (say) BayesElo and what do you get? For Elo to work, the rating difference between A and B _must_ be 200 in order to correctly predict that 75% score. The difference between A and C must be the same, for the same reason. Now, your problem, is to figure out how to make this true:

A - B = 200
A - C = 200
B - C = 100.

There is _no_ solution. A simple try would be A=2600, B=2450, C=2350. That predicts (correctly) that A is stronger than B or C, and B is stronger that C. But if you use 2600 vs 2450, you don't get that 75% prediction, it will be lower, because the difference is not correct.

So a large list does a really good job at showing who is strongest, who is weakest, and how the others should be spread out between those two. But if you take two individual Elo numbers and use that to compute a match result, and then play a long match to get a really precise Elo difference just for those two programs, the numbers may or may not match even closely. In my case above, not a single difference is correct, even though the sorted order is correct.
What you are talking about is intransitivity and as you say there is no solution to it. But playing head to head is the worst solution, unless the only thing you want to know is how well program A is against program B.

What I'm challenging is the idea that we care about the head to head with SF and Crafty. I don't think we care.

You seem to be assuming that if you play head to head you get an "accurate" rating difference but that is far from correct when transitivity is involved.

If A beat B 70% of the time, and B beat C 70% of the time and C beats A 70% of the time you have a dilemma. You cannot say that C is better than A but the BEST you can assume is that all players are equal in strength. In a massive round robin they would come out with equal ratings. And we would have the situation where if I wanted to "prove" that A was better than B, I would play a head to head and just say, "see, I proved it" and then I could use this to infer other incorrect things.


The numbers I posted last were simply 6,000 games, head-to-head. The Elo difference is precise. The number of games between SF and Crafty in a list is necessarily low, because they are trying to play everyone against everyone. So small head-to-head sample.
Yes, that exact matchup is low, but that does not mean the results are incorrect. For example if they DID run every possible of combinations to 1 million games it would still not make Crafty move way up the rating list close to Stockfish.

Now you have to somehow sort programs into order and maintain the individual Elo differences. It isn't possible, particularly in the above simple example.
Correct, because ELO is not completely transitive. But you cannot correct it by playing a single pair of players. After you do your head to head you then have to consider the fact that if Crafty is stronger, then all the programs who played Crafty are under-rated too which affects Stockfish rating adversely. You cannot fix it, the best you can do is say that you need variety to balance out the various injustices.

Someone brought up a valid point earlier. If I were to tune against Crafty, I might get spectacular results against Crafty but horrible results against everyone else. I could run a test like you are doing to prove I'm better than Crafty when in fact I'm not.


The difference could be there. I don't use any opening books at all. It could be that someone has a better book than I have. Or it could be that someone uses a book to force Crafty to play positions it does not like. Who knows? It would take a _ton_ of time to figure that out, and even my hardware is not enough, particularly due to all the non linux programs. So I'm not going to lose sleep over it, because this is _not_ unexpected. It is perfectly normal for anyone that understands what Elo really means.

I had email exchanges with Remi several years ago about this and other issues, and he guided me along in how to set this stuff up, how one introduces unexpected error (more opponents as discussed above) and such. And after some thought, I developed an understanding of what elo is all about.

If Crafty loses 72% of the games vs program A, it is 216 Elo worse. Not 240. Not 180. This is an exact number. If other lists have a different losing percentage than the 78% in my test, then that is another variable. Why? Different time controls can change things. Ponder on or off? books? Learning? egtbs (I am convinced they hurt me more than they help after a lot of testing).


I produced accurate numbers, 6000 games of SF vs Crafty, no duplicated games, no odd opening book lines, etc. Which is far more accurate for determining an Elo that will predict Crafty vs SF game outcomes.
That's all well and good but why are we trying to predict Crafty vs SF game outcomes?
Because someone mentioned +300. I said that based on _my_ testing that was wrong. I posted results to support that. And here we are...

I would rather see a test that is more relevant, one that would predict what ratings you would get against a variety of opponents and compare that to the rating that Stockfish would get.
When you compare two programs by looking at their Elo ratings, what are you trying to determine? 99% of the people on planet earth are trying to answer the question "if these two play, who comes out on top and by how much." Ratings against a pool of players don't show that. Rating against a single opponent shows it extremely accurately.

I don't test against just one opponent, because I want to make sure that something that works against A doesn't hurt against B. And all I care about in my testing is answering the question "is version A' better or worse than A against this set of opponents?" I don't play A vs A', because I am not interested in that result nearly as much as the results against other programs...

What you do depends on what you want to know. Over the past 3-4 years, I have very carefully defined what I am trying to measure, and post those results. Occasionally, when it takes no effort, I will produce the head-to-head values I gave previously as it takes no work, just grab that part of the PGN and stuff it into BayesElo...




But a separate issue is why are we not comparing Crafty to the much stronger Rybka 4? If you cannot test that, you could use one of the clones that is much stronger than SF.

I have answered this many times. Point me to one that doesn't crash/hang regularly. I have not found one yet, although I have not looked that hard after getting burned so many times. And I do not believe, based on numbers I have seen, that R4 is "much stronger" than SF 1.8. If you have contradictory data, please point me to it. But testing vs R4 is out, without source...


So it isn't "odd" at all. It comes from an understanding of what the term "Elo rating" means and then trying to get the most accurate "difference" between Crafty's and SF's Elo.
I just don't see what we are going to do with difference. What we want to know is how far away Crafty is from the top.
Different question, with different answer. If you want to know exactly how far crafty is from the top program, play 'em head to head for an exact (within statistical reason) answer. That was not the answer being discussed here. I didn't start the topic. I did start a new thread because I do not see why we are discussing this in a deep blue thread, an SMP thread, etc...


In tennis, people are interested in who the number 1 player is. The number 1 player may not do very well against some specific opponent, but wins more against a variety of players.

So I don't care how Crafty does against Stockfish in head to head. The rating lists already give us the numbers. The error margins of a head to head test are not relevant in any way because that is a whole different test.
It is relevant if that is the issue that was raised, explicitly... I didn't bring it up. Someone said +300, I said good numbers show +200, and there we went...


If you objective really is to prove they are wrong, you need to run a more appropriate test if you believe what you just wrote. Otherwise, this is interesting, but doesn't relate to any of the recent discussions.

It _directly_ answers the question how much stronger is SF than Crafty?
In answers the question, how would they do in a head to head match.
Which is _exactly_ what "how much stronger is SF than Crafty?" means in any version of the English language I understand. In a tournament, the top two will eventually meet. Would you want to just stop before you finish the thing and compute "performance ratings" and let the biggest number win? Not desirable.

You have said yourself that you should test against a variety of opponents to get an accurate rating.
You are just blowing the meaning of "rating" all to hell. Elo's book covers that concept. Absolute number means nothing. Just delta between two opponents. The more games between those two opponents, and the fewer games with other opponents, the more accurate that "difference" becomes. Using an overall rating, which is a statistical average, is perfect for ordering everyone, knowing that you are going to get a few wrong because of the peculiarities of the game and the players. But it works well. But if you _really_ want to compare A to B and get a "Elo difference" then head-to-head it is. If you want to take a group and rank 'em from low to high, the rating lists do this well.

But we can't even get different rating systems to agree (USCF, BCF, ECF, CCF, FIDE, and who knows what else) so talking about "accuracy" is a bit misleading. "accuracy in ordering" is something else entirely, and the lists do that pretty nicely. But not accuracy in individual Elo from the intent of the original idea Elo implemented.

Answer, strong enough that Crafty wins 22% of the games, SF wins 78%. And the _only_ Elo difference that will produce that ratio is 216 or whatever it comes out in my data. Not +300, not +250. So my number is extremely accurate to answer _that_ question...
But we don't need the answer to that question. I'm happy for you that you know the answer, but how is it relevant to our discussion? In other words, where are you going to take this? This will produce a premise that cannot be applied to the "debate" that we are having.
The question asked, was how much stronger is SF than Crafty. My number is _the_ definitive answer for the test conditions I use, without the outside noise of other opponents, books, different hardware, different settings, etc. That's all I can say. If you want to take +300, fine by me. But if you play em in a match, SF is not going to win 90% of the games under my conditions. It is going to win 78%.


When you mix in other programs, you lose that specificity, to gain an overall view of who is best and who is worse, but the "how much better or worse" is significantly less accurate as a result.
The overall view is the one we want if you are trying to answer the question what is the relative strength difference between players. I don't know how head to head factors in to any of this.
Unfortunately you are not answering that "what is the strength difference" as that can _only_ be done head-to-head. Otherwise you can answer "what is the "average" ordering of these opponents from strongest to weakest?" But the elo difference between any two will not work as an accurate predictor of results in most cases. It might be close most of the time. But not accurate.


A clear example of "you can't have your cake and eat it too..."
[/quote]