Crafty vs Stockfish

bob · Post by **bob** » Thu Sep 16, 2010 1:41 am

Don wrote:
bob wrote:
Don wrote:
bob wrote: Don, I understand the Elo model. I understand that a huge group of opponents produces a different Elo for A and B than just playing A vs B. The statement made was "your tests are showing SF as only better than Crafty by about +200, while other (much bigger) lists have it significantly better than that.
This is not a sample size issue at all. EVERY list out there says your test is producing a different result from all the lists.

I want to know why your numbers are so different. Do you have a theory?

Don:

Are you reading _any_ of the things I write? Here is a simple scenario that happened to me.

Three programs, A, B and C. I played A vs B and found A to be +200 (it was winning about 75% of the games. I played A vs C and found the same. And then I played B vs C and found B was about +100 better than C.

So you run all of that thru (say) BayesElo and what do you get? For Elo to work, the rating difference between A and B _must_ be 200 in order to correctly predict that 75% score. The difference between A and C must be the same, for the same reason. Now, your problem, is to figure out how to make this true:

A - B = 200
A - C = 200
B - C = 100.

There is _no_ solution. A simple try would be A=2600, B=2450, C=2350. That predicts (correctly) that A is stronger than B or C, and B is stronger that C. But if you use 2600 vs 2450, you don't get that 75% prediction, it will be lower, because the difference is not correct.

So a large list does a really good job at showing who is strongest, who is weakest, and how the others should be spread out between those two. But if you take two individual Elo numbers and use that to compute a match result, and then play a long match to get a really precise Elo difference just for those two programs, the numbers may or may not match even closely. In my case above, not a single difference is correct, even though the sorted order is correct.
What you are talking about is intransitivity and as you say there is no solution to it. But playing head to head is the worst solution, unless the only thing you want to know is how well program A is against program B.

So we agree, finally. And when someone asks "How much stronger is Stockfish than Crafty?" what is the most direct interpretation of that question? In every tournament we have played over the past ten years you would get a different answer when comparing two specific programs. So which one is right? The only reasonable answer is to play them and measure the difference in Elo... Any other answer has to be qualified in some way.

What I'm challenging is the idea that we care about the head to head with SF and Crafty. I don't think we care.

Fine by me. For the question asked, I did care, because comparing the two, IMHO, is asking which is better. It is quite easy to create some results for a group, so that A is rated slightly higher, yet B is slightly better. The ratings are not correct because they (a) didn't play each other and (b) played different percentages of the games against different opponents. All you can say is "Based on the games we have to go on, A is stronger. Or, one can do the actual test and prove "Nope, B is slightly stronger and here are the games to prove it..."

That was my _only_ interest. When someone asks a question, and I have pre-computed data (since I save the PGN for recent matches) which makes it easy to whip out a very accurate answer.

You seem to be assuming that if you play head to head you get an "accurate" rating difference but that is far from correct when transitivity is involved.

That is absolutely, 100% wrong. Head to head solves the intransitivity issue since there _is_ none in head to head...

Take my A vs B vs C example. Adding them together and you get wrong Elos. Play only head to head and you get perfect Elos...

If A beat B 70% of the time, and B beat C 70% of the time and C beats A 70% of the time you have a dilemma. You cannot say that C is better than A but the BEST you can assume is that all players are equal in strength.

Only in moron land. Ask me any question about who is better than who and I can answer. Ask me which is the best of the three and I can't. Head to head eliminates that noise, if you care about comparing any two. If you want to comapare 3, things get very difficult.

In a massive round robin they would come out with equal ratings. And we would have the situation where if I wanted to "prove" that A was better than B, I would play a head to head and just say, "see, I proved it" and then I could use this to infer other incorrect things.

Since all 3 will finish up equal, I would play 3 head to head matches, and discover that this particular group has a problem. Shoot, in operating systems demand paging, increasing the resident size decreases the page fault _almost_ every time. And logically, every time. But Belady's anomaly pops up to burst that bubble. So there are always exceptions. Except when there aren't. Or vice-versa. But in College Football, or basketball, or soccer, or whatever, when someone asks "who is better" nobody cares about how they do against a common or disjoint set of opponents, everyone wants to know what will happen when those two teams play.

The numbers I posted last were simply 6,000 games, head-to-head. The Elo difference is precise. The number of games between SF and Crafty in a list is necessarily low, because they are trying to play everyone against everyone. So small head-to-head sample.
Yes, that exact matchup is low, but that does not mean the results are incorrect. For example if they DID run every possible of combinations to 1 million games it would still not make Crafty move way up the rating list close to Stockfish.

Again, it depends on what you are measuring. Elo was defined to predict the outcome between two players, based on past history. And the more different opponents everyone plays, the better the list is sorted, but the less the Elo numbers mean when you try to use them to say for any two players, who will win how many games in a 100 game match.

That's my point. head to head, you get a perfect predictor. In a large pool of players, you get a perfect ordering. At least as perfect as it can be. But the gap between two players is not meaningful in the Elo sense. It would rarely be wrong (but one can contrive a set of opponents where A > B in rating, but A loses to B every time) in predicting who is best, but it would rarely be right for predicting exactly how much better the best player is.

Now you have to somehow sort programs into order and maintain the individual Elo differences. It isn't possible, particularly in the above simple example.
Correct, because ELO is not completely transitive. But you cannot correct it by playing a single pair of players. After you do your head to head you then have to consider the fact that if Crafty is stronger, then all the programs who played Crafty are under-rated too which affects Stockfish rating adversely. You cannot fix it, the best you can do is say that you need variety to balance out the various injustices.

Fine. What you want to do is to back up and use a wide-angle lens to see everyone. I want to use a microscope to look at two specific players, since that is the question as I saw it... Either makes sense. But when you say A is better than B, that is a reasonable statement if Elo(A) > Elo(B) with the Elo computed in a large pool of players. But when you say A is 200 better than B, that is a very specific statement that should imply A will win 75% of the games (approximately, again I like the simple 200 Elo = 75% winning advantage. Not exact, but simple enough to do mental calculations).

Someone brought up a valid point earlier. If I were to tune against Crafty, I might get spectacular results against Crafty but horrible results against everyone else. I could run a test like you are doing to prove I'm better than Crafty when in fact I'm not.

You have a very strange definition of "better". You beat me most games. But you are not better. If you change that to say "better against other players" then that is a different claim. But "better than Crafty" has to mean you beat me more often than not, otherwise we are speaking different language.

The difference could be there. I don't use any opening books at all. It could be that someone has a better book than I have. Or it could be that someone uses a book to force Crafty to play positions it does not like. Who knows? It would take a _ton_ of time to figure that out, and even my hardware is not enough, particularly due to all the non linux programs. So I'm not going to lose sleep over it, because this is _not_ unexpected. It is perfectly normal for anyone that understands what Elo really means.

I had email exchanges with Remi several years ago about this and other issues, and he guided me along in how to set this stuff up, how one introduces unexpected error (more opponents as discussed above) and such. And after some thought, I developed an understanding of what elo is all about.

If Crafty loses 72% of the games vs program A, it is 216 Elo worse. Not 240. Not 180. This is an exact number. If other lists have a different losing percentage than the 78% in my test, then that is another variable. Why? Different time controls can change things. Ponder on or off? books? Learning? egtbs (I am convinced they hurt me more than they help after a lot of testing).

I produced accurate numbers, 6000 games of SF vs Crafty, no duplicated games, no odd opening book lines, etc. Which is far more accurate for determining an Elo that will predict Crafty vs SF game outcomes.
That's all well and good but why are we trying to predict Crafty vs SF game outcomes?
Because someone mentioned +300. I said that based on _my_ testing that was wrong. I posted results to support that. And here we are...

I would rather see a test that is more relevant, one that would predict what ratings you would get against a variety of opponents and compare that to the rating that Stockfish would get.
When you compare two programs by looking at their Elo ratings, what are you trying to determine? 99% of the people on planet earth are trying to answer the question "if these two play, who comes out on top and by how much." Ratings against a pool of players don't show that. Rating against a single opponent shows it extremely accurately.

I don't test against just one opponent, because I want to make sure that something that works against A doesn't hurt against B. And all I care about in my testing is answering the question "is version A' better or worse than A against this set of opponents?" I don't play A vs A', because I am not interested in that result nearly as much as the results against other programs...

What you do depends on what you want to know. Over the past 3-4 years, I have very carefully defined what I am trying to measure, and post those results. Occasionally, when it takes no effort, I will produce the head-to-head values I gave previously as it takes no work, just grab that part of the PGN and stuff it into BayesElo...

But a separate issue is why are we not comparing Crafty to the much stronger Rybka 4? If you cannot test that, you could use one of the clones that is much stronger than SF.

I have answered this many times. Point me to one that doesn't crash/hang regularly. I have not found one yet, although I have not looked that hard after getting burned so many times. And I do not believe, based on numbers I have seen, that R4 is "much stronger" than SF 1.8. If you have contradictory data, please point me to it. But testing vs R4 is out, without source...

So it isn't "odd" at all. It comes from an understanding of what the term "Elo rating" means and then trying to get the most accurate "difference" between Crafty's and SF's Elo.

I just don't see what we are going to do with difference. What we want to know is how far away Crafty is from the top.
Different question, with different answer. If you want to know exactly how far crafty is from the top program, play 'em head to head for an exact (within statistical reason) answer. That was not the answer being discussed here. I didn't start the topic. I did start a new thread because I do not see why we are discussing this in a deep blue thread, an SMP thread, etc...

In tennis, people are interested in who the number 1 player is. The number 1 player may not do very well against some specific opponent, but wins more against a variety of players.

So I don't care how Crafty does against Stockfish in head to head. The rating lists already give us the numbers. The error margins of a head to head test are not relevant in any way because that is a whole different test.
It is relevant if that is the issue that was raised, explicitly... I didn't bring it up. Someone said +300, I said good numbers show +200, and there we went...

If you objective really is to prove they are wrong, you need to run a more appropriate test if you believe what you just wrote. Otherwise, this is interesting, but doesn't relate to any of the recent discussions.

It _directly_ answers the question how much stronger is SF than Crafty?

In answers the question, how would they do in a head to head match.
Which is _exactly_ what "how much stronger is SF than Crafty?" means in any version of the English language I understand. In a tournament, the top two will eventually meet. Would you want to just stop before you finish the thing and compute "performance ratings" and let the biggest number win? Not desirable.

You have said yourself that you should test against a variety of opponents to get an accurate rating.
You are just blowing the meaning of "rating" all to hell. Elo's book covers that concept. Absolute number means nothing. Just delta between two opponents. The more games between those two opponents, and the fewer games with other opponents, the more accurate that "difference" becomes. Using an overall rating, which is a statistical average, is perfect for ordering everyone, knowing that you are going to get a few wrong because of the peculiarities of the game and the players. But it works well. But if you _really_ want to compare A to B and get a "Elo difference" then head-to-head it is. If you want to take a group and rank 'em from low to high, the rating lists do this well.

But we can't even get different rating systems to agree (USCF, BCF, ECF, CCF, FIDE, and who knows what else) so talking about "accuracy" is a bit misleading. "accuracy in ordering" is something else entirely, and the lists do that pretty nicely. But not accuracy in individual Elo from the intent of the original idea Elo implemented.

Answer, strong enough that Crafty wins 22% of the games, SF wins 78%. And the _only_ Elo difference that will produce that ratio is 216 or whatever it comes out in my data. Not +300, not +250. So my number is extremely accurate to answer _that_ question...

But we don't need the answer to that question. I'm happy for you that you know the answer, but how is it relevant to our discussion? In other words, where are you going to take this? This will produce a premise that cannot be applied to the "debate" that we are having.
The question asked, was how much stronger is SF than Crafty. My number is _the_ definitive answer for the test conditions I use, without the outside noise of other opponents, books, different hardware, different settings, etc. That's all I can say. If you want to take +300, fine by me. But if you play em in a match, SF is not going to win 90% of the games under my conditions. It is going to win 78%.

When you mix in other programs, you lose that specificity, to gain an overall view of who is best and who is worse, but the "how much better or worse" is significantly less accurate as a result.

The overall view is the one we want if you are trying to answer the question what is the relative strength difference between players. I don't know how head to head factors in to any of this.
Unfortunately you are not answering that "what is the strength difference" as that can _only_ be done head-to-head. Otherwise you can answer "what is the "average" ordering of these opponents from strongest to weakest?" But the elo difference between any two will not work as an accurate predictor of results in most cases. It might be close most of the time. But not accurate.

A clear example of "you can't have your cake and eat it too..."

[/quote]

Crafty vs Stockfish

Re: Crafty vs Stockfish