Don wrote:bob wrote:
Don, I understand the Elo model. I understand that a huge group of opponents produces a different Elo for A and B than just playing A vs B. The statement made was "your tests are showing SF as only better than Crafty by about +200, while other (much bigger) lists have it significantly better than that.
This is not a sample size issue at all. EVERY list out there says your test is producing a different result from all the lists.
I want to know why your numbers are so different. Do you have a theory?
Don:
Are you reading _any_ of the things I write? Here is a simple scenario that happened to me.
Three programs, A, B and C. I played A vs B and found A to be +200 (it was winning about 75% of the games. I played A vs C and found the same. And then I played B vs C and found B was about +100 better than C.
So you run all of that thru (say) BayesElo and what do you get? For Elo to work, the rating difference between A and B _must_ be 200 in order to correctly predict that 75% score. The difference between A and C must be the same, for the same reason. Now, your problem, is to figure out how to make this true:
A - B = 200
A - C = 200
B - C = 100.
There is _no_ solution. A simple try would be A=2600, B=2450, C=2350. That predicts (correctly) that A is stronger than B or C, and B is stronger that C. But if you use 2600 vs 2450, you don't get that 75% prediction, it will be lower, because the difference is not correct.
So a large list does a really good job at showing who is strongest, who is weakest, and how the others should be spread out between those two. But if you take two individual Elo numbers and use that to compute a match result, and then play a long match to get a really precise Elo difference just for those two programs, the numbers may or may not match even closely. In my case above, not a single difference is correct, even though the sorted order is correct.
The numbers I posted last were simply 6,000 games, head-to-head. The Elo difference is precise. The number of games between SF and Crafty in a list is necessarily low, because they are trying to play everyone against everyone. So small head-to-head sample. Now you have to somehow sort programs into order and maintain the individual Elo differences. It isn't possible, particularly in the above simple example. The difference could be there. I don't use any opening books at all. It could be that someone has a better book than I have. Or it could be that someone uses a book to force Crafty to play positions it does not like. Who knows? It would take a _ton_ of time to figure that out, and even my hardware is not enough, particularly due to all the non linux programs. So I'm not going to lose sleep over it, because this is _not_ unexpected. It is perfectly normal for anyone that understands what Elo really means.
I had email exchanges with Remi several years ago about this and other issues, and he guided me along in how to set this stuff up, how one introduces unexpected error (more opponents as discussed above) and such. And after some thought, I developed an understanding of what elo is all about.
If Crafty loses 72% of the games vs program A, it is 216 Elo worse. Not 240. Not 180. This is an exact number. If other lists have a different losing percentage than the 78% in my test, then that is another variable. Why? Different time controls can change things. Ponder on or off? books? Learning? egtbs (I am convinced they hurt me more than they help after a lot of testing).
I produced accurate numbers, 6000 games of SF vs Crafty, no duplicated games, no odd opening book lines, etc. Which is far more accurate for determining an Elo that will predict Crafty vs SF game outcomes.
That's all well and good but why are we trying to predict Crafty vs SF game outcomes?
Because someone mentioned +300. I said that based on _my_ testing that was wrong. I posted results to support that. And here we are...
I would rather see a test that is more relevant, one that would predict what ratings you would get against a variety of opponents and compare that to the rating that Stockfish would get.
When you compare two programs by looking at their Elo ratings, what are you trying to determine? 99% of the people on planet earth are trying to answer the question "if these two play, who comes out on top and by how much." Ratings against a pool of players don't show that. Rating against a single opponent shows it extremely accurately.
I don't test against just one opponent, because I want to make sure that something that works against A doesn't hurt against B. And all I care about in my testing is answering the question "is version A' better or worse than A against this set of opponents?" I don't play A vs A', because I am not interested in that result nearly as much as the results against other programs...
What you do depends on what you want to know. Over the past 3-4 years, I have very carefully defined what I am trying to measure, and post those results. Occasionally, when it takes no effort, I will produce the head-to-head values I gave previously as it takes no work, just grab that part of the PGN and stuff it into BayesElo...
But a separate issue is why are we not comparing Crafty to the much stronger Rybka 4? If you cannot test that, you could use one of the clones that is much stronger than SF.
I have answered this many times. Point me to one that doesn't crash/hang regularly. I have not found one yet, although I have not looked that hard after getting burned so many times. And I do not believe, based on numbers I have seen, that R4 is "much stronger" than SF 1.8. If you have contradictory data, please point me to it. But testing vs R4 is out, without source...
So it isn't "odd" at all. It comes from an understanding of what the term "Elo rating" means and then trying to get the most accurate "difference" between Crafty's and SF's Elo.
I just don't see what we are going to do with difference. What we want to know is how far away Crafty is from the top.
Different question, with different answer. If you want to know exactly how far crafty is from the top program, play 'em head to head for an exact (within statistical reason) answer. That was not the answer being discussed here. I didn't start the topic. I did start a new thread because I do not see why we are discussing this in a deep blue thread, an SMP thread, etc...
In tennis, people are interested in who the number 1 player is. The number 1 player may not do very well against some specific opponent, but wins more against a variety of players.
So I don't care how Crafty does against Stockfish in head to head. The rating lists already give us the numbers. The error margins of a head to head test are not relevant in any way because that is a whole different test.
It is relevant if that is the issue that was raised, explicitly... I didn't bring it up. Someone said +300, I said good numbers show +200, and there we went...
If you objective really is to prove they are wrong, you need to run a more appropriate test if you believe what you just wrote. Otherwise, this is interesting, but doesn't relate to any of the recent discussions.
It _directly_ answers the question how much stronger is SF than Crafty?
In answers the question, how would they do in a head to head match.
Which is _exactly_ what "how much stronger is SF than Crafty?" means in any version of the English language I understand. In a tournament, the top two will eventually meet. Would you want to just stop before you finish the thing and compute "performance ratings" and let the biggest number win? Not desirable.
You have said yourself that you should test against a variety of opponents to get an accurate rating.
You are just blowing the meaning of "rating" all to hell. Elo's book covers that concept. Absolute number means nothing. Just delta between two opponents. The more games between those two opponents, and the fewer games with other opponents, the more accurate that "difference" becomes. Using an overall rating, which is a statistical average, is perfect for ordering everyone, knowing that you are going to get a few wrong because of the peculiarities of the game and the players. But it works well. But if you _really_ want to compare A to B and get a "Elo difference" then head-to-head it is. If you want to take a group and rank 'em from low to high, the rating lists do this well.
But we can't even get different rating systems to agree (USCF, BCF, ECF, CCF, FIDE, and who knows what else) so talking about "accuracy" is a bit misleading. "accuracy in ordering" is something else entirely, and the lists do that pretty nicely. But not accuracy in individual Elo from the intent of the original idea Elo implemented.
Answer, strong enough that Crafty wins 22% of the games, SF wins 78%. And the _only_ Elo difference that will produce that ratio is 216 or whatever it comes out in my data. Not +300, not +250. So my number is extremely accurate to answer _that_ question...
But we don't need the answer to that question. I'm happy for you that you know the answer, but how is it relevant to our discussion? In other words, where are you going to take this? This will produce a premise that cannot be applied to the "debate" that we are having.
The question asked, was how much stronger is SF than Crafty. My number is _the_ definitive answer for the test conditions I use, without the outside noise of other opponents, books, different hardware, different settings, etc. That's all I can say. If you want to take +300, fine by me. But if you play em in a match, SF is not going to win 90% of the games under my conditions. It is going to win 78%.
When you mix in other programs, you lose that specificity, to gain an overall view of who is best and who is worse, but the "how much better or worse" is significantly less accurate as a result.
The overall view is the one we want if you are trying to answer the question what is the relative strength difference between players. I don't know how head to head factors in to any of this.
Unfortunately you are not answering that "what is the strength difference" as that can _only_ be done head-to-head. Otherwise you can answer "what is the "average" ordering of these opponents from strongest to weakest?" But the elo difference between any two will not work as an accurate predictor of results in most cases. It might be close most of the time. But not accurate.
A clear example of "you can't have your cake and eat it too..."
[/quote]