krazyken wrote:bob wrote:krazyken wrote:bob wrote:bnemias wrote:Can we just produce some data points to illustrate?
Say, take a linear 32,000 game match between A and A'. Pick n unique random starting points in the list (100 maybe) to produce n 10 game runs. Then it should be easy to compute the accuracy of a run. Also, it might be interesting to post some of the most skewed runs.
In fact, this issue keeps recurring. How about once it's done, compute how many runs of n produce results within some ELO of the actual difference? Then bookmark the link so you can reference the data any time this issue comes up again. Heh.
I believe I already posted most of what you asked for. I played a 32,000 game match, and grabbed the partial results every few seconds. For the first 200 games, the elo was +/- 100 from the truth, all over the place. By the time we hit 1,000 games it had settled down, although it was a little high, by 32,000 games it had settled down completely. 3 runs with the same programs produced final Elo values within the stated +/-5 error bar BayesElo gives.
We had this same discussion several years ago. Chris Whittington used a different methodology to illustrate this point. He assumed two programs of identical strength. He ignored draws and simply generated a string of 1,000 numbers either 0 or 1 representing a win for program A with a 1, and a win for program B with a 0. He then searched for strings of 0's or 1's. And posted some analysis showing that a group of 10 wins or losses is hardly unexpected with two equal opponents.
This was the Nth time testing came up. I know everyone wants to be able to see the truth with 10 games. But you don't even see a good guess with 100 games. Unfortunately.
Well if you are ignoring draws, that is totally possible somewhere around 96%. But in the real world, not very likely at all.
What is the difference between 20 results with 10 draws and 10 wins, opposed to 10 draws and 10 losses? Did you see Remi's discussion a couple of years back about ignoring draws anyway???
In the results I posted I believe draws were happening 22% of the time. So one in every 5 games was a 1/2. Doesn't really change things with respect to the complete randomness of taking a 10 game sample as in the data that started this thread. 5.5-4.5 or 6.5-3.5 is simply meaningless when comparing the two programs. Simply meaningless...
Well you were talking about a string of 10 wins in a row. I was assuming you were talking about a set of 1000 real games. If you insist on adding new criteria so late in the discussion, kind of makes the whole discussion pointless. makes the math completely pointless. if two programs are equal and get a 22% draw rate the chance of a win is 39%, if you are ignoring draws the chance of a win is 50%. When you are talking about 10 in a row, that is a HUGE difference.
so if you are ignoring draws, I'm going to use fuzzier math because I haven't the time to waste figuring out formulas for this new scenario just now. play 1000 games throw out the 22% that are draws, and you are left with about 780 games that can be represented as a string of 1's and 0's using a 50% probability of a win, it will come up to about 75% that there are 10 wins in a row. the chance that there are 10 wins in a row and ten losses in a row in the same series will probably have an upper bound of (75%)^2 someone with more time may want to verify that, So in the end it is likely that in this subset of the chess universe you are right.
Let's back up to what I said. If you look at the _original_ data posted, there were 3-4 sets of 10-game matches. _none_ of them were 10-0. I pointed out that from a set of 10 games, you can draw no conclusions, even if you did get a 10-0 or 0-10 result, because the error bar is too large, and 10 consecutive wins or losses is not exactly a rare event.
When comparing _two_ programs, draws are irrelevant. Remi explained this in detail a few years back with great clarity. And the results that were posted were exactly that, 4 ten game matches against 4 different opponents. But my point always was "ten games is not enough to learn _anything_." Nothing more, nothing less.
1 Fruit 2.1 2644 70 66 16 66% 2556 31%
2 Crafty-23.1-1 2556 66 70 16 34% 2644 31%
There's a 16 game match. with an error bar that is 150 points wide.
Here's how that ended:
1 Crafty-23.1-1 2623 5 5 7782 56% 2577 26%
2 Fruit 2.1 2577 5 5 7782 44% 2623 26%
Which one would you trust? Which one would give you a _reasonable_ idea of which is better? That is all I have been saying, from the get-go. If you want to draw a conclusion from that first set of data, fine. But it is not very meaningful, particularly when, as Paul Harvey used to say, "now here's the rest of the story" and you get the final match results (only 8K games, waiting 15 minutes was enough make the point here).