Toga The Killer 1Y MP 4CPU is the strongest Toga....

Discussion of anything and everything relating to chess playing software and machines.

Moderator: Ras

bob
Posts: 20943
Joined: Mon Feb 27, 2006 7:30 pm
Location: Birmingham, AL

Re: Toga The Killer 1Y MP 4CPU is the strongest Toga....

Post by bob »

krazyken wrote:
bob wrote:
bnemias wrote:Can we just produce some data points to illustrate?

Say, take a linear 32,000 game match between A and A'. Pick n unique random starting points in the list (100 maybe) to produce n 10 game runs. Then it should be easy to compute the accuracy of a run. Also, it might be interesting to post some of the most skewed runs.

In fact, this issue keeps recurring. How about once it's done, compute how many runs of n produce results within some ELO of the actual difference? Then bookmark the link so you can reference the data any time this issue comes up again. Heh.
I believe I already posted most of what you asked for. I played a 32,000 game match, and grabbed the partial results every few seconds. For the first 200 games, the elo was +/- 100 from the truth, all over the place. By the time we hit 1,000 games it had settled down, although it was a little high, by 32,000 games it had settled down completely. 3 runs with the same programs produced final Elo values within the stated +/-5 error bar BayesElo gives.

We had this same discussion several years ago. Chris Whittington used a different methodology to illustrate this point. He assumed two programs of identical strength. He ignored draws and simply generated a string of 1,000 numbers either 0 or 1 representing a win for program A with a 1, and a win for program B with a 0. He then searched for strings of 0's or 1's. And posted some analysis showing that a group of 10 wins or losses is hardly unexpected with two equal opponents.

This was the Nth time testing came up. I know everyone wants to be able to see the truth with 10 games. But you don't even see a good guess with 100 games. Unfortunately.
Well if you are ignoring draws, that is totally possible somewhere around 96%. But in the real world, not very likely at all.
What is the difference between 20 results with 10 draws and 10 wins, opposed to 10 draws and 10 losses? Did you see Remi's discussion a couple of years back about ignoring draws anyway???

In the results I posted I believe draws were happening 22% of the time. So one in every 5 games was a 1/2. Doesn't really change things with respect to the complete randomness of taking a 10 game sample as in the data that started this thread. 5.5-4.5 or 6.5-3.5 is simply meaningless when comparing the two programs. Simply meaningless...
krazyken

Re: Toga The Killer 1Y MP 4CPU is the strongest Toga....

Post by krazyken »

bob wrote:
krazyken wrote:
bob wrote:
bnemias wrote:Can we just produce some data points to illustrate?

Say, take a linear 32,000 game match between A and A'. Pick n unique random starting points in the list (100 maybe) to produce n 10 game runs. Then it should be easy to compute the accuracy of a run. Also, it might be interesting to post some of the most skewed runs.

In fact, this issue keeps recurring. How about once it's done, compute how many runs of n produce results within some ELO of the actual difference? Then bookmark the link so you can reference the data any time this issue comes up again. Heh.
I believe I already posted most of what you asked for. I played a 32,000 game match, and grabbed the partial results every few seconds. For the first 200 games, the elo was +/- 100 from the truth, all over the place. By the time we hit 1,000 games it had settled down, although it was a little high, by 32,000 games it had settled down completely. 3 runs with the same programs produced final Elo values within the stated +/-5 error bar BayesElo gives.

We had this same discussion several years ago. Chris Whittington used a different methodology to illustrate this point. He assumed two programs of identical strength. He ignored draws and simply generated a string of 1,000 numbers either 0 or 1 representing a win for program A with a 1, and a win for program B with a 0. He then searched for strings of 0's or 1's. And posted some analysis showing that a group of 10 wins or losses is hardly unexpected with two equal opponents.

This was the Nth time testing came up. I know everyone wants to be able to see the truth with 10 games. But you don't even see a good guess with 100 games. Unfortunately.
Well if you are ignoring draws, that is totally possible somewhere around 96%. But in the real world, not very likely at all.
What is the difference between 20 results with 10 draws and 10 wins, opposed to 10 draws and 10 losses? Did you see Remi's discussion a couple of years back about ignoring draws anyway???

In the results I posted I believe draws were happening 22% of the time. So one in every 5 games was a 1/2. Doesn't really change things with respect to the complete randomness of taking a 10 game sample as in the data that started this thread. 5.5-4.5 or 6.5-3.5 is simply meaningless when comparing the two programs. Simply meaningless...
Well you were talking about a string of 10 wins in a row. I was assuming you were talking about a set of 1000 real games. If you insist on adding new criteria so late in the discussion, kind of makes the whole discussion pointless. makes the math completely pointless. if two programs are equal and get a 22% draw rate the chance of a win is 39%, if you are ignoring draws the chance of a win is 50%. When you are talking about 10 in a row, that is a HUGE difference.

so if you are ignoring draws, I'm going to use fuzzier math because I haven't the time to waste figuring out formulas for this new scenario just now. play 1000 games throw out the 22% that are draws, and you are left with about 780 games that can be represented as a string of 1's and 0's using a 50% probability of a win, it will come up to about 75% that there are 10 wins in a row. the chance that there are 10 wins in a row and ten losses in a row in the same series will probably have an upper bound of (75%)^2 someone with more time may want to verify that, So in the end it is likely that in this subset of the chess universe you are right.
bob
Posts: 20943
Joined: Mon Feb 27, 2006 7:30 pm
Location: Birmingham, AL

Re: Toga The Killer 1Y MP 4CPU is the strongest Toga....

Post by bob »

krazyken wrote:
bob wrote:
krazyken wrote:
bob wrote:
bnemias wrote:Can we just produce some data points to illustrate?

Say, take a linear 32,000 game match between A and A'. Pick n unique random starting points in the list (100 maybe) to produce n 10 game runs. Then it should be easy to compute the accuracy of a run. Also, it might be interesting to post some of the most skewed runs.

In fact, this issue keeps recurring. How about once it's done, compute how many runs of n produce results within some ELO of the actual difference? Then bookmark the link so you can reference the data any time this issue comes up again. Heh.
I believe I already posted most of what you asked for. I played a 32,000 game match, and grabbed the partial results every few seconds. For the first 200 games, the elo was +/- 100 from the truth, all over the place. By the time we hit 1,000 games it had settled down, although it was a little high, by 32,000 games it had settled down completely. 3 runs with the same programs produced final Elo values within the stated +/-5 error bar BayesElo gives.

We had this same discussion several years ago. Chris Whittington used a different methodology to illustrate this point. He assumed two programs of identical strength. He ignored draws and simply generated a string of 1,000 numbers either 0 or 1 representing a win for program A with a 1, and a win for program B with a 0. He then searched for strings of 0's or 1's. And posted some analysis showing that a group of 10 wins or losses is hardly unexpected with two equal opponents.

This was the Nth time testing came up. I know everyone wants to be able to see the truth with 10 games. But you don't even see a good guess with 100 games. Unfortunately.
Well if you are ignoring draws, that is totally possible somewhere around 96%. But in the real world, not very likely at all.
What is the difference between 20 results with 10 draws and 10 wins, opposed to 10 draws and 10 losses? Did you see Remi's discussion a couple of years back about ignoring draws anyway???

In the results I posted I believe draws were happening 22% of the time. So one in every 5 games was a 1/2. Doesn't really change things with respect to the complete randomness of taking a 10 game sample as in the data that started this thread. 5.5-4.5 or 6.5-3.5 is simply meaningless when comparing the two programs. Simply meaningless...
Well you were talking about a string of 10 wins in a row. I was assuming you were talking about a set of 1000 real games. If you insist on adding new criteria so late in the discussion, kind of makes the whole discussion pointless. makes the math completely pointless. if two programs are equal and get a 22% draw rate the chance of a win is 39%, if you are ignoring draws the chance of a win is 50%. When you are talking about 10 in a row, that is a HUGE difference.

so if you are ignoring draws, I'm going to use fuzzier math because I haven't the time to waste figuring out formulas for this new scenario just now. play 1000 games throw out the 22% that are draws, and you are left with about 780 games that can be represented as a string of 1's and 0's using a 50% probability of a win, it will come up to about 75% that there are 10 wins in a row. the chance that there are 10 wins in a row and ten losses in a row in the same series will probably have an upper bound of (75%)^2 someone with more time may want to verify that, So in the end it is likely that in this subset of the chess universe you are right.
Let's back up to what I said. If you look at the _original_ data posted, there were 3-4 sets of 10-game matches. _none_ of them were 10-0. I pointed out that from a set of 10 games, you can draw no conclusions, even if you did get a 10-0 or 0-10 result, because the error bar is too large, and 10 consecutive wins or losses is not exactly a rare event.

When comparing _two_ programs, draws are irrelevant. Remi explained this in detail a few years back with great clarity. And the results that were posted were exactly that, 4 ten game matches against 4 different opponents. But my point always was "ten games is not enough to learn _anything_." Nothing more, nothing less.

1 Fruit 2.1 2644 70 66 16 66% 2556 31%
2 Crafty-23.1-1 2556 66 70 16 34% 2644 31%

There's a 16 game match. with an error bar that is 150 points wide.

Here's how that ended:

1 Crafty-23.1-1 2623 5 5 7782 56% 2577 26%
2 Fruit 2.1 2577 5 5 7782 44% 2623 26%

Which one would you trust? Which one would give you a _reasonable_ idea of which is better? That is all I have been saying, from the get-go. If you want to draw a conclusion from that first set of data, fine. But it is not very meaningful, particularly when, as Paul Harvey used to say, "now here's the rest of the story" and you get the final match results (only 8K games, waiting 15 minutes was enough make the point here).
krazyken

Re: Toga The Killer 1Y MP 4CPU is the strongest Toga....

Post by krazyken »

bob wrote:
krazyken wrote:
bob wrote:
krazyken wrote:
bob wrote:
bnemias wrote:Can we just produce some data points to illustrate?

Say, take a linear 32,000 game match between A and A'. Pick n unique random starting points in the list (100 maybe) to produce n 10 game runs. Then it should be easy to compute the accuracy of a run. Also, it might be interesting to post some of the most skewed runs.

In fact, this issue keeps recurring. How about once it's done, compute how many runs of n produce results within some ELO of the actual difference? Then bookmark the link so you can reference the data any time this issue comes up again. Heh.
I believe I already posted most of what you asked for. I played a 32,000 game match, and grabbed the partial results every few seconds. For the first 200 games, the elo was +/- 100 from the truth, all over the place. By the time we hit 1,000 games it had settled down, although it was a little high, by 32,000 games it had settled down completely. 3 runs with the same programs produced final Elo values within the stated +/-5 error bar BayesElo gives.

We had this same discussion several years ago. Chris Whittington used a different methodology to illustrate this point. He assumed two programs of identical strength. He ignored draws and simply generated a string of 1,000 numbers either 0 or 1 representing a win for program A with a 1, and a win for program B with a 0. He then searched for strings of 0's or 1's. And posted some analysis showing that a group of 10 wins or losses is hardly unexpected with two equal opponents.

This was the Nth time testing came up. I know everyone wants to be able to see the truth with 10 games. But you don't even see a good guess with 100 games. Unfortunately.
Well if you are ignoring draws, that is totally possible somewhere around 96%. But in the real world, not very likely at all.
What is the difference between 20 results with 10 draws and 10 wins, opposed to 10 draws and 10 losses? Did you see Remi's discussion a couple of years back about ignoring draws anyway???

In the results I posted I believe draws were happening 22% of the time. So one in every 5 games was a 1/2. Doesn't really change things with respect to the complete randomness of taking a 10 game sample as in the data that started this thread. 5.5-4.5 or 6.5-3.5 is simply meaningless when comparing the two programs. Simply meaningless...
Well you were talking about a string of 10 wins in a row. I was assuming you were talking about a set of 1000 real games. If you insist on adding new criteria so late in the discussion, kind of makes the whole discussion pointless. makes the math completely pointless. if two programs are equal and get a 22% draw rate the chance of a win is 39%, if you are ignoring draws the chance of a win is 50%. When you are talking about 10 in a row, that is a HUGE difference.

so if you are ignoring draws, I'm going to use fuzzier math because I haven't the time to waste figuring out formulas for this new scenario just now. play 1000 games throw out the 22% that are draws, and you are left with about 780 games that can be represented as a string of 1's and 0's using a 50% probability of a win, it will come up to about 75% that there are 10 wins in a row. the chance that there are 10 wins in a row and ten losses in a row in the same series will probably have an upper bound of (75%)^2 someone with more time may want to verify that, So in the end it is likely that in this subset of the chess universe you are right.
Let's back up to what I said. If you look at the _original_ data posted, there were 3-4 sets of 10-game matches. _none_ of them were 10-0. I pointed out that from a set of 10 games, you can draw no conclusions, even if you did get a 10-0 or 0-10 result, because the error bar is too large, and 10 consecutive wins or losses is not exactly a rare event.
this is a separate topic, 10 consecutive wins or losses is a rare event except when one engine is clearly stronger than the other, the math doesn't lie.
bob wrote: When comparing _two_ programs, draws are irrelevant. Remi explained this in detail a few years back with great clarity. And the results that were posted were exactly that, 4 ten game matches against 4 different opponents. But my point always was "ten games is not enough to learn _anything_." Nothing more, nothing less.

1 Fruit 2.1 2644 70 66 16 66% 2556 31%
2 Crafty-23.1-1 2556 66 70 16 34% 2644 31%

There's a 16 game match. with an error bar that is 150 points wide.

Here's how that ended:

1 Crafty-23.1-1 2623 5 5 7782 56% 2577 26%
2 Fruit 2.1 2577 5 5 7782 44% 2623 26%

Which one would you trust? Which one would give you a _reasonable_ idea of which is better? That is all I have been saying, from the get-go. If you want to draw a conclusion from that first set of data, fine. But it is not very meaningful, particularly when, as Paul Harvey used to say, "now here's the rest of the story" and you get the final match results (only 8K games, waiting 15 minutes was enough make the point here).
I would trust both results. Results like that can and do happen. Never a doubt. If results like that happen frequently, then I would definitely be checking for anomalies in the test procedure. Of course if I could run 8000 games in 15 minutes, it probably wouldn't be worth the time to sort out the anomalies.
Ryan Benitez
Posts: 719
Joined: Thu Mar 09, 2006 1:21 am
Location: Portland Oregon

Re: Toga The Killer 1Y MP 4CPU is the strongest Toga....

Post by Ryan Benitez »

bob wrote: 1 Fruit 2.1 2644 70 66 16 66% 2556 31%
2 Crafty-23.1-1 2556 66 70 16 34% 2644 31%

There's a 16 game match. with an error bar that is 150 points wide.

Here's how that ended:

1 Crafty-23.1-1 2623 5 5 7782 56% 2577 26%
2 Fruit 2.1 2577 5 5 7782 44% 2623 26%

Which one would you trust? Which one would give you a _reasonable_ idea of which is better? That is all I have been saying, from the get-go. If you want to draw a conclusion from that first set of data, fine. But it is not very meaningful, particularly when, as Paul Harvey used to say, "now here's the rest of the story" and you get the final match results (only 8K games, waiting 15 minutes was enough make the point here).
If someone does not already agree that 10 or 16 games in not enough they can not be helped. Maybe you are hunting such people out in case you see them at a poker table some day? It is always good to know who lacks elementary math skills at the poker table.
krazyken

Re: Toga The Killer 1Y MP 4CPU is the strongest Toga....

Post by krazyken »

Ryan Benitez wrote:
bob wrote: 1 Fruit 2.1 2644 70 66 16 66% 2556 31%
2 Crafty-23.1-1 2556 66 70 16 34% 2644 31%

There's a 16 game match. with an error bar that is 150 points wide.

Here's how that ended:

1 Crafty-23.1-1 2623 5 5 7782 56% 2577 26%
2 Fruit 2.1 2577 5 5 7782 44% 2623 26%

Which one would you trust? Which one would give you a _reasonable_ idea of which is better? That is all I have been saying, from the get-go. If you want to draw a conclusion from that first set of data, fine. But it is not very meaningful, particularly when, as Paul Harvey used to say, "now here's the rest of the story" and you get the final match results (only 8K games, waiting 15 minutes was enough make the point here).
If someone does not already agree that 10 or 16 games in not enough they can not be helped. Maybe you are hunting such people out in case you see them at a poker table some day? It is always good to know who lacks elementary math skills at the poker table.
Strange thing is, I have a degree in Math, but I guess you haven't been following along, the question isn't whether more games is better, the question is whether or not a set of games has no value whatsoever. Just because a small sample isn't always right does not mean that it is always wrong, If you have set things up correctly, it will be right far more often than it will be wrong.
bob
Posts: 20943
Joined: Mon Feb 27, 2006 7:30 pm
Location: Birmingham, AL

Re: Toga The Killer 1Y MP 4CPU is the strongest Toga....

Post by bob »

krazyken wrote:
bob wrote:
krazyken wrote:
bob wrote:
krazyken wrote:
bob wrote:
bnemias wrote:Can we just produce some data points to illustrate?

Say, take a linear 32,000 game match between A and A'. Pick n unique random starting points in the list (100 maybe) to produce n 10 game runs. Then it should be easy to compute the accuracy of a run. Also, it might be interesting to post some of the most skewed runs.

In fact, this issue keeps recurring. How about once it's done, compute how many runs of n produce results within some ELO of the actual difference? Then bookmark the link so you can reference the data any time this issue comes up again. Heh.
I believe I already posted most of what you asked for. I played a 32,000 game match, and grabbed the partial results every few seconds. For the first 200 games, the elo was +/- 100 from the truth, all over the place. By the time we hit 1,000 games it had settled down, although it was a little high, by 32,000 games it had settled down completely. 3 runs with the same programs produced final Elo values within the stated +/-5 error bar BayesElo gives.

We had this same discussion several years ago. Chris Whittington used a different methodology to illustrate this point. He assumed two programs of identical strength. He ignored draws and simply generated a string of 1,000 numbers either 0 or 1 representing a win for program A with a 1, and a win for program B with a 0. He then searched for strings of 0's or 1's. And posted some analysis showing that a group of 10 wins or losses is hardly unexpected with two equal opponents.

This was the Nth time testing came up. I know everyone wants to be able to see the truth with 10 games. But you don't even see a good guess with 100 games. Unfortunately.
Well if you are ignoring draws, that is totally possible somewhere around 96%. But in the real world, not very likely at all.
What is the difference between 20 results with 10 draws and 10 wins, opposed to 10 draws and 10 losses? Did you see Remi's discussion a couple of years back about ignoring draws anyway???

In the results I posted I believe draws were happening 22% of the time. So one in every 5 games was a 1/2. Doesn't really change things with respect to the complete randomness of taking a 10 game sample as in the data that started this thread. 5.5-4.5 or 6.5-3.5 is simply meaningless when comparing the two programs. Simply meaningless...
Well you were talking about a string of 10 wins in a row. I was assuming you were talking about a set of 1000 real games. If you insist on adding new criteria so late in the discussion, kind of makes the whole discussion pointless. makes the math completely pointless. if two programs are equal and get a 22% draw rate the chance of a win is 39%, if you are ignoring draws the chance of a win is 50%. When you are talking about 10 in a row, that is a HUGE difference.

so if you are ignoring draws, I'm going to use fuzzier math because I haven't the time to waste figuring out formulas for this new scenario just now. play 1000 games throw out the 22% that are draws, and you are left with about 780 games that can be represented as a string of 1's and 0's using a 50% probability of a win, it will come up to about 75% that there are 10 wins in a row. the chance that there are 10 wins in a row and ten losses in a row in the same series will probably have an upper bound of (75%)^2 someone with more time may want to verify that, So in the end it is likely that in this subset of the chess universe you are right.
Let's back up to what I said. If you look at the _original_ data posted, there were 3-4 sets of 10-game matches. _none_ of them were 10-0. I pointed out that from a set of 10 games, you can draw no conclusions, even if you did get a 10-0 or 0-10 result, because the error bar is too large, and 10 consecutive wins or losses is not exactly a rare event.
this is a separate topic, 10 consecutive wins or losses is a rare event except when one engine is clearly stronger than the other, the math doesn't lie.
bob wrote: When comparing _two_ programs, draws are irrelevant. Remi explained this in detail a few years back with great clarity. And the results that were posted were exactly that, 4 ten game matches against 4 different opponents. But my point always was "ten games is not enough to learn _anything_." Nothing more, nothing less.

1 Fruit 2.1 2644 70 66 16 66% 2556 31%
2 Crafty-23.1-1 2556 66 70 16 34% 2644 31%

There's a 16 game match. with an error bar that is 150 points wide.

Here's how that ended:

1 Crafty-23.1-1 2623 5 5 7782 56% 2577 26%
2 Fruit 2.1 2577 5 5 7782 44% 2623 26%

Which one would you trust? Which one would give you a _reasonable_ idea of which is better? That is all I have been saying, from the get-go. If you want to draw a conclusion from that first set of data, fine. But it is not very meaningful, particularly when, as Paul Harvey used to say, "now here's the rest of the story" and you get the final match results (only 8K games, waiting 15 minutes was enough make the point here).
I would trust both results. Results like that can and do happen. Never a doubt. If results like that happen frequently, then I would definitely be checking for anomalies in the test procedure. Of course if I could run 8000 games in 15 minutes, it probably wouldn't be worth the time to sort out the anomalies.
You can't trust _both_. They are mutually exclusive. So either Crafty is better, or it is worse. It can't possibly be both. But a small number of games is not enough to answer the question.
bob
Posts: 20943
Joined: Mon Feb 27, 2006 7:30 pm
Location: Birmingham, AL

Re: Toga The Killer 1Y MP 4CPU is the strongest Toga....

Post by bob »

krazyken wrote:
Ryan Benitez wrote:
bob wrote: 1 Fruit 2.1 2644 70 66 16 66% 2556 31%
2 Crafty-23.1-1 2556 66 70 16 34% 2644 31%

There's a 16 game match. with an error bar that is 150 points wide.

Here's how that ended:

1 Crafty-23.1-1 2623 5 5 7782 56% 2577 26%
2 Fruit 2.1 2577 5 5 7782 44% 2623 26%

Which one would you trust? Which one would give you a _reasonable_ idea of which is better? That is all I have been saying, from the get-go. If you want to draw a conclusion from that first set of data, fine. But it is not very meaningful, particularly when, as Paul Harvey used to say, "now here's the rest of the story" and you get the final match results (only 8K games, waiting 15 minutes was enough make the point here).
If someone does not already agree that 10 or 16 games in not enough they can not be helped. Maybe you are hunting such people out in case you see them at a poker table some day? It is always good to know who lacks elementary math skills at the poker table.
Strange thing is, I have a degree in Math, but I guess you haven't been following along, the question isn't whether more games is better, the question is whether or not a set of games has no value whatsoever. Just because a small sample isn't always right does not mean that it is always wrong, If you have set things up correctly, it will be right far more often than it will be wrong.
That is simply _WRONG_.

I've given several examples. There is no "right way" to set up a test so that 10 games will tell you with any sort of usable certainty that A is better or worse than B. Simply no way.
User avatar
Laskos
Posts: 10948
Joined: Wed Jul 26, 2006 10:21 pm
Full name: Kai Laskos

Re: Toga The Killer 1Y MP 4CPU is the strongest Toga....

Post by Laskos »

bob wrote:
That is simply _WRONG_.

I've given several examples. There is no "right way" to set up a test so that 10 games will tell you with any sort of usable certainty that A is better or worse than B. Simply no way.
There are simple probabilistic formulas to figure out the reliability of the results. 10-0 is probably one-two sigma, 7-3 is less than one sigma. I don't know why such discussions are taking place again and again, when it is clear that for 10 Elo points difference one has to play thousands of games to have two sigmas.

Kai
krazyken

Re: Toga The Killer 1Y MP 4CPU is the strongest Toga....

Post by krazyken »

bob wrote:
krazyken wrote:
Ryan Benitez wrote:
bob wrote: 1 Fruit 2.1 2644 70 66 16 66% 2556 31%
2 Crafty-23.1-1 2556 66 70 16 34% 2644 31%

There's a 16 game match. with an error bar that is 150 points wide.

Here's how that ended:

1 Crafty-23.1-1 2623 5 5 7782 56% 2577 26%
2 Fruit 2.1 2577 5 5 7782 44% 2623 26%

Which one would you trust? Which one would give you a _reasonable_ idea of which is better? That is all I have been saying, from the get-go. If you want to draw a conclusion from that first set of data, fine. But it is not very meaningful, particularly when, as Paul Harvey used to say, "now here's the rest of the story" and you get the final match results (only 8K games, waiting 15 minutes was enough make the point here).
If someone does not already agree that 10 or 16 games in not enough they can not be helped. Maybe you are hunting such people out in case you see them at a poker table some day? It is always good to know who lacks elementary math skills at the poker table.
Strange thing is, I have a degree in Math, but I guess you haven't been following along, the question isn't whether more games is better, the question is whether or not a set of games has no value whatsoever. Just because a small sample isn't always right does not mean that it is always wrong, If you have set things up correctly, it will be right far more often than it will be wrong.
That is simply _WRONG_.

I've given several examples. There is no "right way" to set up a test so that 10 games will tell you with any sort of usable certainty that A is better or worse than B. Simply no way.
proof by example? my professors have never let me get away with that.
I suppose it could be possible that BayesELO is using an algorithm that that requires a minimum sample size. It is more likely that there are some assumptions inherent in the algorithms that are not being satisfied by some of your samples. Regardless, BayesELO is not the only possible statistical tool, just because it has trouble with a particular sample, doesn't mean we need to throw away all the rest of the statistical tools we have at are disposal and declare the sample worthless.