Then why are we having this discussion?krazyken wrote:I didn't see anybody drawing conclusions in this thread from only 10 games. It seems the person that was being ridiculed for posting only 10 games has already posted 70. and still nobody has drawn any conclusions based on the data.bob wrote:However, that is _not_ what is happening. Someone is posting ten game results, and drawing a conclusion that is not supported by any statistical analysis of any kind. 10 games. Almost a +/- 200 error bar. Not very informative or useful.krazyken wrote:
As a final note, if I post 10 games and a few others post their 10 games matches, I would have much more info than if nobody posted their 10 games for fear of having their time called worthless.
Toga The Killer 1Y MP 4CPU is the strongest Toga....
Moderator: Ras
-
- Posts: 20943
- Joined: Mon Feb 27, 2006 7:30 pm
- Location: Birmingham, AL
Re: Toga The Killer 1Y MP 4CPU is the strongest Toga....
Re: Toga The Killer 1Y MP 4CPU is the strongest Toga....
It might be for the camaraderie?bob wrote:Then why are we having this discussion?krazyken wrote:I didn't see anybody drawing conclusions in this thread from only 10 games. It seems the person that was being ridiculed for posting only 10 games has already posted 70. and still nobody has drawn any conclusions based on the data.bob wrote:However, that is _not_ what is happening. Someone is posting ten game results, and drawing a conclusion that is not supported by any statistical analysis of any kind. 10 games. Almost a +/- 200 error bar. Not very informative or useful.krazyken wrote:
As a final note, if I post 10 games and a few others post their 10 games matches, I would have much more info than if nobody posted their 10 games for fear of having their time called worthless.
Re: Toga The Killer 1Y MP 4CPU is the strongest Toga....
I suppose that's why not all of your tests take unpredicted turns.bob wrote:Except that the tests start in a random order... so there is no predicting.krazyken wrote:Your setup is different. You are using various starting positions, so your results will be dependent upon the order of the starting positions.bob wrote:bob wrote: Tell me what you can determine from 10 games between A and B. You can't, with any confidence at all, decide which is best. You can't, with any confidence, decide that A and B are at least reliable and can play long matches without one crashing. You can't, with any confidence, decide that neither A or B will uncork an illegal move or fail to recognize an opponent's legal move in a long match.
So exactly _what_ can you conclude from 10 games?That will depend on the results. 10-0-0 tells you one thing, 0-0-10 tells you another.
Yes, one tells me I won ten games, the other tells me I lost ten games. I can draw _no_ conclusions about which is better. I've shows the data in the past. Many times the first 100 games show something _entirely_ different from what I find after 32,000 games.
-
- Posts: 20943
- Joined: Mon Feb 27, 2006 7:30 pm
- Location: Birmingham, AL
Re: Toga The Killer 1Y MP 4CPU is the strongest Toga....
Let's go back to "here":krazyken wrote:It might be for the camaraderie?bob wrote:Then why are we having this discussion?krazyken wrote:I didn't see anybody drawing conclusions in this thread from only 10 games. It seems the person that was being ridiculed for posting only 10 games has already posted 70. and still nobody has drawn any conclusions based on the data.bob wrote:However, that is _not_ what is happening. Someone is posting ten game results, and drawing a conclusion that is not supported by any statistical analysis of any kind. 10 games. Almost a +/- 200 error bar. Not very informative or useful.krazyken wrote:
As a final note, if I post 10 games and a few others post their 10 games matches, I would have much more info than if nobody posted their 10 games for fear of having their time called worthless.
"Thanks, a pretty good results I assume...."Dr.Wael Deeb wrote:Thanks,a pretty good results I assume....jpqy wrote:More results:
Blitz 5min Core i7 @3.89Ghz 2009
JP.
- TTK.cirebonb1Y.st.4cpu_b-Stockfish_13_win32_ja 5.0 - 5.0 +4/-4/=2 50.00%
TTK.cirebonb1Y.st.4cpu_b-Grapefruit 1.0 alpha 3 5.0 - 5.0 +3/-3/=4 50.00%
TTK.cirebonb1Y.st.4cpu_b-MP-x86-Inert---Thinker 5.4D 5.5 - 4.5 +4/-3/=3 55.00%
TTK.cirebonb1Y.st.4cpu_b-TogaII141SE6-4cpu 3.0 - 7.0 +1/-5/=4 30.00%
TTK.cirebonb1Y.st.4cpu_b-Glaurung22_win32_ja 6.0 - 4.0 +3/-1/=6 60.00%
Dr.D
Is that not a "conclusion"? Based on ten games per opponent? With an astronomical error bar?
-
- Posts: 20943
- Joined: Mon Feb 27, 2006 7:30 pm
- Location: Birmingham, AL
Re: Toga The Killer 1Y MP 4CPU is the strongest Toga....
Not sure what you mean. I produce a ton of scripts to play a set of positions. I queue them up and they are assigned semi-random priorities, mixed in with other jobs. I never know which single run (which will be one opponent, 8 positions, 2 games per position) will be run first or last. Early results _always_ jump all over the place, with the "jumping around" smoothing out as we pass the 5,000-10,000 game level, until we reach the final 32,000 game result which is very repeatable within the +/-4 error bar. I can eliminate all but one opponent and see the _same_ jumping around. until we reach a significant number of games. Crafty vs Glaurung 2.2 is a good example. G2.2 is about +60 better in my testing. But it is not uncommon for Crafty to start off +100 after a few dozen games. Or sometimes -200. And then as the games pile up, the result settles in at around -60 as it should.krazyken wrote:I suppose that's why not all of your tests take unpredicted turns.bob wrote:Except that the tests start in a random order... so there is no predicting.krazyken wrote:Your setup is different. You are using various starting positions, so your results will be dependent upon the order of the starting positions.bob wrote:bob wrote: Tell me what you can determine from 10 games between A and B. You can't, with any confidence at all, decide which is best. You can't, with any confidence, decide that A and B are at least reliable and can play long matches without one crashing. You can't, with any confidence, decide that neither A or B will uncork an illegal move or fail to recognize an opponent's legal move in a long match.
So exactly _what_ can you conclude from 10 games?That will depend on the results. 10-0-0 tells you one thing, 0-0-10 tells you another.
Yes, one tells me I won ten games, the other tells me I lost ten games. I can draw _no_ conclusions about which is better. I've shows the data in the past. Many times the first 100 games show something _entirely_ different from what I find after 32,000 games.
It is just the nature of computer vs computer chess games.
Re: Toga The Killer 1Y MP 4CPU is the strongest Toga....
What I mean is that by introducing a 3rd variable, with starting positions, your early results can easily vary widely based on the order of the positions. You won't reach a good representation of the 32000 positions until after you've run a few thousand games. What you see in your practice runs due to the way you have set them up will not necessarily apply to tests set up in a different manner.bob wrote:Not sure what you mean. I produce a ton of scripts to play a set of positions. I queue them up and they are assigned semi-random priorities, mixed in with other jobs. I never know which single run (which will be one opponent, 8 positions, 2 games per position) will be run first or last. Early results _always_ jump all over the place, with the "jumping around" smoothing out as we pass the 5,000-10,000 game level, until we reach the final 32,000 game result which is very repeatable within the +/-4 error bar. I can eliminate all but one opponent and see the _same_ jumping around. until we reach a significant number of games. Crafty vs Glaurung 2.2 is a good example. G2.2 is about +60 better in my testing. But it is not uncommon for Crafty to start off +100 after a few dozen games. Or sometimes -200. And then as the games pile up, the result settles in at around -60 as it should.krazyken wrote:I suppose that's why not all of your tests take unpredicted turns.bob wrote:Except that the tests start in a random order... so there is no predicting.krazyken wrote:Your setup is different. You are using various starting positions, so your results will be dependent upon the order of the starting positions.bob wrote:bob wrote: Tell me what you can determine from 10 games between A and B. You can't, with any confidence at all, decide which is best. You can't, with any confidence, decide that A and B are at least reliable and can play long matches without one crashing. You can't, with any confidence, decide that neither A or B will uncork an illegal move or fail to recognize an opponent's legal move in a long match.
So exactly _what_ can you conclude from 10 games?That will depend on the results. 10-0-0 tells you one thing, 0-0-10 tells you another.
Yes, one tells me I won ten games, the other tells me I lost ten games. I can draw _no_ conclusions about which is better. I've shows the data in the past. Many times the first 100 games show something _entirely_ different from what I find after 32,000 games.
It is just the nature of computer vs computer chess games.
-
- Posts: 921
- Joined: Mon May 29, 2006 11:18 pm
- Location: For now the planet Earth
Re: Toga The Killer 1Y MP 4CPU is the strongest Toga....
Bob,bob wrote:Not sure what you mean. I produce a ton of scripts to play a set of positions. I queue them up and they are assigned semi-random priorities, mixed in with other jobs. I never know which single run (which will be one opponent, 8 positions, 2 games per position) will be run first or last. Early results _always_ jump all over the place, with the "jumping around" smoothing out as we pass the 5,000-10,000 game level, until we reach the final 32,000 game result which is very repeatable within the +/-4 error bar. I can eliminate all but one opponent and see the _same_ jumping around. until we reach a significant number of games. Crafty vs Glaurung 2.2 is a good example. G2.2 is about +60 better in my testing. But it is not uncommon for Crafty to start off +100 after a few dozen games. Or sometimes -200. And then as the games pile up, the result settles in at around -60 as it should.krazyken wrote:I suppose that's why not all of your tests take unpredicted turns.bob wrote:Except that the tests start in a random order... so there is no predicting.krazyken wrote:Your setup is different. You are using various starting positions, so your results will be dependent upon the order of the starting positions.bob wrote:bob wrote: Tell me what you can determine from 10 games between A and B. You can't, with any confidence at all, decide which is best. You can't, with any confidence, decide that A and B are at least reliable and can play long matches without one crashing. You can't, with any confidence, decide that neither A or B will uncork an illegal move or fail to recognize an opponent's legal move in a long match.
So exactly _what_ can you conclude from 10 games?That will depend on the results. 10-0-0 tells you one thing, 0-0-10 tells you another.
Yes, one tells me I won ten games, the other tells me I lost ten games. I can draw _no_ conclusions about which is better. I've shows the data in the past. Many times the first 100 games show something _entirely_ different from what I find after 32,000 games.
It is just the nature of computer vs computer chess games.
Two questions.
1) When you are playing 10,000+ computer matches with given opponents
what is the game time? Eg:game in one min, game in sixty min's?
2) Will the ELO results be approximately same whether the game is one min or sixty min's? Just as an example Crafty plays 20,000 games at game in one min and let's say this gives an ELO of 2700.What would you expect Crafty's ELO to be after 20,000 games with the same opponents at game in two hours?
Jaimes
Veritas Vos Liberabit
-
- Posts: 10121
- Joined: Thu Mar 09, 2006 12:57 am
- Location: van buren,missouri
Re: Toga The Killer 1Y MP 4CPU is the strongest Toga....
Very good question. I have been testing engines for 5 years now.Jaimes Conda wrote:Bob,bob wrote:Not sure what you mean. I produce a ton of scripts to play a set of positions. I queue them up and they are assigned semi-random priorities, mixed in with other jobs. I never know which single run (which will be one opponent, 8 positions, 2 games per position) will be run first or last. Early results _always_ jump all over the place, with the "jumping around" smoothing out as we pass the 5,000-10,000 game level, until we reach the final 32,000 game result which is very repeatable within the +/-4 error bar. I can eliminate all but one opponent and see the _same_ jumping around. until we reach a significant number of games. Crafty vs Glaurung 2.2 is a good example. G2.2 is about +60 better in my testing. But it is not uncommon for Crafty to start off +100 after a few dozen games. Or sometimes -200. And then as the games pile up, the result settles in at around -60 as it should.krazyken wrote:I suppose that's why not all of your tests take unpredicted turns.bob wrote:Except that the tests start in a random order... so there is no predicting.krazyken wrote:Your setup is different. You are using various starting positions, so your results will be dependent upon the order of the starting positions.bob wrote:bob wrote: Tell me what you can determine from 10 games between A and B. You can't, with any confidence at all, decide which is best. You can't, with any confidence, decide that A and B are at least reliable and can play long matches without one crashing. You can't, with any confidence, decide that neither A or B will uncork an illegal move or fail to recognize an opponent's legal move in a long match.
So exactly _what_ can you conclude from 10 games?That will depend on the results. 10-0-0 tells you one thing, 0-0-10 tells you another.
Yes, one tells me I won ten games, the other tells me I lost ten games. I can draw _no_ conclusions about which is better. I've shows the data in the past. Many times the first 100 games show something _entirely_ different from what I find after 32,000 games.
It is just the nature of computer vs computer chess games.
Two questions.
1) When you are playing 10,000+ computer matches with given opponents
what is the game time? Eg:game in one min, game in sixty min's?
2) Will the ELO results be approximately same whether the game is one min or sixty min's? Just as an example Crafty plays 20,000 games at game in one min and let's say this gives an ELO of 2700.What would you expect Crafty's ELO to be after 20,000 games with the same opponents at game in two hours?
Jaimes
Playing engines vs engines at 10 mins and at 40 mins. Getting
mixed results on your question.
Best,
Gerold.
-
- Posts: 20943
- Joined: Mon Feb 27, 2006 7:30 pm
- Location: Birmingham, AL
Re: Toga The Killer 1Y MP 4CPU is the strongest Toga....
Depends on what I am testing. For evaluation ideas, I start with very fast games (10 secs on clock, 10ms increment) so that I can complete the 32,000 games in an hour or so. For changes that look good, I then will play slower games. 1+1 (one min on clock, 1 sec increment) takes about 12 hours to complete). If I am testing search ideas, I usually run 1+1 games, and for changes that look reasonable, or which intuition suggests might be better at longer time controls, I play longer games. I rarely go past 5+5 except for final verificatoin. 5+5 takes about 2 days to finish. 60+60 is almost 2 weeks.Jaimes Conda wrote:Bob,bob wrote:Not sure what you mean. I produce a ton of scripts to play a set of positions. I queue them up and they are assigned semi-random priorities, mixed in with other jobs. I never know which single run (which will be one opponent, 8 positions, 2 games per position) will be run first or last. Early results _always_ jump all over the place, with the "jumping around" smoothing out as we pass the 5,000-10,000 game level, until we reach the final 32,000 game result which is very repeatable within the +/-4 error bar. I can eliminate all but one opponent and see the _same_ jumping around. until we reach a significant number of games. Crafty vs Glaurung 2.2 is a good example. G2.2 is about +60 better in my testing. But it is not uncommon for Crafty to start off +100 after a few dozen games. Or sometimes -200. And then as the games pile up, the result settles in at around -60 as it should.krazyken wrote:I suppose that's why not all of your tests take unpredicted turns.bob wrote:Except that the tests start in a random order... so there is no predicting.krazyken wrote:Your setup is different. You are using various starting positions, so your results will be dependent upon the order of the starting positions.bob wrote:bob wrote: Tell me what you can determine from 10 games between A and B. You can't, with any confidence at all, decide which is best. You can't, with any confidence, decide that A and B are at least reliable and can play long matches without one crashing. You can't, with any confidence, decide that neither A or B will uncork an illegal move or fail to recognize an opponent's legal move in a long match.
So exactly _what_ can you conclude from 10 games?That will depend on the results. 10-0-0 tells you one thing, 0-0-10 tells you another.
Yes, one tells me I won ten games, the other tells me I lost ten games. I can draw _no_ conclusions about which is better. I've shows the data in the past. Many times the first 100 games show something _entirely_ different from what I find after 32,000 games.
It is just the nature of computer vs computer chess games.
Two questions.
1) When you are playing 10,000+ computer matches with given opponents
what is the game time? Eg:game in one min, game in sixty min's?
last question points out a real problem. There are changes that look good at fast time controls and bad at long ones, and vice-versa. But those are fairly rare. I see more of the case where a change either helps at fast games and does little or nothing for longer games, or else it doesn't make any difference in fast games, but helps in longer games.
2) Will the ELO results be approximately same whether the game is one min or sixty min's? Just as an example Crafty plays 20,000 games at game in one min and let's say this gives an ELO of 2700.What would you expect Crafty's ELO to be after 20,000 games with the same opponents at game in two hours?
Jaimes
-
- Posts: 20943
- Joined: Mon Feb 27, 2006 7:30 pm
- Location: Birmingham, AL
Re: Toga The Killer 1Y MP 4CPU is the strongest Toga....
Couple of points.krazyken wrote:What I mean is that by introducing a 3rd variable, with starting positions, your early results can easily vary widely based on the order of the positions. You won't reach a good representation of the 32000 positions until after you've run a few thousand games. What you see in your practice runs due to the way you have set them up will not necessarily apply to tests set up in a different manner.bob wrote:Not sure what you mean. I produce a ton of scripts to play a set of positions. I queue them up and they are assigned semi-random priorities, mixed in with other jobs. I never know which single run (which will be one opponent, 8 positions, 2 games per position) will be run first or last. Early results _always_ jump all over the place, with the "jumping around" smoothing out as we pass the 5,000-10,000 game level, until we reach the final 32,000 game result which is very repeatable within the +/-4 error bar. I can eliminate all but one opponent and see the _same_ jumping around. until we reach a significant number of games. Crafty vs Glaurung 2.2 is a good example. G2.2 is about +60 better in my testing. But it is not uncommon for Crafty to start off +100 after a few dozen games. Or sometimes -200. And then as the games pile up, the result settles in at around -60 as it should.krazyken wrote:I suppose that's why not all of your tests take unpredicted turns.bob wrote:Except that the tests start in a random order... so there is no predicting.krazyken wrote:Your setup is different. You are using various starting positions, so your results will be dependent upon the order of the starting positions.bob wrote:bob wrote: Tell me what you can determine from 10 games between A and B. You can't, with any confidence at all, decide which is best. You can't, with any confidence, decide that A and B are at least reliable and can play long matches without one crashing. You can't, with any confidence, decide that neither A or B will uncork an illegal move or fail to recognize an opponent's legal move in a long match.
So exactly _what_ can you conclude from 10 games?That will depend on the results. 10-0-0 tells you one thing, 0-0-10 tells you another.
Yes, one tells me I won ten games, the other tells me I lost ten games. I can draw _no_ conclusions about which is better. I've shows the data in the past. Many times the first 100 games show something _entirely_ different from what I find after 32,000 games.
It is just the nature of computer vs computer chess games.
1. I have 4,000 positions. Each position can be played in any order, but as a position is played, two games alternating sides are always played back-to-back. No matter what the order, things are always very volatile for the first 500-1000 games at least. I suppose it might be possible to "replay" a complete match, running samples thru BayesElo every 30 seconds to see how things start and finish and vary over the course of 32,000 games, to show the point.
2. What kind of "different manner" could you use? If you use one computer, you could play the positions in the same order each time. But if you use more than one machine, inherent randomness is still going to get the machines "out of step" pretty quickly. Using an opening book is certainly going to increase the instability, not help it.
Here's some sample data I did have. This represents the results over the first 1,500 games played in one of my matches. More after the data.
Code: Select all
--error- total win
Elo + - games pct
Crafty-23.1 2608 145 146 3 50% 2596 33%
Crafty-23.1 2591 130 134 5 40% 2608 40%
Crafty-23.1 2586 114 118 9 39% 2614 33%
Crafty-23.1 2562 109 115 11 32% 2623 27%
Crafty-23.1 2582 103 107 13 38% 2606 31%
Crafty-23.1 2557 84 88 23 37% 2603 30%
Crafty-23.1 2591 74 75 34 46% 2594 26%
Crafty-23.1 2604 68 67 49 53% 2551 20%
Crafty-23.1 2628 60 59 67 57% 2534 19%
Crafty-23.1 2639 50 49 92 57% 2556 22%
Crafty-23.1 2629 43 43 122 54% 2567 21%
Crafty-23.1 2636 39 38 156 54% 2578 19%
Crafty-23.1 2659 36 36 188 57% 2580 16%
Crafty-23.1 2663 34 34 212 58% 2579 16%
Crafty-23.1 2665 33 32 230 58% 2578 16%
Crafty-23.1 2659 31 31 244 58% 2580 18%
Crafty-23.1 2663 31 30 254 58% 2580 18%
Crafty-23.1 2661 29 29 277 59% 2578 19%
Crafty-23.1 2656 29 28 297 58% 2580 19%
Crafty-23.1 2645 27 27 317 56% 2582 19%
Crafty-23.1 2639 27 26 341 56% 2585 19%
Crafty-23.1 2642 25 25 378 56% 2585 19%
Crafty-23.1 2642 24 24 397 57% 2585 20%
Crafty-23.1 2644 24 24 417 57% 2586 19%
Crafty-23.1 2648 24 23 440 58% 2584 19%
Crafty-23.1 2645 23 23 463 57% 2585 19%
Crafty-23.1 2647 22 22 483 58% 2583 19%
Crafty-23.1 2646 22 22 500 58% 2584 20%
Crafty-23.1 2650 22 22 517 58% 2581 19%
Crafty-23.1 2650 21 21 531 58% 2582 20%
Crafty-23.1 2648 21 21 556 58% 2584 21%
Crafty-23.1 2644 20 20 592 57% 2585 21%
Crafty-23.1 2649 20 20 617 58% 2583 21%
Crafty-23.1 2649 19 19 634 58% 2584 21%
Crafty-23.1 2651 19 19 651 58% 2584 21%
Crafty-23.1 2648 19 19 674 58% 2584 23%
Crafty-23.1 2644 18 18 704 57% 2585 23%
Crafty-23.1 2642 18 18 731 57% 2585 23%
Crafty-23.1 2642 18 18 742 57% 2586 23%
Crafty-23.1 2642 18 17 765 57% 2586 24%
Crafty-23.1 2642 17 17 788 57% 2587 24%
Crafty-23.1 2643 17 17 810 57% 2586 24%
Crafty-23.1 2644 17 17 839 57% 2586 23%
Crafty-23.1 2644 17 16 867 57% 2586 23%
Crafty-23.1 2644 16 16 887 57% 2586 23%
Crafty-23.1 2643 16 16 911 57% 2586 23%
Crafty-23.1 2644 16 16 933 57% 2586 23%
Crafty-23.1 2643 16 16 949 57% 2586 23%
Crafty-23.1 2642 16 16 972 57% 2586 23%
Crafty-23.1 2640 15 15 1000 57% 2587 23%
Crafty-23.1 2640 15 15 1028 57% 2586 23%
Crafty-23.1 2640 15 15 1047 57% 2586 23%
Crafty-23.1 2640 15 15 1070 57% 2586 22%
Crafty-23.1 2640 15 15 1086 57% 2586 22%
Crafty-23.1 2638 15 15 1115 56% 2587 22%
Crafty-23.1 2637 15 14 1137 56% 2587 22%
Crafty-23.1 2636 14 14 1154 56% 2588 22%
Crafty-23.1 2637 14 14 1177 56% 2587 22%
Crafty-23.1 2637 14 14 1191 56% 2588 22%
Crafty-23.1 2636 14 14 1218 56% 2588 22%
Crafty-23.1 2635 14 14 1249 56% 2588 22%
Crafty-23.1 2634 14 14 1271 56% 2588 22%
Crafty-23.1 2632 14 14 1295 55% 2589 22%
Crafty-23.1 2630 14 13 1327 55% 2589 22%
Crafty-23.1 2628 13 13 1343 55% 2590 22%
Crafty-23.1 2627 13 13 1365 55% 2590 22%
Crafty-23.1 2627 13 13 1385 55% 2590 22%
Crafty-23.1 2626 13 13 1405 54% 2591 22%
Crafty-23.1 2626 13 13 1426 54% 2591 22%
Crafty-23.1 2627 13 13 1445 55% 2590 22%
Crafty-23.1 2625 13 13 1469 54% 2591 22%
...
Crafty-23.1 2611 5 4 31128 53% 2591 22%
But notice the "early returns". If you ran just 10 games, as someone here did, you are -50 (way under-estimating the final result). If you had run just 200 games, you are now +50 (way over-estimating final result).
This is the reason I said 10 games mean absolutely nothing. All this takes is some "balanced" positions where the games can go either way. They will thanks to the inherent randomness caused by using time to constrain the search. Adding opening books makes this worse, which is why I started testing without books a long time back.