Amazing ... I thought about the same thing 2 hours ago
Vincent
recent cluster testing
Moderators: hgm, Rebel, chrisw
-
- Posts: 5228
- Joined: Thu Mar 09, 2006 9:40 am
- Full name: Vincent Lejeune
-
- Posts: 1627
- Joined: Thu Mar 09, 2006 12:35 pm
Re: recent cluster testing
Interesting....bob wrote: Here are 4 test runs with 23.0 and 23.0. All use same positions, same everything, very consistent results at 40,000 games per run
Code: Select all
Crafty-23.1-4 2645 4 3 40000 56% 2597 23% Crafty-23.1-2 2644 4 4 40000 56% 2597 23% Crafty-23.1-1 2644 4 4 40000 56% 2597 23% Crafty-23.1-3 2643 4 4 40000 56% 2597 23% Crafty-23.0-1 2560 4 4 40000 45% 2597 21% Crafty-23.0-2 2559 4 4 40000 45% 2597 21% Crafty-23.0-4 2559 4 4 40000 45% 2597 21% Crafty-23.0-3 2558 4 4 40000 45% 2597 21%
But is it possible to provide the same list but per 5000 games? That means 8 lists, with the 1st list after the first 5000 games, 2nd list after 10000 games, 3rd list after 15000 games, etc.
Also is it possible to provide the list after the first 100 games? And then after 1000 games?
It would be interesting to see the progression of ELO of the different versions.
Also is it possible to provide the % score of each engine with let's say 3 decimal digits? Since games are 40000 it's of no point to round to tens and gives us not valuable information.
After his son's birth they've asked him:
"Is it a boy or girl?"
YES! He replied.....
"Is it a boy or girl?"
YES! He replied.....
-
- Posts: 4366
- Joined: Fri Mar 10, 2006 5:23 am
- Location: http://www.arasanchess.org
Re: recent cluster testing (EC2?)
I have experimented a bit recently with Amazon's EC2 compute cloud. It is possible to provision a "high CPU" virtual machine from them. I think it's under $1 an hour for a Linux VM. Unfortunately although you get a decent amount of memory you don't actually get a lot of processing power: I found that it's maybe 25% slower running my program on this than on a mid-size local quad box. But it's really easy to set these up and you could get a lot of them going easily. Worth thinking about and it will only get cheaper/faster with time.
--Jon
--Jon
-
- Posts: 20943
- Joined: Mon Feb 27, 2006 7:30 pm
- Location: Birmingham, AL
Re: recent cluster testing
I save all 40,000 games in PGN form. I can extract from that in any way that seems interesting. I could pick 100 random games, or 1000 random games. Picking the first 100 is not so easy. And from my perspective, this really offers no useful information. During the first 1,000 games, the Elo bounces around all over. As the error bar drops, so does the variation. The results after 100 games could be almost _anything_.George Tsavdaris wrote:Interesting....bob wrote: Here are 4 test runs with 23.0 and 23.0. All use same positions, same everything, very consistent results at 40,000 games per run
Code: Select all
Crafty-23.1-4 2645 4 3 40000 56% 2597 23% Crafty-23.1-2 2644 4 4 40000 56% 2597 23% Crafty-23.1-1 2644 4 4 40000 56% 2597 23% Crafty-23.1-3 2643 4 4 40000 56% 2597 23% Crafty-23.0-1 2560 4 4 40000 45% 2597 21% Crafty-23.0-2 2559 4 4 40000 45% 2597 21% Crafty-23.0-4 2559 4 4 40000 45% 2597 21% Crafty-23.0-3 2558 4 4 40000 45% 2597 21%
But is it possible to provide the same list but per 5000 games? That means 8 lists, with the 1st list after the first 5000 games, 2nd list after 10000 games, 3rd list after 15000 games, etc.
Also is it possible to provide the list after the first 100 games? And then after 1000 games?
It would be interesting to see the progression of ELO of the different versions.
Also is it possible to provide the % score of each engine with let's say 3 decimal digits? Since games are 40000 it's of no point to round to tens and gives us not valuable information.
As far as the accuracy (precision) of the output, this is directly from BayesElo. I don't work on that code at all...
Notice that the above output is made by combining _all_ the PGN from the entire 40K * 8 games into one file and then passing thru BayesElo, as Remi recommended to make the Elos comparable.
-
- Posts: 1627
- Joined: Thu Mar 09, 2006 12:35 pm
Re: recent cluster testing
Yes random games should work also(as long as each engine gets equal games with white and black side).bob wrote: I save all 40,000 games in PGN form. I can extract from that in any way that seems interesting. I could pick 100 random games, or 1000 random games. Picking the first 100 is not so easy. And from my perspective, this really offers no useful information. During the first 1,000 games, the Elo bounces around all over. As the error bar drops, so does the variation. The results after 100 games could be almost _anything_.
It is interesting for me because i want to see how much ELO fluctuations of the different Crafty versions we will have after e.g 100 games, after 500 games, after 100 games, after 5000 games, 10000 games, etc... and compare these ELOs with the final ELOs after the 40000 games.
For example was there any moment(any set of X games--and since games are independent from each other it doesn't matter if we take the first X games or X random games(with equal number of white-black games of course)) where e.g Crafty-23.0-1 was first?
This is just an example(i'm not specifically interested about it).
Because i see most people here and generally everywhere play mostly a small number of games like 20, 50, 100, and maybe 1000 games and seem satisfied with the results. And i want to know that if this is generally correct to trust these results or if upsets can happen much later after 8000 games for example and whether the table can turn completely upside down.
After his son's birth they've asked him:
"Is it a boy or girl?"
YES! He replied.....
"Is it a boy or girl?"
YES! He replied.....
-
- Posts: 778
- Joined: Sat Jul 01, 2006 7:11 am
Re: recent cluster testing
There was a very long thread on this subject here about a year ago.George Tsavdaris wrote:Yes random games should work also(as long as each engine gets equal games with white and black side).bob wrote: I save all 40,000 games in PGN form. I can extract from that in any way that seems interesting. I could pick 100 random games, or 1000 random games. Picking the first 100 is not so easy. And from my perspective, this really offers no useful information. During the first 1,000 games, the Elo bounces around all over. As the error bar drops, so does the variation. The results after 100 games could be almost _anything_.
It is interesting for me because i want to see how much ELO fluctuations of the different Crafty versions we will have after e.g 100 games, after 500 games, after 100 games, after 5000 games, 10000 games, etc... and compare these ELOs with the final ELOs after the 40000 games.
For example was there any moment(any set of X games--and since games are independent from each other it doesn't matter if we take the first X games or X random games(with equal number of white-black games of course)) where e.g Crafty-23.0-1 was first?
This is just an example(i'm not specifically interested about it).
Because i see most people here and generally everywhere play mostly a small number of games like 20, 50, 100, and maybe 1000 games and seem satisfied with the results. And i want to know that if this is generally correct to trust these results or if upsets can happen much later after 8000 games for example and whether the table can turn completely upside down.
-
- Posts: 778
- Joined: Sat Jul 01, 2006 7:11 am
Re: recent cluster testing
I would like to see win/loss/draw percentages for each position, though it would probably take millions of games, not tens of thousands, to get small enough error bars. I think this could provide a start toward correlating evaluations and estimated score.bob wrote:I save all 40,000 games in PGN form. I can extract from that in any way that seems interesting. I could pick 100 random games, or 1000 random games. Picking the first 100 is not so easy. And from my perspective, this really offers no useful information. During the first 1,000 games, the Elo bounces around all over. As the error bar drops, so does the variation. The results after 100 games could be almost _anything_.George Tsavdaris wrote:Interesting....bob wrote: Here are 4 test runs with 23.0 and 23.0. All use same positions, same everything, very consistent results at 40,000 games per run
Code: Select all
Crafty-23.1-4 2645 4 3 40000 56% 2597 23% Crafty-23.1-2 2644 4 4 40000 56% 2597 23% Crafty-23.1-1 2644 4 4 40000 56% 2597 23% Crafty-23.1-3 2643 4 4 40000 56% 2597 23% Crafty-23.0-1 2560 4 4 40000 45% 2597 21% Crafty-23.0-2 2559 4 4 40000 45% 2597 21% Crafty-23.0-4 2559 4 4 40000 45% 2597 21% Crafty-23.0-3 2558 4 4 40000 45% 2597 21%
But is it possible to provide the same list but per 5000 games? That means 8 lists, with the 1st list after the first 5000 games, 2nd list after 10000 games, 3rd list after 15000 games, etc.
Also is it possible to provide the list after the first 100 games? And then after 1000 games?
It would be interesting to see the progression of ELO of the different versions.
Also is it possible to provide the % score of each engine with let's say 3 decimal digits? Since games are 40000 it's of no point to round to tens and gives us not valuable information.
As far as the accuracy (precision) of the output, this is directly from BayesElo. I don't work on that code at all...
Notice that the above output is made by combining _all_ the PGN from the entire 40K * 8 games into one file and then passing thru BayesElo, as Remi recommended to make the Elos comparable.
-
- Posts: 20943
- Joined: Mon Feb 27, 2006 7:30 pm
- Location: Birmingham, AL
Re: recent cluster testing
You can essentially predict this statistically. The error bar provides the normal range. And it "tightens up" as the number of games increase.George Tsavdaris wrote:Yes random games should work also(as long as each engine gets equal games with white and black side).bob wrote: I save all 40,000 games in PGN form. I can extract from that in any way that seems interesting. I could pick 100 random games, or 1000 random games. Picking the first 100 is not so easy. And from my perspective, this really offers no useful information. During the first 1,000 games, the Elo bounces around all over. As the error bar drops, so does the variation. The results after 100 games could be almost _anything_.
It is interesting for me because i want to see how much ELO fluctuations of the different Crafty versions we will have after e.g 100 games, after 500 games, after 100 games, after 5000 games, 10000 games, etc... and compare these ELOs with the final ELOs after the 40000 games.
For example was there any moment(any set of X games--and since games are independent from each other it doesn't matter if we take the first X games or X random games(with equal number of white-black games of course)) where e.g Crafty-23.0-1 was first?
This is just an example(i'm not specifically interested about it).
Because i see most people here and generally everywhere play mostly a small number of games like 20, 50, 100, and maybe 1000 games and seem satisfied with the results. And i want to know that if this is generally correct to trust these results or if upsets can happen much later after 8000 games for example and whether the table can turn completely upside down.
But here is some samples. The number of games grows for each line
Code: Select all
Rank Name Elo + - games score oppo. draws
1 Crafty-23.1 2669 144 131 6 75% 2538 17%
2 Crafty-23.1 2652 127 121 9 67% 2530 22%
2 Crafty-23.1 2652 101 96 16 66% 2542 31%
3 Crafty-23.1 2683 97 90 23 72% 2528 22%
2 Crafty-23.1 2737 79 74 46 72% 2546 17%
2 Crafty-23.1 2724 68 64 68 71% 2530 15%
2 Crafty-23.1 2698 57 55 93 66% 2547 15%
2 Crafty-23.1 2683 44 43 142 61% 2571 15%
3 Crafty-23.1 2666 37 36 197 58% 2577 16%
3 Crafty-23.1 2652 34 33 233 55% 2588 17%
3 Crafty-23.1 2651 33 33 239 54% 2589 19%
3 Crafty-23.1 2650 33 33 240 54% 2590 19%
3 Crafty-23.1 2649 33 32 242 54% 2589 19%
3 Crafty-23.1 2652 33 32 245 55% 2588 19%
3 Crafty-23.1 2651 32 32 247 55% 2588 19%
3 Crafty-23.1 2651 32 32 247 55% 2588 19%
3 Crafty-23.1 2652 32 32 253 55% 2586 19%
3 Crafty-23.1 2649 32 32 261 55% 2584 18%
3 Crafty-23.1 2647 31 31 270 55% 2582 18%
3 Crafty-23.1 2648 31 31 278 56% 2581 18%
2 Crafty-23.1 2649 30 30 290 56% 2584 20%
2 Crafty-23.1 2652 29 29 303 57% 2583 20%
2 Crafty-23.1 2653 29 28 321 57% 2581 19%
2 Crafty-23.1 2642 27 27 351 56% 2584 20%
2 Crafty-23.1 2639 26 26 389 56% 2586 20%
3 Crafty-23.1 2633 25 25 415 55% 2588 20%
3 Crafty-23.1 2632 24 24 447 55% 2590 18%
2 Crafty-23.1 2634 24 23 470 55% 2591 19%
2 Crafty-23.1 2632 23 23 482 55% 2593 20%
2 Crafty-23.1 2630 23 23 488 55% 2593 20%
2 Crafty-23.1 2631 23 23 491 55% 2593 20%
2 Crafty-23.1 2632 23 23 493 55% 2592 20%
3 Crafty-23.1 2632 23 23 498 55% 2591 19%
2 Crafty-23.1 2633 23 23 508 55% 2591 20%
2 Crafty-23.1 2633 22 22 517 55% 2590 20%
2 Crafty-23.1 2632 22 22 523 55% 2589 20%
2 Crafty-23.1 2634 22 22 534 56% 2588 19%
2 Crafty-23.1 2636 22 22 555 56% 2588 19%
2 Crafty-23.1 2636 21 21 573 56% 2588 19%
2 Crafty-23.1 2638 21 21 602 56% 2588 20%
3 Crafty-23.1 2634 20 20 634 55% 2588 20%
3 Crafty-23.1 2635 20 20 655 55% 2587 20%
3 Crafty-23.1 2638 20 20 686 55% 2589 20%
3 Crafty-23.1 2636 19 19 704 55% 2590 20%
3 Crafty-23.1 2636 19 19 714 55% 2590 20%
2 Crafty-23.1 2637 19 19 726 55% 2591 20%
2 Crafty-23.1 2639 19 19 740 55% 2591 20%
2 Crafty-23.1 2640 19 19 748 55% 2590 20%
2 Crafty-23.1 2639 19 19 753 55% 2590 20%
2 Crafty-23.1 2641 19 18 763 56% 2589 20%
2 Crafty-23.1 2642 19 19 773 56% 2588 20%
2 Crafty-23.1 2644 18 18 790 56% 2588 20%
2 Crafty-23.1 2644 18 18 797 57% 2587 20%
2 Crafty-23.1 2646 18 18 811 57% 2587 20%
2 Crafty-23.1 2645 18 18 833 57% 2587 20%
2 Crafty-23.1 2641 18 18 857 56% 2587 20%
3 Crafty-23.1 2639 18 17 884 56% 2587 20%
3 Crafty-23.1 2638 17 17 903 56% 2587 20%
3 Crafty-23.1 2637 17 17 927 56% 2589 20%
3 Crafty-23.1 2638 17 17 946 56% 2590 20%
3 Crafty-23.1 2636 17 17 967 56% 2590 20%
3 Crafty-23.1 2636 17 16 975 56% 2590 20%
3 Crafty-23.1 2636 17 16 984 56% 2591 20%
3 Crafty-23.1 2636 16 16 995 56% 2591 20%
3 Crafty-23.1 2635 16 16 1006 56% 2590 20%
3 Crafty-23.1 2634 16 16 1014 55% 2591 20%
3 Crafty-23.1 2635 16 16 1030 55% 2590 20%
.....
.....
2 Crafty-23.1 2645 4 3 40000 56% 2597 23%
-
- Posts: 20943
- Joined: Mon Feb 27, 2006 7:30 pm
- Location: Birmingham, AL
Re: recent cluster testing
Note that 23.0 and 23.1 are not playing in the same 40,000 game match. I play one against the normal opponents, then play the other. So I don't see those intermediate results. But if you look at the data I posted in another follow-up in this thread, the range of ratings varies significantly up front, so 23.0 could quite easily look better at those points.jwes wrote:There was a very long thread on this subject here about a year ago.George Tsavdaris wrote:Yes random games should work also(as long as each engine gets equal games with white and black side).bob wrote: I save all 40,000 games in PGN form. I can extract from that in any way that seems interesting. I could pick 100 random games, or 1000 random games. Picking the first 100 is not so easy. And from my perspective, this really offers no useful information. During the first 1,000 games, the Elo bounces around all over. As the error bar drops, so does the variation. The results after 100 games could be almost _anything_.
It is interesting for me because i want to see how much ELO fluctuations of the different Crafty versions we will have after e.g 100 games, after 500 games, after 100 games, after 5000 games, 10000 games, etc... and compare these ELOs with the final ELOs after the 40000 games.
For example was there any moment(any set of X games--and since games are independent from each other it doesn't matter if we take the first X games or X random games(with equal number of white-black games of course)) where e.g Crafty-23.0-1 was first?
This is just an example(i'm not specifically interested about it).
Because i see most people here and generally everywhere play mostly a small number of games like 20, 50, 100, and maybe 1000 games and seem satisfied with the results. And i want to know that if this is generally correct to trust these results or if upsets can happen much later after 8000 games for example and whether the table can turn completely upside down.
I'll run the same test with 23.0 and append to the other post.
-
- Posts: 5228
- Joined: Thu Mar 09, 2006 9:40 am
- Full name: Vincent Lejeune
Re: recent cluster testing
Will Crafty 23.1 be out soon ?