recent cluster testing

Discussion of chess software programming and technical issues.

Moderators: hgm, Rebel, chrisw

Vinvin
Posts: 5228
Joined: Thu Mar 09, 2006 9:40 am
Full name: Vincent Lejeune

Re: recent cluster testing

Post by Vinvin »

Amazing ... I thought about the same thing 2 hours ago :D

Vincent
User avatar
George Tsavdaris
Posts: 1627
Joined: Thu Mar 09, 2006 12:35 pm

Re: recent cluster testing

Post by George Tsavdaris »

bob wrote: Here are 4 test runs with 23.0 and 23.0. All use same positions, same everything, very consistent results at 40,000 games per run

Code: Select all

Crafty-23.1-4        2645    4    3 40000   56%  2597   23%
Crafty-23.1-2        2644    4    4 40000   56%  2597   23%
Crafty-23.1-1        2644    4    4 40000   56%  2597   23%
Crafty-23.1-3        2643    4    4 40000   56%  2597   23%
Crafty-23.0-1        2560    4    4 40000   45%  2597   21%
Crafty-23.0-2        2559    4    4 40000   45%  2597   21%
Crafty-23.0-4        2559    4    4 40000   45%  2597   21%
Crafty-23.0-3        2558    4    4 40000   45%  2597   21%
Interesting....

But is it possible to provide the same list but per 5000 games? That means 8 lists, with the 1st list after the first 5000 games, 2nd list after 10000 games, 3rd list after 15000 games, etc.


Also is it possible to provide the list after the first 100 games? And then after 1000 games?


It would be interesting to see the progression of ELO of the different versions.

Also is it possible to provide the % score of each engine with let's say 3 decimal digits? Since games are 40000 it's of no point to round to tens and gives us not valuable information.
After his son's birth they've asked him:
"Is it a boy or girl?"
YES! He replied.....
jdart
Posts: 4366
Joined: Fri Mar 10, 2006 5:23 am
Location: http://www.arasanchess.org

Re: recent cluster testing (EC2?)

Post by jdart »

I have experimented a bit recently with Amazon's EC2 compute cloud. It is possible to provision a "high CPU" virtual machine from them. I think it's under $1 an hour for a Linux VM. Unfortunately although you get a decent amount of memory you don't actually get a lot of processing power: I found that it's maybe 25% slower running my program on this than on a mid-size local quad box. But it's really easy to set these up and you could get a lot of them going easily. Worth thinking about and it will only get cheaper/faster with time.

--Jon
bob
Posts: 20943
Joined: Mon Feb 27, 2006 7:30 pm
Location: Birmingham, AL

Re: recent cluster testing

Post by bob »

George Tsavdaris wrote:
bob wrote: Here are 4 test runs with 23.0 and 23.0. All use same positions, same everything, very consistent results at 40,000 games per run

Code: Select all

Crafty-23.1-4        2645    4    3 40000   56%  2597   23%
Crafty-23.1-2        2644    4    4 40000   56%  2597   23%
Crafty-23.1-1        2644    4    4 40000   56%  2597   23%
Crafty-23.1-3        2643    4    4 40000   56%  2597   23%
Crafty-23.0-1        2560    4    4 40000   45%  2597   21%
Crafty-23.0-2        2559    4    4 40000   45%  2597   21%
Crafty-23.0-4        2559    4    4 40000   45%  2597   21%
Crafty-23.0-3        2558    4    4 40000   45%  2597   21%
Interesting....

But is it possible to provide the same list but per 5000 games? That means 8 lists, with the 1st list after the first 5000 games, 2nd list after 10000 games, 3rd list after 15000 games, etc.


Also is it possible to provide the list after the first 100 games? And then after 1000 games?


It would be interesting to see the progression of ELO of the different versions.

Also is it possible to provide the % score of each engine with let's say 3 decimal digits? Since games are 40000 it's of no point to round to tens and gives us not valuable information.
I save all 40,000 games in PGN form. I can extract from that in any way that seems interesting. I could pick 100 random games, or 1000 random games. Picking the first 100 is not so easy. And from my perspective, this really offers no useful information. During the first 1,000 games, the Elo bounces around all over. As the error bar drops, so does the variation. The results after 100 games could be almost _anything_.

As far as the accuracy (precision) of the output, this is directly from BayesElo. I don't work on that code at all...

Notice that the above output is made by combining _all_ the PGN from the entire 40K * 8 games into one file and then passing thru BayesElo, as Remi recommended to make the Elos comparable.
User avatar
George Tsavdaris
Posts: 1627
Joined: Thu Mar 09, 2006 12:35 pm

Re: recent cluster testing

Post by George Tsavdaris »

bob wrote: I save all 40,000 games in PGN form. I can extract from that in any way that seems interesting. I could pick 100 random games, or 1000 random games. Picking the first 100 is not so easy. And from my perspective, this really offers no useful information. During the first 1,000 games, the Elo bounces around all over. As the error bar drops, so does the variation. The results after 100 games could be almost _anything_.
Yes random games should work also(as long as each engine gets equal games with white and black side).

It is interesting for me because i want to see how much ELO fluctuations of the different Crafty versions we will have after e.g 100 games, after 500 games, after 100 games, after 5000 games, 10000 games, etc... and compare these ELOs with the final ELOs after the 40000 games.

For example was there any moment(any set of X games--and since games are independent from each other it doesn't matter if we take the first X games or X random games(with equal number of white-black games of course)) where e.g Crafty-23.0-1 was first?
This is just an example(i'm not specifically interested about it).

Because i see most people here and generally everywhere play mostly a small number of games like 20, 50, 100, and maybe 1000 games and seem satisfied with the results. And i want to know that if this is generally correct to trust these results or if upsets can happen much later after 8000 games for example and whether the table can turn completely upside down.
After his son's birth they've asked him:
"Is it a boy or girl?"
YES! He replied.....
jwes
Posts: 778
Joined: Sat Jul 01, 2006 7:11 am

Re: recent cluster testing

Post by jwes »

George Tsavdaris wrote:
bob wrote: I save all 40,000 games in PGN form. I can extract from that in any way that seems interesting. I could pick 100 random games, or 1000 random games. Picking the first 100 is not so easy. And from my perspective, this really offers no useful information. During the first 1,000 games, the Elo bounces around all over. As the error bar drops, so does the variation. The results after 100 games could be almost _anything_.
Yes random games should work also(as long as each engine gets equal games with white and black side).

It is interesting for me because i want to see how much ELO fluctuations of the different Crafty versions we will have after e.g 100 games, after 500 games, after 100 games, after 5000 games, 10000 games, etc... and compare these ELOs with the final ELOs after the 40000 games.

For example was there any moment(any set of X games--and since games are independent from each other it doesn't matter if we take the first X games or X random games(with equal number of white-black games of course)) where e.g Crafty-23.0-1 was first?
This is just an example(i'm not specifically interested about it).

Because i see most people here and generally everywhere play mostly a small number of games like 20, 50, 100, and maybe 1000 games and seem satisfied with the results. And i want to know that if this is generally correct to trust these results or if upsets can happen much later after 8000 games for example and whether the table can turn completely upside down.
There was a very long thread on this subject here about a year ago.
jwes
Posts: 778
Joined: Sat Jul 01, 2006 7:11 am

Re: recent cluster testing

Post by jwes »

bob wrote:
George Tsavdaris wrote:
bob wrote: Here are 4 test runs with 23.0 and 23.0. All use same positions, same everything, very consistent results at 40,000 games per run

Code: Select all

Crafty-23.1-4        2645    4    3 40000   56%  2597   23%
Crafty-23.1-2        2644    4    4 40000   56%  2597   23%
Crafty-23.1-1        2644    4    4 40000   56%  2597   23%
Crafty-23.1-3        2643    4    4 40000   56%  2597   23%
Crafty-23.0-1        2560    4    4 40000   45%  2597   21%
Crafty-23.0-2        2559    4    4 40000   45%  2597   21%
Crafty-23.0-4        2559    4    4 40000   45%  2597   21%
Crafty-23.0-3        2558    4    4 40000   45%  2597   21%
Interesting....

But is it possible to provide the same list but per 5000 games? That means 8 lists, with the 1st list after the first 5000 games, 2nd list after 10000 games, 3rd list after 15000 games, etc.


Also is it possible to provide the list after the first 100 games? And then after 1000 games?


It would be interesting to see the progression of ELO of the different versions.

Also is it possible to provide the % score of each engine with let's say 3 decimal digits? Since games are 40000 it's of no point to round to tens and gives us not valuable information.
I save all 40,000 games in PGN form. I can extract from that in any way that seems interesting. I could pick 100 random games, or 1000 random games. Picking the first 100 is not so easy. And from my perspective, this really offers no useful information. During the first 1,000 games, the Elo bounces around all over. As the error bar drops, so does the variation. The results after 100 games could be almost _anything_.

As far as the accuracy (precision) of the output, this is directly from BayesElo. I don't work on that code at all...

Notice that the above output is made by combining _all_ the PGN from the entire 40K * 8 games into one file and then passing thru BayesElo, as Remi recommended to make the Elos comparable.
I would like to see win/loss/draw percentages for each position, though it would probably take millions of games, not tens of thousands, to get small enough error bars. I think this could provide a start toward correlating evaluations and estimated score.
bob
Posts: 20943
Joined: Mon Feb 27, 2006 7:30 pm
Location: Birmingham, AL

Re: recent cluster testing

Post by bob »

George Tsavdaris wrote:
bob wrote: I save all 40,000 games in PGN form. I can extract from that in any way that seems interesting. I could pick 100 random games, or 1000 random games. Picking the first 100 is not so easy. And from my perspective, this really offers no useful information. During the first 1,000 games, the Elo bounces around all over. As the error bar drops, so does the variation. The results after 100 games could be almost _anything_.
Yes random games should work also(as long as each engine gets equal games with white and black side).

It is interesting for me because i want to see how much ELO fluctuations of the different Crafty versions we will have after e.g 100 games, after 500 games, after 100 games, after 5000 games, 10000 games, etc... and compare these ELOs with the final ELOs after the 40000 games.

For example was there any moment(any set of X games--and since games are independent from each other it doesn't matter if we take the first X games or X random games(with equal number of white-black games of course)) where e.g Crafty-23.0-1 was first?
This is just an example(i'm not specifically interested about it).

Because i see most people here and generally everywhere play mostly a small number of games like 20, 50, 100, and maybe 1000 games and seem satisfied with the results. And i want to know that if this is generally correct to trust these results or if upsets can happen much later after 8000 games for example and whether the table can turn completely upside down.
You can essentially predict this statistically. The error bar provides the normal range. And it "tightens up" as the number of games increase.

But here is some samples. The number of games grows for each line

Code: Select all

Rank Name                  Elo   +    -  games score oppo. draws

   1 Crafty-23.1          2669  144  131     6   75%  2538   17% 
   2 Crafty-23.1          2652  127  121     9   67%  2530   22% 
   2 Crafty-23.1          2652  101   96    16   66%  2542   31% 
   3 Crafty-23.1          2683   97   90    23   72%  2528   22% 
   2 Crafty-23.1          2737   79   74    46   72%  2546   17% 
   2 Crafty-23.1          2724   68   64    68   71%  2530   15% 
   2 Crafty-23.1          2698   57   55    93   66%  2547   15% 
   2 Crafty-23.1          2683   44   43   142   61%  2571   15% 
   3 Crafty-23.1          2666   37   36   197   58%  2577   16% 
   3 Crafty-23.1          2652   34   33   233   55%  2588   17% 
   3 Crafty-23.1          2651   33   33   239   54%  2589   19% 
   3 Crafty-23.1          2650   33   33   240   54%  2590   19% 
   3 Crafty-23.1          2649   33   32   242   54%  2589   19% 
   3 Crafty-23.1          2652   33   32   245   55%  2588   19% 
   3 Crafty-23.1          2651   32   32   247   55%  2588   19% 
   3 Crafty-23.1          2651   32   32   247   55%  2588   19% 
   3 Crafty-23.1          2652   32   32   253   55%  2586   19% 
   3 Crafty-23.1          2649   32   32   261   55%  2584   18% 
   3 Crafty-23.1          2647   31   31   270   55%  2582   18% 
   3 Crafty-23.1          2648   31   31   278   56%  2581   18% 
   2 Crafty-23.1          2649   30   30   290   56%  2584   20% 
   2 Crafty-23.1          2652   29   29   303   57%  2583   20% 
   2 Crafty-23.1          2653   29   28   321   57%  2581   19% 
   2 Crafty-23.1          2642   27   27   351   56%  2584   20% 
   2 Crafty-23.1          2639   26   26   389   56%  2586   20% 
   3 Crafty-23.1          2633   25   25   415   55%  2588   20% 
   3 Crafty-23.1          2632   24   24   447   55%  2590   18% 
   2 Crafty-23.1          2634   24   23   470   55%  2591   19% 
   2 Crafty-23.1          2632   23   23   482   55%  2593   20% 
   2 Crafty-23.1          2630   23   23   488   55%  2593   20% 
   2 Crafty-23.1          2631   23   23   491   55%  2593   20% 
   2 Crafty-23.1          2632   23   23   493   55%  2592   20% 
   3 Crafty-23.1          2632   23   23   498   55%  2591   19% 
   2 Crafty-23.1          2633   23   23   508   55%  2591   20% 
   2 Crafty-23.1          2633   22   22   517   55%  2590   20% 
   2 Crafty-23.1          2632   22   22   523   55%  2589   20% 
   2 Crafty-23.1          2634   22   22   534   56%  2588   19% 
   2 Crafty-23.1          2636   22   22   555   56%  2588   19% 
   2 Crafty-23.1          2636   21   21   573   56%  2588   19% 
   2 Crafty-23.1          2638   21   21   602   56%  2588   20% 
   3 Crafty-23.1          2634   20   20   634   55%  2588   20% 
   3 Crafty-23.1          2635   20   20   655   55%  2587   20% 
   3 Crafty-23.1          2638   20   20   686   55%  2589   20% 
   3 Crafty-23.1          2636   19   19   704   55%  2590   20% 
   3 Crafty-23.1          2636   19   19   714   55%  2590   20% 
   2 Crafty-23.1          2637   19   19   726   55%  2591   20% 
   2 Crafty-23.1          2639   19   19   740   55%  2591   20% 
   2 Crafty-23.1          2640   19   19   748   55%  2590   20% 
   2 Crafty-23.1          2639   19   19   753   55%  2590   20% 
   2 Crafty-23.1          2641   19   18   763   56%  2589   20% 
   2 Crafty-23.1          2642   19   19   773   56%  2588   20% 
   2 Crafty-23.1          2644   18   18   790   56%  2588   20% 
   2 Crafty-23.1          2644   18   18   797   57%  2587   20% 
   2 Crafty-23.1          2646   18   18   811   57%  2587   20% 
   2 Crafty-23.1          2645   18   18   833   57%  2587   20% 
   2 Crafty-23.1          2641   18   18   857   56%  2587   20% 
   3 Crafty-23.1          2639   18   17   884   56%  2587   20% 
   3 Crafty-23.1          2638   17   17   903   56%  2587   20% 
   3 Crafty-23.1          2637   17   17   927   56%  2589   20% 
   3 Crafty-23.1          2638   17   17   946   56%  2590   20% 
   3 Crafty-23.1          2636   17   17   967   56%  2590   20% 
   3 Crafty-23.1          2636   17   16   975   56%  2590   20% 
   3 Crafty-23.1          2636   17   16   984   56%  2591   20% 
   3 Crafty-23.1          2636   16   16   995   56%  2591   20% 
   3 Crafty-23.1          2635   16   16  1006   56%  2590   20% 
   3 Crafty-23.1          2634   16   16  1014   55%  2591   20% 
   3 Crafty-23.1          2635   16   16  1030   55%  2590   20% 
.....
.....
   2 Crafty-23.1          2645    4    3 40000   56%  2597   23%

Pretty well shows why I am playing 40,000 games since I am trying to measure 3-4-5 Elo changes.
bob
Posts: 20943
Joined: Mon Feb 27, 2006 7:30 pm
Location: Birmingham, AL

Re: recent cluster testing

Post by bob »

jwes wrote:
George Tsavdaris wrote:
bob wrote: I save all 40,000 games in PGN form. I can extract from that in any way that seems interesting. I could pick 100 random games, or 1000 random games. Picking the first 100 is not so easy. And from my perspective, this really offers no useful information. During the first 1,000 games, the Elo bounces around all over. As the error bar drops, so does the variation. The results after 100 games could be almost _anything_.
Yes random games should work also(as long as each engine gets equal games with white and black side).

It is interesting for me because i want to see how much ELO fluctuations of the different Crafty versions we will have after e.g 100 games, after 500 games, after 100 games, after 5000 games, 10000 games, etc... and compare these ELOs with the final ELOs after the 40000 games.

For example was there any moment(any set of X games--and since games are independent from each other it doesn't matter if we take the first X games or X random games(with equal number of white-black games of course)) where e.g Crafty-23.0-1 was first?
This is just an example(i'm not specifically interested about it).

Because i see most people here and generally everywhere play mostly a small number of games like 20, 50, 100, and maybe 1000 games and seem satisfied with the results. And i want to know that if this is generally correct to trust these results or if upsets can happen much later after 8000 games for example and whether the table can turn completely upside down.
There was a very long thread on this subject here about a year ago.
Note that 23.0 and 23.1 are not playing in the same 40,000 game match. I play one against the normal opponents, then play the other. So I don't see those intermediate results. But if you look at the data I posted in another follow-up in this thread, the range of ratings varies significantly up front, so 23.0 could quite easily look better at those points.

I'll run the same test with 23.0 and append to the other post.
Vinvin
Posts: 5228
Joined: Thu Mar 09, 2006 9:40 am
Full name: Vincent Lejeune

Re: recent cluster testing

Post by Vinvin »

Will Crafty 23.1 be out soon ?