I'm actually working on a program that does this very thing. I have a small Windows app for my testing and am trying to incorporate regular calls to bayeselo for a graphical display. I will release it when readyEdsel Apostol wrote:It would be interesting to see this data being shown in a graph including the error bars. I'm interested to see how consistent (straight) the line of the elo and how the error bar diverge nearer the elo line as the number of games increases.
interesting test data.
Moderator: Ras
-
nthom
- Posts: 112
- Joined: Thu Mar 09, 2006 6:15 am
- Location: Australia
Re: interesting test data.
-
Kirill Kryukov
- Posts: 518
- Joined: Sun Mar 19, 2006 4:12 am
- Full name: Kirill Kryukov
Re: interesting test data.
I show such graphs in my tournament site. Example (Sjeng 12.13):Edsel Apostol wrote:It would be interesting to see this data being shown in a graph including the error bars. I'm interested to see how consistent (straight) the line of the elo and how the error bar diverge nearer the elo line as the number of games increases.

green line - rating
red lines - rating error interval
yellow line - average opponent
blue line - number of games (scale not shown)
And a smoother version where days without games are skipped:

Similar graphs are also featured in CCRL sites.
Of course, the number of games is much less than 40k. But it does give the idea of how much the ratings fluctuate with few games.
Best,
Kirill
-
Uri Blass
- Posts: 10941
- Joined: Thu Mar 09, 2006 12:37 am
- Location: Tel-Aviv Israel
Re: interesting test data.
<snipped>
In this case it is easy to predict if a change is good or bad and hopefully most of the changes that are productive for swami's test suite are going to be also productive for games.
Uri
I think that the conclusion is simply not to optimize engines for games but to optimize them for solving swami's test suite.bob wrote:
This does show the danger of relying on small numbers of games to predict whether a change is good or bad...
In this case it is easy to predict if a change is good or bad and hopefully most of the changes that are productive for swami's test suite are going to be also productive for games.
Uri
-
Uri Blass
- Posts: 10941
- Joined: Thu Mar 09, 2006 12:37 am
- Location: Tel-Aviv Israel
Re: interesting test data.
I can add that you cannot do it if you test changes in the time managament or changes of learning during the game but I hope that swami develop a good test suite and it may be possible to test it by comparing rating of programs based on CCRL or CEGT and rating of programs based on some formula based on the test results.
Uri
Uri
-
mcostalba
- Posts: 2684
- Joined: Sat Jun 14, 2008 9:17 pm
Re: interesting test data.
What I found missing in swami tests (and I have already said this some weeks ago) is that the final score uses equal weights for all the tests.Uri Blass wrote:I can add that you cannot do it if you test changes in the time managament or changes of learning during the game but I hope that swami develop a good test suite and it may be possible to test it by comparing rating of programs based on CCRL or CEGT and rating of programs based on some formula based on the test results.
Uri
I am not sure that some strategic aspects have the same impact of others, I would think that to go in the direction of infering the ELO rating of the engine out of the Swami test, individual test scores should be weighted before to be summed toghter.
Correct weight to use can only be deduced out of various engine's result knowing their real ELO and so find a weight vector that best approximates engines ELO out of the swami test result.
-
Dann Corbit
- Posts: 12799
- Joined: Wed Mar 08, 2006 8:57 pm
- Location: Redmond, WA USA
Re: interesting test data.
I guess that you are right that some features have more value than others.mcostalba wrote:What I found missing in swami tests (and I have already said this some weeks ago) is that the final score uses equal weights for all the tests.Uri Blass wrote:I can add that you cannot do it if you test changes in the time managament or changes of learning during the game but I hope that swami develop a good test suite and it may be possible to test it by comparing rating of programs based on CCRL or CEGT and rating of programs based on some formula based on the test results.
Uri
I am not sure that some strategic aspects have the same impact of others, I would think that to go in the direction of infering the ELO rating of the engine out of the Swami test, individual test scores should be weighted before to be summed toghter.
Correct weight to use can only be deduced out of various engine's result knowing their real ELO and so find a weight vector that best approximates engines ELO out of the swami test result.
We might be able to guess the proper weights by doing a linear least squares fit or some other simple math.
-
MattieShoes
- Posts: 718
- Joined: Fri Mar 20, 2009 8:59 pm
Re: interesting test data.

Something like that maybe? I didn't muck about with error bars, just the rating and the 95% confidence intervals
-
Don
- Posts: 5106
- Joined: Tue Apr 29, 2008 4:27 pm
Re: interesting test data.
I agree with Bob in a big way and with all due respect, I think the question was a stupid one.bob wrote:Because we are busy improving Crafty, which is my own code.liuzy wrote:Bob, why don't you improve ipplite using your cluster?What would be the benefit of improving IP* if we one day discover it is derived from Rybka??? Also, what would be the motivation? I want to win with my code. Not much point in copying or using what someone else has done, IMHO. This is about enjoying a hobby, not copying what others have done. (Of course, not _everyone_ believes in that philosophy, but that's a separate issue.)
-
Daniel Shawul
- Posts: 4186
- Joined: Tue Mar 14, 2006 11:34 am
- Location: Ethiopia
Re: interesting test data.
Can you guide a newbie on how to run multiple games on a cluster.
Are you using cutechess-cli for that purpose or do you use your own script ?
For other softwares that I use capapble of running on cluster , I just invoke
"mpirun -np 12 'which xxxx' " to run the command xxxx on 12 processors. I guess I need an MPI script to replace xxxxx , which assign the games to different IP addresses.
Right now when I do "mpirun -np 4 scorpio" , it starts 4 instances of scorpio on 4 nodes.
Are you using cutechess-cli for that purpose or do you use your own script ?
For other softwares that I use capapble of running on cluster , I just invoke
"mpirun -np 12 'which xxxx' " to run the command xxxx on 12 processors. I guess I need an MPI script to replace xxxxx , which assign the games to different IP addresses.
Right now when I do "mpirun -np 4 scorpio" , it starts 4 instances of scorpio on 4 nodes.
-
bob
- Posts: 20943
- Joined: Mon Feb 27, 2006 7:30 pm
- Location: Birmingham, AL
Re: interesting test data.
You are kidding, right???Uri Blass wrote:<snipped>I think that the conclusion is simply not to optimize engines for games but to optimize them for solving swami's test suite.bob wrote:
This does show the danger of relying on small numbers of games to predict whether a change is good or bad...
In this case it is easy to predict if a change is good or bad and hopefully most of the changes that are productive for swami's test suite are going to be also productive for games.
Uri