Question to the members of the ranking lists..

Dann Corbit · Post by **Dann Corbit** » Mon Jun 28, 2010 10:56 pm

lkaufman wrote:Regardless of the wording issue, the real question is whether one can use engine-engine rating lists to predict (with reasonable margin of error) the ratings engines would have against top humans in competition. I say "yes, if the calculation is done correctly, but not just by looking at the raw number on the list or by just adding or subtracting a constant". I have the feeling that you (Ingo) would say "no, regardless of how the calculation is done". Please correct me if I am wrong.

In order to find out how the lists relate (e.g. Fide Elo and CCRL Elo) it would be necessary for Fide and CCRL players to play against each other.

Without this, we know definitely that there is a correlation, but we cannot know what it is numerically.

I guess that almost everyone knows this.

Albert Silver · Post by **Albert Silver** » Mon Jun 28, 2010 11:04 pm

Dann Corbit wrote:
lkaufman wrote:Regardless of the wording issue, the real question is whether one can use engine-engine rating lists to predict (with reasonable margin of error) the ratings engines would have against top humans in competition. I say "yes, if the calculation is done correctly, but not just by looking at the raw number on the list or by just adding or subtracting a constant". I have the feeling that you (Ingo) would say "no, regardless of how the calculation is done". Please correct me if I am wrong.
In order to find out how the lists relate (e.g. Fide Elo and CCRL Elo) it would be necessary for Fide and CCRL players to play against each other.

Without this, we know definitely that there is a correlation, but we cannot know what it is numerically.

I guess that almost everyone knows this.

lkaufman · Post by **lkaufman** » Mon Jun 28, 2010 11:05 pm

In the past century, there were plenty of games between FIDE rated players and SSDF rated engines, enough to derive a formula of how to predict the FIDE rating of an engine given the SSDF rating. No one did this precisely, but my 75% formula is a good approximation. Presumably, CCRL (or CEGT) slow ratings are comparable to SSDF ratings with the addition or subtraction of some constant, so the same formula with a different constant should be applicable for them. Of course it's not too precise due to few games between top GMs and top engines in this century, but still there are a few dozen such games so the formula should be at least in the ballpark. The many handicap matches played by Rybka seem to roughly confirm the accuracy of my formula, although we can't be too precise as these matches require estimating the rating value of the various handicaps. Roughly speaking, they indicate that when Rybka was rated around 3100 on the lists, it performed around 3000 against humans after allowing for the handicap, which roughly agrees with subtracting 25% of the excess over the base rating of 2750.

Dann Corbit · Post by **Dann Corbit** » Mon Jun 28, 2010 11:13 pm

lkaufman wrote:In the past century, there were plenty of games between FIDE rated players and SSDF rated engines, enough to derive a formula of how to predict the FIDE rating of an engine given the SSDF rating. No one did this precisely, but my 75% formula is a good approximation. Presumably, CCRL (or CEGT) slow ratings are comparable to SSDF ratings with the addition or subtraction of some constant, so the same formula with a different constant should be applicable for them. Of course it's not too precise due to few games between top GMs and top engines in this century, but still there are a few dozen such games so the formula should be at least in the ballpark. The many handicap matches played by Rybka seem to roughly confirm the accuracy of my formula, although we can't be too precise as these matches require estimating the rating value of the various handicaps. Roughly speaking, they indicate that when Rybka was rated around 3100 on the lists, it performed around 3000 against humans after allowing for the handicap, which roughly agrees with subtracting 25% of the excess over the base rating of 2750.

It will take about 500 games between members of both lists under constant conditions (IOW, the numbers are only valid for the conditions of the tests) before an accurate relationship can be known.

On the other hand, I think we know what we need to know (e.g. Rybka is really, really, really strong.)

Don · Post by **Don** » Mon Jun 28, 2010 11:22 pm

Larry,

What affect does pondering have on rating distortion? Will 2 programs that ponder show a higher difference in general than if they don't ponder, all else being equal?

Don

lkaufman wrote:
IWB wrote:Hello Rainer
"Nevertheless Frank and I chosed to have an identical starting point. The 2800 for Shredder were an easy solution because we both had the 32bit Version wih quite a lot of games in our lists and we did not want to have a single engine (at that time) with 3000 Elo in our lists as this seems to be unrealistic to us. (Of course the values are not comparable to human rating, but people do compare ...)
Bye
Ingo
"

Regarding comparing engine ratings to human ratings, I'd like to make two points. First, 2800 for Shredder 12 is obviously too low in human terms, as it is much stronger than any of the programs that played Kasparov or Kramnik successfully (drawn or won matches). But I agree that the top ratings on the CCRL/CEGT lists seem too high in human terms, as they imply very little chance for Anand to get even a draw against Deep Rybka 4. The explanation is clear from a study of the SSDF ratings over more than two decades. SSDF had to regularly reduce their whole list to avoid having inflated ratings at the top. The reason is simply that engine vs. engine testing overstates rating gains compared to human vs. engine games, probably because the more similar two entities are, the more certain it is that superiority will decide a game. If two players have totally different evals and search, a doubling of search speed for one will have limited benefit, but if they are otherwise identical it will be decisive. Anyway, a study of the SSDF ratings indicates that engine vs. engine ratings need to be contracted by roughly 3/4 to be comparable to human ratings. So to estimate a human rating for an engine, first decide at what level the list being used seems right (maybe 3000 for IPON, maybe somewhere in the 2700-2800 range for CCRL and CEGT), then move the rating of a given engine by 25% towards this number. This should produce a pretty accurate estimate of the rating the engine would get in human competition.

lkaufman · Post by **lkaufman** » Tue Jun 29, 2010 12:18 am

It has been theorized that pondering favors the stronger program, as if it sees things faster it is more likely to predict the opponent's move. If someone wants to go to the trouble, this theory can be checked by comparing the standard deviation of the ratings of engines on the IPON list with the same engines on the CCRL or CEGT blitz lists. If this theory is correct the IPON list should have a larger standard deviation as it uses ponder on unlike the other two lists.

Dann Corbit · Post by **Dann Corbit** » Tue Jun 29, 2010 12:31 am

lkaufman wrote:It has been theorized that pondering favors the stronger program, as if it sees things faster it is more likely to predict the opponent's move. If someone wants to go to the trouble, this theory can be checked by comparing the standard deviation of the ratings of engines on the IPON list with the same engines on the CCRL or CEGT blitz lists. If this theory is correct the IPON list should have a larger standard deviation as it uses ponder on unlike the other two lists.

A program could be superior due to evaluation or search or both.
I think the effects of greater time will be different depending on whether the superiority is due to search or eval.

Also, if pondering favors the stronger program, will longer time control also favor the stronger program? If that is the case, will we see a biased strength shift as the time scale increases?
E.g. we have 40/4 lists, a 40/20 list, a 40/40 list, and a few longer time control lists. Do these lists show an Elo shift in favor of the stronger programs as we increase time control? After all, pondering is little more than a doubling of time control.

Don · Post by **Don** » Tue Jun 29, 2010 1:02 am

Dann Corbit wrote:
lkaufman wrote:It has been theorized that pondering favors the stronger program, as if it sees things faster it is more likely to predict the opponent's move. If someone wants to go to the trouble, this theory can be checked by comparing the standard deviation of the ratings of engines on the IPON list with the same engines on the CCRL or CEGT blitz lists. If this theory is correct the IPON list should have a larger standard deviation as it uses ponder on unlike the other two lists.
A program could be superior due to evaluation or search or both.
I think the effects of greater time will be different depending on whether the superiority is due to search or eval.

I think you are probably correct about this.

There is a bit of a discontinuity between search improvement and evaluation improvements in terms of how programs play chess - and I think improving the search probably makes a program a better predictor of the opponents moves - more than would making it stronger due to evaluation improvements.

I'm probably being a little superstitious about this but if I could get 100 ELO with either evaluation or search, I would rather get it with evaluation improvements as I think it makes the program a more "rounded" player. No half way reasonable program has a problem with tactics anyway, but nevertheless we seem to spend more time on search than evaluation.

Also, if pondering favors the stronger program, will longer time control also favor the stronger program? If that is the case, will we see a biased strength shift as the time scale increases?
E.g. we have 40/4 lists, a 40/20 list, a 40/40 list, and a few longer time control lists. Do these lists show an Elo shift in favor of the stronger programs as we increase time control? After all, pondering is little more than a doubling of time control.

Steve B · Post by **Steve B** » Tue Jun 29, 2010 1:29 am

lkaufman wrote:It has been theorized that pondering favors the stronger program, as if it sees things faster it is more likely to predict the opponent's move. If someone wants to go to the trouble, this theory can be checked by comparing the standard deviation of the ratings of engines on the IPON list with the same engines on the CCRL or CEGT blitz lists. If this theory is correct the IPON list should have a larger standard deviation as it uses ponder on unlike the other two lists.

Hi GM Kaufman
excuse me if i digress
firstly its good to see you well again and back posting on the forums
perhaps you will remember me?
i helped sponsor the GM Joel Benjamin v Rybka "White Odds" match a few years ago

quick question for you if i may..

i am sure you are well aware of the now legendary BB report
some computer chess enthusiasts do not take this report seriously because of the fact that it is anonymous(not signed)
Zach Wegner who is now preparing for the WCCC has met BB in person and he also mentioned that you have spoken to BB in person
forgetting for the moment the actual content of the report..
do you think the BB report is tainted in any way by virtue of its being unsigned?

Best Regards
Steve

lkaufman · Post by **lkaufman** » Tue Jun 29, 2010 2:49 am

Dann Corbit wrote:
lkaufman wrote:
"A program could be superior due to evaluation or search or both.
I think the effects of greater time will be different depending on whether the superiority is due to search or eval."

This is of course true, but in general stronger programs are stronger in both search and eval, so a comparison of the standard deviation on the lists as I suggest should be a valid test of the hypothesis.

"Also, if pondering favors the stronger program, will longer time control also favor the stronger program? If that is the case, will we see a biased strength shift as the time scale increases?
E.g. we have 40/4 lists, a 40/20 list, a 40/40 list, and a few longer time control lists. Do these lists show an Elo shift in favor of the stronger programs as we increase time control? After all, pondering is little more than a doubling of time control.
"

This is definitely not the case. Longer time limits favor the weaker program, not the stronger one, at least if we measure things in Elo points. The reason is that the longer the time limit, the higher the percentage of draws. The effect of pondering is quite different. If my program is just like yours but a ply faster, I will consistently guess your move and thus will reply quickly to your moves, gaining clock time and increasing my advantage over you.

Question to the members of the ranking lists..

Re: Question to the members of the ranking lists..

Re: Question to the members of the ranking lists..

Re: Question to the members of the ranking lists..

Re: Question to the members of the ranking lists..

Re: Question to the members of the ranking lists..

Re: Question to the members of the ranking lists..

Re: Question to the members of the ranking lists..

Re: Question to the members of the ranking lists..

Re: Question to the members of the ranking lists..

Re: Question to the members of the ranking lists..