Alternative methods of testing

Hart · Post by **Hart** » Sat Mar 28, 2009 9:17 am

Inspired by studies where individuals seek to rank GM's and World Champions, alive and deceased, I thought that it would make sense that the stronger the engine used for the analysis the better judge of play. The rankings given from such analysis appear to be sensible. But, judging human play with raw engine calculation is obviously not always best, given alternative strategies to playing, that is, choosing worse moves to create complications, simplifying, etc. But what about for ranking other engines?

My hypothesis is that the stronger the engine (or play), the higher the correlation between determined rank (rated time) and actual rank (as determined by CCRL/CEGT) of x number of engines.

To test I used Toga 1.3.1, Fruit 2.1, Toga 1.4 beta5c, Rybka 2.2, Naum 4, and Glaurung 2.2, 4-CPU's for all engines capable. Rather than use another engine to analyze these engines games I fed Arena high level CC games gathered from ICCF, the idea still being that the higher the level of play the better the scores (rated times) will correlate with actual ratings.

I had two data sets - my first being a pgn with CC games with players in the range of 2200-2300, with positions from move 20+. My second set were games played at the 2600+ level. All games from both sets ended in a draw.

Sure enough, rankings from the lower rated games were significantly worse then the rankings using the 2600+ level games. Coefficients of determination were .741, and .91 respectively. When mapping Elo from rated time, with my new linear function (with 2600+ games), I now get:

Fruit 2.1...............2862
Toga 1.3.1...........2870
Toga 1.4 beat5c...2952
Glaurung 2.2........2970
Rybka 2.2............3100
Naum 4................3177

Here are the CCRL (40/40?) ratings:

Fruit 2.1...............2794
Toga 1.3.1...........2894
Toga 1.4 beat5c...2991
Glaurung 2.2........2994
Rybka 2.2............3109
Naum 4................3152

The rankings are perfect, however the ratings are not. Average difference: 31.5 Elo. Maybe two of the results fall outside of CCRL's ratings including error bars.

In total I used 5 high level ICCF drawn games, with players 2600+. I only counted those moves after 20 to exclude openings. A total of maybe 350-400 positions. I am guessing (hoping?) that with more useful positions and shallower depths and times, ratings could be determined in much shorter times than is currently needed for traditional testing methods.

Hart · Post by **Hart** » Sat Mar 28, 2009 10:26 am

I just ran Bright 0.4a through my "test suite" and it scored 2978. It's CCRL rating: 2998 +/- 19

My quadratic yields an R^2 = .954

But, is it fair to switch from a linear to a quadratic model simply because it has a smaller error?

Marc Lacrosse · Post by **Marc Lacrosse** » Sat Mar 28, 2009 11:00 am

Very interesting, but could we get a more detailed explanation of the rating process ?
Do the engine play games from the selected positions from correspondence games ? or do they analyse them and you search correlation with actually played moves? or correlation with rybka's evaluation of the same moves ?

Marc

Hart · Post by **Hart** » Sat Mar 28, 2009 11:15 am

I will take Bright 0.4a as an example.

In using Automatic Analysis in Arena I use five ICCF games, all ending in draws. I want to see how many times and how quickly Bright can find the move the CC players made in these games from move 20. If a move is not found I add 4 to the total time searched, where 4 is the maximum time (in seconds) allowed for a given position. At the end of the analysis I add up the total time to find solutions plus 4 for every move not found. Actually, Arena does most of the work for me.

I am left with a score for every engine. I take these scores and run them through a linear regression analysis with their CCRL ratings. This gives a linear function E(t) = -2.706t + 4455.63 that will then give me an estimate of an engines strength in terms of CCRL ratings.

Bright took 546 seconds to find solutions for 360 positions, with maximum time = 4 seconds. After plugging this into my function I get 2978

Hart · Post by **Hart** » Sat Mar 28, 2009 11:42 am

Also, many of these 360 positions are crap because a) the move is forced or b) no engine can find solutions. I could also think of many reasons why Arena is not perfectly suited to the task of this analysis, it just happens to be the most convenient tool I have right now.
I believe that with a proper GUI to automate the analysis and a better set of positions, I could rate an engine with error bars as small as 5-10 Elo in a matter of minutes.

Gian-Carlo Pascutto · Post by **Gian-Carlo Pascutto** » Sat Mar 28, 2009 1:28 pm

Hart wrote: But, is it fair to switch from a linear to a quadratic model simply because it has a smaller error?

No, because the quadratic model has more degrees of freedom.

michiguel · Post by **michiguel** » Sat Mar 28, 2009 3:44 pm

Gian-Carlo Pascutto wrote:
Hart wrote: But, is it fair to switch from a linear to a quadratic model simply because it has a smaller error?
No, because the quadratic model has more degrees of freedom.

It depends how much better the fitting is. There are some tests that evaluate this.

Miguel

Laskos · Post by **Laskos** » Sun Mar 29, 2009 2:06 pm

Hart wrote:
I am left with a score for every engine. I take these scores and run them through a linear regression analysis with their CCRL ratings. This gives a linear function E(t) = -2.706t + 4455.63 that will then give me an estimate of an engines strength in terms of CCRL ratings.

Can you fit it with A+B*ln(t) ? How R^2 looks like in this case compared to linear (not quadratic, because you have more parameters in this case, and asymptotically it looks weird)?

Kai

CRoberson · Post by **CRoberson** » Sun Mar 29, 2009 5:56 pm

When you say CC, do you mean computer chess or correspondence chess?

This is a nice stab at an approximation method. Things like this
have been tried for years and failed. However, maybe it could be
accurate to say 200 points.

There are flaws in the method. It takes more positions to do this
accurately. Also, your method of only allowing 4 seconds then adding
4 for failures has multiple issues.

1) A program may produce the "correct" answer in 4 seconds then
move to something else at 5 seconds and never return.
2) You are clustering programs together that don't see the answer
in 4 seconds as programs that will see it in 8 seconds. Some will
see it in 8 and others may never see it.
3) Sometimes there are multiple correct answers.
4) How do you come up with the "correct" answer?

Your method ignores issues in creating a good or great
computer chess program such as the timing algorithm.

I think your luck has been in testing only top programs. Try a set
of programs scattered through the CCRL list. Lets say 1 program
for every 50 points. This would be 16 programs from 2200 to 2800.

Actually, I see your method as a decent way to detect potential
clones.

Hart · Post by **Hart** » Sun Mar 29, 2009 8:06 pm

With more data now, I have a logarithmic regression with R^2=.883, whereas my linear is now .882. Not much difference.

Alternative methods of testing

Alternative methods of testing

Re: Alternative methods of testing

Re: Alternative methods of testing

Re: Alternative methods of testing

Re: Alternative methods of testing

Re: Alternative methods of testing

Re: Alternative methods of testing

Re: Alternative methods of testing

Re: Alternative methods of testing

Re: Alternative methods of testing