Alternative methods of testing

Discussion of chess software programming and technical issues.

Moderators: hgm, Rebel, chrisw

Hart

Re: Alternative methods of testing

Post by Hart »

Charles Roberson wrote: When you say CC, do you mean computer chess or correspondence chess?

This is a nice stab at an approximation method. Things like this
have been tried for years and failed. However, maybe it could be
accurate to say 200 points.

There are flaws in the method. It takes more positions to do this
accurately. Also, your method of only allowing 4 seconds then adding
4 for failures has multiple issues.

1) A program may produce the "correct" answer in 4 seconds then
move to something else at 5 seconds and never return.
2) You are clustering programs together that don't see the answer
in 4 seconds as programs that will see it in 8 seconds. Some will
see it in 8 and others may never see it.
3) Sometimes there are multiple correct answers.
4) How do you come up with the "correct" answer?

Your method ignores issues in creating a good or great
computer chess program such as the timing algorithm.

I think your luck has been in testing only top programs. Try a set
of programs scattered through the CCRL list. Lets say 1 program
for every 50 points. This would be 16 programs from 2200 to 2800.

Actually, I see your method as a decent way to detect potential
clones.



Correspondence chess.

My average absolute error is 38 Elo. And this is all assuming that CCRL ratings are more or less correct. My test does not measure "game elo", where a game score is also partially the function of the quality of an engines time management.

1) Of course you are correct. My hypothesis was that higher rated engines would find the move in less time and they did. I have no doubt that Fruit 2.1, with 1000x more time, or Naum with 2x more time would be comparable to Rybka with x time. My very hypothesis rests on the fact that different engines of different strengths will find the "solution" in varying amounts of time, depending on their "absolute" strength, if such a thing could be measured.

2) See 1)

3) I know of no way around this problem at this time. I am sure some engines were penalized for finding solutions that were even better than the proposed solution but I have no way to determine this.

4) My "correct" answers were taken from the CC games. I believe these CC games are played at a level of at least 500 Elo higher than these engines at t=4. So, for all practical purposes the moves played from these games should be judged as more or less correct from the perspective of a player 500+ Elo below these CC GM's. What I end up with is the fact the more similar an engine plays to your average CC GM's the higher rated that engine will actually be.

"Your method ignores issues in creating a good or great
computer chess program such as the timing algorithm. "

It was not my intention to measure the performance of the "timing algorithm", assuming we are talking about something like time management. In fact, I wish CCRL and CEGT did not either. An engines use at raw analysis might be overestimated by the fact that it does have exceptional time management and therefor scores better in tournaments. It is not clear to me how effective time management lends itself to actual analysis.

I threw Movei in the mix to see what would happen and it scored 2763, its CCRL rating: 2732. I will test other lower rated engines when I get a chance.
CRoberson
Posts: 2056
Joined: Mon Mar 13, 2006 2:31 am
Location: North Carolina, USA

Re: Alternative methods of testing

Post by CRoberson »

Ok, you are looking for a way to find good analysis engines. Then
your technique has more merit. Because of several things in a full
game test, it is possible that some engines have higher value as
analysis engines than their CCRL/CEGT/WBEC.... rating would indicate.

However, seems you are missing one of my points. I was pointing
out that the variance between your method and the CCRL..... would
likely be greater with the lower rated engines. This is likely to be
true for several reasons; just one of which is the timing algorithm.
User avatar
Laskos
Posts: 10948
Joined: Wed Jul 26, 2006 10:21 pm
Full name: Kai Laskos

Re: Alternative methods of testing

Post by Laskos »

Hart wrote:With more data now, I have a logarithmic regression with R^2=.883, whereas my linear is now .882. Not much difference.
Thanks, a little bit disappointed :(, but if you can threw there much weaker engines (400-500 Elo weaker) maybe the logarithmic would show itself better. Maybe you should use more positions too, I am not even sure how to calculate your error bars.

Regards,
Kai
Hart

Re: Alternative methods of testing

Post by Hart »

I am rerunning the test now with different positions, from higher level games, and with the addition of Lime v66, Yace, Greko, and Rybka 3.
User avatar
Andres Valverde
Posts: 557
Joined: Sun Feb 18, 2007 11:07 pm
Location: Almeria. SPAIN

Re: Alternative methods of testing

Post by Andres Valverde »

Hart wrote:I am rerunning the test now with different positions, from higher level games, and with the addition of Lime v66, Yace, Greko, and Rybka 3.
Hi Michael, do you extract a EPD file with positions in the game from move 20 if I understood well. Is there a chance for getting those EPD files?
Saludos, Andres
User avatar
Andres Valverde
Posts: 557
Joined: Sun Feb 18, 2007 11:07 pm
Location: Almeria. SPAIN

Re: Alternative methods of testing

Post by Andres Valverde »

Andres Valverde wrote:
Hart wrote:I am rerunning the test now with different positions, from higher level games, and with the addition of Lime v66, Yace, Greko, and Rybka 3.
Hi Michael, do you extract a EPD file with positions in the game from move 20 if I understood well. Is there a chance for getting those EPD files?
Never mind, i realized that Arena can analyze PGNs as well :)
Saludos, Andres
User avatar
Andres Valverde
Posts: 557
Joined: Sun Feb 18, 2007 11:07 pm
Location: Almeria. SPAIN

Re: Alternative methods of testing

Post by Andres Valverde »

Hart wrote:I will take Bright 0.4a as an example.

In using Automatic Analysis in Arena I use five ICCF games, all ending in draws. I want to see how many times and how quickly Bright can find the move the CC players made in these games from move 20. If a move is not found I add 4 to the total time searched, where 4 is the maximum time (in seconds) allowed for a given position. At the end of the analysis I add up the total time to find solutions plus 4 for every move not found. Actually, Arena does most of the work for me.

I am left with a score for every engine. I take these scores and run them through a linear regression analysis with their CCRL ratings. This gives a linear function E(t) = -2.706t + 4455.63 that will then give me an estimate of an engines strength in terms of CCRL ratings.

Bright took 546 seconds to find solutions for 360 positions, with maximum time = 4 seconds. After plugging this into my function I get 2978
1) Imagine a slow but good engine that solves 1000 positions at 3'' averaqe per position. Total time :

1000 x 3" = 3000 s

2) A _fast_ but not so good engine, finds only 500 solutions out of 1000, but it does it at 2"/position : Total time :

500 x 2" + 500 x 4" = 3000 s

Both engines would have the same rating, but former solved double number of positions than latter!.

So, the number of positions solved have to be used somehow in your formula or I'm missing somthing..
Saludos, Andres
Richard Allbert
Posts: 792
Joined: Wed Jul 19, 2006 9:58 am

Re: Alternative methods of testing

Post by Richard Allbert »

Hart wrote:I am rerunning the test now with different positions, from higher level games, and with the addition of Lime v66, Yace, Greko, and Rybka 3.
If you have some results, I'd be really intersted in seeing them for Lime.. :)

Thanks!!

Richard
Hart

Re: Alternative methods of testing

Post by Hart »

Andres Valverde wrote:1) Imagine a slow but good engine that solves 1000 positions at 3'' averaqe per position. Total time :

1000 x 3" = 3000 s

2) A _fast_ but not so good engine, finds only 500 solutions out of 1000, but it does it at 2"/position : Total time :

500 x 2" + 500 x 4" = 3000 s

Both engines would have the same rating, but former solved double number of positions than latter!.

So, the number of positions solved have to be used somehow in your formula or I'm missing somthing..
A better model might include the number or percentage of solved position. However, what I have consistently seen is markedly high correlations between rated time and positions solved. In other words, they are both measuring, to a large extent, the same thing: playing strength. In most of my analyses I get better rankings using rated time as well, go figure.

As for your example, I am not quite sure I understand it. From what I can see it does not look like you account for the law of averages or you are using an extreme example where my model would of course fail. No engine will solve x positions in one discrete moment of time and then y at another discrete amount of time in these tests. They will actually be distributed unevenly throughout the time period t largely depending on their relative strength, in which case this distribution will be captured in rated time and will correlate with playing strength.
Hart

Re: Alternative methods of testing

Post by Hart »

I don't know if I was very unlucky with my latest position set or very lucky with my first one but my latest results are not holding up to expectations.

Still working out some kinks...