Hardware vs Software

Discussion of chess software programming and technical issues.

Moderator: Ras

User avatar
Bill Rogers
Posts: 3562
Joined: Thu Mar 09, 2006 3:54 am
Location: San Jose, California

Re: Hardware vs Software - test results

Post by Bill Rogers »

Mr. Hyatt
Am I to understand that Crafty 8.11 used only material for its evaluation?
Bill
bob
Posts: 20943
Joined: Mon Feb 27, 2006 7:30 pm
Location: Birmingham, AL

Re: Hardware vs Software - test results

Post by bob »

Bill Rogers wrote:Mr. Hyatt
Am I to understand that Crafty 8.11 used only material for its evaluation?
Bill
No. It had an evaluation that is fairly similar to the ideas used today. The no-eval issue was something Charles wanted me to test, just to see what evaluation adds to an engine...
Uri Blass
Posts: 10789
Joined: Thu Mar 09, 2006 12:37 am
Location: Tel-Aviv Israel

Re: Hardware vs Software - test results

Post by Uri Blass »

bob wrote:
Bill Rogers wrote:Mr. Hyatt
Am I to understand that Crafty 8.11 used only material for its evaluation?
Bill
No. It had an evaluation that is fairly similar to the ideas used today. The no-eval issue was something Charles wanted me to test, just to see what evaluation adds to an engine...
I suspect that the rating estimate is not correct for the no evaluation
and only material Crafty simply performs better relative to programs of similiar strength(score near 50% against it) against stronger opponents.

The problem is that Crafty only material is not normal weaker engine and it may from time to time win by some tactics against tactical weaker opponents.

In order to get a better estimate for Crafty only material rating you need opponents that are not more than 100 elo better than it and get rating for them.

This means that you need to find opponents that score not more than 60-70% against it in the first place.

Uri
bob
Posts: 20943
Joined: Mon Feb 27, 2006 7:30 pm
Location: Birmingham, AL

Re: Hardware vs Software - test results

Post by bob »

Uri Blass wrote:
bob wrote:
Bill Rogers wrote:Mr. Hyatt
Am I to understand that Crafty 8.11 used only material for its evaluation?
Bill
No. It had an evaluation that is fairly similar to the ideas used today. The no-eval issue was something Charles wanted me to test, just to see what evaluation adds to an engine...
I suspect that the rating estimate is not correct for the no evaluation
and only material Crafty simply performs better relative to programs of similiar strength(score near 50% against it) against stronger opponents.

The problem is that Crafty only material is not normal weaker engine and it may from time to time win by some tactics against tactical weaker opponents.

In order to get a better estimate for Crafty only material rating you need opponents that are not more than 100 elo better than it and get rating for them.

This means that you need to find opponents that score not more than 60-70% against it in the first place.

Uri
Did you see the opponents I used? In my current tests. Glaurung 2 is 60-70 points better than Crafty, as is newest Toga. Fruit 2.1 is about 30-50 lower...

So I didn't use any opponents that are more than 100 Elo better, and I do not see your point. What part of the following BayesElo output do you not understand:

Code: Select all

Rank Name               Elo    +    - games score oppo. draws
   1 Glaurung 2.1      2695    4    4 62256   65%  2585   20%
   2 Toga2             2695    4    3 62256   64%  2585   20%
   3 Crafty-22.9X1-1   2638    4    4 31128   51%  2629   21%
   4 Crafty-22.9X1-2   2636    5    5 31128   51%  2629   21%
   5 Fruit 2.1         2597    3    3 62256   52%  2585   22%
   6 Crafty-22.9X2-2   2596    4    4 31128   45%  2629   21%
   7 Crafty-22.9X3-2   2596    4    4 31128   46%  2629   20%
   8 Crafty-22.9X2-1   2594    4    5 31128   45%  2629   21%
   9 Crafty-22.9X3-1   2591    4    5 31128   45%  2629   20%
  10 Glaurung 1.1 SMP  2530    3    4 62256   43%  2585   19%
  11 Crafty-22.9X4-1   2517    5    5 31128   35%  2629   19%
  12 Crafty-22.9X4-2   2514    5    5 31128   35%  2629   18% 
Glaurung 2.1 scored 65% against Crafty, Toga2 64%, fruit 2.1 52% and Glaurung 1 43%., Note that those are combined and the actual percentages against the "best" crafty are lower. Let me see if I can find some data...

Here you go:

Code: Select all

Rank Name               Elo    +    - games score oppo. draws
   1 Glaurung 2.1      2664    6    6  7782   58%  2603   21% 
   2 Toga2             2663    6    6  7782   58%  2603   22% 
   3 Crafty-22.9-100   2603    4    4 31128   50%  2599   21% 
   4 Fruit 2.1         2569    6    6  7782   45%  2603   23% 
   5 Glaurung 1.1 SMP  2501    7    6  7782   36%  2603   18% 
that is the "best" current version. Against glaurung 2 crafty is losing 58%.

Again, I don't get your point at all... None of them scored even 60% against it...
bob
Posts: 20943
Joined: Mon Feb 27, 2006 7:30 pm
Location: Birmingham, AL

Re: Hardware vs Software

Post by bob »

CRoberson wrote:
bob wrote:I ran this overnight. I simply made Evaluate() return the material score only. It was almost exactly a 400 point drop in Elo from the version with the most recent evaluation.

Code: Select all

Crafty-22.9R01     2650    5    5 31128   51%  2644   21% 
Crafty-22.9R02     2261    5    6 31128    9%  2644    7% 
Ok, here is another test. No book. Combine that with the full Crafty
and the raw material Crafty.
I never use a book. I always start from about 4,000 unique starting positions and play the games out from both sides (hence the 8,000 games per opponent.
Uri Blass
Posts: 10789
Joined: Thu Mar 09, 2006 12:37 am
Location: Tel-Aviv Israel

Re: Hardware vs Software - test results

Post by Uri Blass »

bob wrote:
Uri Blass wrote:
bob wrote:
Bill Rogers wrote:Mr. Hyatt
Am I to understand that Crafty 8.11 used only material for its evaluation?
Bill
No. It had an evaluation that is fairly similar to the ideas used today. The no-eval issue was something Charles wanted me to test, just to see what evaluation adds to an engine...
I suspect that the rating estimate is not correct for the no evaluation
and only material Crafty simply performs better relative to programs of similiar strength(score near 50% against it) against stronger opponents.

The problem is that Crafty only material is not normal weaker engine and it may from time to time win by some tactics against tactical weaker opponents.

In order to get a better estimate for Crafty only material rating you need opponents that are not more than 100 elo better than it and get rating for them.

This means that you need to find opponents that score not more than 60-70% against it in the first place.

Uri
Did you see the opponents I used? In my current tests. Glaurung 2 is 60-70 points better than Crafty, as is newest Toga. Fruit 2.1 is about 30-50 lower...

So I didn't use any opponents that are more than 100 Elo better, and I do not see your point. What part of the following BayesElo output do you not understand:

Code: Select all

Rank Name               Elo    +    - games score oppo. draws
   1 Glaurung 2.1      2695    4    4 62256   65%  2585   20%
   2 Toga2             2695    4    3 62256   64%  2585   20%
   3 Crafty-22.9X1-1   2638    4    4 31128   51%  2629   21%
   4 Crafty-22.9X1-2   2636    5    5 31128   51%  2629   21%
   5 Fruit 2.1         2597    3    3 62256   52%  2585   22%
   6 Crafty-22.9X2-2   2596    4    4 31128   45%  2629   21%
   7 Crafty-22.9X3-2   2596    4    4 31128   46%  2629   20%
   8 Crafty-22.9X2-1   2594    4    5 31128   45%  2629   21%
   9 Crafty-22.9X3-1   2591    4    5 31128   45%  2629   20%
  10 Glaurung 1.1 SMP  2530    3    4 62256   43%  2585   19%
  11 Crafty-22.9X4-1   2517    5    5 31128   35%  2629   19%
  12 Crafty-22.9X4-2   2514    5    5 31128   35%  2629   18% 
Glaurung 2.1 scored 65% against Crafty, Toga2 64%, fruit 2.1 52% and Glaurung 1 43%., Note that those are combined and the actual percentages against the "best" crafty are lower. Let me see if I can find some data...

Here you go:

Code: Select all

Rank Name               Elo    +    - games score oppo. draws
   1 Glaurung 2.1      2664    6    6  7782   58%  2603   21% 
   2 Toga2             2663    6    6  7782   58%  2603   22% 
   3 Crafty-22.9-100   2603    4    4 31128   50%  2599   21% 
   4 Fruit 2.1         2569    6    6  7782   45%  2603   23% 
   5 Glaurung 1.1 SMP  2501    7    6  7782   36%  2603   18% 
that is the "best" current version. Against glaurung 2 crafty is losing 58%.

Again, I don't get your point at all... None of them scored even 60% against it...
My point is not about the best Crafty but about the rating of only material Crafty that is supposed to be 400 elo weaker and I believe that the difference is bigger.

I assume that you use the same opponents and I suspect that Crafty only material performs relatively better against significantly stronger opponents because it usually play weak but from time to time find some tactics and beat stronger opponents and it is not the normal behaviour of 400 elo weaker engine.

The point is that I believe that you may get lower rating for Crafty only material evaluation, if you find some other engines that are near 400 elo weaker than Crafty22.9 and test Crafty22.9 only material against them.

possible candidate with free source based on the CCRL list may be

Phalanx XXII Reborn JA
Thor's Hammer 2.28
NanoSzachy 3.1
GreKo 5.3
Natwarlal 0.14

Uri
bob
Posts: 20943
Joined: Mon Feb 27, 2006 7:30 pm
Location: Birmingham, AL

Re: Hardware vs Software - test results

Post by bob »

Uri Blass wrote:
bob wrote:
Uri Blass wrote:
bob wrote:
Bill Rogers wrote:Mr. Hyatt
Am I to understand that Crafty 8.11 used only material for its evaluation?
Bill
No. It had an evaluation that is fairly similar to the ideas used today. The no-eval issue was something Charles wanted me to test, just to see what evaluation adds to an engine...
I suspect that the rating estimate is not correct for the no evaluation
and only material Crafty simply performs better relative to programs of similiar strength(score near 50% against it) against stronger opponents.

The problem is that Crafty only material is not normal weaker engine and it may from time to time win by some tactics against tactical weaker opponents.

In order to get a better estimate for Crafty only material rating you need opponents that are not more than 100 elo better than it and get rating for them.

This means that you need to find opponents that score not more than 60-70% against it in the first place.

Uri
Did you see the opponents I used? In my current tests. Glaurung 2 is 60-70 points better than Crafty, as is newest Toga. Fruit 2.1 is about 30-50 lower...

So I didn't use any opponents that are more than 100 Elo better, and I do not see your point. What part of the following BayesElo output do you not understand:

Code: Select all

Rank Name               Elo    +    - games score oppo. draws
   1 Glaurung 2.1      2695    4    4 62256   65%  2585   20%
   2 Toga2             2695    4    3 62256   64%  2585   20%
   3 Crafty-22.9X1-1   2638    4    4 31128   51%  2629   21%
   4 Crafty-22.9X1-2   2636    5    5 31128   51%  2629   21%
   5 Fruit 2.1         2597    3    3 62256   52%  2585   22%
   6 Crafty-22.9X2-2   2596    4    4 31128   45%  2629   21%
   7 Crafty-22.9X3-2   2596    4    4 31128   46%  2629   20%
   8 Crafty-22.9X2-1   2594    4    5 31128   45%  2629   21%
   9 Crafty-22.9X3-1   2591    4    5 31128   45%  2629   20%
  10 Glaurung 1.1 SMP  2530    3    4 62256   43%  2585   19%
  11 Crafty-22.9X4-1   2517    5    5 31128   35%  2629   19%
  12 Crafty-22.9X4-2   2514    5    5 31128   35%  2629   18% 
Glaurung 2.1 scored 65% against Crafty, Toga2 64%, fruit 2.1 52% and Glaurung 1 43%., Note that those are combined and the actual percentages against the "best" crafty are lower. Let me see if I can find some data...

Here you go:

Code: Select all

Rank Name               Elo    +    - games score oppo. draws
   1 Glaurung 2.1      2664    6    6  7782   58%  2603   21% 
   2 Toga2             2663    6    6  7782   58%  2603   22% 
   3 Crafty-22.9-100   2603    4    4 31128   50%  2599   21% 
   4 Fruit 2.1         2569    6    6  7782   45%  2603   23% 
   5 Glaurung 1.1 SMP  2501    7    6  7782   36%  2603   18% 
that is the "best" current version. Against glaurung 2 crafty is losing 58%.

Again, I don't get your point at all... None of them scored even 60% against it...
My point is not about the best Crafty but about the rating of only material Crafty that is supposed to be 400 elo weaker and I believe that the difference is bigger.

I assume that you use the same opponents and I suspect that Crafty only material performs relatively better against significantly stronger opponents because it usually play weak but from time to time find some tactics and beat stronger opponents and it is not the normal behaviour of 400 elo weaker engine.

The point is that I believe that you may get lower rating for Crafty only material evaluation, if you find some other engines that are near 400 elo weaker than Crafty22.9 and test Crafty22.9 only material against them.

possible candidate with free source based on the CCRL list may be

Phalanx XXII Reborn JA
Thor's Hammer 2.28
NanoSzachy 3.1
GreKo 5.3
Natwarlal 0.14

Uri
So now we need to pick the opponents that make it look the worst? Makes perfect sense to me. Crafty is 60 elo below G2, 100 elo above G1. 40 above fruit, and 60 below toga. And against that group, performance dropped 400, which is a huge drop IMHO. I don't have the time to start picking different opponents, and then figuring out how to make them work properly on the cluster. Against the current set, with 2 above and 2 below crafty, the results are pretty easy to interpret. Otherwise I would need to get _every_ chess program into the mix in order to say "against computers, the evaluation is worth XXX elo." And then there is the human problem as well.

I feel comfortable with the 400 elo drop. Whether it is 400 or 500 is sort of moot. 400 means you lose 95% of the games, which is pretty dominating. You could also restrict the program to 1 ply searches to see how important the search is to playing skill. I'd suspect that would drop way on down, losing close to 100% of the time.
BubbaTough
Posts: 1154
Joined: Fri Jun 23, 2006 5:18 am

Re: Hardware vs Software - test results

Post by BubbaTough »

My point is not about the best Crafty but about the rating of only material Crafty that is supposed to be 400 elo weaker and I believe that the difference is bigger.
I think it is quite plausible that Uri is right. With no eval, crafty may have difficulty making progress in many positions and may settle for draws against many weak engines. By not having weak engines to play against, you get better results than you should. I don't really care enough personally about measuring this that I would actually spend a bunch of cycles testing it, just writing to tip my hat to a good catch by Uri (I think).

-Sam
CRoberson
Posts: 2091
Joined: Mon Mar 13, 2006 2:31 am
Location: North Carolina, USA

Re: Hardware vs Software

Post by CRoberson »

bob wrote:
CRoberson wrote:
bob wrote:I ran this overnight. I simply made Evaluate() return the material score only. It was almost exactly a 400 point drop in Elo from the version with the most recent evaluation.

Code: Select all

Crafty-22.9R01     2650    5    5 31128   51%  2644   21% 
Crafty-22.9R02     2261    5    6 31128    9%  2644    7% 
Ok, here is another test. No book. Combine that with the full Crafty
and the raw material Crafty.
I never use a book. I always start from about 4,000 unique starting positions and play the games out from both sides (hence the 8,000 games per opponent.
Yes, I forgot about that. What about position learning? Is that on or off?
If on, how about turning it off - I think learning would skew the results.
Uri Blass
Posts: 10789
Joined: Thu Mar 09, 2006 12:37 am
Location: Tel-Aviv Israel

Re: Hardware vs Software - test results

Post by Uri Blass »

BubbaTough wrote:
My point is not about the best Crafty but about the rating of only material Crafty that is supposed to be 400 elo weaker and I believe that the difference is bigger.
I think it is quite plausible that Uri is right. With no eval, crafty may have difficulty making progress in many positions and may settle for draws against many weak engines. By not having weak engines to play against, you get better results than you should. I don't really care enough personally about measuring this that I would actually spend a bunch of cycles testing it, just writing to tip my hat to a good catch by Uri (I think).

-Sam
Yes
Not making progress is part of the problem.

Another problem is simply getting inferior position and being killed positionally when the fact that you see more than your opponent does not help because you only see that you lose faster.

It happens only in part of the games but when it happens you can lose even against engines that are 1000 elo weaker.

The point is that sometimes you may win or draw against relatively stronger engines because you see big material win by search and sometimes you can even lose against weak engines because you get bad position and search only help you to see that you lose faster than the opponent.

Uri