Mr. Hyatt
Am I to understand that Crafty 8.11 used only material for its evaluation?
Bill
Hardware vs Software
Moderator: Ras
-
- Posts: 3562
- Joined: Thu Mar 09, 2006 3:54 am
- Location: San Jose, California
-
- Posts: 20943
- Joined: Mon Feb 27, 2006 7:30 pm
- Location: Birmingham, AL
Re: Hardware vs Software - test results
No. It had an evaluation that is fairly similar to the ideas used today. The no-eval issue was something Charles wanted me to test, just to see what evaluation adds to an engine...Bill Rogers wrote:Mr. Hyatt
Am I to understand that Crafty 8.11 used only material for its evaluation?
Bill
-
- Posts: 10789
- Joined: Thu Mar 09, 2006 12:37 am
- Location: Tel-Aviv Israel
Re: Hardware vs Software - test results
I suspect that the rating estimate is not correct for the no evaluationbob wrote:No. It had an evaluation that is fairly similar to the ideas used today. The no-eval issue was something Charles wanted me to test, just to see what evaluation adds to an engine...Bill Rogers wrote:Mr. Hyatt
Am I to understand that Crafty 8.11 used only material for its evaluation?
Bill
and only material Crafty simply performs better relative to programs of similiar strength(score near 50% against it) against stronger opponents.
The problem is that Crafty only material is not normal weaker engine and it may from time to time win by some tactics against tactical weaker opponents.
In order to get a better estimate for Crafty only material rating you need opponents that are not more than 100 elo better than it and get rating for them.
This means that you need to find opponents that score not more than 60-70% against it in the first place.
Uri
-
- Posts: 20943
- Joined: Mon Feb 27, 2006 7:30 pm
- Location: Birmingham, AL
Re: Hardware vs Software - test results
Did you see the opponents I used? In my current tests. Glaurung 2 is 60-70 points better than Crafty, as is newest Toga. Fruit 2.1 is about 30-50 lower...Uri Blass wrote:I suspect that the rating estimate is not correct for the no evaluationbob wrote:No. It had an evaluation that is fairly similar to the ideas used today. The no-eval issue was something Charles wanted me to test, just to see what evaluation adds to an engine...Bill Rogers wrote:Mr. Hyatt
Am I to understand that Crafty 8.11 used only material for its evaluation?
Bill
and only material Crafty simply performs better relative to programs of similiar strength(score near 50% against it) against stronger opponents.
The problem is that Crafty only material is not normal weaker engine and it may from time to time win by some tactics against tactical weaker opponents.
In order to get a better estimate for Crafty only material rating you need opponents that are not more than 100 elo better than it and get rating for them.
This means that you need to find opponents that score not more than 60-70% against it in the first place.
Uri
So I didn't use any opponents that are more than 100 Elo better, and I do not see your point. What part of the following BayesElo output do you not understand:
Code: Select all
Rank Name Elo + - games score oppo. draws
1 Glaurung 2.1 2695 4 4 62256 65% 2585 20%
2 Toga2 2695 4 3 62256 64% 2585 20%
3 Crafty-22.9X1-1 2638 4 4 31128 51% 2629 21%
4 Crafty-22.9X1-2 2636 5 5 31128 51% 2629 21%
5 Fruit 2.1 2597 3 3 62256 52% 2585 22%
6 Crafty-22.9X2-2 2596 4 4 31128 45% 2629 21%
7 Crafty-22.9X3-2 2596 4 4 31128 46% 2629 20%
8 Crafty-22.9X2-1 2594 4 5 31128 45% 2629 21%
9 Crafty-22.9X3-1 2591 4 5 31128 45% 2629 20%
10 Glaurung 1.1 SMP 2530 3 4 62256 43% 2585 19%
11 Crafty-22.9X4-1 2517 5 5 31128 35% 2629 19%
12 Crafty-22.9X4-2 2514 5 5 31128 35% 2629 18%
Here you go:
Code: Select all
Rank Name Elo + - games score oppo. draws
1 Glaurung 2.1 2664 6 6 7782 58% 2603 21%
2 Toga2 2663 6 6 7782 58% 2603 22%
3 Crafty-22.9-100 2603 4 4 31128 50% 2599 21%
4 Fruit 2.1 2569 6 6 7782 45% 2603 23%
5 Glaurung 1.1 SMP 2501 7 6 7782 36% 2603 18%
Again, I don't get your point at all... None of them scored even 60% against it...
-
- Posts: 20943
- Joined: Mon Feb 27, 2006 7:30 pm
- Location: Birmingham, AL
Re: Hardware vs Software
I never use a book. I always start from about 4,000 unique starting positions and play the games out from both sides (hence the 8,000 games per opponent.CRoberson wrote:Ok, here is another test. No book. Combine that with the full Craftybob wrote:I ran this overnight. I simply made Evaluate() return the material score only. It was almost exactly a 400 point drop in Elo from the version with the most recent evaluation.
Code: Select all
Crafty-22.9R01 2650 5 5 31128 51% 2644 21% Crafty-22.9R02 2261 5 6 31128 9% 2644 7%
and the raw material Crafty.
-
- Posts: 10789
- Joined: Thu Mar 09, 2006 12:37 am
- Location: Tel-Aviv Israel
Re: Hardware vs Software - test results
My point is not about the best Crafty but about the rating of only material Crafty that is supposed to be 400 elo weaker and I believe that the difference is bigger.bob wrote:Did you see the opponents I used? In my current tests. Glaurung 2 is 60-70 points better than Crafty, as is newest Toga. Fruit 2.1 is about 30-50 lower...Uri Blass wrote:I suspect that the rating estimate is not correct for the no evaluationbob wrote:No. It had an evaluation that is fairly similar to the ideas used today. The no-eval issue was something Charles wanted me to test, just to see what evaluation adds to an engine...Bill Rogers wrote:Mr. Hyatt
Am I to understand that Crafty 8.11 used only material for its evaluation?
Bill
and only material Crafty simply performs better relative to programs of similiar strength(score near 50% against it) against stronger opponents.
The problem is that Crafty only material is not normal weaker engine and it may from time to time win by some tactics against tactical weaker opponents.
In order to get a better estimate for Crafty only material rating you need opponents that are not more than 100 elo better than it and get rating for them.
This means that you need to find opponents that score not more than 60-70% against it in the first place.
Uri
So I didn't use any opponents that are more than 100 Elo better, and I do not see your point. What part of the following BayesElo output do you not understand:
Glaurung 2.1 scored 65% against Crafty, Toga2 64%, fruit 2.1 52% and Glaurung 1 43%., Note that those are combined and the actual percentages against the "best" crafty are lower. Let me see if I can find some data...Code: Select all
Rank Name Elo + - games score oppo. draws 1 Glaurung 2.1 2695 4 4 62256 65% 2585 20% 2 Toga2 2695 4 3 62256 64% 2585 20% 3 Crafty-22.9X1-1 2638 4 4 31128 51% 2629 21% 4 Crafty-22.9X1-2 2636 5 5 31128 51% 2629 21% 5 Fruit 2.1 2597 3 3 62256 52% 2585 22% 6 Crafty-22.9X2-2 2596 4 4 31128 45% 2629 21% 7 Crafty-22.9X3-2 2596 4 4 31128 46% 2629 20% 8 Crafty-22.9X2-1 2594 4 5 31128 45% 2629 21% 9 Crafty-22.9X3-1 2591 4 5 31128 45% 2629 20% 10 Glaurung 1.1 SMP 2530 3 4 62256 43% 2585 19% 11 Crafty-22.9X4-1 2517 5 5 31128 35% 2629 19% 12 Crafty-22.9X4-2 2514 5 5 31128 35% 2629 18%
Here you go:that is the "best" current version. Against glaurung 2 crafty is losing 58%.Code: Select all
Rank Name Elo + - games score oppo. draws 1 Glaurung 2.1 2664 6 6 7782 58% 2603 21% 2 Toga2 2663 6 6 7782 58% 2603 22% 3 Crafty-22.9-100 2603 4 4 31128 50% 2599 21% 4 Fruit 2.1 2569 6 6 7782 45% 2603 23% 5 Glaurung 1.1 SMP 2501 7 6 7782 36% 2603 18%
Again, I don't get your point at all... None of them scored even 60% against it...
I assume that you use the same opponents and I suspect that Crafty only material performs relatively better against significantly stronger opponents because it usually play weak but from time to time find some tactics and beat stronger opponents and it is not the normal behaviour of 400 elo weaker engine.
The point is that I believe that you may get lower rating for Crafty only material evaluation, if you find some other engines that are near 400 elo weaker than Crafty22.9 and test Crafty22.9 only material against them.
possible candidate with free source based on the CCRL list may be
Phalanx XXII Reborn JA
Thor's Hammer 2.28
NanoSzachy 3.1
GreKo 5.3
Natwarlal 0.14
Uri
-
- Posts: 20943
- Joined: Mon Feb 27, 2006 7:30 pm
- Location: Birmingham, AL
Re: Hardware vs Software - test results
So now we need to pick the opponents that make it look the worst? Makes perfect sense to me. Crafty is 60 elo below G2, 100 elo above G1. 40 above fruit, and 60 below toga. And against that group, performance dropped 400, which is a huge drop IMHO. I don't have the time to start picking different opponents, and then figuring out how to make them work properly on the cluster. Against the current set, with 2 above and 2 below crafty, the results are pretty easy to interpret. Otherwise I would need to get _every_ chess program into the mix in order to say "against computers, the evaluation is worth XXX elo." And then there is the human problem as well.Uri Blass wrote:My point is not about the best Crafty but about the rating of only material Crafty that is supposed to be 400 elo weaker and I believe that the difference is bigger.bob wrote:Did you see the opponents I used? In my current tests. Glaurung 2 is 60-70 points better than Crafty, as is newest Toga. Fruit 2.1 is about 30-50 lower...Uri Blass wrote:I suspect that the rating estimate is not correct for the no evaluationbob wrote:No. It had an evaluation that is fairly similar to the ideas used today. The no-eval issue was something Charles wanted me to test, just to see what evaluation adds to an engine...Bill Rogers wrote:Mr. Hyatt
Am I to understand that Crafty 8.11 used only material for its evaluation?
Bill
and only material Crafty simply performs better relative to programs of similiar strength(score near 50% against it) against stronger opponents.
The problem is that Crafty only material is not normal weaker engine and it may from time to time win by some tactics against tactical weaker opponents.
In order to get a better estimate for Crafty only material rating you need opponents that are not more than 100 elo better than it and get rating for them.
This means that you need to find opponents that score not more than 60-70% against it in the first place.
Uri
So I didn't use any opponents that are more than 100 Elo better, and I do not see your point. What part of the following BayesElo output do you not understand:
Glaurung 2.1 scored 65% against Crafty, Toga2 64%, fruit 2.1 52% and Glaurung 1 43%., Note that those are combined and the actual percentages against the "best" crafty are lower. Let me see if I can find some data...Code: Select all
Rank Name Elo + - games score oppo. draws 1 Glaurung 2.1 2695 4 4 62256 65% 2585 20% 2 Toga2 2695 4 3 62256 64% 2585 20% 3 Crafty-22.9X1-1 2638 4 4 31128 51% 2629 21% 4 Crafty-22.9X1-2 2636 5 5 31128 51% 2629 21% 5 Fruit 2.1 2597 3 3 62256 52% 2585 22% 6 Crafty-22.9X2-2 2596 4 4 31128 45% 2629 21% 7 Crafty-22.9X3-2 2596 4 4 31128 46% 2629 20% 8 Crafty-22.9X2-1 2594 4 5 31128 45% 2629 21% 9 Crafty-22.9X3-1 2591 4 5 31128 45% 2629 20% 10 Glaurung 1.1 SMP 2530 3 4 62256 43% 2585 19% 11 Crafty-22.9X4-1 2517 5 5 31128 35% 2629 19% 12 Crafty-22.9X4-2 2514 5 5 31128 35% 2629 18%
Here you go:that is the "best" current version. Against glaurung 2 crafty is losing 58%.Code: Select all
Rank Name Elo + - games score oppo. draws 1 Glaurung 2.1 2664 6 6 7782 58% 2603 21% 2 Toga2 2663 6 6 7782 58% 2603 22% 3 Crafty-22.9-100 2603 4 4 31128 50% 2599 21% 4 Fruit 2.1 2569 6 6 7782 45% 2603 23% 5 Glaurung 1.1 SMP 2501 7 6 7782 36% 2603 18%
Again, I don't get your point at all... None of them scored even 60% against it...
I assume that you use the same opponents and I suspect that Crafty only material performs relatively better against significantly stronger opponents because it usually play weak but from time to time find some tactics and beat stronger opponents and it is not the normal behaviour of 400 elo weaker engine.
The point is that I believe that you may get lower rating for Crafty only material evaluation, if you find some other engines that are near 400 elo weaker than Crafty22.9 and test Crafty22.9 only material against them.
possible candidate with free source based on the CCRL list may be
Phalanx XXII Reborn JA
Thor's Hammer 2.28
NanoSzachy 3.1
GreKo 5.3
Natwarlal 0.14
Uri
I feel comfortable with the 400 elo drop. Whether it is 400 or 500 is sort of moot. 400 means you lose 95% of the games, which is pretty dominating. You could also restrict the program to 1 ply searches to see how important the search is to playing skill. I'd suspect that would drop way on down, losing close to 100% of the time.
-
- Posts: 1154
- Joined: Fri Jun 23, 2006 5:18 am
Re: Hardware vs Software - test results
I think it is quite plausible that Uri is right. With no eval, crafty may have difficulty making progress in many positions and may settle for draws against many weak engines. By not having weak engines to play against, you get better results than you should. I don't really care enough personally about measuring this that I would actually spend a bunch of cycles testing it, just writing to tip my hat to a good catch by Uri (I think).My point is not about the best Crafty but about the rating of only material Crafty that is supposed to be 400 elo weaker and I believe that the difference is bigger.
-Sam
-
- Posts: 2091
- Joined: Mon Mar 13, 2006 2:31 am
- Location: North Carolina, USA
Re: Hardware vs Software
Yes, I forgot about that. What about position learning? Is that on or off?bob wrote:I never use a book. I always start from about 4,000 unique starting positions and play the games out from both sides (hence the 8,000 games per opponent.CRoberson wrote:Ok, here is another test. No book. Combine that with the full Craftybob wrote:I ran this overnight. I simply made Evaluate() return the material score only. It was almost exactly a 400 point drop in Elo from the version with the most recent evaluation.
Code: Select all
Crafty-22.9R01 2650 5 5 31128 51% 2644 21% Crafty-22.9R02 2261 5 6 31128 9% 2644 7%
and the raw material Crafty.
If on, how about turning it off - I think learning would skew the results.
-
- Posts: 10789
- Joined: Thu Mar 09, 2006 12:37 am
- Location: Tel-Aviv Israel
Re: Hardware vs Software - test results
YesBubbaTough wrote:I think it is quite plausible that Uri is right. With no eval, crafty may have difficulty making progress in many positions and may settle for draws against many weak engines. By not having weak engines to play against, you get better results than you should. I don't really care enough personally about measuring this that I would actually spend a bunch of cycles testing it, just writing to tip my hat to a good catch by Uri (I think).My point is not about the best Crafty but about the rating of only material Crafty that is supposed to be 400 elo weaker and I believe that the difference is bigger.
-Sam
Not making progress is part of the problem.
Another problem is simply getting inferior position and being killed positionally when the fact that you see more than your opponent does not help because you only see that you lose faster.
It happens only in part of the games but when it happens you can lose even against engines that are 1000 elo weaker.
The point is that sometimes you may win or draw against relatively stronger engines because you see big material win by search and sometimes you can even lose against weak engines because you get bad position and search only help you to see that you lose faster than the opponent.
Uri