testing, again. Glaurung 2 change

Discussion of chess software programming and technical issues.

Moderator: Ras

krazyken

Re: error in testing...

Post by krazyken »

2.1 came with a lot of more options with several default values changed. For sp, the biggest changes were probably the "single reply extensions"
Uri Blass
Posts: 10909
Joined: Thu Mar 09, 2006 12:37 am
Location: Tel-Aviv Israel

Re: testing, again. Glaurung 2 change

Post by Uri Blass »

bob wrote:I just realized what you are saying. You do realize that each of these sets of games is just 4,000 games long? And looking at the results against one program is falling right back into the random result issue as in the past. Look at the error bars for each program (except for crafty.) Nothing looks unusual at all. I was simply pointing out that in the first two of these runs I made, G2.1 was doing somewhat worse.
I disagree that nothing look unusual
worse performance of Glaurung 2.1 is certainly unusual because 2.1 is better based on many games of CCRL of CEGT.
In addition to it the difference between fruit2.1's rating seem to be unusual

2 Fruit 2.1 68 9 9 3894 64% -34 24% (+102 elo relative to Crafty)
5 Crafty-22.2 -34 5 5 19470 44% 7 23%

2 Fruit 2.1 52 11 11 2267 60% -22 24% (+74 elo relative to Crafty)
5 Crafty-22.2 -22 5 6 11344 46% 4 24%

If we ignore games of Crafty against non fruit programs then
the effective difference in performance against Crafty is 28 elo
and it certainly seems unusual.

Uri
bob
Posts: 20943
Joined: Mon Feb 27, 2006 7:30 pm
Location: Birmingham, AL

Re: testing, again. Glaurung 2 change

Post by bob »

Uri Blass wrote:
bob wrote:I just realized what you are saying. You do realize that each of these sets of games is just 4,000 games long? And looking at the results against one program is falling right back into the random result issue as in the past. Look at the error bars for each program (except for crafty.) Nothing looks unusual at all. I was simply pointing out that in the first two of these runs I made, G2.1 was doing somewhat worse.
I disagree that nothing look unusual
worse performance of Glaurung 2.1 is certainly unusual because 2.1 is better based on many games of CCRL of CEGT.
In addition to it the difference between fruit2.1's rating seem to be unusual

Read my other post. I had not noticed that Tord had significantly changed the polyglot.ini file and I was running the new version of Glaurung with the old 2.0 e5 file. I am re-running the tests now.

As to the difference in Fruit's rating, are we now going to flip sides where you say even if the new rating is inside the error bar for the old one, that it is "wrong"???

I have only been pointing out the cases where the two ratings +/- the error do not overlap at all. Here they do.


2 Fruit 2.1 68 9 9 3894 64% -34 24% (+102 elo relative to Crafty)
5 Crafty-22.2 -34 5 5 19470 44% 7 23%

2 Fruit 2.1 52 11 11 2267 60% -22 24% (+74 elo relative to Crafty)
5 Crafty-22.2 -22 5 6 11344 46% 4 24%

If we ignore games of Crafty against non fruit programs then
the effective difference in performance against Crafty is 28 elo
and it certainly seems unusual.

Uri
It doesn't to me, considering that there are 2,000 positions, and 4,000 games roughly.68 -9 and 52 + 11 are certainly within expectation.
bob
Posts: 20943
Joined: Mon Feb 27, 2006 7:30 pm
Location: Birmingham, AL

Re: error in testing...

Post by bob »

Dirt wrote:
bob wrote:I had forgotten about the polyglot.ini, and was using the 2.0 e5 .ini file. I am re-running the test with the polyglot.ini from 2.1 when running 2.1.

I will post new results tomorrow if nothing goes wrong tonight. Weather is a bit foul down here due to this tropical storm that is in the area, so we might lose power at some point.
What difference is there in the polyglot.ini files that would cause a substantial change?
There are some scoring paramaters. for example, old:

Aggressiveness = 150
Cowardice = 100
Passed pawns = 140
Pawn structure = 150
Mobility (middle game) = 130
Mobility (endgame) = 110
Space = 100
Development = 130


new:

Mobility (Middle Game) = 100
Mobility (Endgame) = 100
Aggressiveness = 100
Cowardice = 100
Pawn Structure (Middle Game) = 100
Pawn Structure (Endgame) = 100
Passed Pawns (Middle Game) = 100
Passed Pawns (Endgame) = 100


I always try to run engines with "default settings" And I probably should just remove all of those completely, but they are "as distributed". I have Crafty set so that default settings are optimal so that setting hash size and max threads is enough....
Uri Blass
Posts: 10909
Joined: Thu Mar 09, 2006 12:37 am
Location: Tel-Aviv Israel

Re: testing, again. Glaurung 2 change

Post by Uri Blass »

bob wrote:
Uri Blass wrote:
bob wrote:I just realized what you are saying. You do realize that each of these sets of games is just 4,000 games long? And looking at the results against one program is falling right back into the random result issue as in the past. Look at the error bars for each program (except for crafty.) Nothing looks unusual at all. I was simply pointing out that in the first two of these runs I made, G2.1 was doing somewhat worse.
I disagree that nothing look unusual
worse performance of Glaurung 2.1 is certainly unusual because 2.1 is better based on many games of CCRL of CEGT.
In addition to it the difference between fruit2.1's rating seem to be unusual

Read my other post. I had not noticed that Tord had significantly changed the polyglot.ini file and I was running the new version of Glaurung with the old 2.0 e5 file. I am re-running the tests now.

As to the difference in Fruit's rating, are we now going to flip sides where you say even if the new rating is inside the error bar for the old one, that it is "wrong"???

I have only been pointing out the cases where the two ratings +/- the error do not overlap at all. Here they do.


2 Fruit 2.1 68 9 9 3894 64% -34 24% (+102 elo relative to Crafty)
5 Crafty-22.2 -34 5 5 19470 44% 7 23%

2 Fruit 2.1 52 11 11 2267 60% -22 24% (+74 elo relative to Crafty)
5 Crafty-22.2 -22 5 6 11344 46% 4 24%

If we ignore games of Crafty against non fruit programs then
the effective difference in performance against Crafty is 28 elo
and it certainly seems unusual.

Uri
It doesn't to me, considering that there are 2,000 positions, and 4,000 games roughly.68 -9 and 52 + 11 are certainly within expectation.
I do not say impossible but this is not a result that I can expect to happen often and the difference is practically bigger if we give Crafty the same rating in both lists and ignore games of other programs against Crafty.

If in calculating rating for fruit you simply ignore games of Crafty against non fruit programs and give Crafty constant rating of 0 you will probably get
for Fruit 102+-9 and 74 +-11 that is clearly not within expectation.

It is possible to explain it not only by statistical noise and it is possible that fruit simply does not play relatively well in the part of the positions
that you tested when you got the +-11.

I do not claim that there has to be an error in your games but only that it is a possibility that I think to check based on the data.

Uri
Tord Romstad
Posts: 1808
Joined: Wed Mar 08, 2006 9:19 pm
Location: Oslo, Norway

Re: testing, again. Glaurung 2 change

Post by Tord Romstad »

bob wrote:Tord:

A while back you mentioned that I should move from the older 2.0 epsilon whatever to the most recent. I didn't change at the time because I didn't want to alter a constant opponent that was represented in a lot of old data.

With the new testing approach, I am in the progress os now re-evaluating the opponents, and perhaps adding a few more opponents (to do a few less games per opponent to keep things close computationally).

One oddity I found is this:

Code: Select all

crafty-22.2R5
Rank Name                   Elo    +    - games score oppo. draws
   1 Glaurung 2-epsilon/5   115    9    9  3894   70%   -34   20% 
   2 Fruit 2.1               68    9    9  3894   64%   -34   24% 
   3 opponent-21.7           20    8    8  3894   58%   -34   34% 
   4 Glaurung 1.1 SMP        14    9    9  3894   57%   -34   20% 
   5 Crafty-22.2            -34    5    5 19470   44%     7   23% 
   6 Arasan 10.0           -184    9    9  3894   30%   -34   20% 

Rank Name                   Elo    +    - games score oppo. draws
   1 Glaurung 2.1        95   11   11  2271   65%   -22   18% 
   2 Fruit 2.1           52   11   11  2267   60%   -22   24% 
   3 Glaurung 1.1 SMP    27   11   11  2263   57%   -22   21% 
   4 opponent-21.7       16   11   10  2269   56%   -22   35% 
   5 Crafty-22.2        -22    5    6 11344   46%     4   24% 
   6 Arasan 10.0       -169   11   11  2274   30%   -22   20% 
While the current test has not completed, I did run one complete run but threw it out because I accidentally replaced the wrong glaurung with the newest. But the thing I noticed is that at least for Crafty, the new glaurung is not doing quite as well as the previous version (old was 70% vs crafty, new is 65%). I will post the complete run when it finishes, but I thought it interesting. Whether it suggests that some change was not so good, or just not so good against Crafty I am not sure.
All rating lists agree that 2.1 is stronger than 2 - ε/5, so I'm quite sure the latter of your two explanations is right. This isn't a big surprise: Glaurung 2.1 has a strange and very speculative evaluation function, which often return huge scores even in materially equal positions. You may have noticed that Glaurung 2.1 often fails to win even when it reaches scores of +3 or +4. Making the program less speculative would almost certainly make it stronger, but also less fun, which is of course more important.

Highly speculative play works better against some engines than against others. I suppose Crafty has a very sound and solid style, and excels at refuting risky play.

Tord
Tord Romstad
Posts: 1808
Joined: Wed Mar 08, 2006 9:19 pm
Location: Oslo, Norway

Re: error in testing...

Post by Tord Romstad »

bob wrote:
Dirt wrote:
bob wrote:I had forgotten about the polyglot.ini, and was using the 2.0 e5 .ini file. I am re-running the test with the polyglot.ini from 2.1 when running 2.1.

I will post new results tomorrow if nothing goes wrong tonight. Weather is a bit foul down here due to this tropical storm that is in the area, so we might lose power at some point.
What difference is there in the polyglot.ini files that would cause a substantial change?
There are some scoring paramaters. for example, old:

Aggressiveness = 150
Cowardice = 100
Passed pawns = 140
Pawn structure = 150
Mobility (middle game) = 130
Mobility (endgame) = 110
Space = 100
Development = 130
Actually, these parameters seem to be from the polyglot.ini file of Glaurung 1. Several of the above parameters don't even exist in Glaurung 2.

I don't think using the old polyglot.ini file will hurt the strength significantly, though. I think the biggest part of the explanation for Glaurung 2.1's poor results compared to 2 - ε/5 is simply that it has problems against Crafty.
I have Crafty set so that default settings are optimal so that setting hash size and max threads is enough....
I do exactly the same in Glaurung: The default settings are identical to what is found in the polyglot.ini file supplied with Glaurung, apart from the hash size and the number of threads.

Tord
User avatar
Ovyron
Posts: 4562
Joined: Tue Jul 03, 2007 4:30 am

Re: testing, again. Glaurung 2 change

Post by Ovyron »

Tord Romstad wrote:Making the program less speculative would almost certainly make it stronger, but also less fun, which is of course more important.
Your way of thinking is appreciated.
Uri Blass
Posts: 10909
Joined: Thu Mar 09, 2006 12:37 am
Location: Tel-Aviv Israel

Re: error in testing...

Post by Uri Blass »

Tord Romstad wrote:
bob wrote:
Dirt wrote:
bob wrote:I had forgotten about the polyglot.ini, and was using the 2.0 e5 .ini file. I am re-running the test with the polyglot.ini from 2.1 when running 2.1.

I will post new results tomorrow if nothing goes wrong tonight. Weather is a bit foul down here due to this tropical storm that is in the area, so we might lose power at some point.
What difference is there in the polyglot.ini files that would cause a substantial change?
There are some scoring paramaters. for example, old:

Aggressiveness = 150
Cowardice = 100
Passed pawns = 140
Pawn structure = 150
Mobility (middle game) = 130
Mobility (endgame) = 110
Space = 100
Development = 130
Actually, these parameters seem to be from the polyglot.ini file of Glaurung 1. Several of the above parameters don't even exist in Glaurung 2.

I don't think using the old polyglot.ini file will hurt the strength significantly, though. I think the biggest part of the explanation for Glaurung 2.1's poor results compared to 2 - ε/5 is simply that it has problems against Crafty.
I have Crafty set so that default settings are optimal so that setting hash size and max threads is enough....
I do exactly the same in Glaurung: The default settings are identical to what is found in the polyglot.ini file supplied with Glaurung, apart from the hash size and the number of threads.

Tord
The result of Glaurung2.1 in another thread suggest that Glaurung2.1 has no special problems against Crafty.

I think that usually when A is significantly stronger than B then it means
that every opponent is going to perform worse against A relative to B.

You can verify it by the CCRL FRC results

http://computerchess.org.uk/ccrl/404FRC ... t_all.html

For example
Glaurung2.1 scored better against common weaker opponents relative to 2.01 and there were only 100 games in every single match:

Movei 00.8.438 70.5 − 29.5(2.01 67-33)
Pharaon 3.5.1 76.5 − 23.5(2.01 70.5-29.5)
Hamsters 0.6 2595 76.5 − 23.5(2.01 75-25)
Ufim 8.02 2590 82 − 18(2.01 79-21)

Uri
bob
Posts: 20943
Joined: Mon Feb 27, 2006 7:30 pm
Location: Birmingham, AL

Re: error in testing...

Post by bob »

Tord Romstad wrote:
bob wrote:
Dirt wrote:
bob wrote:I had forgotten about the polyglot.ini, and was using the 2.0 e5 .ini file. I am re-running the test with the polyglot.ini from 2.1 when running 2.1.

I will post new results tomorrow if nothing goes wrong tonight. Weather is a bit foul down here due to this tropical storm that is in the area, so we might lose power at some point.
What difference is there in the polyglot.ini files that would cause a substantial change?
There are some scoring paramaters. for example, old:

Aggressiveness = 150
Cowardice = 100
Passed pawns = 140
Pawn structure = 150
Mobility (middle game) = 130
Mobility (endgame) = 110
Space = 100
Development = 130
Actually, these parameters seem to be from the polyglot.ini file of Glaurung 1. Several of the above parameters don't even exist in Glaurung 2.

I don't think using the old polyglot.ini file will hurt the strength significantly, though. I think the biggest part of the explanation for Glaurung 2.1's poor results compared to 2 - ε/5 is simply that it has problems against Crafty.
I have Crafty set so that default settings are optimal so that setting hash size and max threads is enough....
I do exactly the same in Glaurung: The default settings are identical to what is found in the polyglot.ini file supplied with Glaurung, apart from the hash size and the number of threads.

Tord
Actually the new one is doing much better if you have seen the data I published in the q-search checks thread. Something in the old polyglot file was not so good for the new one, not sure what...