Unified eval tournament?

Discussion of computer chess matches and engine tournaments.

Moderator: Ras

Uri Blass
Posts: 10892
Joined: Thu Mar 09, 2006 12:37 am
Location: Tel-Aviv Israel

Re: Unified eval tournament?

Post by Uri Blass »

Tord Romstad wrote:
Michael Sherwin wrote:
Tord Romstad wrote:
cyberfish wrote:Ah thanks!

We just need to get a few more people now...

I just implemented the simplified eval in my engine, and in ~2 seconds (limited depth) games, it's 52-72 elo points weaker.
That's far less than I would have thought. What does your evaluation contain, apart from material and piece square tables?

I just finished a quick Silver match between the normal version of my program and an otherwise identical version with the evaluation function replaced by Tomasz Michniewski's piece square table evaluation:

Code: Select all

Glaurung 090122: 86.5 (+81,=11,-8)
Glaurung UFO 090122: 13.5 (+8,=11,-81)
Tord
I have not read this thread, so sorry if this has already been asked.

About how many ply would you have to slow down the search to get an even result?
I have no idea. 300 Elo points should normally correspond to about five plies in the middle game, but it's difficult to believe that the evaluation function can really be worth that much.

Tord
I guess that there is no constant number here because the version with simple evaluation does relatively better at shorter time control.

I expect 1 ply normal glaurung to lose against 6 ply simple evaluation but I expect 16 ply normal glaurung to beat 21 ply simple evaluation.

I also do not believe that you get only 60 elo per one ply when we talk about small depths and I believe the difference is bigger than it.

The difference may be 60 elo at long time control but it also may be 100 elo at the blitz time control that you are playing.

Uri
Uri Blass
Posts: 10892
Joined: Thu Mar 09, 2006 12:37 am
Location: Tel-Aviv Israel

Re: Unified eval tournament?

Post by Uri Blass »

Tord Romstad wrote:
Allard Siemelink wrote:Here are the results of the match at 40 moves/40 seconds repeating:
18.14% elo=-263. +38 -261 =51

Indeed the results of the simple eval version have gone down, as I expected.
Yet, it still scores better than Glaurungs 13.5%.
Tord, may I ask what time control you used?
I used 1 minute/game, with a 0.5 second increment. Perhaps I played too few games, or perhaps my evaluation function is better than I think.

Tord
another possibility is that your search is tuned better for your evaluation function and not for simple evaluation.

If you want to compare only evaluations then results at fixed depth with no pruning based on evaluation give better data.

Note that evne only null move pruning is pruning that is based on evaluation and it may change the elo difference between different evaluations(I do not know if it is going to increase or to reduce it).

Uri
User avatar
hgm
Posts: 28387
Joined: Fri Mar 10, 2006 10:06 am
Location: Amsterdam
Full name: H G Muller

Re: Unified eval tournament?

Post by hgm »

Tord Romstad wrote:I have no idea. 300 Elo points should normally correspond to about five plies in the middle game, but it's difficult to believe that the evaluation function can really be worth that much.
The difference between a reasonably good (e.g. TSCP-like) and a superb evaluation can probably not be worth that much. But I guess it is easy enough to spoil an evaluation by any number of ply by making a very poor one. (Flipping the sign would do a good job at that, for example. :lol: ) I expect this one to be quite poor even for its complexity class (i.e. poorly tuned).
Tord Romstad
Posts: 1808
Joined: Wed Mar 08, 2006 9:19 pm
Location: Oslo, Norway

Re: Unified eval tournament?

Post by Tord Romstad »

hgm wrote:
Tord Romstad wrote:I have no idea. 300 Elo points should normally correspond to about five plies in the middle game, but it's difficult to believe that the evaluation function can really be worth that much.
The difference between a reasonably good (e.g. TSCP-like) and a superb evaluation can probably not be worth that much. But I guess it is easy enough to spoil an evaluation by any number of ply by making a very poor one. (Flipping the sign would do a good job at that, for example. :lol: ) I expect this one to be quite poor even for its complexity class (i.e. poorly tuned).
You are right that Tomasz's piece square table evaluation is almost certainly worse than TSCP's more complex evaluation, but on the other hand my own eval is very far from qualifying for the label "superb". Most of the evaluation weights are taken out of thin air and are completely untested, and I have not even verified that most of my evaluation terms improve the strength. I assume the same is true for most other amateur programs.

I am a little bit surprised that Bright seems to lose less strength than Glaurung with the simplified evaluation function. I've never seen Bright in action (the last time I checked, neither the source code nor a Mac OS X binary was available), but as they are amateur engines of comparable strength, I would assume the quality of their searches and evaluation functions to be roughly equal. I think it is most likely that I just played too few test games, and that Glaurung UFO would have performed better in a longer match.

Tord
Stan Arts

Re: Unified eval tournament?

Post by Stan Arts »

Ah well, against some pawnstructure and kingsafety knowledge UFO doesn't stand a chance. In a quick test I did for a moment I thought my normal version would score 100%, but UFO got a few points by tactical shots at the fast timecontrol I tested at, 3+1. (for me UFO runs about 1.5x faster, and the psqt's don't do much to upset PV or search, frequently adding up to a ply extra.)

Perhaps against the same engine these things get magnified, and in a field of engines the difference would be less. But that's just a guess.

Stan
Uri Blass
Posts: 10892
Joined: Thu Mar 09, 2006 12:37 am
Location: Tel-Aviv Israel

Re: Unified eval tournament?

Post by Uri Blass »

Stan Arts wrote:Ah well, against some pawnstructure and kingsafety knowledge UFO doesn't stand a chance. In a quick test I did for a moment I thought my normal version would score 100%, but UFO got a few points by tactical shots at the fast timecontrol I tested at, 3+1. (for me UFO runs about 1.5x faster, and the psqt's don't do much to upset PV or search, frequently adding up to a ply extra.)

Perhaps against the same engine these things get magnified, and in a field of engines the difference would be less. But that's just a guess.

Stan
At very fast time control UFO is going to win part of the games thanks to tactics.

At very long time control UFO is going to draw often because it is going to find the best moves thanks to search.


Conclusion:there is some time control that UFO get worst results and longer time control or faster time control are going to cause it to score worse.

The question is if our hardware is fast enough to find the optimal time control(if the optimal time control against UFO is one day per move then our hardware today is too slow for serious testing)

Uri
Allard Siemelink
Posts: 297
Joined: Fri Jun 30, 2006 9:30 pm
Location: Netherlands

Re: Unified eval tournament?

Post by Allard Siemelink »

When examining the losses against the simple eval version, it looked like Bright often overvalued its attacks on the opponent king. e.g. sacrificing a pawn for a failing king attack and losing the endgame because of the missing pawn. It may be that the rewritten king safety evaluation is out of tune.
otoh, king safety seems to be one of the major victory causes as well, so i am not sure it turning down the weight will help. Tests will have to tell...

I also would have thought that search and eval of both engines would be of comparable strength. Perhaps the limited testing distorted the outcome.
But it would certainly be interesting if this is not the case, then we would both know what to improve :)

Sorry, can't help you with a Mac-version of Bright (I do not own a Mac), perhaps you can use Wine or something similar? I guess a Linux binary would be no good either?
Tord Romstad wrote: I am a little bit surprised that Bright seems to lose less strength than Glaurung with the simplified evaluation function. I've never seen Bright in action (the last time I checked, neither the source code nor a Mac OS X binary was available), but as they are amateur engines of comparable strength, I would assume the quality of their searches and evaluation functions to be roughly equal. I think it is most likely that I just played too few test games, and that Glaurung UFO would have performed better in a longer match.

Tord
Allard Siemelink
Posts: 297
Joined: Fri Jun 30, 2006 9:30 pm
Location: Netherlands

Re: Unified eval tournament?

Post by Allard Siemelink »

I tuned Bright's eval a bit, the results are now clearly better and in the same ballpark as Glaurung's:

simple eval scores 11.2% for -359 elo in 250 games at 64k nodes/move (~5 seconds/game):
+194/250=88.80% elo=+359, +211 -17 =22

and
simple eval scores 14.5% for -300 elo in 142 games at 40 moves/40 seconds
+120/142=84.5% elo=+301, +112 -14 =16

Considering that the nps of the simple eval version is double the nps of the version with the real eval, adding ~60 elo, the results match almost suspiciously well.
Allard Siemelink wrote:Here are the results of the match at 40 moves/40 seconds repeating:
18.14% elo=-263. +38 -261 =51

Indeed the results of the simple eval version have gone down, as I expected.
Yet, it still scores better than Glaurungs 13.5%.
Allard Siemelink wrote:Brights numbers are a little less pronounced than Glaurungs but the simple eval is still ~200 elo worse than its own.

Here are the results of a 3000 game match (4096 nodes/move) that just finished:
22.67% elo=-213, +562 -2202 =236

I think I'll run a match with longer time controls to see if that yields different results
Edsel Apostol
Posts: 803
Joined: Mon Jul 17, 2006 5:53 am
Full name: Edsel Apostol

Re: Unified eval tournament?

Post by Edsel Apostol »

Allard Siemelink wrote:I tuned Bright's eval a bit, the results are now clearly better and in the same ballpark as Glaurung's:

simple eval scores 11.2% for -359 elo in 250 games at 64k nodes/move (~5 seconds/game):
+194/250=88.80% elo=+359, +211 -17 =22

and
simple eval scores 14.5% for -300 elo in 142 games at 40 moves/40 seconds
+120/142=84.5% elo=+301, +112 -14 =16

Considering that the nps of the simple eval version is double the nps of the version with the real eval, adding ~60 elo, the results match almost suspiciously well.
Allard Siemelink wrote:Here are the results of the match at 40 moves/40 seconds repeating:
18.14% elo=-263. +38 -261 =51

Indeed the results of the simple eval version have gone down, as I expected.
Yet, it still scores better than Glaurungs 13.5%.
Allard Siemelink wrote:Brights numbers are a little less pronounced than Glaurungs but the simple eval is still ~200 elo worse than its own.

Here are the results of a 3000 game match (4096 nodes/move) that just finished:
22.67% elo=-213, +562 -2202 =236

I think I'll run a match with longer time controls to see if that yields different results
Hi Allard,

What GUI did you use to test with fixed nodes? I'm using Arena and it seems not fast enough for me. A game sometimes could last 30 seconds even if I'm using fewer nodes than what you are using. My engine's NPS is slightly lower than yours and my hardware is old.

By the way, have you tried your experiment above with a fixed depth of for example 1? Which is more reliable in your opinion to test the changes in eval, fixed depth or fixed nodes?
Allard Siemelink
Posts: 297
Joined: Fri Jun 30, 2006 9:30 pm
Location: Netherlands

Re: Unified eval tournament?

Post by Allard Siemelink »

Hi Edsel,

Arena is indeed too slow, so I rolled my own.
It is a command line thingy built into Bright itself.
Basically it starts some uci engine and then talks uci
to play games against the hosting bright exe.

I have not tried fixed depth=1 matches, I would think that it is prone to simple tactical traps and unreliable for endgames.
The 4096 nodes still reach depth=4 on average.
If playing fixed node matches, it will actually search deeper during the endgame, like in real games.

Edsel Apostol wrote: Hi Allard,

What GUI did you use to test with fixed nodes? I'm using Arena and it seems not fast enough for me. A game sometimes could last 30 seconds even if I'm using fewer nodes than what you are using. My engine's NPS is slightly lower than yours and my hardware is old.

By the way, have you tried your experiment above with a fixed depth of for example 1? Which is more reliable in your opinion to test the changes in eval, fixed depth or fixed nodes?