Engine results: a surprise!

Sven · Post by **Sven** » Sun Oct 13, 2013 10:12 pm

xmas79 wrote:the simplest thing I could think to improve its performance: pieces pseudo-mobility

What is that? Do you mean: number of pseudo-legal moves of pieces?

xmas79 wrote:"pawn distance to promotion square bonus" to encourage pawn pushing in endgames (but it actually takes that as an advice in the opening LOL, so you see h4 g4 f4 f5 and moves like that )

I think you should encourage pawn pushing for passers only, unless you implement much more knowledge about pawn structure (e.g. candidate pawns). Getting this right can already help to keep the pawns in front of the king after castling.

xmas79 wrote:, as they are very basic things that are really easy to implement.

But these "easy to implement" eval features can already turn your evaluation function into quite a bad one if you don't do them right.

The approach of starting with a technically clean and stable engine and then implementing a decent search first while using only a minimal evaluation function is certainly good, but in that case I suggest that you even reduce your eval to only two or at most three features for a really consistent approach:
- material (possibly including a bishop pair bonus),
- piece-square-table,
- and possibly passed pawn detection (with a rank-dependent bonus as mentioned above).
King PST (the one for the opening/middlegame) can be written in a way that encourages castling, although this will certainly need to be corrected later on when implementing king safety and being faced with the "exceptional" cases where the king is safer in the middle. Other topics for PST include centralization (e.g. knight, pawn, king in endgame), back rank penalties for knights/bishops, or bishops on long diagonals.

I would fully drop everything else in the beginning, like endgame recognition, "piece pseudo mobility", or pawn-push encouragement for non-passers.

Sven

xmas79 · Post by **xmas79** » Sun Oct 13, 2013 11:02 pm

Sven Schüle wrote:I suggest to save the games as PGN. This can serve for several purposes, e.g.:
- look what really happened in the games;
- get some statistics, like average game length (my question came from thoughts about the total time of your test, i.e. 5000 games with 2 CPU minutes per game based on an average of 60 moves per game);
- check for duplicates and remove them;
- calculate an estimated relative rating.

I will do. But before I start a new test I need more information:
1) What kind of statistics you think are useful? Game length is OK. What else can I extract from games?
2) How do I check for duplicates? Do I have to see if the first let's say 10 moves are identical and eventually dischard these "duplicated" games? Some tools that can help doing this job?
3) I read I can use bayeselo & company. I will learn how to use these tools. Can you suggest me one tool in particular?

Thank you,
Natale.

xmas79 · Post by **xmas79** » Sun Oct 13, 2013 11:07 pm

hgm wrote:I am not sure if with '1 sec/move' you really mean fixed maximum time per move, or something like 40 moves/40sec. Fairy-Max performs very poorly at fixed max time/move, because of it simplistic time management. To not forfeit every game within 10 moves it has to take a huge safety margin when it starts an iteration, as it must be >99.5% sure that it can finish it (and then it would still lose about 30% of the games on time, if they last 60 moves). As a result it uses on average only about 10% of the alotted time.

I really meant maximum time per move. I don't understand the logic, but it's ok that FM performs very bad at that time control.

Sven's question about the starting positions relates to the problem of not repeating the same game over and over again when you play engines without book. So 'starting positions' can also be read as 'book lines' here.

With Fairy-Max the problem is not that severe, as it randomizes its first few opening moves, by adding a rather large random score to all moves in the root. Nevertheless, if you play it 5000 times against a non-randomizing engine, I am pretty sure there will be many duplicate games. When I use Fairy-Max for measuring piece values, it plays against itself (so both engines randomize), but even then I do Chess960-like shuffling of initial positions to prevent duplicats.

Ok, I have to check this. I'll be ready for the next match.

Thank you,
Natale.

xmas79 · Post by **xmas79** » Mon Oct 14, 2013 12:02 am

Sven Schüle wrote:
xmas79 wrote:the simplest thing I could think to improve its performance: pieces pseudo-mobility
What is that? Do you mean: number of pseudo-legal moves of pieces?

Yes

I mean that. Suppose one queen each side and two bishops each side. I count pseudo-legal moves of queen for white and use this number directly as a score, adding it to the material evaluation. So if queen has 13 moves, then score is MATERIAL_VALUE+0.13. Then I evaluate black queen by counting all possible pseudo-legal moves and subtract to the eval and so on.... Very simple (and probabily wrong), but I used to "randomize" search a bit, because without it every move in mid-game looks the same and actually played much worse... I had very simple PST, but I actually dropped them because I know they must be tuned in some way (as everything), and must be done for opening/mid-game/endgame, and I didn't want to do that yet. Pseudo-legal moves are ideal (I think), as each piece assumes a different values at different times during the game (except for king, that's why it never castles). They also let the engine to play a little "aggressive" because a knight that can attack opponent pieces is evaluated better than a knight in a defensive position.... Only funny reasoning here...

xmas79 wrote:"pawn distance to promotion square bonus" to encourage pawn pushing in endgames (but it actually takes that as an advice in the opening LOL, so you see h4 g4 f4 f5 and moves like that )
I think you should encourage pawn pushing for passers only, unless you implement much more knowledge about pawn structure (e.g. candidate pawns). Getting this right can already help to keep the pawns in front of the king after castling.

I wanted to simply put something into pawn evaluation. The first thing I thought was that, and that allowed to win some endgame because the engine knows that a pawn near the promotion-rank has a better evaluation, even if it has not the depth to see the promotion.

xmas79 wrote:, as they are very basic things that are really easy to implement.
But these "easy to implement" eval features can already turn your evaluation function into quite a bad one if you don't do them right.

The approach of starting with a technically clean and stable engine and then implementing a decent search first while using only a minimal evaluation function is certainly good, but in that case I suggest that you even reduce your eval to only two or at most three features for a really consistent approach:
- material (possibly including a bishop pair bonus),
- piece-square-table,
- and possibly passed pawn detection (with a rank-dependent bonus as mentioned above).
King PST (the one for the opening/middlegame) can be written in a way that encourages castling, although this will certainly need to be corrected later on when implementing king safety and being faced with the "exceptional" cases where the king is safer in the middle. Other topics for PST include centralization (e.g. knight, pawn, king in endgame), back rank penalties for knights/bishops, or bishops on long diagonals.

I would fully drop everything else in the beginning, like endgame recognition, "piece pseudo mobility", or pawn-push encouragement for non-passers.

Sven

As you said, my intent was to have a bug-free working search framework, and that goal was I think achieved successfully.
I had only a material evaluation function, but I soon recognized that I needed something to go one step further.

Remember my first post about KNB-k endgame? How to test the engine? Forced mates (and consequently endgames) are the most obvious positions...

Once the engine can mate KNB-K after a long time search, how do you go a step further? Add a mating table and let search mate you faster and faster. Then you move to KQK and KRK endgames, and find that you need another table ot give some hint, then you go to KBBK and see if everything is ok, then you try a simple KPK and see that it will promote, but will need some time to discover the promotion, so you add a "push that pawn!" code in the evaluation. Then you start adding some piece here and there, and see how the engine behaves, until you see no progress because every move looks the same... What to do? Run a game or fix that stupid KBP-K drawn endgame that your engine cannot handle at short depths? ---> Basic endgames recognition. Then everything looks ok, it's time to run a game, and it will be lost even before the first move is played on the chessboard, because without a PST or something like that 1.f4 xxx 2.Dh5 (as Henk wrote) are the same move as 1.e4 xxx 2.Nf3... Stuck as before the endgame recognition... You see? How to "shuffle" search withouth crappling completely it out? Without introducing (a lot of) instability? Considering that a piece in the middle of the chessboard is better than one in the corner because it has more "moves", here we go.. a highly dynamic PST based on position, with a simple method. Of course in the endgame every move looks the same.... and far from perfect in the openings.... So this still has problems... But that's another story!

I know there are things that maybe I need to tune (in search), but my eval function was a quick hack, right or wrong.... I think I just tested my search code

... and I need to completely wipe out my eval

Sorry for long posting!

Thank you,
Natale.

Sven · Post by **Sven** » Mon Oct 14, 2013 12:02 am

xmas79 wrote:
Sven Schüle wrote:I suggest to save the games as PGN. This can serve for several purposes, e.g.:
- look what really happened in the games;
- get some statistics, like average game length (my question came from thoughts about the total time of your test, i.e. 5000 games with 2 CPU minutes per game based on an average of 60 moves per game);
- check for duplicates and remove them;
- calculate an estimated relative rating.
I will do. But before I start a new test I need more information:
1) What kind of statistics you think are useful? Game length is OK. What else can I extract from games?
2) How do I check for duplicates? Do I have to see if the first let's say 10 moves are identical and eventually dischard these "duplicated" games? Some tools that can help doing this job?
3) I read I can use bayeselo & company. I will learn how to use these tools. Can you suggest me one tool in particular?

Thank you,
Natale.

Please look here and here to find a lot of useful tools.

1) You could check the number of draws of certain types, for instance. If 90% of all draws are stalemates then this *might* be suspicious. You could also calculate the score percentage for white, and if you get 80% instead of the usual ~55% then this is certainly suspicious, and you should try to find the reason. These are just some wild examples. Maybe this is not the most important reason for saving PGN, though.

2) See the links above. With "duplicates" I mean fully identical games, not only the first 10 moves or so. I came up with this after I noticed that you did not use the typical setup of many different starting positions (and probably no opening book as well), so it might be necessary to check how many different games were actually played. There is support for removal of duplicates with one of Norm Pollock's tools. The point is, if 5000 games are in fact only 1000 different ones then you get higher (more realistic) error bars after removing duplicates.

3) I use BayesElo but Ordo is a good tool as well.

Sven

Adam Hair · Post by **Adam Hair** » Mon Oct 14, 2013 12:16 am

Hi Natale,

You can download custom pgns of openings from here:
http://kirill-kryukov.com/chess/tools/opening-sampler/

Or, here is the pgn of that the Stockfish test framework uses:
http://www.mediafire.com/download/qb8bt ... oves_GM.7z

lkaufman · Post by **lkaufman** » Mon Oct 14, 2013 6:18 am

The most efficient way to test is increment testing (assuming both engines support this), and the ratio of base time to increment should be at least 100, I think. So game/ten seconds plus 0.1 second per move is a good time control for a reasonably strong engine. Also, I recommend testing 8 games at once rather than four on a quad with HT; it is considerably more efficient, and I've never seen any evidence of a problem with it. Others will disagree I'm sure.

jshriver · Post by **jshriver** » Wed Oct 16, 2013 6:48 am

Pgn-extract is your friend

PK · Post by PK » Wed Oct 16, 2013 12:44 pm

once You count pseudo-legal moves in Your evaluation function, it makes sense to weigh them differently. Sungorus uses 4*knight mobility, 3 * bishop, 2* rook and 1* queen. It is sufficiently good for start.

Richard Allbert · Post by **Richard Allbert** » Fri Oct 18, 2013 9:41 am

Hi Sven

Spike is only too strong if the TC's are equal. I Test my engine vs Spike (and others > 2700) using 1s / 40 moves for them, and a LOT more for mine.

This has the benefit of saving a bit of time.

Engine results: a surprise!

Re: Engine results: a surprise!

Re: Engine results: a surprise!

Re: Engine results: a surprise!

Re: Engine results: a surprise!

Re: Engine results: a surprise!

Re: Engine results: a surprise!

Re: Engine results: a surprise!

Re: Engine results: a surprise!

Re: Engine results: a surprise!

Re: Engine results: a surprise!