Engine results: a surprise!

xmas79 · Post by **xmas79** » Sat Oct 12, 2013 11:09 pm

Hi all,
I wanted to show my results and ask how to move on....

I fixed in my engine what I had to do (search stuff), and I let the engine play a match against Fairy-Max 4.8S in winboard (as I'm currently using it as interface and it's the default engine). It played over-night about 5000 games at 1 sec/move, and here's the results:

win: 3945
loss: 556
draw: 227

total games: 4728

That seems a good start, isn't it? I thought that maybe I ran an unfair test, because I noticed sometimes Fairy-max loses on time, so I'm trying at 5 secs/move. Actually we are at:

win: 99
loss: 18
draw: 9

This looks also promising, even if I'm talking of about 100 games only... This is painly slow, as I have only one quad-core with HT (8 logical cores), but I'm running only 4 games in parallel to avoid HT stuff...

Now I already have setup a cutechess-cli framework because I really don't know if sec/move is a good TC.... What I need to know is:

1) What are the best time controls to test the engine?
2) How many opponent I should match? (and which?)
3) How many games on average?
4) Main TT size?

Best regards,
Natale.

P.S.
I'm now running 1 match at 30 secs/move: 3 win, 0 loss, 1 draw...

P.P.S.
My engine never castled, unless there was a mate in the horizon!!! LOL

jdart · Post by **jdart** » Sun Oct 13, 2013 3:50 am

FairyMax is evidently too weak an opponent and the loss on time issue is a problem. There are plenty of other engines that don't lose on time.

You can try Spike (http://spike.lazypics.de/dl_index_en.html). This is a reasonably stable engine. Version 1.2 is not too strong and runs on both Windows and Linux (you didn't say what platform you are testing on). Version 1.4 is quite a bit stronger I think but is Windows only.

1 sec/move is probably ok. It depends on what you are testing. For eval changes many testers use even faster time controls. Game in 5 sec + 0.1 sec increment for example (assumes your program can handle fractional increments).

--Jon

tpetzke · Post by **tpetzke** » Sun Oct 13, 2013 10:23 am

Hi

1 sec / move I would already consider a long time control compared to what my tests usually run with. This means a game lasts about two minutes. (166 CPU hours for 5000 games).

I run my tests (eval related) at 3'+0.03/40. Seem ok so far. I think that a short time control puts more pressure on the eval, and if this is what you are testing this is beneficial.

If your engine does not castle you might have the bonus for possessing the castling rights to high. If your engine castles it loses those rights (and the bonus) and will not castle to keep the bonus.

I think some engines award a bonus for executing a castling move to overcome that. But this is actually a hack. An engine should consider a position with a castled king better than a position with a king in the center who possesses its castling rights. And then it will castle.

Thomas...

Sven · Post by **Sven** » Sun Oct 13, 2013 12:53 pm

jdart wrote:FairyMax is evidently too weak an opponent and the loss on time issue is a problem. There are plenty of other engines that don't lose on time.

You can try Spike (http://spike.lazypics.de/dl_index_en.html). This is a reasonably stable engine. Version 1.2 is not too strong and runs on both Windows and Linux (you didn't say what platform you are testing on). Version 1.4 is quite a bit stronger I think but is Windows only.

1 sec/move is probably ok. It depends on what you are testing. For eval changes many testers use even faster time controls. Game in 5 sec + 0.1 sec increment for example (assumes your program can handle fractional increments).

--Jon

I think even Spike 1.2 is much too strong for Natale's engine in its current stage, it is >2700 Elo in CCRL 40/4. Based on the results above (~80..85% against FairyMax) and on the assumption that FairyMax has a strength of ~1950 Elo in CCRL 40/4 terms (HGM wrote somewhere that FairyMax has almost the same strength as MicroMax), Natale's engine has performed somewhere about 2200-2250 in his tests which is indeed quite good for a start but far away from the level of engines like Spike 1.2. Therefore I suggest to select a couple of opponents from the 2100-2400 range to get a first Elo approximation.

Some more questions @Natale:
- How many of these ~5000 games were lost on time by FairyMax?
- What was the average game length?
- How many different starting positions did you use?

Sven

xmas79 · Post by **xmas79** » Sun Oct 13, 2013 3:22 pm

Hi Sven,
thank you for your suggestions. I'll try different opponents within the range you gave, as FM seems weak even at (this) long time controls. And indeed, after 750 games at 5 sec/move I have:

win: 601
loss: 103
draw: 49

total :753

which seems to me to confirm that my engine is stronger at long time controls, even if looking at some games I found that endgame knowledge of my engine is sometimes very wrong. I have seen it losing a couple of plain drawn endgames with one bishop each side but opposite colors... After juggling a little making over and over the same moves (trying to find 3rd repetition, and FM avoiding it), the engine made a bad move allowing the opponent to promote.... Something I need to investigate before I move on.

At even longer time control (30 sec/move) I actually have:

win: 23
loss: 7
draw: 5

But now I'm stopping it, since it seems to me useless testing at this time controls now.

I'm just curious to test it against FM at shorter time controls, where I'm pretty confident my engine will perform pretty bad. But I realized I don't have the "logic" to handle "game in X time + increment Y per move", so this test will be delayed a little....

Your questions:
1) Form what I understood, FM loses on time when it encounters an unexpected fail low situation, eg a forced mate. It doesn't happen that often, but it happens a bit. That's a bit annoying. I don't have 5k game statistics, as I run as a match and not a tournment. I have logs of the actual test running, seems that it actually lost 168 games on time, where most of them had a negative score (like -15.00), so they are lost anyway.
2) Good question.. I saw games 30 moves long, and games 120 moves long... I don't really know the average length of the games...
3) I don't know what do you mean by "How many different starting positions did you use?", I use the only legal starting position in normal chess. Maybe were you referring to chess960?

Best Regards,
Natale.

xmas79 · Post by **xmas79** » Sun Oct 13, 2013 3:37 pm

Hi Thomas,

tpetzke wrote:...
I run my tests (eval related) at 3'+0.03/40. Seem ok so far. I think that a short time control puts more pressure on the eval, and if this is what you are testing this is beneficial...

Ok, this is exactly the test I'd like to run. Read after.

If your engine does not castle you might have the bonus for possessing the castling rights to high. If your engine castles it loses those rights (and the bonus) and will not castle to keep the bonus.

I think some engines award a bonus for executing a castling move to overcome that. But this is actually a hack. An engine should consider a position with a castled king better than a position with a king in the center who possesses its castling rights. And then it will castle.

Thomas...

Yep, assuming you actually have a pretty basic and working evaluation function... My engine actually have only known endgame recognition (that sometimes doesn't work well....), and only material evaluation + the simplest thing I could think to improve its performance: pieces pseudo-mobility.... No bonus of every sort... No king safety, no castle bonuses, no pawn structures, imbalance tables, nothing nada nisba... I also included a small bonus for the bishops pair and "pawn distance to promotion square bonus" to encourage pawn pushing in endgames (but it actually takes that as an advice in the opening LOL, so you see h4 g4 f4 f5 and moves like that

), as they are very basic things that are really easy to implement. So from my engine's point of view, a castling move seems pretty useless.... Once I will implement king safety I'm sure engine strength will be much greater, and it will castle.... and keep the pawns in front of the king

Thanks a lot for your suggestions,
Natale.

jdart · Post by **jdart** » Sun Oct 13, 2013 5:29 pm

Beowulf is another possibility if you want a weaker opponent (http://www.frayn.net/beowulf/).

--Jon

hgm · Post by **hgm** » Sun Oct 13, 2013 8:08 pm

Fairy-Max will have some time losses no matter what TC you play it on. Fast TC is not a problem per se, it should be able to play very fast. The problem is that when the last piece is traded, and it switches off null move, it will try to top the depth for the position stored in its hash table with a null-move-free search. Which at 1 sec/move can easily take 5 min. Fairy-Max has a very simplistic time management, which only looks at the clock after an iteration in the root completes.

I never bothered to fix this problem, because it almost exclusively loses on time in positions that are extremely lost anyway.

I am not sure if with '1 sec/move' you really mean fixed maximum time per move, or something like 40 moves/40sec. Fairy-Max performs very poorly at fixed max time/move, because of it simplistic time management. To not forfeit every game within 10 moves it has to take a huge safety margin when it starts an iteration, as it must be >99.5% sure that it can finish it (and then it would still lose about 30% of the games on time, if they last 60 moves). As a result it uses on average only about 10% of the alotted time.

Sven's question about the starting positions relates to the problem of not repeating the same game over and over again when you play engines without book. So 'starting positions' can also be read as 'book lines' here.

With Fairy-Max the problem is not that severe, as it randomizes its first few opening moves, by adding a rather large random score to all moves in the root. Nevertheless, if you play it 5000 times against a non-randomizing engine, I am pretty sure there will be many duplicate games. When I use Fairy-Max for measuring piece values, it plays against itself (so both engines randomize), but even then I do Chess960-like shuffling of initial positions to prevent duplicats.

Sven · Post by **Sven** » Sun Oct 13, 2013 9:44 pm

hgm wrote:Fairy-Max will have some time losses no matter what TC you play it on. Fast TC is not a problem per se, it should be able to play very fast. The problem is that when the last piece is traded, and it switches off null move, it will try to top the depth for the position stored in its hash table with a null-move-free search. Which at 1 sec/move can easily take 5 min. Fairy-Max has a very simplistic time management, which only looks at the clock after an iteration in the root completes.

I never bothered to fix this problem, because it almost exclusively loses on time in positions that are extremely lost anyway.

I am not sure if with '1 sec/move' you really mean fixed maximum time per move, or something like 40 moves/40sec. Fairy-Max performs very poorly at fixed max time/move, because of it simplistic time management. To not forfeit every game within 10 moves it has to take a huge safety margin when it starts an iteration, as it must be >99.5% sure that it can finish it (and then it would still lose about 30% of the games on time, if they last 60 moves). As a result it uses on average only about 10% of the alotted time.

Sven's question about the starting positions relates to the problem of not repeating the same game over and over again when you play engines without book. So 'starting positions' can also be read as 'book lines' here.

With Fairy-Max the problem is not that severe, as it randomizes its first few opening moves, by adding a rather large random score to all moves in the root. Nevertheless, if you play it 5000 times against a non-randomizing engine, I am pretty sure there will be many duplicate games. When I use Fairy-Max for measuring piece values, it plays against itself (so both engines randomize), but even then I do Chess960-like shuffling of initial positions to prevent duplicats.

Hmmm ... after that post I start to believe that FairyMax did not play at ~1950 level in Natale's test but perhaps much below that.

So my current estimate of the playing strength of Natale's engine is:

... we don't know yet!

Sven

Sven · Post by **Sven** » Sun Oct 13, 2013 9:47 pm

xmas79 wrote:Your questions:
1) Form what I understood, FM loses on time when it encounters an unexpected fail low situation, eg a forced mate. It doesn't happen that often, but it happens a bit. That's a bit annoying. I don't have 5k game statistics, as I run as a match and not a tournment. I have logs of the actual test running, seems that it actually lost 168 games on time, where most of them had a negative score (like -15.00), so they are lost anyway.
2) Good question.. I saw games 30 moves long, and games 120 moves long... I don't really know the average length of the games...

I suggest to save the games as PGN. This can serve for several purposes, e.g.:
- look what really happened in the games;
- get some statistics, like average game length (my question came from thoughts about the total time of your test, i.e. 5000 games with 2 CPU minutes per game based on an average of 60 moves per game);
- check for duplicates and remove them;
- calculate an estimated relative rating.

xmas79 wrote:3) I don't know what do you mean by "How many different starting positions did you use?", I use the only legal starting position in normal chess. Maybe were you referring to chess960?

No, see HGM's reply.

Sven

Engine results: a surprise!

Engine results: a surprise!

Re: Engine results: a surprise!

Re: Engine results: a surprise!

Re: Engine results: a surprise!

Re: Engine results: a surprise!

Re: Engine results: a surprise!

Re: Engine results: a surprise!

Re: Engine results: a surprise!

Re: Engine results: a surprise!

Re: Engine results: a surprise!