STS and the scaling of Houdini and Stockfish

Laskos · Post by **Laskos** » Thu Dec 12, 2013 11:49 am

The post is half serious, because we know that test suites are not the best indicators of engines' strength. But, with issue appearing that some top engines scale better at long time control than others, instead of weeks spent on longtime control matches, as I am not particularly a tester, I decided to try a one day solution:

Run the full pack of Strategic Test Suites by Swaminathan N. and Dann Corbit (1400 positions), which are not particularly tactical (Houdini Tactical is no better than Houdini default at STS), at different time controls, and see how Houdini 4 and Stockfish DD behave with increasing time control, from blitz to tournament TC.

1400 test positions, each engine on 4 i7 cores.

1s/position:
H4: 1226
SF: 1203

30s/position
H4: 1319
SF: 1311

180s/position
SF: 1355
H4: 1351

So, it's apparent that SF scales better to longer time control, even overtaking Houdini 4 at long TC, 180s/position. Houdini is no longer the king of STS at long time controls.

Jouni · Post by **Jouni** » Thu Dec 12, 2013 8:42 pm

I have noticed, that SF has slowly but steadily improved in STS suite. And SF DD was first version ever to surpass 90% limit with my conditions = 15 sec for position in one CPU. It scored 1264 when H3 got 1274. But how many of 1400 positions are still correct? Your results with 180s may indicate, that most are OK..

Laskos · Post by **Laskos** » Thu Dec 12, 2013 11:23 pm

Jouni wrote:I have noticed, that SF has slowly but steadily improved in STS suite. And SF DD was first version ever to surpass 90% limit with my conditions = 15 sec for position in one CPU. It scored 1264 when H3 got 1274. But how many of 1400 positions are still correct? Your results with 180s may indicate, that most are OK..

I guess there are wrong solutions, I tested several positions which neither Houdini nor Stockfish can solve in half an hour or so, the important thing there are not so many to distort results at 3min/pos. Testing further, to say 15min/pos is probably useless.

Carlos Ylich · Post by **Carlos Ylich** » Fri Dec 13, 2013 4:37 pm

Great job Kai.
Consider these important tests to assess the evolution of engine performance.
Thanks for your work

Jouni · Post by **Jouni** » Sun Feb 09, 2014 9:39 am

And finally SF 8.2.2014 beats Houdini 3 score with 1276! BTW another test suite where Stockfish excels is Arasan suite. There SF solved 170 positions when H3 got "only" 159 with same conditions.

lucasart · Post by **lucasart** » Sun Feb 09, 2014 10:03 am

Laskos wrote: 1s/position:
H4: 1226
SF: 1203

Nice to see Stockfish closing the gap at fast time control. SF used to be a slow starter, but not anymore

Jouni · Post by **Jouni** » Sat Jan 31, 2015 4:41 pm

I tested SF 6 in complete 1500 suite (15s limit). It scored 1327, but Houdini 3 got 1338! I quess, that even with much longer time limit SF is worse in my PC.

bob · Post by **bob** » Sat Jan 31, 2015 6:04 pm

Laskos wrote:The post is half serious, because we know that test suites are not the best indicators of engines' strength. But, with issue appearing that some top engines scale better at long time control than others, instead of weeks spent on longtime control matches, as I am not particularly a tester, I decided to try a one day solution:

Run the full pack of Strategic Test Suites by Swaminathan N. and Dann Corbit (1400 positions), which are not particularly tactical (Houdini Tactical is no better than Houdini default at STS), at different time controls, and see how Houdini 4 and Stockfish DD behave with increasing time control, from blitz to tournament TC.

1400 test positions, each engine on 4 i7 cores.

1s/position:

H4: 1226
SF: 1203

30s/position
H4: 1319
SF: 1311

180s/position
SF: 1355
H4: 1351

So, it's apparent that SF scales better to longer time control, even overtaking Houdini 4 at long TC, 180s/position. Houdini is no longer the king of STS at long time controls.

The math.

#1. Did you test single thread first? It might be that stockfish gets better as D increases. In fact, it might get MUCH better and the SMP scaling actually hurts enough to take away most of the gain.

There's no way you can do a test like this and conclude anything without also running the same test with just one thread first, to see how things look at each of those depths. Then when you increase threads, and compare them to the 1 thread numbers, all you are measuring is the 1-thread to n thread speedup. As it is, these are simply random numbers until they are verified with 1 thread first...

Dann Corbit · Post by **Dann Corbit** » Sat Jan 31, 2015 8:54 pm

Jouni wrote:I have noticed, that SF has slowly but steadily improved in STS suite. And SF DD was first version ever to surpass 90% limit with my conditions = 15 sec for position in one CPU. It scored 1264 when H3 got 1274. But how many of 1400 positions are still correct? Your results with 180s may indicate, that most are OK..

There are clearly bugs in it. When I first started on STS, my operating system was 32 bits and the strongest engine was Rybka 1.0. I was running on a single core.

An hour of CPU was given (with the three top engines of the time), but we can replicate that today in less than a minute due to the exponential increase in both compute power and software excellence.

I have started the process of identification of errors, and there clearly are some errors that need corrected.

STS and the scaling of Houdini and Stockfish

STS and the scaling of Houdini and Stockfish

Re: STS and the scaling of Houdini and Stockfish

Re: STS and the scaling of Houdini and Stockfish

Re: STS and the scaling of Houdini and Stockfish

Re: STS and the scaling of Houdini and Stockfish

Re: STS and the scaling of Houdini and Stockfish

Re: STS and the scaling of Houdini and Stockfish

Re: STS and the scaling of Houdini and Stockfish

Re: STS and the scaling of Houdini and Stockfish