STS and the scaling of Houdini and Stockfish

Discussion of anything and everything relating to chess playing software and machines.

Moderators: hgm, Rebel, chrisw

User avatar
Laskos
Posts: 10948
Joined: Wed Jul 26, 2006 10:21 pm
Full name: Kai Laskos

STS and the scaling of Houdini and Stockfish

Post by Laskos »

The post is half serious, because we know that test suites are not the best indicators of engines' strength. But, with issue appearing that some top engines scale better at long time control than others, instead of weeks spent on longtime control matches, as I am not particularly a tester, I decided to try a one day solution:

Run the full pack of Strategic Test Suites by Swaminathan N. and Dann Corbit (1400 positions), which are not particularly tactical (Houdini Tactical is no better than Houdini default at STS), at different time controls, and see how Houdini 4 and Stockfish DD behave with increasing time control, from blitz to tournament TC.

1400 test positions, each engine on 4 i7 cores.

1s/position:
H4: 1226
SF: 1203

30s/position
H4: 1319
SF: 1311

180s/position
SF: 1355
H4: 1351


So, it's apparent that SF scales better to longer time control, even overtaking Houdini 4 at long TC, 180s/position. Houdini is no longer the king of STS at long time controls.
Jouni
Posts: 3315
Joined: Wed Mar 08, 2006 8:15 pm

Re: STS and the scaling of Houdini and Stockfish

Post by Jouni »

I have noticed, that SF has slowly but steadily improved in STS suite. And SF DD was first version ever to surpass 90% limit with my conditions = 15 sec for position in one CPU. It scored 1264 when H3 got 1274. But how many of 1400 positions are still correct? Your results with 180s may indicate, that most are OK..
Jouni
User avatar
Laskos
Posts: 10948
Joined: Wed Jul 26, 2006 10:21 pm
Full name: Kai Laskos

Re: STS and the scaling of Houdini and Stockfish

Post by Laskos »

Jouni wrote:I have noticed, that SF has slowly but steadily improved in STS suite. And SF DD was first version ever to surpass 90% limit with my conditions = 15 sec for position in one CPU. It scored 1264 when H3 got 1274. But how many of 1400 positions are still correct? Your results with 180s may indicate, that most are OK..
I guess there are wrong solutions, I tested several positions which neither Houdini nor Stockfish can solve in half an hour or so, the important thing there are not so many to distort results at 3min/pos. Testing further, to say 15min/pos is probably useless.
Carlos Ylich
Posts: 175
Joined: Wed Apr 28, 2010 9:31 pm
Location: Brazil

Re: STS and the scaling of Houdini and Stockfish

Post by Carlos Ylich »

Great job Kai.
Consider these important tests to assess the evolution of engine performance.
Thanks for your work :!:
Remember Sabra and Chatila
Jouni
Posts: 3315
Joined: Wed Mar 08, 2006 8:15 pm

Re: STS and the scaling of Houdini and Stockfish

Post by Jouni »

And finally SF 8.2.2014 beats Houdini 3 score with 1276! BTW another test suite where Stockfish excels is Arasan suite. There SF solved 170 positions when H3 got "only" 159 with same conditions.
Jouni
User avatar
lucasart
Posts: 3232
Joined: Mon May 31, 2010 1:29 pm
Full name: lucasart

Re: STS and the scaling of Houdini and Stockfish

Post by lucasart »

Laskos wrote: 1s/position:
H4: 1226
SF: 1203
Nice to see Stockfish closing the gap at fast time control. SF used to be a slow starter, but not anymore :D
Theory and practice sometimes clash. And when that happens, theory loses. Every single time.
Jouni
Posts: 3315
Joined: Wed Mar 08, 2006 8:15 pm

Re: STS and the scaling of Houdini and Stockfish

Post by Jouni »

I tested SF 6 in complete 1500 suite (15s limit). It scored 1327, but Houdini 3 got 1338! I quess, that even with much longer time limit SF is worse in my PC.
Jouni
bob
Posts: 20943
Joined: Mon Feb 27, 2006 7:30 pm
Location: Birmingham, AL

Re: STS and the scaling of Houdini and Stockfish

Post by bob »

Laskos wrote:The post is half serious, because we know that test suites are not the best indicators of engines' strength. But, with issue appearing that some top engines scale better at long time control than others, instead of weeks spent on longtime control matches, as I am not particularly a tester, I decided to try a one day solution:

Run the full pack of Strategic Test Suites by Swaminathan N. and Dann Corbit (1400 positions), which are not particularly tactical (Houdini Tactical is no better than Houdini default at STS), at different time controls, and see how Houdini 4 and Stockfish DD behave with increasing time control, from blitz to tournament TC.

1400 test positions, each engine on 4 i7 cores.

1s/position:

H4: 1226
SF: 1203

30s/position
H4: 1319
SF: 1311

180s/position
SF: 1355
H4: 1351


So, it's apparent that SF scales better to longer time control, even overtaking Houdini 4 at long TC, 180s/position. Houdini is no longer the king of STS at long time controls.
The math.

#1. Did you test single thread first? It might be that stockfish gets better as D increases. In fact, it might get MUCH better and the SMP scaling actually hurts enough to take away most of the gain.

There's no way you can do a test like this and conclude anything without also running the same test with just one thread first, to see how things look at each of those depths. Then when you increase threads, and compare them to the 1 thread numbers, all you are measuring is the 1-thread to n thread speedup. As it is, these are simply random numbers until they are verified with 1 thread first...
Dann Corbit
Posts: 12564
Joined: Wed Mar 08, 2006 8:57 pm
Location: Redmond, WA USA

Re: STS and the scaling of Houdini and Stockfish

Post by Dann Corbit »

Jouni wrote:I have noticed, that SF has slowly but steadily improved in STS suite. And SF DD was first version ever to surpass 90% limit with my conditions = 15 sec for position in one CPU. It scored 1264 when H3 got 1274. But how many of 1400 positions are still correct? Your results with 180s may indicate, that most are OK..
There are clearly bugs in it. When I first started on STS, my operating system was 32 bits and the strongest engine was Rybka 1.0. I was running on a single core.

An hour of CPU was given (with the three top engines of the time), but we can replicate that today in less than a minute due to the exponential increase in both compute power and software excellence.

I have started the process of identification of errors, and there clearly are some errors that need corrected.