The post is half serious, because we know that test suites are not the best indicators of engines' strength. But, with issue appearing that some top engines scale better at long time control than others, instead of weeks spent on longtime control matches, as I am not particularly a tester, I decided to try a one day solution:
Run the full pack of Strategic Test Suites by Swaminathan N. and Dann Corbit (1400 positions), which are not particularly tactical (Houdini Tactical is no better than Houdini default at STS), at different time controls, and see how Houdini 4 and Stockfish DD behave with increasing time control, from blitz to tournament TC.
1400 test positions, each engine on 4 i7 cores.
1s/position:
H4: 1226
SF: 1203
30s/position
H4: 1319
SF: 1311
180s/position
SF: 1355
H4: 1351
So, it's apparent that SF scales better to longer time control, even overtaking Houdini 4 at long TC, 180s/position. Houdini is no longer the king of STS at long time controls.
STS and the scaling of Houdini and Stockfish
Moderators: hgm, Rebel, chrisw
-
- Posts: 10948
- Joined: Wed Jul 26, 2006 10:21 pm
- Full name: Kai Laskos
-
- Posts: 3315
- Joined: Wed Mar 08, 2006 8:15 pm
Re: STS and the scaling of Houdini and Stockfish
I have noticed, that SF has slowly but steadily improved in STS suite. And SF DD was first version ever to surpass 90% limit with my conditions = 15 sec for position in one CPU. It scored 1264 when H3 got 1274. But how many of 1400 positions are still correct? Your results with 180s may indicate, that most are OK..
Jouni
-
- Posts: 10948
- Joined: Wed Jul 26, 2006 10:21 pm
- Full name: Kai Laskos
Re: STS and the scaling of Houdini and Stockfish
I guess there are wrong solutions, I tested several positions which neither Houdini nor Stockfish can solve in half an hour or so, the important thing there are not so many to distort results at 3min/pos. Testing further, to say 15min/pos is probably useless.Jouni wrote:I have noticed, that SF has slowly but steadily improved in STS suite. And SF DD was first version ever to surpass 90% limit with my conditions = 15 sec for position in one CPU. It scored 1264 when H3 got 1274. But how many of 1400 positions are still correct? Your results with 180s may indicate, that most are OK..
-
- Posts: 175
- Joined: Wed Apr 28, 2010 9:31 pm
- Location: Brazil
Re: STS and the scaling of Houdini and Stockfish
Great job Kai.
Consider these important tests to assess the evolution of engine performance.
Thanks for your work
Consider these important tests to assess the evolution of engine performance.
Thanks for your work
Remember Sabra and Chatila
-
- Posts: 3315
- Joined: Wed Mar 08, 2006 8:15 pm
Re: STS and the scaling of Houdini and Stockfish
And finally SF 8.2.2014 beats Houdini 3 score with 1276! BTW another test suite where Stockfish excels is Arasan suite. There SF solved 170 positions when H3 got "only" 159 with same conditions.
Jouni
-
- Posts: 3232
- Joined: Mon May 31, 2010 1:29 pm
- Full name: lucasart
Re: STS and the scaling of Houdini and Stockfish
Nice to see Stockfish closing the gap at fast time control. SF used to be a slow starter, but not anymoreLaskos wrote: 1s/position:
H4: 1226
SF: 1203
Theory and practice sometimes clash. And when that happens, theory loses. Every single time.
-
- Posts: 3315
- Joined: Wed Mar 08, 2006 8:15 pm
Re: STS and the scaling of Houdini and Stockfish
I tested SF 6 in complete 1500 suite (15s limit). It scored 1327, but Houdini 3 got 1338! I quess, that even with much longer time limit SF is worse in my PC.
Jouni
-
- Posts: 20943
- Joined: Mon Feb 27, 2006 7:30 pm
- Location: Birmingham, AL
Re: STS and the scaling of Houdini and Stockfish
The math.Laskos wrote:The post is half serious, because we know that test suites are not the best indicators of engines' strength. But, with issue appearing that some top engines scale better at long time control than others, instead of weeks spent on longtime control matches, as I am not particularly a tester, I decided to try a one day solution:
Run the full pack of Strategic Test Suites by Swaminathan N. and Dann Corbit (1400 positions), which are not particularly tactical (Houdini Tactical is no better than Houdini default at STS), at different time controls, and see how Houdini 4 and Stockfish DD behave with increasing time control, from blitz to tournament TC.
1400 test positions, each engine on 4 i7 cores.
1s/position:
H4: 1226
SF: 1203
30s/position
H4: 1319
SF: 1311
180s/position
SF: 1355
H4: 1351
So, it's apparent that SF scales better to longer time control, even overtaking Houdini 4 at long TC, 180s/position. Houdini is no longer the king of STS at long time controls.
#1. Did you test single thread first? It might be that stockfish gets better as D increases. In fact, it might get MUCH better and the SMP scaling actually hurts enough to take away most of the gain.
There's no way you can do a test like this and conclude anything without also running the same test with just one thread first, to see how things look at each of those depths. Then when you increase threads, and compare them to the 1 thread numbers, all you are measuring is the 1-thread to n thread speedup. As it is, these are simply random numbers until they are verified with 1 thread first...
-
- Posts: 12564
- Joined: Wed Mar 08, 2006 8:57 pm
- Location: Redmond, WA USA
Re: STS and the scaling of Houdini and Stockfish
There are clearly bugs in it. When I first started on STS, my operating system was 32 bits and the strongest engine was Rybka 1.0. I was running on a single core.Jouni wrote:I have noticed, that SF has slowly but steadily improved in STS suite. And SF DD was first version ever to surpass 90% limit with my conditions = 15 sec for position in one CPU. It scored 1264 when H3 got 1274. But how many of 1400 positions are still correct? Your results with 180s may indicate, that most are OK..
An hour of CPU was given (with the three top engines of the time), but we can replicate that today in less than a minute due to the exponential increase in both compute power and software excellence.
I have started the process of identification of errors, and there clearly are some errors that need corrected.