bob wrote:
Strange indeed. Is it possible to set up a remote login at some point? I could at least look at things as it runs and perhaps get an idea. Last time I ran on a Nehalem box it ran like the blazes. It might just need some tuning. One thing is for sure, LMR is way more aggressive (just compare the depths on your two 8-thread searches, the one running 1/2 normal speed went 3 plies deeper). It might be that smpsn needs tuning. If you want to run a few tests, try "smpsn=2000", smpsn=4000, and even smpsn=8000. Run a test for 60 secs, 8 threads, and run the same position 4 times. Then change the smpsn value and repeat. You can fine-tune it even better but usually changing by the above will at least point you in the right area to try...
I'll run a couple of these tests on my 8-core box to see if it is also sensitive to ths more than I realize...
Varying the smpsn parameter had essentially no effect.
I'm coming to the conclusion that this is a gcc-related issue. Have you tried compiling with gcc? I know you use icc.
Here are summary results for 23.2 and 23.3 respectively, each compiled with three versions of gcc.
Some recent change in the Crafty source really affected gcc-4.4.
My first thought would be to remove the -O3 and see what happens with 1 cpu vs 8. Then use -O2 if the first test scales reasonably (you should expect nps to be at least 7x faster and pretty close to 8x usually.) If O2 breaks it, it, we have something to look at.
Never mind. Just tried gcc here. Something is beyond wrong as my numbers get very close to yours suddenly. I'll try to experiment around a bit to see if the problem is an obvious one... ugh...
bob wrote:BTW for me, gcc sucks for either 23.2 or 23.3... not sure why, yet...
Sorry---and relieved---to hear that. At least it's not just me. Thanks for the help.
Have you tried to download Intel's free compiler for Linux. I suspect it will work on your mac unix system.
The free compiler for Linux won't install (well, maybe it would with some hacking) on OS X. I can download the compiler for Mac OS X, but it's for evaluation and stops working in 30 days.
After discovering that binaries compiled with recent versions of gcc were almost as fast as those produced with icc, I decided to give up on icc. Hope you can discover why gcc is struggling with 23.2 and 23.3.
23.3 crafty slow(er) ?
well i found some other compilations of the 23.3 version, namely
on the site of Jim Ablett's winboard projects, havent looked at
the nps though; while it indeed seems to run a bit slower
than the 23.2 version, i find the evaluation function much
better and as result have included it my latest Bookbuilder
package (4.09), which you can download at superchess.com.
and on the site of Peter Skinner http://www.webkikr.com/
there now also the 23.3 compilations seem to be available;
still have to compare which one is the best for me..
best regards,
jef
jefk wrote:23.3 crafty slow(er) ?
well i found some other compilations of the 23.3 version, namely
on the site of Jim Ablett's winboard projects, havent looked at
the nps though; while it indeed seems to run a bit slower
than the 23.2 version, i find the evaluation function much
better and as result have included it my latest Bookbuilder
package (4.09), which you can download at superchess.com.
and on the site of Peter Skinner http://www.webkikr.com/
there now also the 23.3 compilations seem to be available;
still have to compare which one is the best for me..
best regards,
jef
23.3 is a little slower. But just a few percent. Whenever you get more aggressive in pruning and reduce the effective branching factor, NPS suffers, because you do a lot of work at the front of a node to generate moves and such, and then throw some of that away without using it, which will slow things down a bit. But here, we are talking about well over 50% with gcc. On my cluster box I use on ICC (8 cores) I see speeds of 20M and up. Using gcc, this drops to 8M or so. For reasons (so far) that are unknown...
Have tried to profile, but the damned profiler doesn't work with parallel search, it produces corrupted data...
jefk wrote:23.3 crafty slow(er) ?
well i found some other compilations of the 23.3 version, namely
on the site of Jim Ablett's winboard projects, havent looked at
the nps though; while it indeed seems to run a bit slower
than the 23.2 version, i find the evaluation function much
better and as result have included it my latest Bookbuilder
package (4.09), which you can download at superchess.com.
and on the site of Peter Skinner http://www.webkikr.com/
there now also the 23.3 compilations seem to be available;
still have to compare which one is the best for me..
best regards,
jef
23.3 is a little slower. But just a few percent. Whenever you get more aggressive in pruning and reduce the effective branching factor, NPS suffers, because you do a lot of work at the front of a node to generate moves and such, and then throw some of that away without using it, which will slow things down a bit. But here, we are talking about well over 50% with gcc. On my cluster box I use on ICC (8 cores) I see speeds of 20M and up. Using gcc, this drops to 8M or so. For reasons (so far) that are unknown...
Have tried to profile, but the damned profiler doesn't work with parallel search, it produces corrupted data...
Newer versions of gcc have -fprofile-correction, which handles parallel threads, I think.
jefk wrote:23.3 crafty slow(er) ?
well i found some other compilations of the 23.3 version, namely
on the site of Jim Ablett's winboard projects, havent looked at
the nps though; while it indeed seems to run a bit slower
than the 23.2 version, i find the evaluation function much
better and as result have included it my latest Bookbuilder
package (4.09), which you can download at superchess.com.
and on the site of Peter Skinner http://www.webkikr.com/
there now also the 23.3 compilations seem to be available;
still have to compare which one is the best for me..
best regards,
jef
23.3 is a little slower. But just a few percent. Whenever you get more aggressive in pruning and reduce the effective branching factor, NPS suffers, because you do a lot of work at the front of a node to generate moves and such, and then throw some of that away without using it, which will slow things down a bit. But here, we are talking about well over 50% with gcc. On my cluster box I use on ICC (8 cores) I see speeds of 20M and up. Using gcc, this drops to 8M or so. For reasons (so far) that are unknown...
Have tried to profile, but the damned profiler doesn't work with parallel search, it produces corrupted data...
Newer versions of gcc have -fprofile-correction, which handles parallel threads, I think.
I know. And that's crazy-looking. I can't imagine what a compiler could do to cause that kind of SMP slow-down unless there are some hidden library calls inserted that require synchronization primitives to avoid some sort of internal data corruption. I'm looking as I always use gcc on AMD boxes, as the intel compiler seems to produce some sort of bad code when run on AMD, something that slows things down about like what gcc is doing.