Remarkable...
I suppose this was for the version that specifies captures, castlings, etc., that took 199 sec on my 2.8 GHz P4.
For perft(6) the laptop is nearly as fast as my 1.4 times faster P4 desktop (which is already strange, but can perhaps be explained by a better compiler). But perft(7) then it is actually much faster.
I did not like the large difference I got between the version that was giving the count break-down vs the one that only calculated total count. (199 vs 125 sec, a ridiculous 60% slowdown due to counting, that only occurred on the P4; on the Core 2 Duo the counting only gave a 9% slowdown.) This suggests there is something very sick going on with the translation of a few counter increments.
Could you delete all the conditional counting after the count++; near the recursive perft call,
Code: Select all
/* recursion or count end leaf */
if(depth>1) perft(COLOR-color, stack[i], depth-1, d+1);
else
{ count++; if(to!=capt) epcnt++; if(victim!=DUMMY)xcnt++;
if((unsigned int)stack[i]>0xA2000000) cascnt++;
if(mode==0xA1) promcnt++;
}
and time how long it takes then? I have the feeling that the poor performance I have on P4 here is due to heavy branch misprediction caused by poor translation (gcc is not using conditional moves here). Your compiler might do it differently, and thus give much more efficient code.
That the speed advantage does not show up with the shallower perfts might be because there is some power saving trick in the laptop, that only switches the CPU to full speed after a few seconds of no idle time.
Anyway, the Athlon 3400+ seems way faster than my 2.8GHz P4. It is 1.57 times faster than your 2GHz P4 laptop, which translates to an effective clock performance rating of 3157. Not so far from 3400...
I guess that compiler influence is too big to compare your timings with my timings. On the P4 things behave extremely erratic. Without the count breakdown I measure 6.7 sec / 124.7 sec (perft(6) / perft(7)) on the 2.8GHz P4, after compilation with 'gcc -O2'. When I compile with 'gcc -O3' (which is supposed to be better...) this goes to 6.8 sec / 182.5 sec. Almost no change on the perft(6), and an enormous slowdown of the perft(7)! When I then compile with 'gcc -O2 -mno-cygwin', which is supposed to give no other change than using native Windows I/O calls in stead of those from the cygwin dll, I get 4.3 sec / 169.5 sec. An enormous speedup of the perft(6) compared to using cygwin.dll, but an enormous slowdown for the perft(7).
Anyway, your timing on the Athlon corresponds to 33M nps.