mcostalba wrote:Congratulations !
Actually I have tested with -std=c++11 but unfortunatly no luck here
I finally found the culprit. In DiscoCheck, my move struct is defined like this:
Code: Select all
struct move_t {
uint16_t fsq:6, tsq:6;
uint16_t promotion:3;
uint16_t ep:1;
};
If I use instead:
Code: Select all
struct move_t {
uint16_t fsq:8, tsq:8;
uint16_t promotion:8;
uint16_t ep:1;
};
the speed gain on a perft benchmark is already huge (56 M leaf/sec to 85 M leaf/sec)
And I get almost 100 M leaf/sec with this:
Code: Select all
struct move_t {
unsigned fsq, tsq;
int promotion;
bool ep;
};
And the reason is obvious, if you think about how the code runs in assembly language. Extracting m.fsq, and m.tsq for example requires some bit shifting and filtering in the first case, while in the second and third case it's straight forward. For example in the second case, sizeof(move_t) = 4, and if the 32 bit register EAX contains move_t m, then AL == m.fsq, AH = m.tsq, and m.promotion is the third byte of AX (less commonly used, if there isn't a trick to get it directly, then simply an shr EAX, 16 and AL contains m.promotion)
The problem is that sizeof(move_t) really has to be 2, for the purpose of packing hash entries later into 16 bytes, while retaining a sufficnent number of bits from the hash key to avoid collisions (I use 62 bits in DiscoCheck to be on the safe side).
So, for a pure perft(), rather than a fully fledged engine, this is quite an optimization.
PS: Contrary to my initial (stupid) idea, this has nothing to do with C++11 or C++ vs C. The same code compiled in C99 or C++11 performed exactly the same (the C++ executable being slightly more bloated, and a tiny bit slower, as usual).
Theory and practice sometimes clash. And when that happens, theory loses. Every single time.