Compiler switches

abik · Post by **abik** » Mon Apr 16, 2007 8:35 pm

Dear Jarkko,
It pains me to admit, but this is a compiler bug all right (in 9.1, no longer in 10.0). I downloaded the source and could reproduce and debug the difference with “go depth 15” exactly. By a very strange coincidence, but most fitting, the bug was in my own module, namely automatic vectorization. Thanks to your sharp eye, I am able to correct this mistake in the 9.1 version! Ironic how my hobby and job met here.
Thanks again,
Aart Bik
http://www.aartbik.com/

jarkkop · Post by **jarkkop** » Tue Apr 17, 2007 12:02 am

Nice that I could help you and was not imagining things like it sometimes is the case.
Can you as an expert say what could switches could help take most of "your" compiler to make toga even faster executable? Can you say with your fixed version is the /QxT making Toga any faster than /QxP for E4300?

Jarkko

abik · Post by **abik** » Wed Apr 18, 2007 7:46 pm

Dear Jarkko,

FWIW, I just committed the compiler fix to our development workspace, which means that it will eventually find its way to a product update. As for your performance question, some good suggestions were already made in this thread. Below, I show some results with the fixed 9.1 and upcoming 10.0 on a 2.4GHz Conroe (keep in mind that results you reported earlier for –QxT after about 9 seconds exposed the bug, it would change the variation on depth 15 a few seconds later; the results below are the only variation reported for depth 15). Chess engines pose challenges on compiler optimization, partly due to the nature of the application and probably partly due to the fact that most chess programmers understand compilers well enough to do a lot of optimization at source level already. So I am glad to see that at least some performance benefits are obtained.

-O2 (9.1)
info multipv 1 depth 15 seldepth 44 score cp 16 time 24156 nodes 21493535 pv b1c3 g8f6 d2d4 d7d5 c1f4 c7c5 e2e3 c5d4 e3d4 d8b6 d1d3 c8d7 e1c1 b8a6 d3f3 a6b4 c1b1

-Qprof_use -O3 -Qipo –QxP (9.1)
info multipv 1 depth 15 seldepth 44 score cp 16 time 19172 nodes 21493535 pv b1c3 g8f6 d2d4 d7d5 c1f4 c7c5 e2e3 c5d4 e3d4 d8b6 d1d3 c8d7 e1c1 b8a6 d3f3 a6b4 c1b1

-Qprof_use -O3 -Qipo –QxT (9.1)
info multipv 1 depth 15 seldepth 44 score cp 16 time 19094 nodes 21493535 pv b1c3 g8f6 d2d4 d7d5 c1f4 c7c5 e2e3 c5d4 e3d4 d8b6 d1d3 c8d7 e1c1 b8a6 d3f3 a6b4 c1b1

-Qprof_use -O3 -Qipo –QxP (10.0)
info multipv 1 depth 15 seldepth 44 score cp 16 time 18828 nodes 21493535 pv b1c3 g8f6 d2d4 d7d5 c1f4 c7c5 e2e3 c5d4 e3d4 d8b6 d1d3 c8d7 e1c1 b8a6 d3f3 a6b4 c1b1

-Qprof_use -O3 -Qipo –QxT (10.0)
info multipv 1 depth 15 seldepth 44 score cp 16 time 18672 nodes 21493535 pv b1c3 g8f6 d2d4 d7d5 c1f4 c7c5 e2e3 c5d4 e3d4 d8b6 d1d3 c8d7 e1c1 b8a6 d3f3 a6b4 c1b1

Thanks again for bringing this bug to my attention. One final comment, I did not peak at the Toga source other than to debug the compiler (the “weakness” of my own chess engine gives sufficient proof for that).

Aart Bik
http://www.aartbik.com/

jwes · Post by **jwes** » Wed Apr 18, 2007 9:17 pm

I read in the intel optimization manual that the bit operations are now very fast in the Core 2 Duo. Does the Intel compiler use these ? E.g., translate

if (x & (1 << n))
do something
x &= ~(1 << n)

to

BTR x,n
JNC xx
do something
xx:

abik · Post by **abik** » Wed Apr 18, 2007 10:09 pm

If you simply are referring to bit-test instructions, then yes, see below. If I miss a subtle detail in your question, please forgive my ignorance and elaborate.

int x, n;

if (x & (1 << n))
global = 0;

translates by default (O2) to:

mov ecx, DWORD PTR [_n]
mov eax, 1
shl eax, cl
test DWORD PTR [_x], eax
je skip

mov DWORD PTR [_global], 0
skip:

but when compiled for Core 2 Duo (QxT) to:

mov eax, DWORD PTR [_x]
mov edx, DWORD PTR [_n]
bt eax, edx
jae skip

mov DWORD PTR [_global], 0
skip:

Gerd Isenberg · Post by **Gerd Isenberg** » Thu Apr 19, 2007 2:44 pm

abik wrote:If you simply are referring to bit-test instructions, then yes, see below. If I miss a subtle detail in your question, please forgive my ignorance and elaborate.

Hi Aart,

guess Wesley's question was related, whether the compiler understands the semantic of resetting the bit by using btr instead of bt. Eg. what is the assembly of this inlined bool bitTestAndReset - routine:

Code: Select all

bool bitTestAndReset(unsigned int &set, unsigned int bitIndex)
{
    unsigned int bit = 1 << bitIndex;
    bool isSet = (set & bit) != 0;
    set &= ~bit
    return isSet;
}

Code: Select all

if ( bitTestAndReset(x, n))
   doSomething();

Does it translate to something like this?

Code: Select all

mov eax, DWORD PTR [_x]
mov edx, DWORD PTR [_n]
btr eax, edx
mov DWORD PTR [_x], eax
jnz skip

Or do we explicitly need the _bittestandreset (or _bittestandreset64) intrinsics?

Code: Select all

if ( _bittestandreset(&x, n))
   doSomething();

Thanks,
Gerd

abik · Post by **abik** » Thu Apr 19, 2007 6:48 pm

Thanks for the detailed explanation Gerd, which was very helpful. In that case the answer is unfortunately no, or perhaps, not yet, as I am going to discuss this idea with our code generator experts.

Compiler switches

Re: Compiler switches

Re: Compiler switches

Re: Compiler switches

Re: Compiler switches

Re: Compiler switches

Re: Compiler switches

Re: Compiler switches