The best compiler for chess, Intel or gcc or something else?

Don · Post by **Don** » Mon Oct 15, 2012 6:58 pm

diep wrote:
Don wrote:
rbarreira wrote:Vincent if you say GCC sucks on AMD, what is a better compiler for AMD?

Vincent tends to be highly opinionated - but he is deal wrong about GCC being horribly slow, etc. Also he tends to use hyperbole to emphasize his points.

Sometimes he is right on the money, but for the most part you have to take what he says with a grain of salt.

Also:

http://developer.amd.com/tools/opensour ... fault.aspx

AMD recognizes that the GNU toolchain plays a critical part in the software development ecosystem, and therefore has been actively contributing to its evolution for over a decade. As far back as 2001, AMD developers and analysts have brought cutting edge code generation and reliability improvements for all x86 platforms to the GCC, glibc, binutils, GDB projects.

We feel strongly about our role, as a part of the greater open source community, to drive the quality and adoption of GCC and other components of the GNU project. AMD maintains close relationships with operating systems and compiler teams at major linux distribution vendors, as well as with GNU tools developers. These collaborations ensure that distributions include relevant GNU toolchain support for AMD platforms and help future-proof software products, and to analyze cutting edge instruction sets.
Facts speak for themselves:

Exactly. You are claiming that icc wins at least 20% speedup compared to gcc's 7% which is a different of 13% - you are claiming that icc is 13% faster and that is your lower bound because the 20% speedup is really "20-25%" according to you and visual studio is 22% + (the plus means "at least")

This is typical of your hyperbole and why you make it difficult to be believed.

I know that Jim Ablett used to compile my binaries using the Intel compiler and he is known to be exceptionally good as squeezing the most from compiles. His compiles on Intel compiler were always a few percent faster than my GCC compiles but it way under 10% and I almost surely did not get the most from GCC, he probably got close to what could be obtained from ICC. We are talking about the GCC compiler BEFORE 4.6.

I knew a kid in school who exaggerated like you do. He rounded everything up or down, whichever made his point the best and he chained all these round-off errors together. He was even worse than you, but if he were in computers he would have been able to claim a 2 to 1 by accumulating round-up errors and applying it to the Intel side! He would go down a list of all things he believed were better on the Intel side and add an exaggerated estimate of each of them to the total. Then if you tried to call him on it he had so much error built in he would give you the benefit of the doubt on a couple of them.

If you actually ran the test you say you did and came up with 25% speedup then I would have to say that probably used the wrong compiler options or did something else wrong and didn't realize it.

GCC has a horrible PGO still. Wins just 7% on intel processors against icc winning 20-25% and visual studio wins 22%+.

Now where GCC 4.7.0 speeded up a lot compared to 4.4 / 4.5 series on intel hardware, i see no visible difference (or it must be somewhere around a 1% or less) between GCC 4.7.0 versus 4.4/4.5 on AMD hardware.

The reason is seemingly GCC prematurely rewrites simple branches to complex vehicles that afterwards do not get considered for the pgo to be optimized. yet this rewrite does slow down GCC on both architectures, AMD and intel, and especially on AMD get hit hard with it.

I'm guessing most bitboarders hardly have those branches.

I'm sure the reason for this is the fact it just wins 7% with pgo. That's very little.

Even AMD's own compiler wins a whopping 30-40% with pgo.

GCC still is the same sabotaged compiler from AMD's viewpoint.

Now that AMD has a massive disadvantage in number of cores versus intel, that would no longer be needed i'd guess for intel to seem faster.

The pgo of ICC wins a whopping 25% for Diep on older K7 for example when using "p3 settings" at a bit older ICC versions.

bob · Post by **bob** » Mon Oct 15, 2012 7:06 pm

rbarreira wrote:
bob wrote: Think about what you would do if you knew the history of every branch in your program.

If you had code like this:

if (c) { statements }

but you knew that most of the time (> 50%) c is false, you would want to write it like this:

if (c) goto xxxx
back_again:
...

and somewhere else you would do this:

xxxx:
{ statements }
go to back_again;

Now when that executes, the {statements} are not in the direct code path and don't get brought into cache, taking up space and time, as well as booting something else out.

That is about all PGO can do. Branch prediction is done in the hardware, so it can't help there. But if you have a ton of if statements, like a chess engine evaluation (for one place) then it can help. I see about a 10% improvement with icc. gcc has always been problematic for me and is unreliable when doing PGO. It either crashes or produces corrupt PGO temp files, particularly if I try to PGO everything including the threaded code...
PGO does more than that, at least in some compilers:

http://msdn.microsoft.com/en-us/library ... 80%29.aspx

http://software.intel.com/sites/product ... C30CE3.htm

If you read the simple statement I wrote, it covers 99% of those optimizations. About all that can be done is to produce code paths that contain often-executed code, but not rarely-executed code. Each of those things listed on the MS page are simply variations on that theme.

diep · Post by **diep** » Mon Oct 15, 2012 7:22 pm

Don wrote:
diep wrote:
Don wrote:
rbarreira wrote:Vincent if you say GCC sucks on AMD, what is a better compiler for AMD?

Vincent tends to be highly opinionated - but he is deal wrong about GCC being horribly slow, etc. Also he tends to use hyperbole to emphasize his points.

Sometimes he is right on the money, but for the most part you have to take what he says with a grain of salt.

Also:

http://developer.amd.com/tools/opensour ... fault.aspx

AMD recognizes that the GNU toolchain plays a critical part in the software development ecosystem, and therefore has been actively contributing to its evolution for over a decade. As far back as 2001, AMD developers and analysts have brought cutting edge code generation and reliability improvements for all x86 platforms to the GCC, glibc, binutils, GDB projects.

We feel strongly about our role, as a part of the greater open source community, to drive the quality and adoption of GCC and other components of the GNU project. AMD maintains close relationships with operating systems and compiler teams at major linux distribution vendors, as well as with GNU tools developers. These collaborations ensure that distributions include relevant GNU toolchain support for AMD platforms and help future-proof software products, and to analyze cutting edge instruction sets.
Facts speak for themselves:
Exactly. You are claiming that icc wins at least 20% speedup compared to gcc's 7% which is a different of 13% - you are claiming that icc is 13% faster and that is your lower bound because the 20% speedup is really "20-25%" according to you and visual studio is 22% + (the plus means "at least")

Yes what's difficult to understand there?

ICC compiler only works well with pgo. Without it's slower of course, everyone knows that. It always was.

It's 100% parameter tuning that's doing the job of course.

GCC is doing a bad parameter tuning job. It's already rewriting code making it afterwards tougher to parameter tune the branches such that they can be optimized according to what PGO tells you to write it to.

See Bob's explanation on that as well.

Is this so difficult to understand?

It's what i'm saying non-stop. It wins too little by PGO. Just 7% at modern intel processors versus the rest wins 20-30%. In fact AMd's own compiler wins even more (but default is really slow).

So a list of compilers nowadays is doing this: generate pgo code and speed up because of that really a lot. Default no of those compilers is real fast of course.

I'm not telling something new to you here do i?

The AMD thing used to be the mipsprocompiler. that thing didn't have pgo. So they added this quickly and it wins a percent or 40 nearly now. I didn't measure it exactly. Overall it's slower.

ICC however is 11% faster than GCC 4.7.0. That's both with pgo and an old ICC version 11.0

So that's not a fair compare in fact. Also i didn't test latest visual studio yet.
11% already is a lot.

I'm interested in the fastest compile a compiler can product. We see then that all compilers win dozens of percentages by PGO and GCC doesn't, so which compiler is doing it wrong?

diep · Post by **diep** » Mon Oct 15, 2012 7:46 pm

I know that Jim Ablett used to compile my binaries using the Intel compiler and he is known to be exceptionally good as squeezing the most from compiles. His compiles on Intel compiler were always a few percent faster than my GCC compiles but it way under 10%

Which is weird as for other bitboarders difference was way bigger for older GCC compilers (4.5 and before).

However realize Diep has many branches as it has worlds largest evaluation function. Probably around a factor 100 larger than yours.

So there is a lot of complex branches, also skipping a lot of knowledge.

This means there is a lot to win by means of PGO - see Bob's explanation on that if you don't know what PGO is with respect to branches.

Yet one should also use the pgo to avoid getting bigger L1i missrate. GCC seems completely ignorant there.

I bet for your current bitboard code, it's so tiny, that you don't have L1i problems with it at all. It will probably fit entirely within L1i.

GCC is doing a bad job there at several levels in short. Both it has a bigger L1i miss rate as well it messes up with branches.

If you have little complex branches in your code, that isn't demanding much from GCC in short. Doesn't give it an excuse to write objectively slower code.

If then it has a few vectorisation optimizations specific for bitboards, it might be faster for you.

Another difference might be for YOUR bitboard code versus other guys bitboards code is that you code in C code everything whereas most windows guys use intrinsics in windows.

That might give them extra on windows and loses them maybe a lot with GCC which doesn't have these intrinsics. Which means you need the assembler code, like Crafty is doing it. I'd say have fun with that assembler. It's a choice to use.

So summarized why GCC is slow for Diep:

- it's bad with branches as it prematurely rewrites them instead of letting the pgo decide what to do with it
- it generates more instructions than other compilers causing it to get more misses in L1i which hurts bigtime. An old measurement with cachegrind and GCC gave me 1.6% missrate or so for GCC. That's a huge slowdown.

Don · Post by **Don** » Mon Oct 15, 2012 7:59 pm

diep wrote:
I know that Jim Ablett used to compile my binaries using the Intel compiler and he is known to be exceptionally good as squeezing the most from compiles. His compiles on Intel compiler were always a few percent faster than my GCC compiles but it way under 10%
Which is weird as for other bitboarders difference was way bigger for older GCC compilers (4.5 and before).

However realize Diep has many branches as it has worlds largest evaluation function. Probably around a factor 100 larger than yours.

Yes, it is altogether possible that our programs are so different that they respond much differently to different compilers. You can test that by seeing what happens with some open source program.

So there is a lot of complex branches, also skipping a lot of knowledge.

This means there is a lot to win by means of PGO - see Bob's explanation on that if you don't know what PGO is with respect to branches.

Yet one should also use the pgo to avoid getting bigger L1i missrate. GCC seems completely ignorant there.

I bet for your current bitboard code, it's so tiny, that you don't have L1i problems with it at all. It will probably fit entirely within L1i.

GCC is doing a bad job there at several levels in short. Both it has a bigger L1i miss rate as well it messes up with branches.

If you have little complex branches in your code, that isn't demanding much from GCC in short. Doesn't give it an excuse to write objectively slower code.

If then it has a few vectorisation optimizations specific for bitboards, it might be faster for you.

Another difference might be for YOUR bitboard code versus other guys bitboards code is that you code in C code everything whereas most windows guys use intrinsics in windows.

That might give them extra on windows and loses them maybe a lot with GCC which doesn't have these intrinsics. Which means you need the assembler code, like Crafty is doing it. I'd say have fun with that assembler. It's a choice to use.

So summarized why GCC is slow for Diep:

- it's bad with branches as it prematurely rewrites them instead of letting the pgo decide what to do with it
- it generates more instructions than other compilers causing it to get more misses in L1i which hurts bigtime. An old measurement with cachegrind and GCC gave me 1.6% missrate or so for GCC. That's a huge slowdown.

rbarreira · Post by **rbarreira** » Tue Oct 16, 2012 9:58 am

bob wrote:
rbarreira wrote:
bob wrote: Think about what you would do if you knew the history of every branch in your program.

If you had code like this:

if (c) { statements }

but you knew that most of the time (> 50%) c is false, you would want to write it like this:

if (c) goto xxxx
back_again:
...

and somewhere else you would do this:

xxxx:
{ statements }
go to back_again;

Now when that executes, the {statements} are not in the direct code path and don't get brought into cache, taking up space and time, as well as booting something else out.

That is about all PGO can do. Branch prediction is done in the hardware, so it can't help there. But if you have a ton of if statements, like a chess engine evaluation (for one place) then it can help. I see about a 10% improvement with icc. gcc has always been problematic for me and is unreliable when doing PGO. It either crashes or produces corrupt PGO temp files, particularly if I try to PGO everything including the threaded code...
PGO does more than that, at least in some compilers:

http://msdn.microsoft.com/en-us/library ... 80%29.aspx

http://software.intel.com/sites/product ... C30CE3.htm
If you read the simple statement I wrote, it covers 99% of those optimizations. About all that can be done is to produce code paths that contain often-executed code, but not rarely-executed code. Each of those things listed on the MS page are simply variations on that theme.

How can it cover 99% of the optimizations in the MS page when there are only 10 listed there? Talk about hyperbole...

I've been in enough arguments with you to realize you rarely admit you're wrong, but here are the optimizations which are definitely not covered by your simple trick for if statements:

Size/Speed Optimization – Functions where the program spends a lot of time can be optimized for speed.

Register Allocation – Optimizing with profile data results in better register allocation.

Virtual Call Speculation – If a virtual call, or other call through a function pointer, frequently targets a certain function, a profile-guided optimization can insert a conditionally-executed direct call to the frequently-targeted function, and the direct call can be inlined.

The best compiler for chess, Intel or gcc or something else?

Has GCC caught up with Intel with respect to performance?

Re: Some interesting reading.

Re: The best compiler for chess, Intel or gcc or something e

Re: Some interesting reading.

Re: Some interesting reading.

Re: Some interesting reading.

Re: The best compiler for chess, Intel or gcc or something e