Can we learn anything from PGO compiled executables?

mhalstern · Post by **mhalstern** » Fri Nov 27, 2009 11:36 pm

Can we learn anything from PGO compiled executables?

Does the compilation process produce any revised source code? Can we decompile a PGO generated executable and see what changes the compiler made and use this for future projects?

jwes · Post by **jwes** » Sat Nov 28, 2009 12:59 am

mhalstern wrote:Can we learn anything from PGO compiled executables?

Does the compilation process produce any revised source code? Can we decompile a PGO generated executable and see what changes the compiler made and use this for future projects?

I believe that the main point of PGO is to make the program more cache friendly, e.g. moving code around so that function entries and jump targets for the most used code are cache aligned and do not map to the same cache locations. I would much rather let the compiler do this.

Greg Strong · Post by **Greg Strong** » Sat Nov 28, 2009 1:19 am

I have no answers, but while on the subject of PGO, I'm curious about people's experience with this...

Has anyone had significant (measurable) benefit from PGO? If so, what compiler were you using, and what was your procedure for running the instrumented build to generate the profile data?

bob · Post by **bob** » Sat Nov 28, 2009 2:42 am

mhalstern wrote:Can we learn anything from PGO compiled executables?

Does the compilation process produce any revised source code? Can we decompile a PGO generated executable and see what changes the compiler made and use this for future projects?

PGO is only about branches. The idea is this.

Given this C statement:

if (condition) {

}

The issue becomes, is "condition" true most of the time or not? If it is true, then the above code is optimal. If it is false, then you want to change it to this:

if (!condition) go to boondocks;
return_from_boondocks:

and at label boondocks you take the above block of code (inside the braces) and follow that with a jmp back_from_boondocks.

The idea is this. If the branch is true most of the time, you execute that code most of the time, and since it is in sequential memory addresses, cache block fills will be correctly pre-fetching things you actually need. But if it is false most of the time, you fetch that code (in the original example) and then skip around it. By moving that code elsewhere, and only jumping to it on the less common case where condition is false, you increase cache hits. Yes, you can see that if you ask for assembly output. I don't think you can get the modified C, however, but I might be wrong since I have not tried in years.

bob · Post by **bob** » Sat Nov 28, 2009 2:44 am

Greg Strong wrote:I have no answers, but while on the subject of PGO, I'm curious about people's experience with this...

Has anyone had significant (measurable) benefit from PGO? If so, what compiler were you using, and what was your procedure for running the instrumented build to generate the profile data?

I have had excellent results with Intel's C++ compiler. gcc produces good results when it works, but I have had many versions of gcc that would simply crash and burn when profiling Crafty. Particularly when I try to profile a parallel search run so that all the parallel stuff gets optimized as well. This often produces corrupted PGO profile data files which breaks the compiler.

shiv · Post by **shiv** » Tue Dec 01, 2009 8:14 am

bob wrote: The idea is this. If the branch is true most of the time, you execute that code most of the time, and since it is in sequential memory addresses, cache block fills will be correctly pre-fetching things you actually need. But if it is false most of the time, you fetch that code (in the original example) and then skip around it. By moving that code elsewhere, and only jumping to it on the less common case where condition is false, you increase cache hits. Yes, you can see that if you ask for assembly output. I don't think you can get the modified C, however, but I might be wrong since I have not tried in years.

Right. One other thing worth mentioning is that modern instruction pipelines rarely for a branch condition to be evaluated. It will instead predict taken or not taken and do a rollback if the branch condition was predicted incorrectly. Thus, predicting a branch incorrectly entails a costly rollback.

PGO helps moderate this problem by doing a statistical analysis of branches as Bob points out.

Gian-Carlo Pascutto · Tue Dec 01, 2009 10:44 am

PGO also allows the compiler inline those few functions where it really matters.

This saves caller overhead (when inlining) but also I-cache space (when *not* inlining because its not needed).

mhalstern · Post by **mhalstern** » Tue Dec 01, 2009 9:39 pm

This is interesting:

What if I have a condition that is only met if there are more than 2 queens on the board. This happens rarely. What if profiling the compile, I gave then engine 100 test positions with 4 queens on the board. Would the compiled code assume that the condition were met initially, and have to take the time to rollback?

In any event, if not using PGO, are there always options to disable these branch predictions?

shiv · Post by **shiv** » Wed Dec 02, 2009 12:49 am

mhalstern wrote:This is interesting:

What if I have a condition that is only met if there are more than 2 queens on the board. This happens rarely. What if profiling the compile, I gave then engine 100 test positions with 4 queens on the board. Would the compiled code assume that the condition were met initially, and have to take the time to rollback?

In any event, if not using PGO, are there always options to disable these branch predictions?

For the first question, I think the answer is a yes. The compiler will rollback if pathological data was fed in the profile phase. As C/C++ generates assembly code, you are forced to live with worse branch prediction. However, the impact may not be that bad as the code to check whether there are 2 queens on the board is probably rare and thus you will waste CPU cycles only in rare cases.
That being said, there is a hardware level dynamic branch prediction available on modern CPUs which can be leveraged to improve over badly profiled PGO data. An example is the "branch whether hint" on the itanium processor. These are not used often yet.

If you do not use PGO, the compiler typically assumes that every branch has a 50/50 likelihood. However, there are also optimizations in this case for loops, string comparisons etc but the typical branch will be taken about 50% of the time.

shiv · Post by **shiv** » Wed Dec 02, 2009 1:20 am

I meant to add in my previous post that branch prediction is done in hardware anyway (with a small branch prediction cache). Thus, if the pathological PGO seed data is bad, the hardware branch predictor could come to the rescue, but one should not depend on it.

PGO normally helps optimize the hardware branch predictor not the other way round.

Can we learn anything from PGO compiled executables?

Can we learn anything from PGO compiled executables?

Re: Can we learn anything from PGO compiled executables?

Re: Can we learn anything from PGO compiled executables?

Re: Can we learn anything from PGO compiled executables?

Re: Can we learn anything from PGO compiled executables?

Re: Can we learn anything from PGO compiled executables?

Re: Can we learn anything from PGO compiled executables?

Re: Can we learn anything from PGO compiled executables?

Re: Can we learn anything from PGO compiled executables?

Re: Can we learn anything from PGO compiled executables?