PGO improvement for Stockfish?

zullil · Post by **zullil** » Thu Jun 05, 2014 2:05 am

When building the latest Stockfish with g++-4.8 using

make profile-build ARCH=x86-64-modern

I consistently get a binary that is about 3% faster if I replace -fprofile-generate and -fprofile-use with -fprofile-arcs and -fbranch-probabilities, respectively, in the supplied makefile:

Code: Select all

gcc-profile-make&#58;
	$&#40;MAKE&#41; ARCH=$&#40;ARCH&#41; COMP=$&#40;COMP&#41; \
	EXTRACXXFLAGS='-fprofile-generate' \
	EXTRALDFLAGS='-lgcov' \
	all

gcc-profile-use&#58;
	$&#40;MAKE&#41; ARCH=$&#40;ARCH&#41; COMP=$&#40;COMP&#41; \
	EXTRACXXFLAGS='-fprofile-use' \
	EXTRALDFLAGS='-lgcov' \
	all

Timing was done by disabling Turbo Boost and invoking the standard

Code: Select all

./stockfish bench

which defaults to a single-threaded deterministic search.

Does anyone else get similar results using gcc? I'm running Linux.

ZirconiumX · Post by **ZirconiumX** » Thu Jun 05, 2014 8:35 am

Not at my computer, so I can't check your patch, but I've gotten a small speedup by adding -march=native -mtune=native which lets GCC choose extra optimization flags based on the user's machine.

However, it probably isn't a good patch because all the release binaries would be optimized for whatever CPU abrok.eu uses to host its servers.

Matthew:out

Joerg Oster · Post by **Joerg Oster** » Thu Jun 05, 2014 12:45 pm

zullil wrote:When building the latest Stockfish with g++-4.8 using
Code: Select all
make profile-build ARCH=x86-64-modern
I consistently get a binary that is about 3% faster if I replace -fprofile-generate and -fprofile-use with -fprofile-arcs and -fbranch-probabilities, respectively, in the supplied makefile:
Code: Select all
gcc-profile-make&#58;
	$&#40;MAKE&#41; ARCH=$&#40;ARCH&#41; COMP=$&#40;COMP&#41; \
	EXTRACXXFLAGS='-fprofile-generate' \
	EXTRALDFLAGS='-lgcov' \
	all

gcc-profile-use&#58;
	$&#40;MAKE&#41; ARCH=$&#40;ARCH&#41; COMP=$&#40;COMP&#41; \
	EXTRACXXFLAGS='-fprofile-use' \
	EXTRALDFLAGS='-lgcov' \
	all
Timing was done by disabling Turbo Boost and invoking the standard
Code: Select all
./stockfish bench
which defaults to a single-threaded deterministic search.

Does anyone else get similar results using gcc? I'm running Linux.

Well, I cannot confirm.
Trying with latest dev, default profile-build is slightly faster than with your modification.

Also running Linux and g++ -4.8.x.

zullil · Post by **zullil** » Thu Jun 05, 2014 1:19 pm

Joerg Oster wrote: Well, I cannot confirm.
Trying with latest dev, default profile-build is slightly faster than with your modification.

Also running Linux and g++ -4.8.x.

Interesting. Thanks for testing this. Did you have Turbo Boost (or Turbo Core) disabled? I only just realized the impossibility of benchmarking accurately with Turbo Boost on.

zullil · Post by **zullil** » Thu Jun 05, 2014 1:32 pm

ZirconiumX wrote:I've gotten a small speedup by adding -march=native -mtune=native which lets GCC choose extra optimization flags based on the user's machine.

Matthew:out

I used to do the same. But my current testing seems to indicate that removing all invocations of -msse and -msse3 from the makefile and just using -O3 -fno-tree-pre is best. At least on my system using gcc version 4.8.1 (Ubuntu 4.8.1-2ubuntu1~12.04). Odd.

Joerg Oster · Post by **Joerg Oster** » Thu Jun 05, 2014 4:55 pm

zullil wrote:
Joerg Oster wrote: Well, I cannot confirm.
Trying with latest dev, default profile-build is slightly faster than with your modification.

Also running Linux and g++ -4.8.x.
Interesting. Thanks for testing this. Did you have Turbo Boost (or Turbo Core) disabled? I only just realized the impossibility of benchmarking accurately with Turbo Boost on.

Of course.

TurboBoost resp. TurboCore is an absolute no-go in serious engine testing.

bob · Post by **bob** » Thu Jun 05, 2014 6:34 pm

zullil wrote:When building the latest Stockfish with g++-4.8 using
Code: Select all
make profile-build ARCH=x86-64-modern
I consistently get a binary that is about 3% faster if I replace -fprofile-generate and -fprofile-use with -fprofile-arcs and -fbranch-probabilities, respectively, in the supplied makefile:
Code: Select all
gcc-profile-make&#58;
	$&#40;MAKE&#41; ARCH=$&#40;ARCH&#41; COMP=$&#40;COMP&#41; \
	EXTRACXXFLAGS='-fprofile-generate' \
	EXTRALDFLAGS='-lgcov' \
	all

gcc-profile-use&#58;
	$&#40;MAKE&#41; ARCH=$&#40;ARCH&#41; COMP=$&#40;COMP&#41; \
	EXTRACXXFLAGS='-fprofile-use' \
	EXTRALDFLAGS='-lgcov' \
	all
Timing was done by disabling Turbo Boost and invoking the standard
Code: Select all
./stockfish bench
which defaults to a single-threaded deterministic search.

Does anyone else get similar results using gcc? I'm running Linux.

I reported this for Crafty a few months back. The newer options (generate/use) seem to re-order data, but not particularly effectively. profile-arcs and profile-use are definitely better for me too.

zullil · Post by **zullil** » Thu Jun 05, 2014 10:28 pm

bob wrote:profile-arcs and profile-use are definitely better for me too.

For me it has to be profile-arcs paired with branch-probabilities. If I replace the latter with profile-use, I lose the gain. According to the documentation, profile-use enables branch-probabilities, but it also enables a half-dozen other things (at least one of which seems to degrade the optimization).

Krgp · Post by **Krgp** » Sun Jun 08, 2014 1:56 pm

Well ... 'profile-arcs' &/or 'branch-probabilities' (paired or used either of them alone) do not work with GCC 473 (internal compiler error: in edge_badness, at ipa-inline.c:793, make[2]: *** [ucioption.o] Error 1) , with 482, 483 or 490 - these (both together) indeed give a considerable speed gain - around 4% (for 482) around 3% (for 483 & for 490) on i7-4770k - even with 'turbo boost' ON, OC (@ 4.5 GHz) on and all of them off.

bob · Post by **bob** » Sun Jun 08, 2014 4:53 pm

Krgp wrote:Well ... 'profile-arcs' &/or 'branch-probabilities' (paired or used either of them alone) do not work with GCC 473 (internal compiler error: in edge_badness, at ipa-inline.c:793, make[2]: *** [ucioption.o] Error 1) , with 482, 483 or 490 - these (both together) indeed give a considerable speed gain - around 4% (for 482) around 3% (for 483 & for 490) on i7-4770k - even with 'turbo boost' ON, OC (@ 4.5 GHz) on and all of them off.

I use those all the time with Crafty and gcc 4.7.3...

If you do any multi-threaded benchmarking, you do need to add

-fprofile-correction

on the final compile, because the threaded profiling apparently has a few issues with corruption in the .gcda file. The above fixes it.

PGO improvement for Stockfish?

PGO improvement for Stockfish?

Re: PGO improvement for Stockfish?

Re: PGO improvement for Stockfish?

Re: PGO improvement for Stockfish?

Re: PGO improvement for Stockfish?

Re: PGO improvement for Stockfish?

Re: PGO improvement for Stockfish?

Re: PGO improvement for Stockfish?

Re: PGO improvement for Stockfish?

Re: PGO improvement for Stockfish?