PGO improvement for Stockfish?

Discussion of chess software programming and technical issues.

Moderators: hgm, Rebel, chrisw

Krgp
Posts: 20
Joined: Mon Nov 04, 2013 6:18 am

Re: PGO improvement for Stockfish?

Post by Krgp »

zullil wrote:Here are the changes from the default makefile that produce the fastest Stockfish binary for me:

Code: Select all

louis@LZsT5610:~/Documents/Chess/Stockfish/src$ diff Makefile_new Makefile
274,275c274,275
< 		CXXFLAGS +=
< 		DEPENDFLAGS +=
---
> 		CXXFLAGS += -msse
> 		DEPENDFLAGS += -msse
288c288
< 	CXXFLAGS += -DUSE_POPCNT
---
> 	CXXFLAGS += -msse3 -DUSE_POPCNT
308c308
< 			CXXFLAGS +=
---
> 			CXXFLAGS += -flto
450c450
< 	EXTRACXXFLAGS='-fprofile-arcs' \
---
> 	EXTRACXXFLAGS='-fprofile-generate' \
456c456
< 	EXTRACXXFLAGS='-fbranch-probabilities' \
---
> 	EXTRACXXFLAGS='-fprofile-use' \
With gcc-4.8.1, this improves nps by about 4%, as measured using the standard Stockfish (single-threaded, deterministic) benchmark. I'm building with

Code: Select all

make profile-build ARCH=x86-64-modern
For me, with the exception of the inlined popcnt instruction, enabling sse or sse3 actually produced code that was a bit slower.
Indeed -fprofile-arcs & -fbranch-probabilities & no link time optimization produce fastest builds ... but one has to use 4.8 & above for that ... if it could be done on 4.7.3 - it's still 3% speed gain over & above of the 4% given by -fprofile-arcs & -fbranch-probabilities ... a total of around 7% !!

There is a work around - BYO by Brice Allenbrand (RW builds) at https://www.dropbox.com/sh/4rubami2nvld ... ft7y9a/BYO does it ! - what Brice has done is he uses cpuz for getting profile & uses -fbranch-probabilities ... result is that 7% speed gain on 4.7.3 - I tried using his script, modified it a bit (-O3 instead of -Ofast used by Brice) - did get the 7% gain !! ... only if the idea (of course no need of using cpuz - I suppose) could be incorporated in 'make-file' ...
KP
zullil
Posts: 6442
Joined: Tue Jan 09, 2007 12:31 am
Location: PA USA
Full name: Louis Zulli

Re: PGO improvement for Stockfish?

Post by zullil »

Krgp wrote: There is a work around - BYO by Brice Allenbrand (RW builds) at https://www.dropbox.com/sh/4rubami2nvld ... ft7y9a/BYO does it ! - what Brice has done is he uses cpuz for getting profile & uses -fbranch-probabilities ... result is that 7% speed gain on 4.7.3 - I tried using his script, modified it a bit (-O3 instead of -Ofast used by Brice) - did get the 7% gain !! ... only if the idea (of course no need of using cpuz - I suppose) could be incorporated in 'make-file' ...
The file at the URL you provided is some sort of bundled Windows executable, with the Stockfish source code and scripts contained within. Since I don't use Windows, I can't even open it to see what's there.
Krgp
Posts: 20
Joined: Mon Nov 04, 2013 6:18 am

Re: PGO improvement for Stockfish?

Post by Krgp »

zullil wrote:
Krgp wrote: There is a work around - BYO by Brice Allenbrand (RW builds) at https://www.dropbox.com/sh/4rubami2nvld ... ft7y9a/BYO does it ! - what Brice has done is he uses cpuz for getting profile & uses -fbranch-probabilities ... result is that 7% speed gain on 4.7.3 - I tried using his script, modified it a bit (-O3 instead of -Ofast used by Brice) - did get the 7% gain !! ... only if the idea (of course no need of using cpuz - I suppose) could be incorporated in 'make-file' ...
The file at the URL you provided is some sort of bundled Windows executable, with the Stockfish source code and scripts contained within. Since I don't use Windows, I can't even open it to see what's there.
Well ... in the bundle there are 3 GCCs (473, 48b & 49c), cpuz and a batch file ... here is the script Brice uses :

Code: Select all

@ECHO OFF
SETLOCAL ENABLEDELAYEDEXPANSION
ECHO ** Welcome to Stockfish BYO &#40;aka Build You Own&#41; 64 bits
ECHO ** Please wait...
ECHO ** &#40;If something goes wrong run BYO as Admin&#41;
ECHO **
ECHO ** If you want to compile another Stockfish version, simply copy
ECHO ** one or more 14#### folder&#40;s&#41; on your desktop and launch BYO.
ECHO **
ECHO ** System Check
ECHO ** CPUz from Franck Delattre www.cpuid.com
SET HERE=%CD%
cpuz.exe -txt=_1 >NUL
FINDSTR /I /C&#58;"AVX2" _1.txt >NUL
SET AVX2=%ERRORLEVEL%
FINDSTR /I /C&#58;"AVX" _1.txt >NUL
SET AVX=%ERRORLEVEL%
FINDSTR /I /C&#58;"SSE4.2" _1.txt >NUL
SET SSE=%ERRORLEVEL%
DEL _1.txt
ECHO ** DONE, Compilation now....
IF %AVX2% EQU 0 (
	SET PARAM=-DUSE_AVX2 -DUSE_PEXT -DUSE_POPCNT
	SET NAME=avx2
) ELSE IF %AVX% EQU 0 (
	SET PARAM=-DUSE_AVX -DUSE_POPCNT
	SET NAME=modern_sse42_avx
) ELSE IF %SSE% EQU 0 (
	SET PARAM=-DUSE_POPCNT
	SET NAME=modern_sse42
) ELSE (
	SET PARAM=-DPIPO
	SET NAME=x64
)

DEL %HERE%\StockFishRW_BYO.txt >NUL 2>NUL
FOR %%I IN &#40;473 482a 49c&#41; DO (
	SET GG=%HERE%\GCC%%~I\BIN
	ECHO SET PATH=!GG!>pipo.bat
	CALL pipo.bat
	ECHO %%I >>%HERE%\StockFishRW_BYO.txt
	gcc -Q --help=target -march=native >>%HERE%\StockFishRW_BYO.txt
)

FOR /L %%D IN &#40;140000,1,160000&#41; DO (
	IF EXIST "%USERPROFILE%\Desktop\%%D\main.cpp" (
		ECHO == Found %%D src...
		SET COPT=-DBYO -fno-exceptions -Wno-long-long -fno-rtti -ansi -pedantic -DNDEBUG -O3 -march=native
		FINDSTR /I cpuid "%USERPROFILE%\Desktop\%%D\opt.cpp" >NUL 2>NUL
		IF %ERRORLEVEL% EQU 0 &#40;SET COPT=!COPT! -std=gnu++11&#41;
		SET ts=%%D
		SET tts=!ts&#58;~-2!!ts&#58;~2,2!!ts&#58;~0,2!
		SET PARAM=!PARAM! -DMYDATE="!tts!"
		CD /D "%USERPROFILE%\Desktop\%%D"
		SET CPP=
		FOR /F "usebackq delims=," %%I IN (`dir /b /A-D "*.cpp"`) DO &#40;IF %%I NEQ tbcore.cpp &#40;IF %%I NEQ sort.cpp &#40;IF %%I NEQ misc2.cpp &#40;SET CPP=%%I !CPP!))))
		copy misc.cpp misc.bak >NUL
		%HERE%\C\mrp.exe misc.cpp misc.cpp Stockfish Stockfish_BYO
		%HERE%\C\mrp.exe misc.cpp misc.cpp _BYORW RW
		SET OK=1
		FOR %%I IN &#40;473 482a 49c&#41; DO (
			SET GG=%HERE%\GCC%%~I\BIN
			ECHO SET PATH=!GG!>pipo.bat
			ECHO SET GCC=%%~I>>pipo.bat
			CALL pipo.bat
			ECHO ** First pass compilation GCC%%I
			g++ !CPP! !COPT! !PARAM! --coverage -o _stock+Profiled.exe 1>NUL 2>NUL
			_stock+Profiled bench 1024 1 10 1>NUL 2>NUL
			DEL ucioption.gcda ucioption.gcno
			ECHO ** Last pass compilation
			g++ !CPP! !COPT! !PARAM! -fbranch-probabilities -o %HERE%\StockFishRW_!ts!_BYO!GCC!_%NAME%.exe 1>NUL 2>NUL
			DEL *.s *.o *.gcda *.gcno _stock+profiled.exe
			ECHO ** DONE
		)
		MOVE /Y misc.cpp misc.bak >NUL 2>NUL
		DEL pipo.bat
		RMDIR /Q /S "%USERPROFILE%\Desktop\%%D" >NUL 2>NUL
		RMDIR /Q /S "%USERPROFILE%\Desktop\%%D" >NUL 2>NUL
	)
)
IF !OK! NEQ 1 (
	FOR /L %%D IN &#40;140000,1,160000&#41; DO &#40;IF EXIST "%%D\main.cpp" &#40;SET ts=%%D&#41;)
	SET COPT=-DBYO -fno-exceptions -Wno-long-long -fno-rtti -ansi -pedantic -DNDEBUG -O3 -DIS_64BIT -DUSE_BSFQ -DUSE_PREFETCH -static -s -march=native
	FINDSTR /I cpuid "%%D\opt.cpp" >NUL 2>NUL
	IF %ERRORLEVEL% EQU 0 &#40;SET COPT=!COPT! -std=gnu++11&#41;
	CD /D !ts!
	FOR /F "usebackq delims=," %%I IN (`dir /b /A-D "*.cpp"`) DO &#40;IF %%I NEQ tbcore.cpp &#40;IF %%I NEQ sort.cpp &#40;SET CPP=%%I !CPP!)))
	REM Merci à Paul Henri !
	REM FOR /F "usebackq tokens=1,2 delims==" %%i IN (`wmic os get LocalDateTime /VALUE 2^>NUL`) DO IF '.%%i.'=='.LocalDateTime.' SET "ts=%%j" & SET "ts=!ts&#58;~2,2!!ts&#58;~4,2!!ts&#58;~6,2!"
	SET tts=!ts&#58;~-2!!ts&#58;~2,2!!ts&#58;~0,2!
	ECHO No src found on desktop, using internal...!tts!...
	SET PARAM=!PARAM! -DMYDATE="!tts!"
	FOR %%I IN &#40;473 482a 49c&#41; DO (
		SET GG=%HERE%\GCC%%~I\BIN
		ECHO SET PATH=!GG!>pipo.bat
		ECHO SET GCC=%%~I>>pipo.bat
		CALL pipo.bat
		ECHO ** First pass compilation GCC%%I
		g++ !CPP! !COPT! !PARAM! --coverage -o _stock+Profiled.exe 1>NUL 2>NUL
		_stock+Profiled bench 512 1 10 1>NUL 2>NUL
		DEL ucioption.gcda ucioption.gcno
		ECHO ** Last pass compilation
		g++ !CPP! !COPT! !PARAM! -fbranch-probabilities -o %HERE%\StockFishRW_!ts!_BYO!GCC!_%NAME%.exe 1>NUL 2>NUL
		DEL *.s *.o *.gcda *.gcno _stock+profiled.exe
		ECHO ** DONE
	)
	DEL pipo.bat
)
ECHO ** Compression &#40;7z&#41;
CD /D "%HERE%"
C\7z.exe a -t7z -mx9 StockFishRW_BYO StockFishRW_*.*
MOVE /Y StockFishRW_*.exe "%USERPROFILE%\Desktop" 1>NUL 
MOVE /Y StockFishRW_*.7z "%USERPROFILE%\Desktop" 1>NUL 
rem DEL /Q StockFishRW_*.*
pause
KP
zullil
Posts: 6442
Joined: Tue Jan 09, 2007 12:31 am
Location: PA USA
Full name: Louis Zulli

Re: PGO improvement for Stockfish?

Post by zullil »

Krgp wrote: what Brice has done is he uses cpuz for getting profile & uses -fbranch-probabilities ... result is that 7% speed gain on 4.7.3 - I tried using his script, modified it a bit (-O3 instead of -Ofast used by Brice) - did get the 7% gain !! ... only if the idea (of course no need of using cpuz - I suppose) could be incorporated in 'make-file' ...
Well, I spent about three minutes with the script. It seems cpuz is simply used to determine what instruction sets the processor supports, so that flags can be set appropriately (eg, -DUSE_POPCNT, -DUSE_PEXT).

The actual workaround for the gcc-4.7.3 "bug" appears to be the deletion of the files ucioption.gcda and ucioption.gcno prior to the final compilation; if I recall correctly the error you reported involved ucioption.o.

If I can get a copy of gcc-4.7.3, I might spend a few minutes, but there's nothing magical in the script.
zullil
Posts: 6442
Joined: Tue Jan 09, 2007 12:31 am
Location: PA USA
Full name: Louis Zulli

Re: PGO improvement for Stockfish?

Post by zullil »

zullil wrote:
Krgp wrote: what Brice has done is he uses cpuz for getting profile & uses -fbranch-probabilities ... result is that 7% speed gain on 4.7.3 - I tried using his script, modified it a bit (-O3 instead of -Ofast used by Brice) - did get the 7% gain !! ... only if the idea (of course no need of using cpuz - I suppose) could be incorporated in 'make-file' ...
Well, I spent about three minutes with the script. It seems cpuz is simply used to determine what instruction sets the processor supports, so that flags can be set appropriately (eg, -DUSE_POPCNT, -DUSE_PEXT).

The actual workaround for the gcc-4.7.3 "bug" appears to be the deletion of the files ucioption.gcda and ucioption.gcno prior to the final compilation; if I recall correctly the error you reported involved ucioption.o.

If I can get a copy of gcc-4.7.3, I might spend a few minutes, but there's nothing magical in the script.
Turns out I already had 4.7.3 installed. Indeed, there is an error in the final compilation:

Code: Select all

g++-4.7 -Wall -Wcast-qual -fno-exceptions -fno-rtti -fbranch-probabilities -ansi -pedantic -Wno-long-long -Wextra -Wshadow -DNDEBUG -O3 -DIS_64BIT  -DUSE_BSFQ -DUSE_POPCNT    -c -o ucioption.o ucioption.cpp
ucioption.cpp&#58;160&#58;1&#58; internal compiler error&#58; in edge_badness, at ipa-inline.c&#58;793
Please submit a full bug report,
with preprocessed source if appropriate.
See <file&#58;///usr/share/doc/gcc-4.7/README.Bugs> for instructions.
The bug is not reproducible, so it is likely a hardware or OS problem.
make&#91;2&#93;&#58; *** &#91;ucioption.o&#93; Error 1
make&#91;2&#93;&#58; Leaving directory `/home/louis/Documents/Chess/Stockfish/src'
make&#91;1&#93;&#58; *** &#91;gcc-profile-use&#93; Error 2
make&#91;1&#93;&#58; Leaving directory `/home/louis/Documents/Chess/Stockfish/src'
make&#58; *** &#91;profile-build&#93; Error 2
I note that a ucioption.gcda is not present at all, so for me removing this file is clearly not the workaround.
Krgp
Posts: 20
Joined: Mon Nov 04, 2013 6:18 am

Re: PGO improvement for Stockfish?

Post by Krgp »

zullil wrote:
The actual workaround for the gcc-4.7.3 "bug" appears to be the deletion of the files ucioption.gcda and ucioption.gcno prior to the final compilation; if I recall correctly the error you reported involved ucioption.o.

If I can get a copy of gcc-4.7.3, I might spend a few minutes, but there's nothing magical in the script.


Robert Hyatt also mentioned 'corruption' (?!) of .gcda files ...

4.7.3 seems to be hard to get .. but 4.7.4 is released with 'bug fixes' (http://gcc.gnu.org/gcc-4.7/) hopefully this bug is fixed ... in the meantime is there any 'other' way to delete .gcno & .gcda files prior to final compilation? ... the 'additional' 3% speed gain is too tempting ...
KP
zullil
Posts: 6442
Joined: Tue Jan 09, 2007 12:31 am
Location: PA USA
Full name: Louis Zulli

Re: PGO improvement for Stockfish?

Post by zullil »

Krgp wrote:
zullil wrote:
The actual workaround for the gcc-4.7.3 "bug" appears to be the deletion of the files ucioption.gcda and ucioption.gcno prior to the final compilation; if I recall correctly the error you reported involved ucioption.o.

If I can get a copy of gcc-4.7.3, I might spend a few minutes, but there's nothing magical in the script.


Robert Hyatt also mentioned 'corruption' (?!) of .gcda files ...

4.7.3 seems to be hard to get .. but 4.7.4 is released with 'bug fixes' (http://gcc.gnu.org/gcc-4.7/) hopefully this bug is fixed ... in the meantime is there any 'other' way to delete .gcno & .gcda files prior to final compilation? ... the 'additional' 3% speed gain is too tempting ...


My problem with 4.7.3 was that it failed when no .gcda file was present, so deleting was not the fix. It seems that running only

Code: Select all

./stockfish bench 32 1 1 default time
to generate the profile simply doesn't involve ucioption.cpp. At least, no ucioption.gcda is generated.

As a crude workaround, after the error I simply did

Code: Select all

./stockfish
setoption name Hash value 32
quit
to generate a .gcda, and then modified the makefile to complete the final compilation only.

In any case, I got only a 1% speed-up over gcc-4.8, so I'm not sure finding a good workaround is worth the effort. May see if I can get 4.7.4 for my linux system ...
zullil
Posts: 6442
Joined: Tue Jan 09, 2007 12:31 am
Location: PA USA
Full name: Louis Zulli

Re: PGO improvement for Stockfish?

Post by zullil »

zullil wrote:
Krgp wrote:
zullil wrote:
The actual workaround for the gcc-4.7.3 "bug" appears to be the deletion of the files ucioption.gcda and ucioption.gcno prior to the final compilation; if I recall correctly the error you reported involved ucioption.o.

If I can get a copy of gcc-4.7.3, I might spend a few minutes, but there's nothing magical in the script.


Robert Hyatt also mentioned 'corruption' (?!) of .gcda files ...

4.7.3 seems to be hard to get .. but 4.7.4 is released with 'bug fixes' (http://gcc.gnu.org/gcc-4.7/) hopefully this bug is fixed ... in the meantime is there any 'other' way to delete .gcno & .gcda files prior to final compilation? ... the 'additional' 3% speed gain is too tempting ...


My problem with 4.7.3 was that it failed when no .gcda file was present, so deleting was not the fix. It seems that running only

Code: Select all

./stockfish bench 32 1 1 default time
to generate the profile simply doesn't involve ucioption.cpp. At least, no ucioption.gcda is generated.

As a crude workaround, after the error I simply did

Code: Select all

./stockfish
setoption name Hash value 32
quit
to generate a .gcda, and then modified the makefile to complete the final compilation only.

In any case, I got only a 1% speed-up over gcc-4.8, so I'm not sure finding a good workaround is worth the effort. May see if I can get 4.7.4 for my linux system ...
After more careful investigation:

ucioption.gcda is produced when the bench command is run during the profiling stage. But it must be "corrupt" in some manner; using it during the final build causes the gcc-4.7.3 error (and the file is deleted for some reason, which is why I thought it was never present).

If this corrupt .gcda is deleted prior to the final compilation, no error occurs. There's simply a message:

Code: Select all

ucioption.cpp&#58;160&#58;1&#58; note&#58; file /home/louis/Documents/Chess/Stockfish/src/ucioption.gcda not found, execution counts estimated
So the workaround in the Windows script you posted should indeed work.

I'm just building with the following shell script instead of make:

Code: Select all

rm stockfish *.gcda

g++-4.7 -Wall -Wcast-qual -fno-exceptions -fno-rtti -ansi -pedantic -Wno-long-long -Wextra -Wshadow\
        -DNDEBUG -O3 -DIS_64BIT -DUSE_BSFQ -DUSE_POPCNT -fprofile-arcs -o stockfish *.cpp -lpthread

./stockfish bench 32 1 1 default time >> /dev/null

rm ucioption.gcda

g++-4.7 -Wall -Wcast-qual -fno-exceptions -fno-rtti -ansi -pedantic -Wno-long-long -Wextra -Wshadow\
        -DNDEBUG -O3 -DIS_64BIT -DUSE_BSFQ -DUSE_POPCNT -fbranch-probabilities -o stockfish *.cpp -lpthread
Krgp
Posts: 20
Joined: Mon Nov 04, 2013 6:18 am

Re: PGO improvement for Stockfish?

Post by Krgp »

zullil wrote:
I'm just building with the following shell script instead of make:

Code: Select all

rm stockfish *.gcda

g++-4.7 -Wall -Wcast-qual -fno-exceptions -fno-rtti -ansi -pedantic -Wno-long-long -Wextra -Wshadow\
        -DNDEBUG -O3 -DIS_64BIT -DUSE_BSFQ -DUSE_POPCNT -fprofile-arcs -o stockfish *.cpp -lpthread

./stockfish bench 32 1 1 default time >> /dev/null

rm ucioption.gcda

g++-4.7 -Wall -Wcast-qual -fno-exceptions -fno-rtti -ansi -pedantic -Wno-long-long -Wextra -Wshadow\
        -DNDEBUG -O3 -DIS_64BIT -DUSE_BSFQ -DUSE_POPCNT -fbranch-probabilities -o stockfish *.cpp -lpthread
I am also doing similar (however I use Brice's batch file after various modifications - and do get additional 3% speed gain) at the moment ... a slight difference being I can afford to use -march=native ... so no need for other flags ... also I had initially thought 'cpuz' is being used for just determining basic architecture ... however the .txt file generated is quite similar to the output of gcc -c -Q -march=native --help=target ... so I suspected it's being used to get profile data (however could read it in the script itself) ... anyway thanks for your shell script ... will try it out and post feedback ...
KP
zullil
Posts: 6442
Joined: Tue Jan 09, 2007 12:31 am
Location: PA USA
Full name: Louis Zulli

Re: PGO improvement for Stockfish?

Post by zullil »

Krgp wrote:
zullil wrote:
I'm just building with the following shell script instead of make:

Code: Select all

rm stockfish *.gcda

g++-4.7 -Wall -Wcast-qual -fno-exceptions -fno-rtti -ansi -pedantic -Wno-long-long -Wextra -Wshadow\
        -DNDEBUG -O3 -DIS_64BIT -DUSE_BSFQ -DUSE_POPCNT -fprofile-arcs -o stockfish *.cpp -lpthread

./stockfish bench 32 1 1 default time >> /dev/null

rm ucioption.gcda

g++-4.7 -Wall -Wcast-qual -fno-exceptions -fno-rtti -ansi -pedantic -Wno-long-long -Wextra -Wshadow\
        -DNDEBUG -O3 -DIS_64BIT -DUSE_BSFQ -DUSE_POPCNT -fbranch-probabilities -o stockfish *.cpp -lpthread
I am also doing similar (however I use Brice's batch file after various modifications - and do get additional 3% speed gain) at the moment ... a slight difference being I can afford to use -march=native ... so no need for other flags ... also I had initially thought 'cpuz' is being used for just determining basic architecture ... however the .txt file generated is quite similar to the output of gcc -c -Q -march=native --help=target ... so I suspected it's being used to get profile data (however could read it in the script itself) ... anyway thanks for your shell script ... will try it out and post feedback ...
The script you posted invokes gcc -Q --help=target -march=native and writes the output into StockFishRW_BYO.txt

I think all that does is document which compiler switches are enabled/disabled when -march=native is used during compilation.

On my computer, here's what -march=native does:

Code: Select all

louis@LZsT5610&#58;~/Documents/Chess/Stockfish/src$ gcc -Q --help=target -march=native
The following options are target specific&#58;
  -m128bit-long-double        		&#91;disabled&#93;
  -m32                        		&#91;disabled&#93;
  -m3dnow                     		&#91;disabled&#93;
  -m3dnowa                    		&#91;disabled&#93;
  -m64                        		&#91;enabled&#93;
  -m80387                     		&#91;enabled&#93;
  -m8bit-idiv                 		&#91;disabled&#93;
  -m96bit-long-double         		&#91;enabled&#93;
  -mabi=                      		
  -mabm                       		&#91;disabled&#93;
  -maccumulate-outgoing-args  		&#91;disabled&#93;
  -maes                       		&#91;enabled&#93;
  -malign-double              		&#91;disabled&#93;
  -malign-functions=          		
  -malign-jumps=              		
  -malign-loops=              		
  -malign-stringops           		&#91;enabled&#93;
  -mandroid                   		&#91;disabled&#93;
  -march=                     		corei7-avx
  -masm=                      		
  -mavx                       		&#91;enabled&#93;
  -mavx256-split-unaligned-load 	&#91;disabled&#93;
  -mavx256-split-unaligned-store 	&#91;disabled&#93;
  -mbionic                    		&#91;disabled&#93;
  -mbmi                       		&#91;disabled&#93;
  -mbranch-cost=              		
  -mcld                       		&#91;disabled&#93;
  -mcmodel=                   		
  -mcpu=                      		
  -mcrc32                     		&#91;disabled&#93;
  -mcx16                      		&#91;enabled&#93;
  -mdispatch-scheduler        		&#91;disabled&#93;
  -mf16c                      		&#91;enabled&#93;
  -mfancy-math-387            		&#91;enabled&#93;
  -mfentry                    		&#91;enabled&#93;
  -mfma                       		&#91;disabled&#93;
  -mfma4                      		&#91;disabled&#93;
  -mforce-drap                		&#91;disabled&#93;
  -mfp-ret-in-387             		&#91;enabled&#93;
  -mfpmath=                   		
  -mfsgsbase                  		&#91;enabled&#93;
  -mfused-madd                		
  -mglibc                     		&#91;enabled&#93;
  -mhard-float                		&#91;enabled&#93;
  -mieee-fp                   		&#91;enabled&#93;
  -mincoming-stack-boundary=  		
  -minline-all-stringops      		&#91;disabled&#93;
  -minline-stringops-dynamically 	&#91;disabled&#93;
  -mintel-syntax              		
  -mlarge-data-threshold=     		
  -mlwp                       		&#91;disabled&#93;
  -mmmx                       		&#91;disabled&#93;
  -mmovbe                     		&#91;disabled&#93;
  -mms-bitfields              		&#91;disabled&#93;
  -mno-align-stringops        		&#91;disabled&#93;
  -mno-fancy-math-387         		&#91;disabled&#93;
  -mno-push-args              		&#91;disabled&#93;
  -mno-red-zone               		&#91;disabled&#93;
  -mno-sse4                   		&#91;disabled&#93;
  -momit-leaf-frame-pointer   		&#91;disabled&#93;
  -mpc                        		
  -mpclmul                    		&#91;enabled&#93;
  -mpopcnt                    		&#91;enabled&#93;
  -mprefer-avx128             		&#91;disabled&#93;
  -mpreferred-stack-boundary= 		
  -mpush-args                 		&#91;enabled&#93;
  -mrdrnd                     		&#91;enabled&#93;
  -mrecip                     		&#91;disabled&#93;
  -mred-zone                  		&#91;enabled&#93;
  -mregparm=                  		
  -mrtd                       		&#91;disabled&#93;
  -msahf                      		&#91;enabled&#93;
  -msoft-float                		&#91;disabled&#93;
  -msse                       		&#91;enabled&#93;
  -msse2                      		&#91;enabled&#93;
  -msse2avx                   		&#91;disabled&#93;
  -msse3                      		&#91;enabled&#93;
  -msse4                      		&#91;enabled&#93;
  -msse4.1                    		&#91;enabled&#93;
  -msse4.2                    		&#91;enabled&#93;
  -msse4a                     		&#91;disabled&#93;
  -msse5                      		
  -msseregparm                		&#91;disabled&#93;
  -mssse3                     		&#91;enabled&#93;
  -mstack-arg-probe           		&#91;disabled&#93;
  -mstackrealign              		&#91;enabled&#93;
  -mstringop-strategy=        		
  -mtbm                       		&#91;disabled&#93;
  -mtls-dialect=              		
  -mtls-direct-seg-refs       		&#91;enabled&#93;
  -mtune=                     		generic
  -muclibc                    		&#91;disabled&#93;
  -mveclibabi=                		
  -mvect8-ret-in-mem          		&#91;disabled&#93;
  -mvzeroupper                		&#91;disabled&#93;
  -mxop                       		&#91;disabled&#93;
I'll try using -march=native with 4.7.3, but with 4.8 it actually yielded slower code.