abulmo wrote:
-O4 does nothing more than -O3 with gcc 4.7
Well on mine it does (see my detailled post).
I guess it is just a problem of accuracy.
I reproduced your experiment on stockfish 2.3.1, except I run the bench 10 times and rebuild the executable several times:
3612
3563
3590
3599
3619
3555
3603
3601
3547
3607
in average 3590 +/- 25
My conclusion is that bench time fluctuates between runs and also between compilations. IMHO it is very hard to detect small speed enhancement accurately.
If the question is whether there is any difference between gcc -O3 and gcc -O4, just look at the resulting binaries. If they are identical, there is no difference (at least not for that program). If they are not identical, there is a difference.
Bench time certainly fluctuates between runs, but not between compilations provided you use the same compiler and compiler options, and the same source code. The resulting binaries are simply the same. (Ok, it might fluctuate if you use profile guided optimization with different profiles, because then the resulting binaries might differ...)
syzygy wrote:If the question is whether there is any difference between gcc -O3 and gcc -O4, just look at the resulting binaries. If they are identical, there is no difference (at least not for that program). If they are not identical, there is a difference.
If I compile twice stockfish with the same options I got different binaries. So I do not think it is conclusive.
syzygy wrote:Bench time certainly fluctuates between runs, but not between compilations provided you use the same compiler and compiler options, and the same source code. The resulting binaries are simply the same. (Ok, it might fluctuate if you use profile guided optimization with different profiles, because then the resulting binaries might differ...)
With -O3 gcc enables -fguess-branch-probability, which means (if I understand correctly) it will chose some branches at random, and produces non deterministic binaries.
syzygy wrote:If the question is whether there is any difference between gcc -O3 and gcc -O4, just look at the resulting binaries. If they are identical, there is no difference (at least not for that program). If they are not identical, there is a difference.
If I compile twice stockfish with the same options I got different binaries. So I do not think it is conclusive.
But doesn't stockfish offer a compilation mode with profile-guided optimisations?
syzygy wrote:Bench time certainly fluctuates between runs, but not between compilations provided you use the same compiler and compiler options, and the same source code. The resulting binaries are simply the same. (Ok, it might fluctuate if you use profile guided optimization with different profiles, because then the resulting binaries might differ...)
With -O3 gcc enables -fguess-branch-probability, which means (if I understand correctly) it will chose some branches at random, and produces non deterministic binaries.
This surprises me, but the gcc documentation agrees with you. I have never noticed non-deterministic behaviour of -O3 or even -O6, though. Although I might turn out to be wrong, I am guessing that -fguess-branch-probability uses a random generator using the same seed on every run of gcc (and is therefore deterministic).
abulmo wrote:If I compile twice stockfish with the same options I got different binaries.
This has been my experience. I suspect gcc probably embeds some kind of compile-time data that includes the timestamp. If that's the case you'd want to disable anything like that before comparing binaries with a hash.
abulmo wrote:If I compile twice stockfish with the same options I got different binaries.
This has been my experience. I suspect gcc probably embeds some kind of compile-time data that includes the timestamp. If that's the case you'd want to disable anything like that before comparing binaries with a hash.
Yes, even more if macro like __DATE__ and __TIME__ are used in the program. The fguess-branch-probability does have an effect too. If I disable it, the size of the binary changes. Maybe the size of the binary is enough to establish -O3 and -O4 as the same, optimization level as they produce two binaries of the same size.
abulmo wrote:If I compile twice stockfish with the same options I got different binaries.
This has been my experience. I suspect gcc probably embeds some kind of compile-time data that includes the timestamp. If that's the case you'd want to disable anything like that before comparing binaries with a hash.
In my experience this is not the case. With gcc 4.4 at least, the md5sum of two successive compilations is exactly the same. I also tried -O3 and -O4 and the resulting files were exactly equal.
Of course if you use PGO or things like __TIME__ the binaries may not be the same. I haven't checked if Stockfish uses that
If I compile twice stockfish with the same options I got different binaries. So I do not think it is conclusive.
[/quote]
Actually, it looks like the -g flag (always set in stockfish's Makefile) is the culprit, and the strip command cannot remove all of its "garbage" (about 20 bytes of random data stay at address 0x24c - 0x260.
After correcting the makefile by removing the "-g flag", adding a "-s" to the LDFLAGS to automatically strip the executable at link time, and removing the __DATE__ macro in stockfish's code, I get the same executable accross different compilations.
Don wrote:P.S. PGO was useful before 4.6 - at least for me. But since then it has worked extremely well for me.
Don wrote:
lucasart wrote:In my experience, nothing beats GCC 4.7. As for PGO, I have never found that to be faster: maybe it used to in earlier versions, but with -O4 -flto, it's as fast w/o PGO
Who needs ICC or Mickeysoft VC++ anymore
That is pretty odd, I found major benefit's for PGO with Komodo. Maybe it is very program specific then.
For well-written programs, PGO is not going to produce huge improvements. But it will improve things significantly. 10% to as much as 20% is certainly possible. But this is mainly about optimizing the direct instruction path of a program so that the cache doesn't load blocks of code that are rarely used. There's very little gain elsewhere. But with a lot of if statements, particularly if-then-else type structures, it will move the uncommon path out of the primary execution stream and cause cache prefetching (filling an entire block) to work better since it won't prefetch the blocks of code that are infrequently used.
I actually meant that before 4.6 it was USELESS - I could see no advantage at all. By mistake I said it was "useful" but that is not what I meant.
I do take some care to avoid conditional instructions as much as possible but any chess program is relatively heavy on logic. But complex nested if/then statements - are you saying that PGO does the best with them? I could easily believe that.
Think about what you would do if you knew the history of every branch in your program.
If you had code like this:
if (c) { statements }
but you knew that most of the time (> 50%) c is false, you would want to write it like this:
if (c) goto xxxx
back_again:
...
and somewhere else you would do this:
xxxx:
{ statements }
go to back_again;
Now when that executes, the {statements} are not in the direct code path and don't get brought into cache, taking up space and time, as well as booting something else out.
That is about all PGO can do. Branch prediction is done in the hardware, so it can't help there. But if you have a ton of if statements, like a chess engine evaluation (for one place) then it can help. I see about a 10% improvement with icc. gcc has always been problematic for me and is unreliable when doing PGO. It either crashes or produces corrupt PGO temp files, particularly if I try to PGO everything including the threaded code...
Well things are more complex.
Suppose 'c' is just a variable that doesn't get changed say 20 instructions prior to execution.
I would guess some processors already can evaluate the branch then prior to taking it; at least save out some cycles.
Yet most CPU manufacturers are not exactly documenting how their branch prediction works - so it's not clear to us mortals how to write the C code there in order to lose less cycles.
Wouldn't it be able to build a small test there testing this out?
bob wrote:
Think about what you would do if you knew the history of every branch in your program.
If you had code like this:
if (c) { statements }
but you knew that most of the time (> 50%) c is false, you would want to write it like this:
if (c) goto xxxx
back_again:
...
and somewhere else you would do this:
xxxx:
{ statements }
go to back_again;
Now when that executes, the {statements} are not in the direct code path and don't get brought into cache, taking up space and time, as well as booting something else out.
That is about all PGO can do. Branch prediction is done in the hardware, so it can't help there. But if you have a ton of if statements, like a chess engine evaluation (for one place) then it can help. I see about a 10% improvement with icc. gcc has always been problematic for me and is unreliable when doing PGO. It either crashes or produces corrupt PGO temp files, particularly if I try to PGO everything including the threaded code...
PGO does more than that, at least in some compilers: