You're not listening. After it has wrangled my neat written C-code to this spaghetti code it doesn't know how to speedup something like that with pgo, as it isn't a branch, it's just some sort of far jump forward and a jump back. Nothing more nothing less. PGO doesn't help then. It keeps spaghetti.bob wrote:Don't know what you are talking about myself. Both AMD and Intel _are_ "American projects". I've used both. And had success with both. Intel currently leads the pack. AMD led it for quite a while. I use what is best, not what is politically correct or whatever.diep wrote:In GCC yes the PGO is really ugly bad, i use very polite words.hgm wrote:Are you implying that PGO is totally stupid? If I would write a profiler, it would not make the slightest difference how many jumps or banches there were in the code. You simply count the number of times each possible control path is traversed. Either by incrementing a counter after every branch and at every branch target, or, less invasively, by just recording the return adress on every timer interrupt.diep wrote:The reason for that is because GCC writes branches to ugly code like this. After it has been rewritten to this spaghetticode, PGO can do nothing for you.
Beginners level would be a compliment.
At least 4 times worse than intel c++ / visual c++ do it.
Your way of doing PGO in GCC would mean that either we hear 1 clean shot, and you're dead by a bullet shot by a hired criminal, or GCC team would soon undo your code as AMD would be faster IPC wise than Nehalem.
That's why i showed the raped code sample by GCC. You must do SOMETHING to make the code slower from GCC.
Of course 2 simple CMOV's would be a lot faster than all the silly jumping and should get generated by default. It is just 1 example. There is worse.
Both are american companies, that's the weird thing. Intel produces in Israel and ships then the wafers to Malaysia where the hard work starts of testing them for weeks and then putting a stamp on them and they get sold. AMD prints the cpu's in germany then ships it to Indonesia, where the testing happens for weeks, stamp on it and they get sold.
Usually Bob only defends "till death" american projects. Maybe he made a mistake this time and assumed the code would only take the branch if it was seldom.
If you do not use PGO, and you said you didn't, then the compiler has absolutely no way to know how the branch will behave, and yes, it will produce poor code when it makes the wrong assumption. Simple fix is to use PGO. Then the problem goes away.
Intel C is _not_ 4 time faster/better than GCC. It is better. And it might be 10% faster or so as it has been in the past. I haven't run on AMD hardware recently, but when I have, I used GCC and it worked quite well. On occasion, it would fail miserably and crash. Particularly when I tried to profile a threaded program so that the thread-management code would also get optimized. However, I have also had Intel versions that would crash as well, so it happens.
But you only want clean fall-through if it falls through _most_ of the time. Otherwise that would be slower, it would cause the i-cache to prefetch stuff that gets skipped, and is just not as efficient overall.
However the branch generated by GCC here has a cost of around 30 cycles at AMD.
At intel it is a lot cheaper, intel has more clever rewrite mechanisms and has a bigger lookahead buffer. Probably Nehalem can already reorder the code.
Even then it is slower than doing a CMOV.
CMOV costs 1 cycle at most, even at intel.
I forgot to add that taking the branch, dang another cycle penalty extra for AMD, and then the jump back, another cycle at AMD.
So even if it would get 100% predicted by AMD that branch it still loses 2 cycles unnecesary.
That's why i want clean fall through code.
Thanks,
Vincent
Now we can have a discussion whether macaroni is a better form of spaghetti as that is easier to eat with a spoon; problem is that generating clean straight fall through code with CMOV's, for my part using a set* command, it means that pgo pass still CAN optimize the code to a branch when needed. GCC isn't that clever though. GCC generates something that is slower at both core2 as well as AMD. Just it hurts AMD *major league*. Not sure about intel. I assume Nehalem isn't worse than core2 here.
There is no excuse to generate spaghetticode here, except when you want to keep GCC slower than other compilers and hurt AMD a tad more than Intel. When i say AMD i also mean other manufacturers such as IBM (if you use GCC at that chip).
Vincent