GCC quirk

bob · Post by **bob** » Sun Jan 26, 2014 12:27 am

Last night I was getting ready for CCT, starting this morning. As I did a profile-guided-optimization run, gcc puked up one of those "crafty.gcda is corrupted" messages. I remembered running across a "hack" the compiler guys had added to supposedly work around that, but I could not remember what it was.

What I do is compile with the -fprofile-arcs option, then run a bunch of single-thread positions, and then I run a multi-threaded position so that the parallel search code gets profiled as well. I then go back and compile again using -fprofile-use.

I couldn't remember the corrupt gcd a workaround, so I started reading gcc docs. And found it. -fprofile-correction, which seems to work around the corrupted file just fine. But in reading, I found that it was recommended to use -fprofile-generate rather than -fprofile-arcs, because it does a little more in terms of optimization. I changed that, did a quick one-cpu benchmark and all was well.

BTW before I changed the Makefile above, I was seeing about 40M nodes per second as the lower bound on speeds on the 12-core box I am using (a 2010 cluster we bought, fairly old, using just one node here).

When round one started, search speed was 25M nodes per second. SLOWLY increased to a little over 30M peak but never went past that. I puzzled over this, recompiled, nothing fixed it. I played the first 3 rounds leaving it alone, but had a little time to play with this on another node. When I changed the profile option from -fprofile-generate to -fprofile-arcs, the speed went back to 40M+. No idea why. The docs do say that -fprofile-generate enables -fprofile-arcs, -fprofile-values, and -fpvg (or something similar). One of those breaks Crafty badly. Plays normally, but 25m vs 40M.

I'm going to experiment later. But two key bits. The -fprofile-correction solves the age-old problem of trying to PGO a program that uses threads, and the obvious -fprofile-generate seems to hurt a thread program significantly. Might only be crafty, but 40M down to 25M is a major slowdown.

AlvaroBegue · Post by **AlvaroBegue** » Sun Jan 26, 2014 4:58 am

The most important lesson from this story should be "don't change anything before a tournament without testing it thoroughly." Speed in nodes per second is particularly cheap to test, and it should definitely be run after changing compilation options.

lucasart · Post by **lucasart** » Sun Jan 26, 2014 6:35 am

I never managed to get anything from PGO using GCC for DiscoCheck. Perhaps Clang is better at PGO. But, in any case, don't risk anything before a tournament for the sake of micro optimization. It's not worth it. Just go with a nice and stable Crafty.

When I compile Stockfish with and without PGO using GCC, it's the same. The PGO executable is not faster. And, of course, it's faster to build a non-PGO one, so I never use PGO.

michiguel · Post by **michiguel** » Sun Jan 26, 2014 7:06 am

bob wrote:Last night I was getting ready for CCT, starting this morning. As I did a profile-guided-optimization run, gcc puked up one of those "crafty.gcda is corrupted" messages. I remembered running across a "hack" the compiler guys had added to supposedly work around that, but I could not remember what it was.

What I do is compile with the -fprofile-arcs option, then run a bunch of single-thread positions, and then I run a multi-threaded position so that the parallel search code gets profiled as well. I then go back and compile again using -fprofile-use.

I couldn't remember the corrupt gcd a workaround, so I started reading gcc docs. And found it. -fprofile-correction, which seems to work around the corrupted file just fine. But in reading, I found that it was recommended to use -fprofile-generate rather than -fprofile-arcs, because it does a little more in terms of optimization. I changed that, did a quick one-cpu benchmark and all was well.

BTW before I changed the Makefile above, I was seeing about 40M nodes per second as the lower bound on speeds on the 12-core box I am using (a 2010 cluster we bought, fairly old, using just one node here).

When round one started, search speed was 25M nodes per second. SLOWLY increased to a little over 30M peak but never went past that. I puzzled over this, recompiled, nothing fixed it. I played the first 3 rounds leaving it alone, but had a little time to play with this on another node. When I changed the profile option from -fprofile-generate to -fprofile-arcs, the speed went back to 40M+. No idea why. The docs do say that -fprofile-generate enables -fprofile-arcs, -fprofile-values, and -fpvg (or something similar). One of those breaks Crafty badly. Plays normally, but 25m vs 40M.

I'm going to experiment later. But two key bits. The -fprofile-correction solves the age-old problem of trying to PGO a program that uses threads, and the obvious -fprofile-generate seems to hurt a thread program significantly. Might only be crafty, but 40M down to 25M is a major slowdown.

That is strange, but I just tested it and -fprofile-arcs is slightly better for me than -fprofile-generate (which is what I tested before and had in my script). It would have never occurred to me to check that and it is unexpected. This is not really thorough test, only repeated twice, but roughly this is what I got with Gaviota:

(relative speed single core)
Intel = 100%
Intel pgo = 111%
GCC (4.6.3) pgo (with -fprofile-generate) = 111%
GCC (4.6.3) pgo (with -fprofile-arcs) = 114%

Miguel

bob · Post by **bob** » Sun Jan 26, 2014 5:00 pm

AlvaroBegue wrote:The most important lesson from this story should be "don't change anything before a tournament without testing it thoroughly." Speed in nodes per second is particularly cheap to test, and it should definitely be run after changing compilation options.

I tested it. And it looked normal. I just didn't test with threads. Never had a case where single CPU NPS was the same or better, due to compiler optimizations, but thread NPS was slower. Not until this case anyway. Certainly won't happen again...

bob · Post by **bob** » Sun Jan 26, 2014 5:01 pm

lucasart wrote:I never managed to get anything from PGO using GCC for DiscoCheck. Perhaps Clang is better at PGO. But, in any case, don't risk anything before a tournament for the sake of micro optimization. It's not worth it. Just go with a nice and stable Crafty.

When I compile Stockfish with and without PGO using GCC, it's the same. The PGO executable is not faster. And, of course, it's faster to build a non-PGO one, so I never use PGO.

For me, with Intel C and GCC, PGO makes a faster executable. I have not measured gcc difference lately (will do this later today and post results) but intel usually gains about 10% overall, or it did the last time I tested.

bob · Post by **bob** » Sun Jan 26, 2014 5:02 pm

michiguel wrote:
bob wrote:Last night I was getting ready for CCT, starting this morning. As I did a profile-guided-optimization run, gcc puked up one of those "crafty.gcda is corrupted" messages. I remembered running across a "hack" the compiler guys had added to supposedly work around that, but I could not remember what it was.

What I do is compile with the -fprofile-arcs option, then run a bunch of single-thread positions, and then I run a multi-threaded position so that the parallel search code gets profiled as well. I then go back and compile again using -fprofile-use.

I couldn't remember the corrupt gcd a workaround, so I started reading gcc docs. And found it. -fprofile-correction, which seems to work around the corrupted file just fine. But in reading, I found that it was recommended to use -fprofile-generate rather than -fprofile-arcs, because it does a little more in terms of optimization. I changed that, did a quick one-cpu benchmark and all was well.

BTW before I changed the Makefile above, I was seeing about 40M nodes per second as the lower bound on speeds on the 12-core box I am using (a 2010 cluster we bought, fairly old, using just one node here).

When round one started, search speed was 25M nodes per second. SLOWLY increased to a little over 30M peak but never went past that. I puzzled over this, recompiled, nothing fixed it. I played the first 3 rounds leaving it alone, but had a little time to play with this on another node. When I changed the profile option from -fprofile-generate to -fprofile-arcs, the speed went back to 40M+. No idea why. The docs do say that -fprofile-generate enables -fprofile-arcs, -fprofile-values, and -fpvg (or something similar). One of those breaks Crafty badly. Plays normally, but 25m vs 40M.

I'm going to experiment later. But two key bits. The -fprofile-correction solves the age-old problem of trying to PGO a program that uses threads, and the obvious -fprofile-generate seems to hurt a thread program significantly. Might only be crafty, but 40M down to 25M is a major slowdown.
That is strange, but I just tested it and -fprofile-arcs is slightly better for me than -fprofile-generate (which is what I tested before and had in my script). It would have never occurred to me to check that and it is unexpected. This is not really thorough test, only repeated twice, but roughly this is what I got with Gaviota:

(relative speed single core)
Intel = 100%
Intel pgo = 111%
GCC (4.6.3) pgo (with -fprofile-generate) = 111%
GCC (4.6.3) pgo (with -fprofile-arcs) = 114%

Miguel

see if it greatly slows down parallel execution, although if you use processes rather than threads there is probably no effect.

AlvaroBegue · Post by **AlvaroBegue** » Sun Jan 26, 2014 5:31 pm

bob wrote:
AlvaroBegue wrote:The most important lesson from this story should be "don't change anything before a tournament without testing it thoroughly." Speed in nodes per second is particularly cheap to test, and it should definitely be run after changing compilation options.
I tested it. And it looked normal. I just didn't test with threads. Never had a case where single CPU NPS was the same or better, due to compiler optimizations, but thread NPS was slower. Not until this case anyway. Certainly won't happen again...

That's completely understandable: I wouldn't have expected it either.

I'll learn from your mistake and add a multi-thread speed test to my automated checks. So thanks for reporting this.

jdart · Post by **jdart** » Mon Jan 27, 2014 3:54 pm

That is interesting and I will give it a try.

Btw. I notice that recent versions of MSVC have "safe mode" and "fast mode" for PGO - "fast mode" is the default but only "safe mode" is thread-safe.

But with Microsoft I don't get much from PGO especially on x64. The output indicates that few routines are being optimized for speed during the PGO phase.

gcc may do better.

--Jon

GCC quirk

GCC quirk

Re: GCC quirk

Re: GCC quirk

Re: GCC quirk

Re: GCC quirk

Re: GCC quirk

Re: GCC quirk

Re: GCC quirk

Re: GCC quirk