GCC quirk

Discussion of chess software programming and technical issues.

Moderator: Ras

bob
Posts: 20943
Joined: Mon Feb 27, 2006 7:30 pm
Location: Birmingham, AL

GCC quirk

Post by bob »

Last night I was getting ready for CCT, starting this morning. As I did a profile-guided-optimization run, gcc puked up one of those "crafty.gcda is corrupted" messages. I remembered running across a "hack" the compiler guys had added to supposedly work around that, but I could not remember what it was.

What I do is compile with the -fprofile-arcs option, then run a bunch of single-thread positions, and then I run a multi-threaded position so that the parallel search code gets profiled as well. I then go back and compile again using -fprofile-use.

I couldn't remember the corrupt gcd a workaround, so I started reading gcc docs. And found it. -fprofile-correction, which seems to work around the corrupted file just fine. But in reading, I found that it was recommended to use -fprofile-generate rather than -fprofile-arcs, because it does a little more in terms of optimization. I changed that, did a quick one-cpu benchmark and all was well.

BTW before I changed the Makefile above, I was seeing about 40M nodes per second as the lower bound on speeds on the 12-core box I am using (a 2010 cluster we bought, fairly old, using just one node here).

When round one started, search speed was 25M nodes per second. SLOWLY increased to a little over 30M peak but never went past that. I puzzled over this, recompiled, nothing fixed it. I played the first 3 rounds leaving it alone, but had a little time to play with this on another node. When I changed the profile option from -fprofile-generate to -fprofile-arcs, the speed went back to 40M+. No idea why. The docs do say that -fprofile-generate enables -fprofile-arcs, -fprofile-values, and -fpvg (or something similar). One of those breaks Crafty badly. Plays normally, but 25m vs 40M.

I'm going to experiment later. But two key bits. The -fprofile-correction solves the age-old problem of trying to PGO a program that uses threads, and the obvious -fprofile-generate seems to hurt a thread program significantly. Might only be crafty, but 40M down to 25M is a major slowdown.
AlvaroBegue
Posts: 932
Joined: Tue Mar 09, 2010 3:46 pm
Location: New York
Full name: Álvaro Begué (RuyDos)

Re: GCC quirk

Post by AlvaroBegue »

The most important lesson from this story should be "don't change anything before a tournament without testing it thoroughly." Speed in nodes per second is particularly cheap to test, and it should definitely be run after changing compilation options.
lucasart
Posts: 3243
Joined: Mon May 31, 2010 1:29 pm
Full name: lucasart

Re: GCC quirk

Post by lucasart »

I never managed to get anything from PGO using GCC for DiscoCheck. Perhaps Clang is better at PGO. But, in any case, don't risk anything before a tournament for the sake of micro optimization. It's not worth it. Just go with a nice and stable Crafty.

When I compile Stockfish with and without PGO using GCC, it's the same. The PGO executable is not faster. And, of course, it's faster to build a non-PGO one, so I never use PGO.
Theory and practice sometimes clash. And when that happens, theory loses. Every single time.
User avatar
michiguel
Posts: 6401
Joined: Thu Mar 09, 2006 8:30 pm
Location: Chicago, Illinois, USA

Re: GCC quirk

Post by michiguel »

bob wrote:Last night I was getting ready for CCT, starting this morning. As I did a profile-guided-optimization run, gcc puked up one of those "crafty.gcda is corrupted" messages. I remembered running across a "hack" the compiler guys had added to supposedly work around that, but I could not remember what it was.

What I do is compile with the -fprofile-arcs option, then run a bunch of single-thread positions, and then I run a multi-threaded position so that the parallel search code gets profiled as well. I then go back and compile again using -fprofile-use.

I couldn't remember the corrupt gcd a workaround, so I started reading gcc docs. And found it. -fprofile-correction, which seems to work around the corrupted file just fine. But in reading, I found that it was recommended to use -fprofile-generate rather than -fprofile-arcs, because it does a little more in terms of optimization. I changed that, did a quick one-cpu benchmark and all was well.

BTW before I changed the Makefile above, I was seeing about 40M nodes per second as the lower bound on speeds on the 12-core box I am using (a 2010 cluster we bought, fairly old, using just one node here).

When round one started, search speed was 25M nodes per second. SLOWLY increased to a little over 30M peak but never went past that. I puzzled over this, recompiled, nothing fixed it. I played the first 3 rounds leaving it alone, but had a little time to play with this on another node. When I changed the profile option from -fprofile-generate to -fprofile-arcs, the speed went back to 40M+. No idea why. The docs do say that -fprofile-generate enables -fprofile-arcs, -fprofile-values, and -fpvg (or something similar). One of those breaks Crafty badly. Plays normally, but 25m vs 40M.

I'm going to experiment later. But two key bits. The -fprofile-correction solves the age-old problem of trying to PGO a program that uses threads, and the obvious -fprofile-generate seems to hurt a thread program significantly. Might only be crafty, but 40M down to 25M is a major slowdown.
That is strange, but I just tested it and -fprofile-arcs is slightly better for me than -fprofile-generate (which is what I tested before and had in my script). It would have never occurred to me to check that and it is unexpected. This is not really thorough test, only repeated twice, but roughly this is what I got with Gaviota:

(relative speed single core)
Intel = 100%
Intel pgo = 111%
GCC (4.6.3) pgo (with -fprofile-generate) = 111%
GCC (4.6.3) pgo (with -fprofile-arcs) = 114%

Miguel
bob
Posts: 20943
Joined: Mon Feb 27, 2006 7:30 pm
Location: Birmingham, AL

Re: GCC quirk

Post by bob »

AlvaroBegue wrote:The most important lesson from this story should be "don't change anything before a tournament without testing it thoroughly." Speed in nodes per second is particularly cheap to test, and it should definitely be run after changing compilation options.
I tested it. And it looked normal. I just didn't test with threads. Never had a case where single CPU NPS was the same or better, due to compiler optimizations, but thread NPS was slower. Not until this case anyway. Certainly won't happen again...
bob
Posts: 20943
Joined: Mon Feb 27, 2006 7:30 pm
Location: Birmingham, AL

Re: GCC quirk

Post by bob »

lucasart wrote:I never managed to get anything from PGO using GCC for DiscoCheck. Perhaps Clang is better at PGO. But, in any case, don't risk anything before a tournament for the sake of micro optimization. It's not worth it. Just go with a nice and stable Crafty.

When I compile Stockfish with and without PGO using GCC, it's the same. The PGO executable is not faster. And, of course, it's faster to build a non-PGO one, so I never use PGO.
For me, with Intel C and GCC, PGO makes a faster executable. I have not measured gcc difference lately (will do this later today and post results) but intel usually gains about 10% overall, or it did the last time I tested.
bob
Posts: 20943
Joined: Mon Feb 27, 2006 7:30 pm
Location: Birmingham, AL

Re: GCC quirk

Post by bob »

michiguel wrote:
bob wrote:Last night I was getting ready for CCT, starting this morning. As I did a profile-guided-optimization run, gcc puked up one of those "crafty.gcda is corrupted" messages. I remembered running across a "hack" the compiler guys had added to supposedly work around that, but I could not remember what it was.

What I do is compile with the -fprofile-arcs option, then run a bunch of single-thread positions, and then I run a multi-threaded position so that the parallel search code gets profiled as well. I then go back and compile again using -fprofile-use.

I couldn't remember the corrupt gcd a workaround, so I started reading gcc docs. And found it. -fprofile-correction, which seems to work around the corrupted file just fine. But in reading, I found that it was recommended to use -fprofile-generate rather than -fprofile-arcs, because it does a little more in terms of optimization. I changed that, did a quick one-cpu benchmark and all was well.

BTW before I changed the Makefile above, I was seeing about 40M nodes per second as the lower bound on speeds on the 12-core box I am using (a 2010 cluster we bought, fairly old, using just one node here).

When round one started, search speed was 25M nodes per second. SLOWLY increased to a little over 30M peak but never went past that. I puzzled over this, recompiled, nothing fixed it. I played the first 3 rounds leaving it alone, but had a little time to play with this on another node. When I changed the profile option from -fprofile-generate to -fprofile-arcs, the speed went back to 40M+. No idea why. The docs do say that -fprofile-generate enables -fprofile-arcs, -fprofile-values, and -fpvg (or something similar). One of those breaks Crafty badly. Plays normally, but 25m vs 40M.

I'm going to experiment later. But two key bits. The -fprofile-correction solves the age-old problem of trying to PGO a program that uses threads, and the obvious -fprofile-generate seems to hurt a thread program significantly. Might only be crafty, but 40M down to 25M is a major slowdown.
That is strange, but I just tested it and -fprofile-arcs is slightly better for me than -fprofile-generate (which is what I tested before and had in my script). It would have never occurred to me to check that and it is unexpected. This is not really thorough test, only repeated twice, but roughly this is what I got with Gaviota:

(relative speed single core)
Intel = 100%
Intel pgo = 111%
GCC (4.6.3) pgo (with -fprofile-generate) = 111%
GCC (4.6.3) pgo (with -fprofile-arcs) = 114%

Miguel
see if it greatly slows down parallel execution, although if you use processes rather than threads there is probably no effect.
AlvaroBegue
Posts: 932
Joined: Tue Mar 09, 2010 3:46 pm
Location: New York
Full name: Álvaro Begué (RuyDos)

Re: GCC quirk

Post by AlvaroBegue »

bob wrote:
AlvaroBegue wrote:The most important lesson from this story should be "don't change anything before a tournament without testing it thoroughly." Speed in nodes per second is particularly cheap to test, and it should definitely be run after changing compilation options.
I tested it. And it looked normal. I just didn't test with threads. Never had a case where single CPU NPS was the same or better, due to compiler optimizations, but thread NPS was slower. Not until this case anyway. Certainly won't happen again...
That's completely understandable: I wouldn't have expected it either.

I'll learn from your mistake and add a multi-thread speed test to my automated checks. So thanks for reporting this.
jdart
Posts: 4420
Joined: Fri Mar 10, 2006 5:23 am
Location: http://www.arasanchess.org

Re: GCC quirk

Post by jdart »

That is interesting and I will give it a try.

Btw. I notice that recent versions of MSVC have "safe mode" and "fast mode" for PGO - "fast mode" is the default but only "safe mode" is thread-safe.

But with Microsoft I don't get much from PGO especially on x64. The output indicates that few routines are being optimized for speed during the PGO phase.

gcc may do better.

--Jon