Last night I was getting ready for CCT, starting this morning. As I did a profile-guided-optimization run, gcc puked up one of those "crafty.gcda is corrupted" messages. I remembered running across a "hack" the compiler guys had added to supposedly work around that, but I could not remember what it was.
What I do is compile with the -fprofile-arcs option, then run a bunch of single-thread positions, and then I run a multi-threaded position so that the parallel search code gets profiled as well. I then go back and compile again using -fprofile-use.
I couldn't remember the corrupt gcd a workaround, so I started reading gcc docs. And found it. -fprofile-correction, which seems to work around the corrupted file just fine. But in reading, I found that it was recommended to use -fprofile-generate rather than -fprofile-arcs, because it does a little more in terms of optimization. I changed that, did a quick one-cpu benchmark and all was well.
BTW before I changed the Makefile above, I was seeing about 40M nodes per second as the lower bound on speeds on the 12-core box I am using (a 2010 cluster we bought, fairly old, using just one node here).
When round one started, search speed was 25M nodes per second. SLOWLY increased to a little over 30M peak but never went past that. I puzzled over this, recompiled, nothing fixed it. I played the first 3 rounds leaving it alone, but had a little time to play with this on another node. When I changed the profile option from -fprofile-generate to -fprofile-arcs, the speed went back to 40M+. No idea why. The docs do say that -fprofile-generate enables -fprofile-arcs, -fprofile-values, and -fpvg (or something similar). One of those breaks Crafty badly. Plays normally, but 25m vs 40M.
I'm going to experiment later. But two key bits. The -fprofile-correction solves the age-old problem of trying to PGO a program that uses threads, and the obvious -fprofile-generate seems to hurt a thread program significantly. Might only be crafty, but 40M down to 25M is a major slowdown.
GCC quirk
Moderator: Ras
-
AlvaroBegue
- Posts: 932
- Joined: Tue Mar 09, 2010 3:46 pm
- Location: New York
- Full name: Álvaro Begué (RuyDos)
Re: GCC quirk
The most important lesson from this story should be "don't change anything before a tournament without testing it thoroughly." Speed in nodes per second is particularly cheap to test, and it should definitely be run after changing compilation options.
-
lucasart
- Posts: 3243
- Joined: Mon May 31, 2010 1:29 pm
- Full name: lucasart
Re: GCC quirk
I never managed to get anything from PGO using GCC for DiscoCheck. Perhaps Clang is better at PGO. But, in any case, don't risk anything before a tournament for the sake of micro optimization. It's not worth it. Just go with a nice and stable Crafty.
When I compile Stockfish with and without PGO using GCC, it's the same. The PGO executable is not faster. And, of course, it's faster to build a non-PGO one, so I never use PGO.
When I compile Stockfish with and without PGO using GCC, it's the same. The PGO executable is not faster. And, of course, it's faster to build a non-PGO one, so I never use PGO.
Theory and practice sometimes clash. And when that happens, theory loses. Every single time.
-
michiguel
- Posts: 6401
- Joined: Thu Mar 09, 2006 8:30 pm
- Location: Chicago, Illinois, USA
Re: GCC quirk
That is strange, but I just tested it and -fprofile-arcs is slightly better for me than -fprofile-generate (which is what I tested before and had in my script). It would have never occurred to me to check that and it is unexpected. This is not really thorough test, only repeated twice, but roughly this is what I got with Gaviota:bob wrote:Last night I was getting ready for CCT, starting this morning. As I did a profile-guided-optimization run, gcc puked up one of those "crafty.gcda is corrupted" messages. I remembered running across a "hack" the compiler guys had added to supposedly work around that, but I could not remember what it was.
What I do is compile with the -fprofile-arcs option, then run a bunch of single-thread positions, and then I run a multi-threaded position so that the parallel search code gets profiled as well. I then go back and compile again using -fprofile-use.
I couldn't remember the corrupt gcd a workaround, so I started reading gcc docs. And found it. -fprofile-correction, which seems to work around the corrupted file just fine. But in reading, I found that it was recommended to use -fprofile-generate rather than -fprofile-arcs, because it does a little more in terms of optimization. I changed that, did a quick one-cpu benchmark and all was well.
BTW before I changed the Makefile above, I was seeing about 40M nodes per second as the lower bound on speeds on the 12-core box I am using (a 2010 cluster we bought, fairly old, using just one node here).
When round one started, search speed was 25M nodes per second. SLOWLY increased to a little over 30M peak but never went past that. I puzzled over this, recompiled, nothing fixed it. I played the first 3 rounds leaving it alone, but had a little time to play with this on another node. When I changed the profile option from -fprofile-generate to -fprofile-arcs, the speed went back to 40M+. No idea why. The docs do say that -fprofile-generate enables -fprofile-arcs, -fprofile-values, and -fpvg (or something similar). One of those breaks Crafty badly. Plays normally, but 25m vs 40M.
I'm going to experiment later. But two key bits. The -fprofile-correction solves the age-old problem of trying to PGO a program that uses threads, and the obvious -fprofile-generate seems to hurt a thread program significantly. Might only be crafty, but 40M down to 25M is a major slowdown.
(relative speed single core)
Intel = 100%
Intel pgo = 111%
GCC (4.6.3) pgo (with -fprofile-generate) = 111%
GCC (4.6.3) pgo (with -fprofile-arcs) = 114%
Miguel
-
bob
- Posts: 20943
- Joined: Mon Feb 27, 2006 7:30 pm
- Location: Birmingham, AL
Re: GCC quirk
I tested it. And it looked normal. I just didn't test with threads. Never had a case where single CPU NPS was the same or better, due to compiler optimizations, but thread NPS was slower. Not until this case anyway. Certainly won't happen again...AlvaroBegue wrote:The most important lesson from this story should be "don't change anything before a tournament without testing it thoroughly." Speed in nodes per second is particularly cheap to test, and it should definitely be run after changing compilation options.
-
bob
- Posts: 20943
- Joined: Mon Feb 27, 2006 7:30 pm
- Location: Birmingham, AL
Re: GCC quirk
For me, with Intel C and GCC, PGO makes a faster executable. I have not measured gcc difference lately (will do this later today and post results) but intel usually gains about 10% overall, or it did the last time I tested.lucasart wrote:I never managed to get anything from PGO using GCC for DiscoCheck. Perhaps Clang is better at PGO. But, in any case, don't risk anything before a tournament for the sake of micro optimization. It's not worth it. Just go with a nice and stable Crafty.
When I compile Stockfish with and without PGO using GCC, it's the same. The PGO executable is not faster. And, of course, it's faster to build a non-PGO one, so I never use PGO.
-
bob
- Posts: 20943
- Joined: Mon Feb 27, 2006 7:30 pm
- Location: Birmingham, AL
Re: GCC quirk
see if it greatly slows down parallel execution, although if you use processes rather than threads there is probably no effect.michiguel wrote:That is strange, but I just tested it and -fprofile-arcs is slightly better for me than -fprofile-generate (which is what I tested before and had in my script). It would have never occurred to me to check that and it is unexpected. This is not really thorough test, only repeated twice, but roughly this is what I got with Gaviota:bob wrote:Last night I was getting ready for CCT, starting this morning. As I did a profile-guided-optimization run, gcc puked up one of those "crafty.gcda is corrupted" messages. I remembered running across a "hack" the compiler guys had added to supposedly work around that, but I could not remember what it was.
What I do is compile with the -fprofile-arcs option, then run a bunch of single-thread positions, and then I run a multi-threaded position so that the parallel search code gets profiled as well. I then go back and compile again using -fprofile-use.
I couldn't remember the corrupt gcd a workaround, so I started reading gcc docs. And found it. -fprofile-correction, which seems to work around the corrupted file just fine. But in reading, I found that it was recommended to use -fprofile-generate rather than -fprofile-arcs, because it does a little more in terms of optimization. I changed that, did a quick one-cpu benchmark and all was well.
BTW before I changed the Makefile above, I was seeing about 40M nodes per second as the lower bound on speeds on the 12-core box I am using (a 2010 cluster we bought, fairly old, using just one node here).
When round one started, search speed was 25M nodes per second. SLOWLY increased to a little over 30M peak but never went past that. I puzzled over this, recompiled, nothing fixed it. I played the first 3 rounds leaving it alone, but had a little time to play with this on another node. When I changed the profile option from -fprofile-generate to -fprofile-arcs, the speed went back to 40M+. No idea why. The docs do say that -fprofile-generate enables -fprofile-arcs, -fprofile-values, and -fpvg (or something similar). One of those breaks Crafty badly. Plays normally, but 25m vs 40M.
I'm going to experiment later. But two key bits. The -fprofile-correction solves the age-old problem of trying to PGO a program that uses threads, and the obvious -fprofile-generate seems to hurt a thread program significantly. Might only be crafty, but 40M down to 25M is a major slowdown.
(relative speed single core)
Intel = 100%
Intel pgo = 111%
GCC (4.6.3) pgo (with -fprofile-generate) = 111%
GCC (4.6.3) pgo (with -fprofile-arcs) = 114%
Miguel
-
AlvaroBegue
- Posts: 932
- Joined: Tue Mar 09, 2010 3:46 pm
- Location: New York
- Full name: Álvaro Begué (RuyDos)
Re: GCC quirk
That's completely understandable: I wouldn't have expected it either.bob wrote:I tested it. And it looked normal. I just didn't test with threads. Never had a case where single CPU NPS was the same or better, due to compiler optimizations, but thread NPS was slower. Not until this case anyway. Certainly won't happen again...AlvaroBegue wrote:The most important lesson from this story should be "don't change anything before a tournament without testing it thoroughly." Speed in nodes per second is particularly cheap to test, and it should definitely be run after changing compilation options.
I'll learn from your mistake and add a multi-thread speed test to my automated checks. So thanks for reporting this.
-
jdart
- Posts: 4420
- Joined: Fri Mar 10, 2006 5:23 am
- Location: http://www.arasanchess.org
Re: GCC quirk
That is interesting and I will give it a try.
Btw. I notice that recent versions of MSVC have "safe mode" and "fast mode" for PGO - "fast mode" is the default but only "safe mode" is thread-safe.
But with Microsoft I don't get much from PGO especially on x64. The output indicates that few routines are being optimized for speed during the PGO phase.
gcc may do better.
--Jon
Btw. I notice that recent versions of MSVC have "safe mode" and "fast mode" for PGO - "fast mode" is the default but only "safe mode" is thread-safe.
But with Microsoft I don't get much from PGO especially on x64. The output indicates that few routines are being optimized for speed during the PGO phase.
gcc may do better.
--Jon