Beginner's guide to graphical profiling

matthewlai · Post by **matthewlai** » Sat Sep 10, 2016 3:18 am

Here is a quick beginner's guide to graphical profiling. I am doing this on Linux, but see the last section about doing this on Windows.

What is profiling? Profiling is instrumenting your code to see how much time your CPU(s) spend in different parts of your code.

Profiling is very important when you care about performance, because humans are notoriously bad at guessing where performance bottlenecks are, and if you want to optimize your program, you REALLY want to know where to focus your effort on.

Have you heard of the saying "pre-mature optimization is the root of all evil"? It's true, but when is it not pre-mature? After you discover a performance bottleneck through profiling!

So how do you profile your own engine? 4 easy steps!

1. Implement a bench mode if you don't already have one. Have the engine just search a few positions and exit cleanly (this is important - you can't rely on Ctrl-C, because profiles are dumped at the end of execution).

2. Install the necessary tools (this is for Ubuntu/Debian, I'm sure you can figure it out for other distros) -

Code: Select all

sudo apt-get install python python-pip graphviz
sudo pip install gprof2dot

3. Compile your engine with debugging symbols and instrumentation code inserted -

Code: Select all

g++ &#40;or gcc&#41; -Os -g -pg <your normal options except don't do further optimization>

4. Run your program and generate a profile!

Code: Select all

./<your engine> bench
gprof <your engine>|prof2dot.py -s|dot -Tpng -o profile.png

That's it! Now look at profile.png
----------------------------------------------------------------------------------------------
As an example, this is the profile of Giraffe: http://matthewlai.ca/tmp/profile.png

Each box represents a function. For example, the Search::Search() box says

Code: Select all

Search&#58;&#58;Search
94.85%
&#40;0.11%)
1056108x

That means 94.85% of the total run time of the program (of the profiling run) is spent in this function, but only 0.11% of the time is spent in the function itself (as opposed to calling other functions). The function is called 1056108 times in total.

Then we see that 73.91% is spent in QSearch, and 85% in the eval function (regular search calls eval as well). Then if we follow the call graph all the way down, we see that almost all of that time is spent in the Forward call of linear layers in the neural network, which spends all its time in Eigen's matrix-vector product function, as we would expect.

What are some of the things I can learn from this graph, if this is the first time I profiled?
1. I don't do pseudo-legal move generation, and if we look at Board::GenerateAllLegalMoves, we see that only 1.39% of the time is spent there anyways. There is absolutely no need to do pseudo-legal move generation (and make the code unnecessarily complex) in Giraffe. Obviously incremental move generation is even further out of the question.
2. I determine whether a move is checking or not by applying and unapplying. This is a highly inefficient way to do that, and other engines have very complicated ways of doing this. But should I spend time optimizing this in Giraffe? No. CheckLegal only takes 1.22% of the time (and it's called from absolutely everywhere).
3. SEE also takes almost no time at all. So is the SEE-maps thing that I was doing and thought was quite slow.
4. The only place I should really be spending time optimizing is the neural network (and feature generation, a bit), since they take up 86% of the time.
----------------------------------------------------------------------------------------------
On Windows: This is supposedly possible on Windows as well if you use GCC (MinGW or Cygwin), but I never tried it: http://yzhong.co/profiling-with-gprof-u ... -window-7/

If you use Microsoft tools, MSVC has its own profiler, but I have no idea how that works, and I believe you have to pay a lot of money to get that with the Enterprise edition.

Another popular (free) option is Very Sleepy: http://www.codersnotes.com/sleepy/
I believe gprof2dot also supports visualizing Very Sleepy profiles.

Ferdy · Post by **Ferdy** » Sat Sep 10, 2016 5:07 am

matthewlai wrote:Here is a quick beginner's guide to graphical profiling. I am doing this on Linux, but see the last section about doing this on Windows.

Very nice thanks.

mcostalba · Post by **mcostalba** » Sat Sep 10, 2016 8:45 am

Thanks, here is a run on current Stockfish master:

https://postimg.org/image/lhnz5s7dh

matthewlai · Post by **matthewlai** » Sat Sep 10, 2016 1:23 pm

mcostalba wrote:Thanks, here is a run on current Stockfish master:

https://postimg.org/image/lhnz5s7dh

That's really cool! I thought eval in SF would take more than just 25%!

mcostalba · Post by **mcostalba** » Sat Sep 10, 2016 2:53 pm

matthewlai wrote:
mcostalba wrote:Thanks, here is a run on current Stockfish master:

https://postimg.org/image/lhnz5s7dh
That's really cool! I thought eval in SF would take more than just 25%!

Eval is cached in TT along usual TT score....

jwes · Post by **jwes** » Sat Sep 10, 2016 6:10 pm

matthewlai wrote:
mcostalba wrote:Thanks, here is a run on current Stockfish master:

https://postimg.org/image/lhnz5s7dh
That's really cool! I thought eval in SF would take more than just 25%!

I would not have thought MovePicker::next_move would take more time than eval.

elcabesa · Post by **elcabesa** » Sat Sep 10, 2016 6:44 pm

getting the good move order is very important to have a small tree. so Stockfish spend a lot of time here

bob · Post by **bob** » Sun Sep 11, 2016 5:48 pm

matthewlai wrote:Here is a quick beginner's guide to graphical profiling. I am doing this on Linux, but see the last section about doing this on Windows.

What is profiling? Profiling is instrumenting your code to see how much time your CPU(s) spend in different parts of your code.

Profiling is very important when you care about performance, because humans are notoriously bad at guessing where performance bottlenecks are, and if you want to optimize your program, you REALLY want to know where to focus your effort on.

Have you heard of the saying "pre-mature optimization is the root of all evil"? It's true, but when is it not pre-mature? After you discover a performance bottleneck through profiling!

So how do you profile your own engine? 4 easy steps!

1. Implement a bench mode if you don't already have one. Have the engine just search a few positions and exit cleanly (this is important - you can't rely on Ctrl-C, because profiles are dumped at the end of execution).

2. Install the necessary tools (this is for Ubuntu/Debian, I'm sure you can figure it out for other distros) -
Code: Select all
sudo apt-get install python python-pip graphviz
sudo pip install gprof2dot
3. Compile your engine with debugging symbols and instrumentation code inserted -
Code: Select all
g++ &#40;or gcc&#41; -Os -g -pg <your normal options except don't do further optimization>
4. Run your program and generate a profile!
Code: Select all
./<your engine> bench
gprof <your engine>|prof2dot.py -s|dot -Tpng -o profile.png
That's it! Now look at profile.png
----------------------------------------------------------------------------------------------
As an example, this is the profile of Giraffe: http://matthewlai.ca/tmp/profile.png

Each box represents a function. For example, the Search::Search() box says
Code: Select all
Search&#58;&#58;Search
94.85%
&#40;0.11%)
1056108x
That means 94.85% of the total run time of the program (of the profiling run) is spent in this function, but only 0.11% of the time is spent in the function itself (as opposed to calling other functions). The function is called 1056108 times in total.

Then we see that 73.91% is spent in QSearch, and 85% in the eval function (regular search calls eval as well). Then if we follow the call graph all the way down, we see that almost all of that time is spent in the Forward call of linear layers in the neural network, which spends all its time in Eigen's matrix-vector product function, as we would expect.

What are some of the things I can learn from this graph, if this is the first time I profiled?
1. I don't do pseudo-legal move generation, and if we look at Board::GenerateAllLegalMoves, we see that only 1.39% of the time is spent there anyways. There is absolutely no need to do pseudo-legal move generation (and make the code unnecessarily complex) in Giraffe. Obviously incremental move generation is even further out of the question.
2. I determine whether a move is checking or not by applying and unapplying. This is a highly inefficient way to do that, and other engines have very complicated ways of doing this. But should I spend time optimizing this in Giraffe? No. CheckLegal only takes 1.22% of the time (and it's called from absolutely everywhere).
3. SEE also takes almost no time at all. So is the SEE-maps thing that I was doing and thought was quite slow.
4. The only place I should really be spending time optimizing is the neural network (and feature generation, a bit), since they take up 86% of the time.
----------------------------------------------------------------------------------------------
On Windows: This is supposedly possible on Windows as well if you use GCC (MinGW or Cygwin), but I never tried it: http://yzhong.co/profiling-with-gprof-u ... -window-7/

If you use Microsoft tools, MSVC has its own profiler, but I have no idea how that works, and I believe you have to pay a lot of money to get that with the Enterprise edition.

Another popular (free) option is Very Sleepy: http://www.codersnotes.com/sleepy/
I believe gprof2dot also supports visualizing Very Sleepy profiles.

Only thing I would add is that leaving off optimization will drastically skew the results and lead you down the wrong path frequently. I've not found that optimization causes any profiling problems if you are simply looking at the top level (as you are) to see which procedures are the heavy-hitters. If you want to look inside a procedure to specific lines of code, optimization confuses things of course.

matthewlai · Post by **matthewlai** » Sun Sep 11, 2016 5:59 pm

bob wrote: Only thing I would add is that leaving off optimization will drastically skew the results and lead you down the wrong path frequently. I've not found that optimization causes any profiling problems if you are simply looking at the top level (as you are) to see which procedures are the heavy-hitters. If you want to look inside a procedure to specific lines of code, optimization confuses things of course.

Yeah that would depend on the program. I found -Os to be a good compromise because it's almost as fast as -O3, but doesn't do function inlining (except for very tiny functions).

bob · Post by **bob** » Sun Sep 11, 2016 8:07 pm

matthewlai wrote:
bob wrote: Only thing I would add is that leaving off optimization will drastically skew the results and lead you down the wrong path frequently. I've not found that optimization causes any profiling problems if you are simply looking at the top level (as you are) to see which procedures are the heavy-hitters. If you want to look inside a procedure to specific lines of code, optimization confuses things of course.
Yeah that would depend on the program. I found -Os to be a good compromise because it's almost as fast as -O3, but doesn't do function inlining (except for very tiny functions).

I guess I don't understand your "don't do other optimizations". -O anything (other than 0) is a massive optimization anyway and makes it impossible to profile at the source line level.

Beginner's guide to graphical profiling

Beginner's guide to graphical profiling

Re: Beginner's guide to graphical profiling

Re: Beginner's guide to graphical profiling

Re: Beginner's guide to graphical profiling

Re: Beginner's guide to graphical profiling

Re: Beginner's guide to graphical profiling

Re: Beginner's guide to graphical profiling

Re: Beginner's guide to graphical profiling

Re: Beginner's guide to graphical profiling

Re: Beginner's guide to graphical profiling