Profiling PARPG

From Post-Apocalyptic RPG wiki

Jump to: navigation, search

Established workflow.png This article covers an established workflow.

Established workflows have been discussed among and agreed upon by the development department(s). They should serve as documentation how these workflows are successfully utilized in the project.

This tutorial explains how to profile the PARPG codebase.

Contents

An introduction to profiling

Starting with a quote by Donald Knuth: “We should forget about small efficiencies, say about 97% of the time: premature optimization is the root of all evil. Yet we should not pass up our opportunities in that critical 3%. A good programmer will not be lulled into complacency by such reasoning, he will be wise to look carefully at the critical code; but only after that code has been identified”

In order to optimize our application in a performance meaning of the word, it means that at a certain point in time we need to pinpoint exact locations where lots of resources are wasted, this process is called profiling. During profiling a dynamic analysis of the application is conducted and the result is a breakdown which shows where resources are spent.

However profiling in itself can be tricky business since profiling is something related to the Heisenberg uncertainty principle. Basicly that by profiling you typically alter the state of the system. Profiling can happen by code annotation, or by runtime analysis, but none of these come for free, so profiling often results in the application running several times slower than normally.

Next to this we should keep in mind that in this context we even have an extra problem since we're writing a Python frontend to a C++ backend. Which means that a Python-only solution or a C/C++-only solution will not be sufficient. As a result I will present on this page several suggestions on how profiling can be done, and which tools we can use. However who wants to perform profiling will need to make some decisions on the most ideal solution for the problem under investigation.

One important note is that not everything on this page is portable, bear in mind that all things described on this page have only been tested on a .deb based Linux system which might introduce some problems when using the information on this page to work on other systems. Valgrind for example is only available on Linux and Mac but not on BSD/Win32/... and all your other favorite choices.

Profiling Python: cProfile

See for example this document on python.org explains how to profile python applications. On a .deb based system on would install the package python-profiler. Next the easiest step to profile is simply by calling

python -m cProfile run.py

This will execute PARPG and when this is finished it will should you a listing with several timing information and made calls. This will by default sort the output on the function name (which is basicly nonsense). In order to sort on a different table (typically you will be interested in the total number of calls or the total time spent in one function to find the oddities). In order to accomplish this you can use the -s paramter so we could use these to accomplish this:

python -m cProfile -s tim run.py
python -m cProfile -s cal run.py

An alternative approach would be to to modify the main function in run.py, for example in the following way:

if __name__ == '__main__':
    import cProfile
    import pstats
    cProfile.run('main()', 'mainprof')
    stats = pstats.Stats('mainprof')
    stats.sort_stats('tim')
    #stats.sort_stats('cal')
    stats.print_stats()

In this case you can make of the pstats module to do some data mining of your own (within Python), or for example to save the profiling output to a file and postprocess it in a standalone application. (Note: you can pass multiple keys to sort on, however this doesn't work that well for timing information, since time information is stored more precise than displayed a stats.sort_stat('tim', 'cal') will give unexpected results).

This approach is great for finding bottlenecks which live inside the Python part of PARPG but unfortunately most of the time the output will resemble this one:

         614729 function calls (612610 primitive calls) in 54.367 CPU seconds

   Ordered by: internal time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
     1378   47.932    0.035   49.570    0.036 {_fife.Engine_pump}
        1    1.398    1.398    1.398    1.398 {_fife.Engine_destroy}
        4    0.336    0.084    1.332    0.333 xmlmap.py:226(parseInstances)
      421    0.251    0.001    0.251    0.001 {_fife.Label_setWidth}
     1953    0.188    0.000    0.188    0.000 {_fife.Camera_getMatchingInstances}
        1    0.183    0.183    0.183    0.183 {_fife.Engine_init}
    34501    0.174    0.000    0.214    0.000 fife.py:14(_swig_setattr_nondynamic)
    34501    0.161    0.000    0.375    0.000 fife.py:27(_swig_setattr)
    69053    0.149    0.000    0.149    0.000 {built-in method get}
      221    0.142    0.001    0.651    0.003 xmlanimation.py:40(do_load_resource)

Which learns us that the application is spending most of its time within FIFE (_fife.Engine_pump is the heartbeat which make FIFE render a new frame. So at this point cProfile is only capable of pointing to FIFE.

In order to profile more specific chunks of python code, the approach should be not trying to profile the entire main application, but only a chunk of the suspected guilty function(s). For this the example code above can be reused.

Profiling mixed C++/Python universes: Callgrind

Callgrind is a tool which is part of the toolsuite, this is also the methode mentioned in the FIFE wiki. Callgrind functions by attempting to detect function calls and returns, unfortunately callgrind is very heavy profiler, meaning it will have a heavy impact on performance. In order to properly use this tool, one would install the valgrind and the kcachegrind packages (the last one is a visualization frontend).

The default way to use this tool is passing the application as an argument:

valgrind  --tool=callgrind python run.py

And you will immediately notice the impact of this tool in the performance. After having enough patience and exiting the game properly you will end up with a file called 'callgrind.out.' and will be suffixed by the PID of the application. This file can easily be analyzed using kcachegrind. Run

kcachegrind callgrind.out.6957

To open kcachegrind and visualise the file.

This will resemble the following output (click for larger version):

Callgrind example.png

Do note that at this point we are also profiling the python interpreter, which will prevent us from pinpointing problems within Python code. When you want to look more deeper (as in which system calls are used, you will also need to pass special options to callgrind).

On the left you see the callgraph where yo ucan sort on function name, number of times called, time spent in the function, total time. Typically you will find the fat functions by sorting on the self table. On the right you can see who calls this function, the distribution of time in the called functions.

Note that this is a profile of the full application run, thus also including the loading of resources etc. However most of the time you are only interested in the 'why-o-why-is-my-framerate-that-low'-question. This is why there is a possibility to start/stop callgrind on request.

valgrind --help --tool=callgrind
valgrind  --tool=callgrind --instr-atstart=no python run.py

This will start PARPG but will not start the instrumentation just yet. In order to trigger the instrumentation you would do the following:

e:~$ callgrind_control -s
PID 8361: python run.py [requesting 'Status'...]
  No information available as instrumentation is switched off.
 
e:~$ callgrind_control --instr=on
PID 8361: python run.py [requesting '+Instrumentation'...]
  OK.
e:~$ callgrind_control -s
PID 8361: python run.py [requesting 'Status'...]
  Number of running threads: 4, thread IDs: 1 2 3 4
  Events collected: Ir
  Functions: 12,536 (executed 27,586,101, contexts 12,536)
  Basic blocks: 23,645 (executed 137,441,003, call sites 7,339)
 
e:~$ callgrind_control --instr=off
PID 8361: python run.py [requesting '-Instrumentation'...]
  OK.

This will resemble the following output (click for larger version):

Callgrind example1.png

If you compare both outputs you see that in the first output some initialization functions (like FIFE::Pool::sanityCheck() consumed 2% of CPU but disappeared when we are looking at the second run. Also since the first time we didn't show that many frames the FIFE:Camera::render() apparently seemed to be of a lesser importance).

Profiling a system: oprofile

Install oprofile by getting the packages called oprofile and oprofile-gui. oprofile consists out of a kernel module which periodically samples where the instruction pointer of your CPU is pointing at. So it has an extremely low impact on the system under test. But it also goes extremely wide, it will profile the ENTIRE system (so it has no knowledge of program boundaries). So this means if you happen to be running a SETI client and a distributed.net client along with eclipse while watching youtube, you will find a lot of garbage before you find what you might be looking at.

Launch the GUI by running

sudo oprof_start

Mind the sudo, since this is a very low level operation, you will require elevated privileges. In the configuration tab you might want to say you have no kernel image available for profiling (we don't need that anyhow). Check the 'per application checkbox' and click start to run profiling. Then do whatever you want to profile, click stop and save samples file to terminate the application.

Next you can run opreport to examine the samples files:

e:~$ opreport
CPU: Core 2, speed 2101 MHz (estimated)
Counted CPU_CLK_UNHALTED events (Clock cycles when not halted) with a unit mask of 0x00 (Unhalted core cycles) count 100000
CPU_CLK_UNHALT...|
  samples|      %|
------------------
   595737 82.0421 python2.6
	CPU_CLK_UNHALT...|
	  samples|      %|
	------------------
	   402607 67.5813 _fife.so
	   143190 24.0358 libGLcore.so.185.18.36
	     8979  1.5072 libc-2.10.1.so
	     5920  0.9937 python2.6
	     4757  0.7985 zero
	     4575  0.7680 libSDL-1.2.so.0.11.2
	     4152  0.6970 libGL.so.185.18.36
	     3278  0.5502 libpulsecommon-0.9.19.so
	     2968  0.4982 libvorbis.so.0.4.0
	     2326  0.3904 libguichan-0.8.1.so.1.0.0
	     1898  0.3186 libm-2.10.1.so
	     1613  0.2708 libz.so.1.2.3.3
	     1502  0.2521 libopenal_soft.so
	     1405  0.2358 libstdc++.so.6.0.13
	     1154  0.1937 libpthread-2.10.1.so
	     1033  0.1734 libpulse.so.0.12.0
	      858  0.1440 libpng12.so.0.29.0
	      698  0.1172 [vdso] (tgid:9468 range:0xfa5000-0xfa6000)
	      498  0.0836 libguichan_opengl-0.8.1.so.1.0.0
	      488  0.0819 libogg.so.0.6.0
	      476  0.0799 librt-2.10.1.so
	      283  0.0475 libSDL_ttf-2.0.so.0.6.3
	      240  0.0403 libasound.so.2.0.0
	      213  0.0358 libasound_module_pcm_pulse.so
	      205  0.0344 libvorbisfile.so.3.2.0
	      203  0.0341 libX11.so.6.2.0
	      103  0.0173 libxcb.so.1.1.0
	       55  0.0092 libXext.so.6.4.0
	       48  0.0081 libguichan_sdl-0.8.1.so.1.0.0
	        8  0.0013 libfreetype.so.6.3.20
	        4 6.7e-04 libSDL_image-1.2.so.0.1.6

This basicly shows that fife is running away with 67% of the CPU, rendering takes about 25%, and the entire application spends less than 1% of the time chewing python code (aren't there numbers that make you modest, since PARPG is a python project ? :D ). Fair enough, now lets try to look deeper:

e:~$ opreport -l `which python` | more 
warning: /dev/zero could not be found.
warning: [vdso] (tgid:9468 range:0xfa5000-0xfa6000) could not be found.
CPU: Core 2, speed 2101 MHz (estimated)
143190   26.3083  libGLcore.so.185.18.36   /usr/lib/libGLcore.so.185.18.36
27212     4.9997  _fife.so                 FIFE::Camera::render()
21668     3.9811  _fife.so                 __i686.get_pc_thunk.bx
20632     3.7907  _fife.so                 std::map<FIFE::Camera*, FIFE::InstanceVisualCacheItem, std::less<FIFE::Camera*>, std::allocator<std::pair<FIFE::Camera* const, FIFE::InstanceVisualCacheItem> > >::operator[](FIFE::Camera* const&)
15154     2.7842  _fife.so                 std::_Rb_tree<FIFE::Camera*, std::pair<FIFE::Camera* const, FIFE::InstanceVisualCacheItem>, std::_Select1st<std::pair<FIFE::Camera* const, FIFE::InstanceVisualCacheItem> >, std::less<FIFE::Camera*
>, std::allocator<std::pair<FIFE::Camera* const, FIFE::InstanceVisualCacheItem> > >::_M_lower_bound(std::_Rb_tree_node<std::pair<FIFE::Camera* const, FIFE::InstanceVisualCacheItem> >*, std::_Rb_tree_node<std::pair<FIFE::Camera* const, FIFE
::InstanceVisualCacheItem> >*, FIFE::Camera* const&)
11753     2.1594  _fife.so                 std::_Rb_tree<FIFE::Camera*, std::pair<FIFE::Camera* const, FIFE::InstanceVisualCacheItem>, std::_Select1st<std::pair<FIFE::Camera* const, FIFE::InstanceVisualCacheItem> >, std::less<FIFE::Camera*
>, std::allocator<std::pair<FIFE::Camera* const, FIFE::InstanceVisualCacheItem> > >::lower_bound(FIFE::Camera* const&)
11601     2.1315  _fife.so                 FIFE::GLImage::render(FIFE::Rect const&, SDL_Surface*, unsigned char)
10997     2.0205  _fife.so                 std::less<FIFE::Camera*>::operator()(FIFE::Camera* const&, FIFE::Camera* const&) const
9556      1.7557  _fife.so                 std::map<FIFE::Camera*, FIFE::InstanceVisualCacheItem, std::less<FIFE::Camera*>, std::allocator<std::pair<FIFE::Camera* const, FIFE::InstanceVisualCacheItem> > >::lower_bound(FIFE::Camera* const&)
9227      1.6953  _fife.so                 std::_Rb_tree<FIFE::Camera*, std::pair<FIFE::Camera* const, FIFE::InstanceVisualCacheItem>, std::_Select1st<std::pair<FIFE::Camera* const, FIFE::InstanceVisualCacheItem> >, std::less<FIFE::Camera*
>, std::allocator<std::pair<FIFE::Camera* const, FIFE::InstanceVisualCacheItem> > >::end()
8691      1.5968  _fife.so                 std::map<FIFE::Camera*, FIFE::InstanceVisualCacheItem, std::less<FIFE::Camera*>, std::allocator<std::pair<FIFE::Camera* const, FIFE::InstanceVisualCacheItem> > >::end()
8057      1.4803  _fife.so                 FIFE::InstanceVisual* FIFE::Instance::getVisual<FIFE::InstanceVisual>() const
8008      1.4713  _fife.so                 FIFE::Instance::update()
7758      1.4254  _fife.so                 FIFE::Pool::get(unsigned int, bool)
7517      1.3811  _fife.so                 std::_Rb_tree_iterator<std::pair<FIFE::Camera* const, FIFE::InstanceVisualCacheItem> >::operator==(std::_Rb_tree_iterator<std::pair<FIFE::Camera* const, FIFE::InstanceVisualCacheItem> > const&) co
nst
7351      1.3506  _fife.so                 std::map<FIFE::Camera*, FIFE::InstanceVisualCacheItem, std::less<FIFE::Camera*>, std::allocator<std::pair<FIFE::Camera* const, FIFE::InstanceVisualCacheItem> > >::key_comp() const
6944      1.2758  _fife.so                 FIFE::Visual2DGfx::isVisible()
6872      1.2626  _fife.so                 FIFE::InstanceDistanceSort::operator()(FIFE::Instance const*, FIFE::Instance const*)
6624      1.2170  _fife.so                 std::_Rb_tree<FIFE::Camera*, std::pair<FIFE::Camera* const, FIFE::InstanceVisualCacheItem>, std::_Select1st<std::pair<FIFE::Camera* const, FIFE::InstanceVisualCacheItem> >, std::less<FIFE::Camera*
>, std::allocator<std::pair<FIFE::Camera* const, FIFE::InstanceVisualCacheItem> > >::_S_key(std::_Rb_tree_node<std::pair<FIFE::Camera* const, FIFE::InstanceVisualCacheItem> > const*)
6577      1.2084  _fife.so                 std::_Rb_tree<FIFE::Camera*, std::pair<FIFE::Camera* const, FIFE::InstanceVisualCacheItem>, std::_Select1st<std::pair<FIFE::Camera* const, FIFE::InstanceVisualCacheItem> >, std::less<FIFE::Camera*
>, std::allocator<std::pair<FIFE::Camera* const, FIFE::InstanceVisualCacheItem> > >::_M_begin()
5920      1.0877  python2.6                /usr/bin/python2.6
5861      1.0768  _fife.so                 std::_Rb_tree_iterator<std::pair<FIFE::Camera* const, FIFE::InstanceVisualCacheItem> >::_Rb_tree_iterator(std::_Rb_tree_node<std::pair<FIFE::Camera* const, FIFE::InstanceVisualCacheItem> >*)
5745      1.0555  _fife.so                 FIFE::Rect::Rect(int, int, unsigned int, unsigned int)
5397      0.9916  _fife.so                 FIFE::InstanceVisual::getCacheItem(FIFE::Camera*)
5327      0.9787  _fife.so                 FIFE::LogManager::instance()
5196      0.9547  _fife.so                 FIFE::LogManager::isVisible(logmodule_t)
4827      0.8869  _fife.so                 FIFE::InstanceRenderer::render(FIFE::Camera*, FIFE::Layer*, std::vector<FIFE::Instance*, std::allocator<FIFE::Instance*> >&)
4757      0.8740  zero                     /dev/zero
4613      0.8475  _fife.so                 FIFE::Rect::intersects(FIFE::Rect const&) const

The verbosity of a module (aka why have we got details for fife.so and not for libGLCore.so ?) depends on the fact whether debugging symbols are available or not. Reading the C++ STL output is also something which is an acquired taste. You should keep in mind that the template system returns a very verbose messages (as can be seen) and the untrained eye will probably not know that the std::_Rb_tree<> is nothing more than a lookup happening in a map (which is backed by a red-black tree engine).

Personal tools