Profiling

This page outlines the profiling of the code and documents the attempts to improve both CPU and memory usage. Much of this has become outdated by the increased integration with C.

I am really not sure what tools to use for the best here. One thing is certain, this needs to be done at the command line to avoid bloating the code: in particular jupyter notebook is horrendously memory guzzling. The inbuilt profiling tools are controlled by runtime boolean parameters b_profile_cpu and b_profile_mem. However, with v0.2 of the code moving to C these have become much less useful so I will probably get rid of them.

I have yet to investigate profiling of the C code as the python bookkeeping still dominates.

All profiling mentioned below was undertaken on the first 1000 halos of Mill/pygal_063_5.hdf5 with verbosity set to 0 (ie minimal output) and with standard optimisation (-O) enabled.

CPU time

See https://realpython.com/python-timer/#the-python-timer-code for hints.

Of the three methods described below, I have found cProfile to be the most useful: it does not require editing of the code and it produces a modest-sized profile file.

cProfile

This is a command line tool that can tell you in which functions the code spends most of its time. It is easy to use as it requires no modification to the code itself: python -O -m cProfile -o output_dir/L-Galaxies.prof L-Galaxies.py.

Then analyse using python -m pstats output_dir/L-Galaxies.prof which pops you into an interactive interface. The recommended basic commands are:

  • strip – cleans the output;

  • sort (tot|cum)time – sorts by time spent in functions excluding (tot) or including (cum) subfunctions

  • stats # – shows the top # entries.

Once you know which function/method is taking most of the time, then you can use a line profiler to determine which individual lines of code are at fault. This requires an @profile directive to be placed in front of the function to be profiled.

  • pip install line_profiler

  • kernprof -lv -o output_dir/L-Galaxies.py.lprof L-Galaxies.py

  • python -m line_profiler output_dir/L-Galaxies.py.lprof

Because the internal profiling takes up significant resources, the following timings were run with both b_profile_cpu=False and b_profile_mem=False. They also had star formation histories enabled, b_SFH=True.

v0.1

Prior to any attempt to move any code to C, these were the timings on my MacBook Pro, 2.3 GHz 8-Core Intel Core i9, 16 GB

505608955 function calls (471733059 primitive calls) in 518 seconds

ncalls

tottime

cumtime

filename:lineno(function)

6040454

112

128

star_formation_and_feedback.py:15(F_gal_form_stars)

46895

58

70

gals.py:237(append)

163604

50

357

driver.py:55(F_process_halos)

6047421

46

69

star_formation_and_feedback.py:155(F_gal_SNR_feedback)

1536069

33

63

cooling.py:130(F_sub)

46895

26

45

driver.py:170(F_update_halos)

546477

22

26

sfh.py:203(F_sfh_update_bins)

2403

15

15

dataset.py:858(__setitem__)

1536069

15

20

cooling.py:270(F_get_metaldependent_cooling_rate)

6047421

13

13

star_formation_and_feedback.py:112(F_star_formation_unresolved)

As can be seen:

  • Most of the work is undertaken within F_process_halos, i.e. within the astrophysics routines.

  • The bookkeeping within F_update_halos takes much less CPU.

During optimisation of v0.1, the line profiler (not shown here) had also revealed the following:

  • Accidental leaving of units within the main body of the code is very slow.

  • Use of column headings in numpy structured arrays takes up most of the time – could be effectively eliminated by interfacing with C.

  • Manipulation of halo and subhalo class instances is slow but does not seem to dominate.

That suggests that significant speed-up could be achieved by:

  • Defining a (static?) C struct to hold the internal parameters that are currently attributes of parameters.py.

  • Defining a (static?) C struct to hold the internal variabless that are currently stored in commons.py.

  • Associating the gals numpy structured array with an equivalent array of C structures (possibly only one at a time rather than an array).

  • Changing all the routines below F_update_halos (ie most of them!) to C equivalents.

v0.2

After converting most of the astrophysics routines to C, and with exactly the same flags and identical output on a test run, these are the timings:

234767941 function calls (234706396 primitive calls) in 445 seconds

ncalls

tottime

cumtime

filename:lineno(function)

163604

96

258

driver.py:17(F_process_halos)

33293342

56

80

ctypeslib.py:184(from_param)

46895

48

64

gals.py:243(append)

33293342

39

39

__init__.py:511(cast)

46895

35

58

driver.py:130(F_update_halos)

33293342

21

21

_internal.py:249(__init__)

33293342

20

59

_internal.py:266(data_as)

2403

19

20

dataset.py:858(__setitem__)

33293342

12

71

_internal.py:344(as_parameter_)

46895

10

12

subs.py:306(append)

Here we see that:

  • The astrophysics routines have effectively disappeared from the output.

  • However, this has not resulted in a huge saving in time overall. The time taken within the astrophysics functions has been largely replaced by overhead that seems most likely to be associated with using ctypes in the function calls. This then necessitates one of the following:

    • Change the way that ctypes is used to make the overhead less demanding, or

    • Progress the C implementation by converting the whole of driver.py into C. This will also require replacing halo and subhalo classes with numpy arrays and functions.

    I suspect that the latter approach will be needed anyway to make the code run super-efficiently, as most halos adn subhalos require very little astrophysics work and the overhead associated with a class instance is too great to be justifiable.

v0.3

The following main changes were made:

  • Eliminate classes for halos and subhalos and instead use numpy arrays (keep classes for I/O)

  • Split driver.py into two pieces:

    • push_snap.py to do the logistics;

    • process_snap.c to perform the astrophysics.

with these results:

27393842 function calls (27332266 primitive calls) in 197.873 seconds

ncalls

tottime

cumtime

filename:lineno(function)

46895

52

73

push_snap.py:11(F_push_snap)

46895

49

65

gals.py:277(append)

2403

19

19

dataset.py:858(append)

46895

10

12

subs.py:346(append)

46895

10

12

halos.py:381(append)

46895

10

10

subs.py:146(F_subs_initialise)

46895

10

10

halos.py:146(F_halos_initialise)

1000

4

15

graphs.py:50(__init__)

27003

4

6

group.py:348(__getitem__)

1455269

4

4

{method ‘reduce’ of numpy.ufunc’ objects}

Here we see that:

  • The ctypes calls have effectively disappeared from the output.

  • The run time has more than halved.

  • The main time taken is now in push_snap: that does a lot of work. To convert this to C one would first have to extract all the required graph properties into numpy arrays, then pass those into this routine as arguments. This would not be too hard to do.

  • A lot of the remaining CPU time seems to be taken up with the append call in the instance of the galaxy output class. However, I realise now that I am doing a loop over galaxies in python rather than using block assignment using numpy ranges, so I’ll try that and see what difference it makes (see v0.4, below)

  • The dataset.py:858(append) call is probably related to h5py and may not be reducible, except that I have not experimented with the size of the galaxy I/O buffer (parameter numerics:n_HDF5_io_rec).

v0.4

The following change was made:

  • When appending to output buffers (halos, subhalos, galaxies) use ranges [:] rather than looping over objects one at a time.

with these results:

27543316 function calls (27481677 primitive calls) in 128.781 seconds

ncalls

tottime

cumtime

filename:lineno(function)

46895

49

69

push_snap.py:11(F_push_snap)

2403

18

18

dataset.py:858(__setitem__)

46895

7

7

subs.py:146(F_subs_initialise)

46895

6

6

halos.py:178(F_halos_initialise)

1000

4

14

graphs.py:50(__init__)

27003

4

6

group.py:348(__getitem__)

1455269

3

3

{method ‘reduce’ of ‘numpy.ufunc’ objects}

25000

3

3

{method ‘read’ of ‘h5py._selector.Reader’ objects}

631268

3

3

{built-in method numpy.arange}

46895

3

18

gals.py:275(append)

Here we see that:

  • The append methods of the output classes do indeed now take a lot less time: almost certainly not worth making the code uglier by converting them into C.

  • The dataset.py method now claims that __setitem__ is the offending routine rather that append. My guess is that these are essentially the same and experimenting with the size of the output buffer (n_HDF5_io_rec) might help.

  • The time taken in push_snap now dominates the python work. To convert this to C one would first have to extract all the required graph properties into numpy arrays, then pass those into this routine as arguments. This would not be too hard to do.

C_time class

A simple class that holds a dictionary of numpy records:

  • key – name of the record

  • value – n_start, n_stop, cpu_time_start, cpu_time_total

with methods:

  • __repr__ – prints out the dictionary.

  • dump(filename) – saves the dictionary as a pickle file filename.

  • start(name) – adds an entry to the dictionary with key name, or reopens an existing one.

  • stop(name) – accumulates the time spent in cpu_time_total.

This routine should be relatively lightweight. It is used to track the time taken to process each graph. The following plots show the distribution of process times on the first 1000 graphs in Millennium File 5. This is for the pure python version of the code (ie v0.1).

CPU time taken to process the first 1000 halos in Millennium File 5 Cumulative CPU time taken to process the first 1000 halos in Millennium File 5

codetiming.Timer

https://pypi.org/project/codetiming/

This can be used as a decorator to profile individual python functions. It is useful but the output seems incredibly bloated. For example, on processing just 1000 halos it produces an output file that is 300MB in size.

The following plot shows the function process times on the first 1000 graphs in Millennium File 5. Note that these times are inclusive of subfunctions.

Function time taken to process the first 1000 halos in Millennium File 5, inclusive of subfunctions.

Somewhat surprisingly, F_sfh_update_bins does not appear in this listing, even though cProfile has it as the second most CPU-hungry routine.

Overall, cProfile seems to be much less resource-hungry and more useful.

Memory usage

To follow once CPU optimisation is complete. The expectation, however, is that py-galaxies should be memory efficient as it only loads in halos, subhalos and galaxies for two snapshots of a single graph at a time.