Python is a very popular programming language and has successfully found its way into many application areas including data science, machine learning, and scientific computing.

On this page

Prerequisites

This page refers to concepts that are presented in Message Passing Interface, OpenMP and High Performance Libraries.

Context

Python's popularity in data science and scientific computing is mainly related to its relatively smooth learning curve, as well as to the support provided by a very active user community.

Python is available on all Pawsey systems. Developers of Python codes should pay special attention to computational performance and always consider using available high-performance Python tools. This section of the documentation includes the description of selected Python libraries and extensions which are particularly useful for increasing performance of a Python code.

Parallel programming

Code developers and users using Python are not limited to running serial applications. Modern Python libraries exist that enable the use of distributed-memory and shared-memory parallelisation. The optimal parallelisation approach is highly application- and architecture-dependent. Two well-known Python parallelisation packages supported at Pawsey are mpi4py and Cython/OpenMP.

MPI with mpi4py

As its name suggests, mpi4py is a package that implements the Message Passing Interface (or MPI) library, enabling distributed memory parallelisation in an application. MPI allows work to be performed by tasks residing on different CPU cores, that may or may not be on the same compute node. Information can be passed between tasks in the form of messages.

mpi4py is available on all Pawsey systems as a module.

Let's have a look at a sample Python script, called mpi.py, that uses mpi4py. Note how one of the first instructions is to import the mpi4py library.

Listing 1. Mpi4py example

####################################################################################
## Simple script that determines the sum of the IDs of all active MPI tasks using a 
## single global collective call
####################################################################################

from mpi4py import MPI
import numpy as np

# Query the MPI communicator for the number of active MPI tasks and the ID of each task

comm = MPI.COMM_WORLD
rank = comm.Get_rank()
size = comm.Get_size()

# Call the Allreduce global collective to produce the sum of all MPI task ids

data = np.array( int(rank) )
sum = np.zeros( (1,), dtype=np.int )
comm.Allreduce( data, sum, op=MPI.SUM )

# Display the result on all ranks, along with the correct answer

print( "Rank " + str(rank) + " has sum of ranks " + str(sum) + "; Answer = " + str( int((size-1)*size/2) ) )

Now let's execute the code above in an interactive Slurm session on Setonix using 4 processes:

Terminal 1. Run mpi4py code

$ salloc -N 1 -n 4 -p debug -t 0:01:00

$ module load python/3.9.7 py-numpy/1.20.3 py-mpi4py/3.1.2

$ srun -n 4 python mpi.py
Rank 0 has sum of ranks [6]; Answer = 6
Rank 2 has sum of ranks [6]; Answer = 6
Rank 1 has sum of ranks [6]; Answer = 6
Rank 3 has sum of ranks [6]; Answer = 6

OpenMP with Cython

Beside allowing the running of C (or C-like) code in a Python script (see Mixing Python and C via Cython), Cython also allows the use of OpenMP threads to achieve shared-memory parallelisation. In this approach, a program can spawn multiple threads to perform work. All of these threads must reside on the same compute node, as they use the shared global physical memory to pass information. OpenMP is a quick and efficient way to add parallelisation to an existing serial application, and Cython enables this possibility for Python code.

The following example demonstrates OpenMP parallelisation in a simple Python script that computes an approximation of the mathematical constant pi. We're going to comment on the various steps required to build and compile Cython code, partially reiterating the concepts presented in the pure Cython example further down in this page. The starting point is a file that contains the OpenMP-enabled code for the pi approximation:

Listing 2. Cython/OpenMP example

from cython.parallel cimport parallel,prange

def pi_omp( num_rects ):
    cdef double step, x, pi, sum = 0.0
    cdef int i, n

    n = num_rects
    step = 1.0 / n
    with nogil, parallel():
         for i in prange(n):
             x = (i+0.5)*step
             sum += 4.0/(1.0+x*x)

    pi = step * sum
    return pi

Let's call this file OpenmpPI.pyx (the .pyx extension is important here).

Now, the C-enabled Python code must be compiled into a shared-object library that will be called later by the main code. This compilation requires an auxiliary Python script, which we are calling setup.py:

Listing 3. setup.py for Cython/OpenMP

from distutils.core import setup
from distutils.extension import Extension
from Cython.Build import cythonize

ext_modules = [
    Extension(
        "OpenmpPI",
        ["OpenmpPI.pyx"],
        extra_compile_args=['-fopenmp'],
        extra_link_args=['-fopenmp'],
    )
]

setup(
    name='parallel-pi',
    ext_modules = cythonize(ext_modules)
)

Most of this is standard Cython compilation rules, but we do have to add the -fopenmp flag to enable OpenMP support in our Cython file. Be sure to name the correct pyx file in setup.py so that the required shared object library is built. Let's build the library:

Terminal 2. Build a Cython/OpenMP object

$ salloc -N 1 -n 1 -p debug -t 0:05:00

$ module load python/3.9.7 py-cython

$ srun python setup.py build_ext --inplace

As a result, a shared-object library file is built, named OpenmpPI.cpython-36m-x86_64-linux-gnu.so. We can now call the pi_omp function, defined herein, from a standard Python script, script.py:

Listing 4. Use a Cython/OpenMP object

##
##  Calculating Pi with using midpoint rectangle method integral 4.0/(1+x^2) dx = pi
##   or
##  summation (for i=0 to n) 4.0/(1+x_i^2) where x_i = (i + 1/2) * delta x
##  OpenMP Version
##

import timeit
import OpenmpPI
import math

num_rects=100000000

start = timeit.default_timer()
pi = OpenmpPI.pi_omp( num_rects )
diff_time = timeit.default_timer() - start
print( "Estimated value of pi = {!r}".format(pi) )
print( "Error = {!r}".format(abs(math.pi-pi)) )
print( "Compute time = {0:.3f} seconds".format(diff_time) )

Let's execute this code like any other OpenMP or multithreaded code, by means of an interactive Slurm session with 4 cores per process:

Terminal 3. Run Cython/OpenMP code

$ salloc -N 1 -n 1 -c 4 -p debug -t 0:01:00

$ module load python/3.9.7 py-cython/0.29.24
$ export OMP_NUM_THREADS=4

$ srun -c 4 python script.py
Estimated value of pi = 3.1415926535896825
Error = 1.1057821325266559e-13
Compute time = 0.724 seconds

High-performance numerical libraries

Numerical libraries like NumPy and SciPy are add-on modules for Python that provide common mathematical and numerical routines in precompiled, performant functions. The NumPy (Numeric Python) package provides basic routines for manipulating large arrays and matrices of numeric data. The SciPy (Scientific Python) package extends the functionality of NumPy with a substantial collection of useful algorithms, like minimisation, Fourier transformation, regression, and other applied mathematical techniques. Developers of Python codes should look into possible replacements of their numerical computations and use numerical libraries like NumPy or SciPy wherever possible.

NumPy

The NumPy module is an extremely powerful package for dealing with numerical calculations. At its core, it provides a way for users to create and manipulate N-dimensional arrays. It also provides a number of mathematical functions (such as exponentials and trigonometric functions) and solvers (such as linear algebra, FFTW). More importantly, NumPy arrays behave much like arrays in C, C++ and Fortran:

Arrays have a fixed size at creation (no dynamic growth)
They have a homogenous datatype (for example, all floats)
They are contiguous in memory (improved performance)

These features allow for better performance, while still maintaining the flexibility and ease of use of Python. Ultimately, NumPy's power comes from being able to couple Python's convenience with the speed of compiled libraries.

NumPy is available on all Pawsey systems as a module; it is required to load the python module first, and then the numpy module.

Bad practices

Here is something that you should definitely avoid doing. While NumPy offers great performance improvements in many use cases, the idiomatic way to iterate over elements in a Python built-in sequence (such as list or tuple) actually performs very poorly on NumPy lists. Consider the following Python console activity.

Terminal 4. Bad practice with NumPy arrays

>>> import numpy
>>> import time
>>> 
>>> def time_iteration(it):
...     start = time.perf_counter()
...     for x in it: pass
...     return time.perf_counter() - start
... 
>>> 
>>> my_list = list(range(10000000))
>>> time_iteration(my_list)
0.0792810750008357
>>> 
>>> np_list = numpy.array(my_list)
>>> time_iteration(np_list)
0.5176245779985038

See how iterating through a standard list versus a NumPy array takes 0.079 versus 0.52 seconds. What is happening here?

The reason behind this behaviour is that NumPy packs all integers in a C array. When iterating the NumPy array, the Python interpreter has to create a Python object for every integer extracted from the C array so that it can be used within the Python program. On the other hand, each element in the regular list my_list is a reference to an already existing Python object, whose creation happened when the list was first populated.

NumPy example

In the example file below, named matrix.py, we show how to use some basic matrix and matrix-vector operations in NumPy.

Listing 5. NumPy example: basic matrix operations

import numpy as np

# Generate two random matrices A and B
A = np.random.randint(1,10,size=(4,4))
B = np.random.randint(1,10,size=(4,4))

# Generate random vector x
x = np.random.rand(4)

# Compute matrix vector product
y = np.dot(A,x)
print(y)

# Compute the inverse of the matrix B
C = np.linalg.inv(B)
print(C)

# Compute the determinant of the matrix C
d = np.linalg.det(C)
print(d)

# Compute eigenvalues of the matrix C
w,v = np.linalg.eig(C)
print(w)
print(v)

Let's execute this code by means of an interactive Slurm session with one core:

Terminal 5. Run NumPy code

$ salloc -N 1 -n 1 -p debug -t 0:05:00

$ module load python/3.9.7 py-numpy/1.20.3

$ srun python matrix.py
[ 11.32995843  15.18201852  13.1489202   12.46920531]
[[ -5.52631579e-01   1.13157895e+00   1.57894737e-01  -1.84210526e-01]
 [ -1.11111111e-01   5.55555556e-01  -6.76965259e-18  -2.22222222e-01]
 [  7.30994152e-02  -1.54970760e-01  -5.26315789e-02   1.72514620e-01]
 [  3.15789474e-01  -7.89473684e-01   5.26315789e-02   1.05263158e-01]]
0.00292397660819
[ 0.63860313 -0.45638416  0.05517347 -0.18183689]
[[ 0.60206445  0.85629788 -0.619295   -0.00817806]
 [ 0.57685743 -0.0186308  -0.31291905 -0.21176471]
 [-0.19459178  0.05700502 -0.60213224  0.68340563]
 [-0.51661198 -0.51298856 -0.39495839 -0.69860259]]

SciPy

SciPy is a free and open-source Python library used for scientific and technical computing. SciPy contains modules for optimization, linear algebra, integration, interpolation, special functions, FFT, signal and image processing, ODE solvers and other tasks that are common in science and engineering.

SciPy builds on top of the NumPy array object and is part of the NumPy stack, which also includes tools packages like Matplotlib, Pandas and SymPy, and an ever growing set of scientific computing libraries. This NumPy stack has similar uses to other applications such as MATLAB, GNU Octave and Scilab.

SciPy is available on all Pawsey systems as a module. To use it you must load the python module first and then the scipy module.

SciPy example

In the file below, named fft.py, we show how to use the FFT implementation available in SciPy.

Listing 6. SciPy example

import numpy as np
from scipy.fftpack import fft, ifft

x = np.array([1.0, 2.0, 1.0, -1.0, 1.5])

y = fft(x)
print(y)

Let's execute this code by means of an interactive Slurm session with one core:

Terminal 6. Run SciPy code

$ salloc -N 1 -n 1 -p debug -t 0:05:00

$ module load python/3.9.7 py-numpy/1.20.3 py-scipy/1.7.1

$ srun python fft.py
[ 4.50000000+0.j          2.08155948-1.65109876j -1.83155948+1.60822041j
 -1.83155948-1.60822041j  2.08155948+1.65109876j]

Mixing Python and C via Cython

Cython is a package that allows running C (or C-like) code in a Python script. The code is compiled prior to runtime, providing significantly faster execution compared to standard Python interpreted execution.

The building and compilation of Cython code must be done in a couple of steps. In the following example we employ a simple Python script that computes an approximation of the mathematical constant pi. We start with a file that contains the code for such computation:

Listing 7. Cython example

def pi( num_rects ):
    cdef double step, x, pi, sum = 0.0
    cdef int i, n
 
    n = num_rects
    step = 1.0 / n
    for i in range(0,n):
      x = (i+0.5)*step
      sum += 4.0/(1.0+x*x)
 
    pi = step * sum
    return pi

Let's call this file PI.pyx (the .pyx extension is important here).

Now, the C-enabled Python code must be compiled into a shared-object library that will be called later by the main code. This compilation requires an auxiliary Python script, which we are calling setup.py:

Listing 8. setup.py for Cython

from distutils.core import setup
from distutils.extension import Extension
from Cython.Build import cythonize
 
ext_modules = [
    Extension(
        "PI",
        ["PI.pyx"],
        extra_compile_args=['-O3'],
        extra_link_args=['-O3'],
    )
]
 
setup(
    name='pi',
    ext_modules = cythonize(ext_modules)
)

Most of this is standard Cython compilation rules, with the addition of the -O3 flag to set a higher C compiler optimisation level. Be sure to name the correct pyx file in setup.py so that the required shared object library is built. Let's build the library:

Terminal 7. Build a Cython object

$ salloc -N 1 -n 1 -p debug -t 0:05:00

$ module load python/3.9.7 py-cython/0.29.24

$ srun python setup.py build_ext --inplace

As a result, a shared-object library file is built, named PI.cpython-36m-x86_64-linux-gnu.so. We can now call the pi function, defined herein, from a standard Python script, script.py:

Listing 9. Use a Cython object

##
##  Calculating Pi with using midpoint rectangle method integral 4.0/(1+x^2) dx = pi
##   or
##  summation (for i=0 to n) 4.0/(1+x_i^2) where x_i = (i + 1/2) * delta x
##  Serial Version
##
 
import timeit
import PI
import math
 
num_rects=100000000
 
start = timeit.default_timer()
pi = PI.pi( num_rects )
diff_time = timeit.default_timer() - start
print( "Estimated value of pi = {!r}".format(pi) )
print( "Error = {!r}".format(abs(math.pi-pi)) )
print( "Compute time = {0:.3f} seconds".format(diff_time) )

Let's execute this code by means of an interactive Slurm session with one core:

Terminal 8. Run Cython code

$ salloc -N 1 -n 1 -p debug -t 0:01:00

$ module load python/3.9.7 py-cython/0.29.24

$ srun python script.py
Estimated value of pi = 3.1415926535904264
Error = 6.332712132461893e-13
Compute time = 0.547 seconds

User Support Documentation

Developing with Python

Prerequisites

Context

Parallel programming

MPI with mpi4py

OpenMP with Cython

High-performance numerical libraries

NumPy

Bad practices

NumPy example

SciPy

SciPy example

Mixing Python and C via Cython