Python is a very popular programming language and has successfully found its way into many application areas including data science, machine learning, and scientific computing.
Prerequisites
This page refers to concepts that are presented in Message Passing Interface, OpenMP and High Performance Libraries.
Context
Python's popularity in data science and scientific computing is mainly related to its relatively smooth learning curve, as well as to the support provided by a very active user community.
Python is available on all Pawsey systems. Developers of Python codes should pay special attention to computational performance and always consider using available high-performance Python tools. This section of the documentation includes the description of selected Python libraries and extensions which are particularly useful for increasing performance of a Python code.
Parallel programming
Code developers and users using Python are not limited to running serial applications. Modern Python libraries exist that enable the use of distributed-memory and shared-memory parallelisation. The optimal parallelisation approach is highly application- and architecture-dependent. Two well-known Python parallelisation packages supported at Pawsey are mpi4py
and Cython/OpenMP
.
MPI with mpi4py
As its name suggests, mpi4py
is a package that implements the Message Passing Interface (or MPI) library, enabling distributed memory parallelisation in an application. MPI allows work to be performed by tasks residing on different CPU cores, that may or may not be on the same compute node. Information can be passed between tasks in the form of messages.
mpi4py
is available on all Pawsey systems as a module.
Let's have a look at a sample Python script, called mpi.py
, that uses mpi4py
. Note how one of the first instructions is to import the mpi4py
library.
#################################################################################### ## Simple script that determines the sum of the IDs of all active MPI tasks using a ## single global collective call #################################################################################### from mpi4py import MPI import numpy as np # Query the MPI communicator for the number of active MPI tasks and the ID of each task comm = MPI.COMM_WORLD rank = comm.Get_rank() size = comm.Get_size() # Call the Allreduce global collective to produce the sum of all MPI task ids data = np.array( int(rank) ) sum = np.zeros( (1,), dtype=np.int ) comm.Allreduce( data, sum, op=MPI.SUM ) # Display the result on all ranks, along with the correct answer print( "Rank " + str(rank) + " has sum of ranks " + str(sum) + "; Answer = " + str( int((size-1)*size/2) ) )
Now let's execute the code above in an interactive Slurm session on Setonix using 4 processes:
$ salloc -N 1 -n 4 -p debug -t 0:01:00 $ module load python/3.9.7 py-numpy/1.20.3 py-mpi4py/3.1.2 $ srun -n 4 python mpi.py Rank 0 has sum of ranks [6]; Answer = 6 Rank 2 has sum of ranks [6]; Answer = 6 Rank 1 has sum of ranks [6]; Answer = 6 Rank 3 has sum of ranks [6]; Answer = 6
OpenMP with Cython
Beside allowing the running of C (or C-like) code in a Python script (see Mixing Python and C via Cython), Cython also allows the use of OpenMP threads to achieve shared-memory parallelisation. In this approach, a program can spawn multiple threads to perform work. All of these threads must reside on the same compute node, as they use the shared global physical memory to pass information. OpenMP is a quick and efficient way to add parallelisation to an existing serial application, and Cython enables this possibility for Python code.
The following example demonstrates OpenMP parallelisation in a simple Python script that computes an approximation of the mathematical constant pi. We're going to comment on the various steps required to build and compile Cython code, partially reiterating the concepts presented in the pure Cython example further down in this page. The starting point is a file that contains the OpenMP-enabled code for the pi approximation:
from cython.parallel cimport parallel,prange def pi_omp( num_rects ): cdef double step, x, pi, sum = 0.0 cdef int i, n n = num_rects step = 1.0 / n with nogil, parallel(): for i in prange(n): x = (i+0.5)*step sum += 4.0/(1.0+x*x) pi = step * sum return pi
Let's call this file OpenmpPI.pyx
(the .pyx
extension is important here).
Now, the C-enabled Python code must be compiled into a shared-object library that will be called later by the main code. This compilation requires an auxiliary Python script, which we are calling setup.py
:
from distutils.core import setup from distutils.extension import Extension from Cython.Build import cythonize ext_modules = [ Extension( "OpenmpPI", ["OpenmpPI.pyx"], extra_compile_args=['-fopenmp'], extra_link_args=['-fopenmp'], ) ] setup( name='parallel-pi', ext_modules = cythonize(ext_modules) )
Most of this is standard Cython compilation rules, but we do have to add the -fopenmp
flag to enable OpenMP support in our Cython file. Be sure to name the correct pyx file in setup.py
so that the required shared object library is built. Let's build the library:
$ salloc -N 1 -n 1 -p debug -t 0:05:00 $ module load python/3.9.7 py-cython $ srun python setup.py build_ext --inplace
As a result, a shared-object library file is built, named OpenmpPI.cpython-36m-x86_64-linux-gnu.so
. We can now call the pi_omp
function, defined herein, from a standard Python script, script.py
:
## ## Calculating Pi with using midpoint rectangle method integral 4.0/(1+x^2) dx = pi ## or ## summation (for i=0 to n) 4.0/(1+x_i^2) where x_i = (i + 1/2) * delta x ## OpenMP Version ## import timeit import OpenmpPI import math num_rects=100000000 start = timeit.default_timer() pi = OpenmpPI.pi_omp( num_rects ) diff_time = timeit.default_timer() - start print( "Estimated value of pi = {!r}".format(pi) ) print( "Error = {!r}".format(abs(math.pi-pi)) ) print( "Compute time = {0:.3f} seconds".format(diff_time) )
Let's execute this code like any other OpenMP or multithreaded code, by means of an interactive Slurm session with 4 cores per process:
$ salloc -N 1 -n 1 -c 4 -p debug -t 0:01:00 $ module load python/3.9.7 py-cython/0.29.24 $ export OMP_NUM_THREADS=4 $ srun -c 4 python script.py Estimated value of pi = 3.1415926535896825 Error = 1.1057821325266559e-13 Compute time = 0.724 seconds
High-performance numerical libraries
Numerical libraries like NumPy and SciPy are add-on modules for Python that provide common mathematical and numerical routines in precompiled, performant functions. The NumPy (Numeric Python) package provides basic routines for manipulating large arrays and matrices of numeric data. The SciPy (Scientific Python) package extends the functionality of NumPy with a substantial collection of useful algorithms, like minimisation, Fourier transformation, regression, and other applied mathematical techniques. Developers of Python codes should look into possible replacements of their numerical computations and use numerical libraries like NumPy or SciPy wherever possible.
NumPy
The NumPy module is an extremely powerful package for dealing with numerical calculations. At its core, it provides a way for users to create and manipulate N-dimensional arrays. It also provides a number of mathematical functions (such as exponentials and trigonometric functions) and solvers (such as linear algebra, FFTW). More importantly, NumPy arrays behave much like arrays in C, C++ and Fortran:
- Arrays have a fixed size at creation (no dynamic growth)
- They have a homogenous datatype (for example, all floats)
- They are contiguous in memory (improved performance)
These features allow for better performance, while still maintaining the flexibility and ease of use of Python. Ultimately, NumPy's power comes from being able to couple Python's convenience with the speed of compiled libraries.
NumPy is available on all Pawsey systems as a module; it is required to load the python
module first, and then the numpy
module.
Bad practices
Here is something that you should definitely avoid doing. While NumPy offers great performance improvements in many use cases, the idiomatic way to iterate over elements in a Python built-in sequence (such as list
or tuple
) actually performs very poorly on NumPy lists. Consider the following Python console activity.
>>> import numpy >>> import time >>> >>> def time_iteration(it): ... start = time.perf_counter() ... for x in it: pass ... return time.perf_counter() - start ... >>> >>> my_list = list(range(10000000)) >>> time_iteration(my_list) 0.0792810750008357 >>> >>> np_list = numpy.array(my_list) >>> time_iteration(np_list) 0.5176245779985038
See how iterating through a standard list versus a NumPy array takes 0.079 versus 0.52 seconds. What is happening here?
The reason behind this behaviour is that NumPy packs all integers in a C array. When iterating the NumPy array, the Python interpreter has to create a Python object for every integer extracted from the C array so that it can be used within the Python program. On the other hand, each element in the regular list my_list
is a reference to an already existing Python object, whose creation happened when the list was first populated.
NumPy example
In the example file below, named matrix.py
, we show how to use some basic matrix and matrix-vector operations in NumPy.
import numpy as np # Generate two random matrices A and B A = np.random.randint(1,10,size=(4,4)) B = np.random.randint(1,10,size=(4,4)) # Generate random vector x x = np.random.rand(4) # Compute matrix vector product y = np.dot(A,x) print(y) # Compute the inverse of the matrix B C = np.linalg.inv(B) print(C) # Compute the determinant of the matrix C d = np.linalg.det(C) print(d) # Compute eigenvalues of the matrix C w,v = np.linalg.eig(C) print(w) print(v)
Let's execute this code by means of an interactive Slurm session with one core:
$ salloc -N 1 -n 1 -p debug -t 0:05:00 $ module load python/3.9.7 py-numpy/1.20.3 $ srun python matrix.py [ 11.32995843 15.18201852 13.1489202 12.46920531] [[ -5.52631579e-01 1.13157895e+00 1.57894737e-01 -1.84210526e-01] [ -1.11111111e-01 5.55555556e-01 -6.76965259e-18 -2.22222222e-01] [ 7.30994152e-02 -1.54970760e-01 -5.26315789e-02 1.72514620e-01] [ 3.15789474e-01 -7.89473684e-01 5.26315789e-02 1.05263158e-01]] 0.00292397660819 [ 0.63860313 -0.45638416 0.05517347 -0.18183689] [[ 0.60206445 0.85629788 -0.619295 -0.00817806] [ 0.57685743 -0.0186308 -0.31291905 -0.21176471] [-0.19459178 0.05700502 -0.60213224 0.68340563] [-0.51661198 -0.51298856 -0.39495839 -0.69860259]]
SciPy
SciPy is a free and open-source Python library used for scientific and technical computing. SciPy contains modules for optimization, linear algebra, integration, interpolation, special functions, FFT, signal and image processing, ODE solvers and other tasks that are common in science and engineering.
SciPy builds on top of the NumPy array object and is part of the NumPy stack, which also includes tools packages like Matplotlib, Pandas and SymPy, and an ever growing set of scientific computing libraries. This NumPy stack has similar uses to other applications such as MATLAB, GNU Octave and Scilab.
SciPy is available on all Pawsey systems as a module. To use it you must load the python
module first and then the scipy
module.
SciPy example
In the file below, named fft.py
, we show how to use the FFT implementation available in SciPy.
import numpy as np from scipy.fftpack import fft, ifft x = np.array([1.0, 2.0, 1.0, -1.0, 1.5]) y = fft(x) print(y)
Let's execute this code by means of an interactive Slurm session with one core:
$ salloc -N 1 -n 1 -p debug -t 0:05:00 $ module load python/3.9.7 py-numpy/1.20.3 py-scipy/1.7.1 $ srun python fft.py [ 4.50000000+0.j 2.08155948-1.65109876j -1.83155948+1.60822041j -1.83155948-1.60822041j 2.08155948+1.65109876j]
Mixing Python and C via Cython
Cython is a package that allows running C (or C-like) code in a Python script. The code is compiled prior to runtime, providing significantly faster execution compared to standard Python interpreted execution.
The building and compilation of Cython code must be done in a couple of steps. In the following example we employ a simple Python script that computes an approximation of the mathematical constant pi. We start with a file that contains the code for such computation:
def pi( num_rects ): cdef double step, x, pi, sum = 0.0 cdef int i, n n = num_rects step = 1.0 / n for i in range(0,n): x = (i+0.5)*step sum += 4.0/(1.0+x*x) pi = step * sum return pi
Let's call this file PI.pyx
(the .pyx
extension is important here).
Now, the C-enabled Python code must be compiled into a shared-object library that will be called later by the main code. This compilation requires an auxiliary Python script, which we are calling setup.py
:
from distutils.core import setup from distutils.extension import Extension from Cython.Build import cythonize ext_modules = [ Extension( "PI", ["PI.pyx"], extra_compile_args=['-O3'], extra_link_args=['-O3'], ) ] setup( name='pi', ext_modules = cythonize(ext_modules) )
Most of this is standard Cython compilation rules, with the addition of the -O3
flag to set a higher C compiler optimisation level. Be sure to name the correct pyx file in setup.py
so that the required shared object library is built. Let's build the library:
$ salloc -N 1 -n 1 -p debug -t 0:05:00 $ module load python/3.9.7 py-cython/0.29.24 $ srun python setup.py build_ext --inplace
As a result, a shared-object library file is built, named PI.cpython-36m-x86_64-linux-gnu.so
. We can now call the pi function, defined herein, from a standard Python script, script.py
:
## ## Calculating Pi with using midpoint rectangle method integral 4.0/(1+x^2) dx = pi ## or ## summation (for i=0 to n) 4.0/(1+x_i^2) where x_i = (i + 1/2) * delta x ## Serial Version ## import timeit import PI import math num_rects=100000000 start = timeit.default_timer() pi = PI.pi( num_rects ) diff_time = timeit.default_timer() - start print( "Estimated value of pi = {!r}".format(pi) ) print( "Error = {!r}".format(abs(math.pi-pi)) ) print( "Compute time = {0:.3f} seconds".format(diff_time) )
Let's execute this code by means of an interactive Slurm session with one core:
$ salloc -N 1 -n 1 -p debug -t 0:01:00 $ module load python/3.9.7 py-cython/0.29.24 $ srun python script.py Estimated value of pi = 3.1415926535904264 Error = 6.332712132461893e-13 Compute time = 0.547 seconds