This page describes how to run Pangeo and Dask with JupyterHub using a python virtual environment on Setonix with Slurm. This involves launching JupyterHub and then connecting to the Jupyter server.

Installation

Here we use a python virtual environment for the installation of Pangeo and Dask. This allows users to add new python packages as required during their analysis. The installation steps are described below:

Terminal 1: Installing Dask and dependencies

$ cd $MYSOFTWARE # This path defaults to /software/projects/<user_project>/<username>/ 

# Load the required modules. Note that the specific version numbers may change
$ module load python/3.10.10
$ module load py-pip/23.1.2-py3.10.10

# Create and activate the virtual environment 
$ python -m venv pangeo
$ source ${MYSOFTWARE}/pangeo/bin/activate

# Install Pangeo, Dask, and dependencies
$ pip install dask-mpi dask distributed mpi4py jupyter-server-proxy jupyterlab ipywidgets xarray zarr numcodecs hvplot geoviews datashader widgetsnbextension dask-jobqueue dask-labextension notebook wheel netCDF4 pyFFTW basemap geos nodejs-bin

# Clean your pip cache to prevent exceeding your file count quota on the /software partition
$ pip cache purge

Setting up the batch script

Once you have the python virtual environment installed, you can then launch your Jupyter Notebook on Setonix with the following script. This script activates the python virtual environment and launches a Jupyter Notebook on a node in the work partition. You will need to edit the Slurm parameters and working directory to suit your needs.

Listing 1. Launch_jupyter.sh

#!/bin/bash -l
# Allocate slurm resources, edit as necessary
#SBATCH --account=your_pawsey_account
#SBATCH --partition=work
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=2
#SBATCH --mem=8GB
#SBATCH --time=02:00:00
#SBATCH --nodes=1
#SBATCH --job-name=jupyter_notebook
#SBATCH --export=NONE
 
# Set our working directory
# Should be in a writable path with some space, like /scratch
dir="${MYSCRATCH}/jupyter"
 
# Load dependencies for pangeo
module load python/3.10.10
source ${MYSOFTWARE}/pangeo/bin/activate

# You should not need to edit the lines below
 
# Prepare the working directory
mkdir -p ${dir}
cd ${dir}

# Get the hostname
# We'll set up an SSH tunnel to connect to the Juypter notebook server
host=$(hostname)
 
# Set the port for the SSH tunnel
# This part of the script uses a loop to search for available ports on the node;
# this will allow multiple instances of GUI servers to be run from the same host node
port="8888"
pfound="0"
while [ $port -lt 65535 ] ; do
  check=$( ss -tuna | awk '{print $4}' | grep ":$port *" )
  if [ "$check" == "" ] ; then
    pfound="1"
    break
  fi
  : $((++port))
done
if [ $pfound -eq 0 ] ; then
  echo "No available communication port found to establish the SSH tunnel."
  echo "Try again later. Exiting."
  exit
fi

 
echo "*****************************************************"
echo "Setup - from your laptop do:"
echo "ssh -N -f -L ${port}:${host}:${port} $USER@$PAWSEY_CLUSTER.pawsey.org.au"
echo "*****"
echo "The launch directory is: $dir"
echo "*****************************************************"
echo ""
echo "*****************************************************"
echo "Terminate - from your laptop do:"
echo "kill \$( ps x | grep 'ssh.*-L *${port}:${host}:${port}' | awk '{print \$1}' )"
echo "*****************************************************"
echo ""
  
#Launch the notebook 
srun -N $SLURM_JOB_NUM_NODES -n $SLURM_NTASKS -c $SLURM_CPUS_PER_TASK \
	jupyter lab \
  --no-browser \
  --port=${port} --ip=0.0.0.0 \
  --notebook-dir=${dir}

Please note that the port forwarding will not work correctly if you run Jupyter on the login node.

Run your Jupyter notebook server

To start, submit the SLURM jobscript. It will take a few minutes to start (depending on how busy the queue and how large of an image you're downloading). Once the job starts you will have a SLURM output file in your directory, which will have instructions on how to connect at the end.

Terminal 2. Submitting sbatch script

$ sbatch Launch_jupyter.sh
Submitted batch job 2850635

$ cat slurm-2850635.out #Check for the ssh-tunneling instructions in the slurm output file
.
.
.
. *****************************************************
Setup - from your laptop do:
ssh -N -f -L 8888:nid002024:8888 matilda@setonix.pawsey.org.au
*****
The launch directory is: /scratch/pawsey0001/matilda/jupyter
*****************************************************

*****************************************************
Terminate - from your laptop do:
kill $( ps x | grep 'ssh.*-L *8888:nid002024:8888' | awk '{print $1}' )
*****************************************************
.
.
.    
    To access the server, open this file in a browser:
        file:///home/matilda/.local/share/jupyter/runtime/jpserver-255258-open.html
    Or copy and paste one of these URLs:
        http://nid002024:8888/lab?token=a8135a22fab1a3f97214fa1424eefb25c4e415f6caaab030
        http://127.0.0.1:8888/lab?token=a8135a22fab1a3f97214fa1424eefb25c4e415f6caaab030
.
.
.

Now open a separate terminal window in your local computer and execute the ssh command to perform the tunneling between local and remote ports. In this case:

ssh -N -f -L 8888:nid002024:8888 <username>@setonix.pawsey.org.au

Supply your Setonix password if requested. Then go to your internet browser and navigate to the Jupyter address (e.g. http://127.0.0.1:8888/lab?token=a8135a22fab1a3f97214fa1424eefb25c4e415f6caaab030 in the above example). This will take you to your Jupyter lab, where you can run Pangeo and Dask.

When selecting the URL to use in your browser, ensure you use the address with 127.0.0.1 and not nidXXXXX.

Example Dask usage from Jupyter

The following is provided as an example of how you might use Dask from within the Jupyter session.

Terminal 3. Example Dask usage in Jupyter

from dask_jobqueue import SLURMCluster

cluster = dask_jobqueue.SLURMCluster(
    cores=24,
    memory='100GB',
    shebang='#!/bin/bash -l',  # default is bash
    processes=6,
    local_directory='/scratch/your_working_dor',
    job_extra_directives=['--account=pawseyXXXX'],  # additional job-specific options
    walltime='02:00:00',
    queue='work',
 ) 
cluster.scale(jobs=2)  # launch 2 jobs, each of which starts 6 worker processes 
cluster.scale(cores=48)  # Or specify cores or memory directly 
cluster.scale(memory="200 GB")  # Or specify cores or memory directly

## Print job script for you to review
print(cluster.job_script())

##Connect the cluster to the notebook
client = Client(cluster)
client
## You should then see the workers spawn and the dashboard start up. You can also check the jobs spawning on the Setonix terminal with `watch squeue -u username -l`

Cleaning up when you are finished

Once you have finished:

In the remote Pawsey cluster, cancel your job with scancel <job_id>.
In you local computer terminal, kill the SSH tunnel, based on the command displayed in the output file:

kill $( ps x | grep 'ssh.*-L *8888:nid002024:8888' | awk '{print $1}' )

External Links

These external links may be useful for you in making the most of Dask:

More information on how to submit Dask jobs to the Slurm queue via Jupyter, and other related information: https://jobqueue.dask.org/en/latest/examples.html
Talk from SciPy 2019 conference, "Turning HPC Systems into Interactive Data Analysis Platforms" by A. Banihirwe. This gives a tutorial about using Dask on HPC. https://www.youtube.com/watch?v=vhawO8fgD64

How to Run Pangeo and Dask