This page gives an example in how to handle a data workflow using Acacia and Setonix, with three different jobs connected by dependencies.
Example workflow on Setonix
The data workflow will consist of:
- A first job submitted to the data transfer nodes (
copy
partition) that will stage initial data from Acacia and into the working directory on/scratch
- A second job submitted to the compute nodes (
work
partition) that will execute a supercomputing job making use of the staged data. We assume that this job will produce new result/data in the same working directory on/scratch
. - A third job submitted to the data transfer nodes (
copy
partition) that will handle the new results/data and store them into Acacia
Script for staging of initial data into /scratch
Staging can be performed with a slurm job script that makes use of the data-transfer nodes (copy
partition) to handle initial data originally stored in Acacia and to be staged into the working directory on /scratch
. Note that the original data is packed within a .tar
file, so the script also performs the "untarring" of the data:
#!/bin/bash --login #--------------- #About this script #stageFromAcaciaTar.sh : copies a tar object from Acacia and extracts it in the destination path #--------------- #Requested resources: #SBATCH --account=[yourProjectName] #SBATCH --job-name=stageTar.rclone #SBATCH --partition=copy #SBATCH --ntasks=1 #SBATCH --cpus-per-task=2 #SBATCH --time=[requiredTime] #SBATCH --export=NONE #----------------- #Loading the required modules module load rclone/<version> #This example is performed with the use of the rclone. #----------------- #Defining variables that will handle the names related to your access, buckets and stored objects in Acacia profileName=<profileNameGivenToYourProfileOfAccessToAcacia> bucketName=<bucketInAcaciaContainingTheData> prefixPath=<prefixPathInBucketUsedToOrginiseTheData> fullPathInAcacia="${profileName}:${bucketName}/${prefixPath}" #Note the colon(:) when using rclone #----------------- #Name of the file to be transferred and auxiliary dir to temporarily place it tarFileName=<nameOfTheTarFileContainingInitialData> auxiliaryDirForTars="$MYSCRATCH/tars" echo "Checking that the auxiliary directory exists" if ! [ -d $auxiliaryDirForTars ]; then echo "Trying to create the auxiliary directory as it does not exist" mkdir -p $auxiliaryDirForTars; exitcode=$? if [ $exitcode -ne 0 ]; then echo "The auxiliary directory $auxiliaryDirForTars does not exist and can't be created" echo "Exiting the script with non-zero code in order to inform job dependencies not to continue." exit 1 fi fi #----------------- #Working directory in scratch for the supercomputing job workingDir="$MYSCRATCH/<workingDirectoryForSupercomputingJob>" echo "Checking that the working directory exists" if ! [ -d $workingDir ]; then echo "Trying to create the working directory as it does not exist" mkdir -p $workingDir; exitcode=$? if [ $exitcode -ne 0 ]; then echo "The working directory $workingDir does not exist and can't be created" echo "Exiting the script with non-zero code in order to inform job dependencies not to continue." exit 1 fi fi #----------------- #Check if Acacia definitions make sense, and if the object to transfer exist echo "Checking that the profile exists" rclone config show | grep "${profileName}" > /dev/null; exitcode=$? if [ $exitcode -ne 0 ]; then echo "The given profileName=$profileName seems not to exist in the user configuration of rclone" echo "Exiting the script with non-zero code in order to inform job dependencies not to continue." exit 1 fi echo "Checking that bucket exists and that you have read access" rclone lsd "${profileName}:${bucketName}" > /dev/null; exitcode=$? #Note the colon(:) when using rclone if [ $exitcode -ne 0 ]; then echo "The bucket name or the profile name may be wrong: ${profileName}:${bucketName}" echo "Exiting the script with non-zero code in order to inform job dependencies not to continue." exit 1 fi echo "Checking if the file can be listed in Acacia" listResult=$(rclone lsl "${fullPathInAcacia}/${tarFileName}") if [ -z "$listResult" ]; then echo "Problems occurred during the listing of the file ${tarFileName}" echo "Check that the file exists in the fullPathInAcacia: ${fullPathInAcacia}/" echo "Exiting the script with non-zero code in order to inform job dependencies not to continue." exit 1 fi #----------------- #Perform the transfer of the tar file into the auxiliary directory and check for the transfer echo "Performing the transfer ... " srun rclone copy "${fullPathInAcacia}/${tarFileName}" "${auxiliaryDirForTars}/"; exitcode=$? if [ $exitcode -ne 0 ]; then echo "Problems occurred during the transfer of file ${tarFileName}" echo "Check that the file exists in the fullPathInAcacia: ${fullPathInAcacia}/" echo "Exiting the script with non-zero code in order to inform job dependencies not to continue." exit 1 fi #----------------- #Perform untaring with desired options into the working directory echo "Performing the untarring ... " #tarOptions=( --strip-components 8 ) #Avoiding creation of some directories in the path srun tar -xvzf "${auxiliaryDirForTars}/${tarFileName}" -C $workingDir "${tarOptions[@]}"; exitcode=$? if [ $exitcode -ne 0 ]; then echo "Problems occurred during the untaring of file ${auxiliaryDirForTars}/${tarFileName}" echo "Exiting the script with non-zero code in order to inform job dependencies not to continue." exit 1 else echo "Removing the tar file as it has been successfully untarred" rm $auxiliaryDirForTars/$tarFileName #comment this line when debugging workflow fi #----------------- ## Final checks .... #--------------- #Successfully finished echo "Done" exit 0
Client support
- Note the use of variables to store the names of directories, files, buckets, prefixes, objects etc.
- Also note the several checks at the different parts of the script and the redirection to
/dev/null
in most of the commands used for checking correctness (as we are not interested in their output). - As messages from the mc client are too verbose when transferring files (even with the
--quiet
option), we make use of a redirection of the output messages to/dev/null
when performing transfers with this client on scripts. For this reason we often find rclone a better choice for the use of clients in scripts.
Script for performing a supercomputing job
The slurm job script for performing the supercomputing job makes use of the initial data staged in the previous step. This job requests execution in the compute nodes (work
partition).
#!/bin/bash -l #SBATCH --account=[yourProject] #SBATCH --job-name=superExecution #SBATCH --partition=work #SBATCH --ntasks=[numberOfCoresToUse] #SBATCH --time=[requiredTime] #SBATCH --export=none #-------------- #Load required modules here ... #--------------- #Defining the working dir workingDir="$MYSCRATCH/<pathAndNameOfWorkingDirectory" #--------------- #Entering the working dir cd $workingDir #--------------- #Check for the correct staging of the initial conditions if needed ... #--------------- #Supercomputing execution srun <theTool> <neededArguments> #--------------- #Successfully finished echo "Done" exit 0
Script for storing of results into Acacia
Storing can be performed with a slurm job script that makes use of the data-transfer nodes (copy
partition) to handle new results/data in the working directory on /scratch
and store them in Acacia. Note that data is first packed into a .tar
file, and transferred to Acacia afterwards.
Coordinating the different steps (scripts) with job dependencies
To coordinate the stage, supercomputing, and store steps, the job scripts are submitted to the scheduler with job dependencies. This is easier to manage with a master script that submits the jobs:
#!/bin/bash scriptsDir=<pathToWhereTheScriptsAre> echo "Using dependencies to connect execution of several jobs on different partitions" #---------------------- echo "Submitting the staging job and saving the jobID" stageJobID=$(sbatch --parsable \ $scriptsDir/stageFromAcaciaTar.sh \ | cut -d ";" -f 1) echo "stageJobID=$stageJobID" #---------------------- echo "Submitting the supercomputing job and saving the jobID" superJobID=$(sbatch --parsable \ --dependency=afterok:$stageJobID \ $scriptsDir/superExecutionSetonix.sh \ | cut -d ";" -f 1) echo "superJobID=$superJobID" #---------------------- echo "Submitting the storing job and saving the jobID" storeJobID=$(sbatch --parsable \ --dependency=afterok:$superJobID \ | cut -d ";" -f 1) echo "storeJobID=$storeJobID" #---------------------- echo "Displaying the current status of the jobs" squeue --jobs="$stageJobID,$superJobID,$storeJobID" \ --Format "JobID:8 ,UserName:10 ,Account:11 ,Partition:10 ,Name:14 ,State:8 ,Reason:12 ,Dependency:30"
Client support
- Note that this is not a slurm job script, but a simple bash script. Then you should not submit it to the scheduler but simply execute it in the command line, then the script is the one which submits the scripts for the different steps to the scheduler.
- Note that the scrip makes use of the cut command to grab the fourth field of the output of the sbatch command (which is the jobID) and then assign it to a variable. In the following submission, the variable with the corresponding jobID is used in the
--dependency
option. In this case, the dependent jobs will only start execution if the "parent" job ends with success.
When executing the script, you can save the output to a log file for your records pipelining the output with the tee
command:
$ ./masterWorkflow.sh | tee log.master.out Using dependencies to connect execution of several jobs on different partitions Submitting the staging job and saving the jobID stageJobID=5534757 Submitting the supercomputing job and saving the jobID superJobID=5534758 Submitting the storing job and saving the jobID storeJobID=5534759 Displaying the current status o the jobs JOBID USER ACCOUNT PARTITION NAME STATE REASON DEPENDENCY 5534759 username pawsey copy storeTar PENDING Dependency afterok:5534758(unfulfilled) 5534758 username pawsey work superExecution PENDING Dependency afterok:5534757(unfulfilled) 5534757 username pawsey copy stageTar PENDING Priority (null)
Client support
Note that the final command in the script is the use of the squeue
command to display the information of your jobs in the queue and their dependencies.
Example workflow on other systems
In principle the same scripts can be used in other systems, but they will need to be adapted accordingly. The main adaptation would be the impossiblity of coordinating the different steps through dependencies when scrips are to be executed on different clusters; for example, using the data mover nodes on zeus (copyq
) and the compute nodes on magnus (workq
) for the supercomputing job.