Script Fails or Misbehaves When Submitting From One Cluster to Another

Problem

A script runs fine when submitted within the intended cluster, but fails/misbehaves when submitted from another cluster using the clause -M or --clusters. For example, when submitting a job from Zeus to be ran on Magnus with

--clusters=setonix


Solution

We have found that when using the multi-cluster operation, some variables set by the scheduler by default on one cluster are not equally set when the job has been remotely submitted from another cluster.

One solution that works for some cases is to explicitly set the offending parameters within the job script header. Usually, the offending parameters are not set explicitly and, therefore, the user relies on their default values to be correct. Unfortunately, default values may change when submission starts from a different cluster. For example, the following setting for the number of tasks per node may be necessary when the job is to be submitted from Zeus, even if it is the default for Magnus:

#SBATCH --ntasks-per-node=24
#SBATCH --clusters=setonix


Another solution that works for some other cases is to unset the offending environment variables within the job script with the command

unset


before the sbatch to remote cluster is performed.  Or just delete them all before running sbatch

$ unset $(compgen -e SLURM_)
$ sbatch setonix_script


When debugging this kind of problem, it is always useful to check the values of slurm variables at the different stages of the workflow in order to identify which parameters are creating the problem:

$ printenv | grep "SLURM" > vars_at_state_A.txt


Or echo the value of specific variables, like:

$ echo $SLURM_JOB_NUM_NODES


Read specific solutions of this kind of problems here:

srun

There are no items with the selected labels at this time.