Number of Nodes are not Set Correctly When Submitting From One Cluster to Another

Problem

A job script intended to be executed on Magnus works fine when submitted from Magnus itself. But when the job script is submitted from Zeus using the clause:

-M magnus

or

--clusters=magnus

then the number of nodes are not set correctly and the script fails or misbehaves.

Solution

We have found that when using the multi-cluster operation, some variables set on one cluster by default are not equally set when the job has been remotely submitted from another cluster. Usually, the offending parameters were not set explicitly in the script and the user relies on their default values. Unfortunately, default values may change when submission starts from a different cluster. In this case, when submitted from Zeus, the variable SLURM_HINT is not set properly, which creates a problem with the number of tasks to be executed per node.

Our proposed solution in this case is to explicitly set the number of tasks within the job script header:

#SBATCH --ntasks-per-node=24
#SBATCH --clusters=magnus

Note

This is the recommended practice for every job script, even if it is intended to always be submitted from Magnus itself.

Another solution would be to explicitly set the offending variable to the value it would assume on Magnus by default:

#SBATCH --hint=nomultithread
#SBATCH --clusters=magnus

When debugging this kind of problem, it is always useful to check the values of Slurm variables at the different stages of the workflow in order to identify which parameters are creating the problem:

$ printenv | grep "SLURM" > vars_at_state_A.txt

Or echo the value of specific variables, like:

$ echo $SLURM_JOB_NUM_NODES
$ echo $SLURM_HINT


Filter by label

There are no items with the selected labels at this time.