srun Fails When Submitting From One Cluster to Another

Problem

When a job on Magnus or Galaxy submits a job to another cluster, srun within that job fails with the message:

srun: fatal: SLURM_MEM_PER_CPU, SLURM_MEM_PER_GPU, and SLURM_MEM_PER_NODE are mutually exclusive.

A scenario where this might occur is a batch job on Magnus submits to the Zeus copyq to move data after a large simulation completes.

Solution

This occurs because jobs on the Crays are node-exclusive, so SLURM sets memory limits at the node level.  On Zeus most of the queues are shared, so SLURM sets memory limits at the per core/task level.  The environment variables are automatically set by SLURM, so there is no elegant solution.  The simple solution is to unset the offending environment variable in the initial job script before calling the sbatch that submits the job to another cluster, or before calling srun within the affected job script. 

unset SLURM_MEM_PER_NODE


Other possibility is to set the memory usage within the affected job script. For example, for a serial job in the copyq of Zeus, you can explicitly set the memory at the node level in your affected job script with:

#SBATCH --mem=2G


Basically, the user will need to take care that the job script settings do not run into a double setting of mutually exclusive variables.
A very useful trick when debugging this kind of problem is to echo the values of these variables within all the job scripts involved in the problem:

echo "SLURM_CLUSTER_NAME=$SLURM_CLUSTER_NAME"
echo "SLURM_MEM_PER_CPU=$SLURM_MEM_PER_CPU"
echo "SLURM_MEM_PER_GPU=$SLURM_MEM_PER_GPU"
echo "SLURM_MEM_PER_NODE=$SLURM_MEM_PER_NODE"


Filter by label

There are no items with the selected labels at this time.