Skip to end of banner
Go to start of banner

Slurm Job Cancelled Due to Time Limit

Skip to end of metadata
Go to start of metadata

You are viewing an old version of this page. View the current version.

Compare with Current View Page History

« Previous Version 2 Current »

Problem

A job finishes with a message saying slurmstepd error, with the job cancelled due to time limit. The error message will be in the Slurm output file. The slurmstepd error message may be preceded with a message from srun that the job step was aborted.

slurmstepd: error: *** JOB 3501970 ON nid00161 CANCELLED AT 2018-01-08T15:17:49 DUE TO TIME LIMIT ***

Solution

There are two likely causes. Firstly, the job may have reached the maximum time it was allowed to run. Secondly, if you have a fixed allocation then your allocation may have been used up.

  1. Slurm allocates resources to a job for a fixed amount of time. This time limit is either specified in the job request, or if none is specified then it is the default limit. There are maximum limits on all Slurm partitions, so if you have not requested the maximum then try increasing the time limit in the request with the --time= flag to sbatch or sallo.

    #SBATCH --time=12:00:00

    To see the maximum and default time limits, use sinfo:

    Terminal 1. View the time limits for a queue
    $ sinfo -o "%.10P %.5a %.10l %.15L %.6D %.6t" -p workq
     PARTITION AVAIL  TIMELIMIT     DEFAULTTIME  NODES  STATE
        workq*    up 1-00:00:00         1:00:00      1  drain
        workq*    up 1-00:00:00         1:00:00     20   resv
        workq*    up 1-00:00:00         1:00:00     34    mix
        workq*    up 1-00:00:00         1:00:00     25  alloc
  2. Usually if your allocation is not sufficient to support a job running to completion, Slurm will not start the job. However, if multiple jobs start at the same time then each job may not hit the limit but collectively they might. When this happens they will all start, but get terminated when the allocation is used up. You can tell this is the case if the elapsed time does not match the job's time limit.

    $ sacct -j 2954681 -o jobid,elapsed,time
           JobID    Elapsed  Timelimit
    ------------ ---------- ----------
    2954681        05:54:30 1-00:00:00
    2954681.bat+   05:54:31
    2954681.ext+   05:54:31
    2954681.0      05:54:30

    If this is the case, check whether your allocation is used up.  If it is, contact the Pawsey help desk.  See Submitting and Monitoring Jobs#ProjectAccounting for more information about project accounting.

    $ pawseyAccountBalance

Filter by label

There are no items with the selected labels at this time.





  • No labels