Exceeded Job Memory Limit

Problem

A job on the Pawsey supercomputers fails with "slurmstepd: error: Exceeded job memory limit at some point."

Solution

This shows that the job has exhausted all the memory available on a core/node. This error can occur shortly after the job has started, or much later in execution, depending on the demand for memory by the application.

There are the two options to solve this problem:

  1. Explicitly request for more memory per task or thread, using the directive #SBATCH --mem-per-cpu=10G (10 GB in this example).  This is best used with the --cpus-per-task option for full control, or
  2. Increase the memory available to the application by reducing the number of tasks within a node (but still request all the CPUs, otherwise another job might be allocated on the node). Or,
  3. Reduce the memory requirement of the application - which may mean reducing the problem size, but may also mean checking for the possibility of memory leaks if you are developing your own code.

Filter by label

There are no items with the selected labels at this time.