Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.
Comment: Update known SLURM issues

...

  • There is a known bug in slurm related to memory requests, which will be fixed in the future with a patch. The amount of total memory on a node is incorrectly calculated when requesting more than 67 MPI processes per node with --ntasks-per-node=67 or more
    • Workaround: Provide the total memory required with --mem=<desired_value>
  • For shared node access, pinning of cpus to MPI processes or OpenMP threads will be poor. 
    • Workaround: srun should be run with -m block:block:block
  • Email notifications implemented through the --mail-type  and --mail-user options of Slurm are currently not working. The issue will be investigated soon.
  • The use of both --ntasks-per-node  and an explicit memory request (including --mem=0  and --exclusive ) can lead to some job requests being rejected with an error message of the form error: Job submit/allocate failed: Requested node configuration is not available .
    • Workaround: Only include total job resources (e.g. `–ntasks` and --nodes) in the resource request (sbatch script or salloc), and distribute tasks across nodes when invoking srun , i.e. srun --ntasks-per-node=<desired_value> 
  • When asking for a number of GPUs with the option --gpus-per-node=<desired_value> , the number of GPUs visible to processes launched during an srun  call does not always match what is asked for.
    • Workaround: Use --gres=gpu:<desired_value>  instead of --gpus-per-node , as recommended throughout our documentation.

Quota issues on /software

To avoid the metadata servers of the /software filesystem being overwhelmed with too many inodes, Pawsey imposes a 100k quota on the number of files each user can have on said filesystem. However, we acknowledge the chosen limit may be too strict for some software such as weather models, large Git repositories, etc. We are working on a solution and will update you as soon as we have found one that meets the requirements of Pawsey and user applications.

...

Column
width60%

Maintenance and Incidents


Page Properties Report
firstcolumnLog
headingsStatus:,Start Date/Time (AWST):,End Date/Time (AWST):,Systems/Services Affected:,Summary:
pageSize8
sortByStart Date/Time (AWST):
reverseSorttrue
cqllabel in ("maintenance","incident") and space = currentSpace() and ancestor = "406357651929970" and lastmodified >= now('-1M')

...