...
- There is a known bug in slurm related to memory requests, which will be fixed in the future with a patch. The amount of total memory on a node is incorrectly calculated when requesting more than 67 MPI processes per node with
--ntasks-per-node=67
or more- Workaround: Provide the total memory required with
--mem=<desired_value>
- Workaround: Provide the total memory required with
- For shared node access, pinning of cpus to MPI processes or OpenMP threads will be poor.
- Workaround:
srun
should be run with-m block:block:block
- Workaround:
- Email notifications implemented through the
--mail-type
and--mail-user
options of Slurm are currently not working. The issue will be investigated soon. - The use of both
--ntasks-per-node
and an explicit memory request (including--mem=0
and--exclusive
) can lead to some job requests being rejected with an error message of the formerror: Job submit/allocate failed: Requested node configuration is not available
.- Workaround: Only include total job resources (e.g. `–ntasks` and
--nodes
) in the resource request (sbatch script orsalloc
), and distribute tasks across nodes when invokingsrun
, i.e.srun --ntasks-per-node=<desired_value>
- Workaround: Only include total job resources (e.g. `–ntasks` and
- When asking for a number of GPUs with the option
--gpus-per-node=<desired_value>
, the number of GPUs visible to processes launched during ansrun
call does not always match what is asked for.- Workaround: Use
--gres=gpu:<desired_value>
instead of--gpus-per-node
, as recommended throughout our documentation.
- Workaround: Use
Quota issues on /software
To avoid the metadata servers of the /software
filesystem being overwhelmed with too many inodes, Pawsey imposes a 100k quota on the number of files each user can have on said filesystem. However, we acknowledge the chosen limit may be too strict for some software such as weather models, large Git repositories, etc. We are working on a solution and will update you as soon as we have found one that meets the requirements of Pawsey and user applications.
...
Column | ||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| ||||||||||||||
Maintenance and Incidents
|
...