Known Issues on Setonix
On this page:
- 1 Known ongoing issues
- 1.1 Parallel IO within Containers
- 1.2 Issues with Slingshot network
- 1.3 Issues with libfabric affecting MPI and multi-node applications
- 1.4 MPI jobs hanging
- 1.5 Performance of the Cray environment
- 1.6 Linking to Cray libraries different than the default ones
- 1.7 Threads and processes placement on Zen3 cores
- 1.8 The default project for a user on Setonix may be incorrect
- 1.9 Slurm
- 1.10 Quota issues on /software
- 1.11 Cannot find software installed with Spack
- 1.12 Problems with native Tensorflow distribution strategies
- 2 Resolved issues and incidents
- 3 Current status and live incidents
Known ongoing issues
Parallel IO within Containers
Currently there are issues running MPI-enabled software that makes use of parallel IO from within a container being run by the Singularity container engine. The error message seen will be similar to:
Assertion failed in file ../../../../src/mpi/romio/adio/ad_cray/ad_cray_adio_open.c at line 520: liblustreapi != NULL
/opt/cray/pe/mpich/default/ofi/gnu/9.1/lib-abi-mpich/libmpi.so.12(MPL_backtrace_show+0x26) [0x14ac6c37cc4b]
/opt/cray/pe/mpich/default/ofi/gnu/9.1/lib-abi-mpich/libmpi.so.12(+0x1ff3684) [0x14ac6bd2e684]
/opt/cray/pe/mpich/default/ofi/gnu/9.1/lib-abi-mpich/libmpi.so.12(+0x2672775) [0x14ac6c3ad775]
/opt/cray/pe/mpich/default/ofi/gnu/9.1/lib-abi-mpich/libmpi.so.12(+0x26ae1c1) [0x14ac6c3e91c1]
/opt/cray/pe/mpich/default/ofi/gnu/9.1/lib-abi-mpich/libmpi.so.12(MPI_File_open+0x205) [0x14ac6c38e625]Currently it is unclear exactly what is causing this issue. Investigations are ongoing.
Workaround:
There is no workaround that does not require a change in the workflow. Either the container needs to be rebuilt to not make use of parallel IO libraries (e.g. the container was built using parallel HDF5) or if that is not possible, the software stack must be built "bare-metal" on Setonix itself (see How to Install Software).
Issues with Slingshot network
MPI jobs failing to run compute nodes shared with other jobs
MPI jobs running on compute nodes shared with other jobs have been observed failing at MPI_Init providing error messages similar to:
MPICH ERROR [Rank 0] [job id 957725.1] [Wed Feb 22 01:53:50 2023] [nid001012] - Abort(1616271) (rank 0 in comm 0): Fatal error in PMPI_Init: Other MPI error, error stack:
MPIR_Init_thread(170).......:
MPID_Init(501)..............:
MPIDI_OFI_mpi_init_hook(814):
create_endpoint(1359).......: OFI EP enable failed (ofi_init.c:1359:create_endpoint:Address already in use)
aborting job:
Fatal error in PMPI_Init: Other MPI error, error stack:
MPIR_Init_thread(170).......:
MPID_Init(501)..............:
MPIDI_OFI_mpi_init_hook(814):
create_endpoint(1359).......: OFI EP enable failed (ofi_init.c:1359:create_endpoint:Address already in use)This error is not properly a problem with MPI but with some components of the Slingshot network.
Workaround:
Firstly, allocate as few nodes as possible for the requested tasks and ideally use exclusive access
Users should avoid the unnecessary spread of MPI ranks across many nodes and favour the use of exclusive access if possible. This will promote better performance in the execution of own user's code, but also of the rest of the jobs being executed in the Supercomputer.
Users should stick to the use of the following best practices to avoid the unnecessary spread of MPI ranks across many nodes:
Any job allocation should be defined for the closest and most effective communication among MPI tasks. So if the job requires the use of fewer than 128 tasks, then users should explicitly ask for the use of a single node (using -N 1) and not allow Slurm to spread tasks beyond one node.
In the case of a requirement of, for example, 192 tasks, users should explicitly request for the use of only 2 nodes (using
-N 2)and follow a similar approach.
For jobs with higher counts of required tasks, please follow the same approach.
Secondly, use the FI_CXI_DEFAULT_VNI variable when requesting jobs in shared access
Very important, if jobs are still required to use compute nodes in shared access, HPE has provided a workaround that avoids the problems with communications among shared nodes. (This workaround can be used while Slingshot settings/libraries are properly fixed.) Workaround consists on the setting of the Slingshot environment variable FI_CXI_DEFAULT_VNI to a unique value before execution of each srun step in a Slurm job script. Therefore, before any srun in the job script, users should assign the mentioned variable to a "unique" random value using /dev/urandom as indicated in the two example scripts above for shared access.
The same workaround of the use of FI_CXI_DEFAULT_VNI should be applied when executing multiple parallel srun-steps even on an exclusive compute node, as the examples in: Multiple Parallel Job Steps in a Single Main Job
Issues with libfabric affecting MPI and multi-node applications
Since 11-Nov-2022 problems with the update to the libfabric library on Setonix are leading to further instances of job failure with some applications run over multiple nodes.
Errors have been noted specifically with the following applications:
Multicore VASP on a single node has encountered spurious memory access errors on long runs.
We have seen out of memory errors in some multinode NAMD runs.
A full resolution of these issues is expected with the completion of the Setonix Phase 2 installation in the following months. Some of these issues have been resolved (see resolved issues section).
MPI jobs hanging
MPI jobs with large comm-worlds using asynchronous pt2pt communication can sometimes hang. A root cause for this hang has not yet been identified, however it appears to be affected by several factors, including rank distribution across nodes (total no. of ranks, nodes, and ranks per node), the amount of data each rank sends and receives, and the amount of data in individual sends and receives.
Workaround:
At this point in time, no full workaround has been identified. One current recommendation to anyone experiencing this hang is to try and adjust the distribution of ranks across nodes by making it more compact if possible. Pawsey staff testing has shown that when fewer nodes are used, more total ranks are needed to trigger the hang, given other variables (such as amount of data being sent) are the same.
Performance of the Cray environment
At the moment, vendor-provided libraries such as cray-libsci and cray-fftw do not provide the best performance and Pawsey staff will test them during the coming months. Moreover, the Cray C/C++ compiler seems to not optimise code very well. For this reason, you should avoid using the Cray stack other than cray-mpich and crayftn, and using the alternatives provided by the Pawsey software stack. To compile C/C++ code, we suggest using GCC compilers.
Linking to Cray libraries different than the default ones
Cray modulefiles do not set the LD_LIBRARY_PATH environment variable despite the fact that Cray compilers now default to dynamic linking. As a result, programs will link to the libraries found in /opt/cray/pe/lib64, which are a symlink to the deployment of the latest version available.
$ cat /etc/ld.so.conf.d/cray-pe.conf
/opt/cray/pe/lib64
/opt/cray/pe/lib64/cceTo avoid this behaviour, once the desired version of the Cray libraries has been loaded with the module command, set the appropriate environment variables accordingly.
$export LD_LIBRARY_PATH=$CRAY_LD_LIBRARY_PATH:$LD_LIBRARY_PATH
$export LIBRARY_PATH=$CRAY_LIBRARY_PATH:$LIBRARY_PATH
Threads and processes placement on Zen3 cores
Currently, Slurm presents unwanted behaviours that have an impact on the performance of a job when it is submitted without the --exlusive sbatch flag, In particular, Slurm loses awareness of the Zen3 architecture and threads and/or processes are placed onto cores with no reasonable mapping.
To avoid the issue, pass the -m block:block:block flag to srun within a sbatch script or an interactive session.
The default project for a user on Setonix may be incorrect
A user's default project is determined by the content of the first line of the $HOME/.pawsey_project (sometimes referred to as: ~/.pawsey_project) file,
which is read on login and used to populate the $PAWSEY_PROJECT environmental variable (EnvVar).
The PAWSEY_PROJECT environmental variable is used as a part of some other EnvVars, for example MYSCRATCH and MYSOFTWARE.
(A full list of EnvVars that may affect your experience on Pawsey systems can be found on this wiki page: Setonix Software Environment)
The contents of the $HOME/.pawsey_project file may not be set to what you want as your current default, especially if you are a member of multiple projects or have moved from one project to another
To set the contents of the $HOME/.pawsey_project file so that it contains ONLY the project that you want to be your default, you should do the following, where <project> should be replaced by your project
echo <project> > $HOME/.pawsey_projectAfter you have done that, you should logout and then log back in again, so that the change can take effect, and then type the following commands to ensure that the file, and the EnvVar are correct
cat $HOME/.pawsey_project
echo $PAWSEY_PROJECTIf you wish to store more than one project name in your $HOME/.pawsey_project file, so that you can be reminded that you are in more than one project, then you can do the following,
where <project1> is the project you want to be your default.
echo <project1> > $HOME/.pawsey_project
echo <project2> >> $HOME/.pawsey_projectNote the use of the double ">>" in the second command to append to, rather than overwrite, the contents of the file.
You can, of course, always edit the file, so that it contains the first line that you want to be your default project in any given shell, rather than explicitly echo-ing lines into it.