Known Issues on Setonix

Known Issues on Setonix

On this page:



Known ongoing issues

Parallel IO within Containers

Currently there are issues running MPI-enabled software that makes use of parallel IO from within a container being run by the Singularity container engine. The error message seen will be similar to:

Example of error message
Assertion failed in file ../../../../src/mpi/romio/adio/ad_cray/ad_cray_adio_open.c at line 520: liblustreapi != NULL
/opt/cray/pe/mpich/default/ofi/gnu/9.1/lib-abi-mpich/libmpi.so.12(MPL_backtrace_show+0x26) [0x14ac6c37cc4b]
/opt/cray/pe/mpich/default/ofi/gnu/9.1/lib-abi-mpich/libmpi.so.12(+0x1ff3684) [0x14ac6bd2e684]
/opt/cray/pe/mpich/default/ofi/gnu/9.1/lib-abi-mpich/libmpi.so.12(+0x2672775) [0x14ac6c3ad775]
/opt/cray/pe/mpich/default/ofi/gnu/9.1/lib-abi-mpich/libmpi.so.12(+0x26ae1c1) [0x14ac6c3e91c1]
/opt/cray/pe/mpich/default/ofi/gnu/9.1/lib-abi-mpich/libmpi.so.12(MPI_File_open+0x205) [0x14ac6c38e625]

Currently it is unclear exactly what is causing this issue. Investigations are ongoing. 

Workaround:

There is no workaround that does not require a change in the workflow. Either the container needs to be rebuilt to not make use of parallel IO libraries (e.g. the container was built using parallel HDF5) or if that is not possible, the software stack must be built "bare-metal" on Setonix itself (see How to Install Software). 

Issues with Slingshot network

MPI jobs failing to run compute nodes shared with other jobs

MPI jobs running on compute nodes shared with other jobs have been observed failing at MPI_Init providing error messages similar to:

Example of error message
MPICH ERROR [Rank 0] [job id 957725.1] [Wed Feb 22 01:53:50 2023] [nid001012] - Abort(1616271) (rank 0 in comm 0): Fatal error in PMPI_Init: Other MPI error, error stack:
MPIR_Init_thread(170).......: 
MPID_Init(501)..............: 
MPIDI_OFI_mpi_init_hook(814): 
create_endpoint(1359).......: OFI EP enable failed (ofi_init.c:1359:create_endpoint:Address already in use)

aborting job:
Fatal error in PMPI_Init: Other MPI error, error stack:
MPIR_Init_thread(170).......: 
MPID_Init(501)..............: 
MPIDI_OFI_mpi_init_hook(814): 
create_endpoint(1359).......: OFI EP enable failed (ofi_init.c:1359:create_endpoint:Address already in use)

This error is not properly a problem with MPI but with some components of the Slingshot network.



Workaround:

Firstly, allocate as less nodes as possible for the requested tasks and prefer for exclusive access

Users should avoid the unnecessary spread of MPI ranks across many nodes and favor the use of exclusive access if possible. This will promote better performance in the execution of own user's code, but also of the rest of the jobs being executed in the Supercomputer.

Users should stick to the use of the following best practices to avoid the unnecessary spread of MPI ranks across many nodes:

  • Any job allocation should be defined for the closest and most effective communication among MPI tasks. So if the job requires the use of less than 128 tasks, then users should explicitly ask for the use of single node (using -N 1) and do not let Slurm to spread tasks beyond one node.

    • For example, if the srun step initially requires 96 tasks:

      • Users should consider the possibility of modifying their code settings to make it able to use the whole compute node (128 tasks) with exclusive access:

  •  

    •  

      • If code settings modification is not possible, then users should consider the use of the node in exclusive access while not using the total number of cores available in the node:

        This practice of leaving some cores idle makes sense only if considerable performance improvements are seen compared to the shared access practice. Jobs with requirements of 64 cores or more are the usual candidates for the use of this approach. (Take into account that job will be charged for the use of the full node.)

  •  

    •  

      • If there are not considerable performance gains in using a compute node in exclusive access, then request resources in shared access, but explicitly ask to allocate all tasks into a single node :

        (Note the use of the variable FI_CXI_DEFAULT_VNI for jobs with shared access. See note below.)



  • In the case of a requirement of, for example 192 tasks. Then users should explicitly request for the use of only 2 nodes (using -N 2) and follow a similar approach:

    • Users should consider the possibility of modifying their code settings to make it able to use whole compute nodes (256 tasks) with exclusive access:

  •  

    • If code settings modification is not possible, then users should consider the use of compute nodes in exclusive access while not using the total number of cores available in each node:

      This practice of leaving some cores idle makes sense only if considerable performance improvements are seen compared to the shared access practice. Jobs with requirements of 64 cores or more per node are the usual candidates for the use of this approach. (Take into account that job will be charged for the use of the full nodes.)

  •  

    • If there are not considerable performance gains in using the compute nodes in exclusive access, then request resources in shared access, but explicitly ask to allocate all tasks into only 2 nodes:

      (Note the use of the variable FI_CXI_DEFAULT_VNI for jobs with shared access. See note below.)

  • For jobs with higher counts of required tasks, please follow the same approach.



Secondly, use the FI_CXI_DEFAULT_VNI variable when requesting jobs in shared access

Very important, if jobs are still required to use compute nodes in shared access, HPE has provided a workaround that avoids the problems with communications among shared nodes. (This workaround can be used while Slingshot settings/libraries are properly fixed.) Workaround consists on the setting of the Slingshot environment variable FI_CXI_DEFAULT_VNI to a unique value before execution of each srun step in a Slurm job script. Therefore, before any srun in the job script, users should assign the mentioned variable to a "unique" random value using /dev/urandom as indicated in the two example scripts above for shared access.

The same workaround of the use of FI_CXI_DEFAULT_VNI should be applied when executing multiple parallel srun-steps even on an exclusive compute node, as the examples in: Multiple Parallel Job Steps in a Single Main Job

Issues with libfabric affecting MPI and multi-node applications

Since 11-Nov-2022 problems with the update to the libfabric library on Setonix are leading to further instances of job failure with some applications run over multiple nodes.

Errors have been noted specifically with the following applications:

  • Multicore VASP on a single node has encountered spurious memory access errors on long runs.

  • We have seen out of memory errors in some multinode NAMD runs.

A full resolution of these issues is expected with the completion of the Setonix Phase 2 installation in the following months. Some of these issues have been resolved (see resolved issues section).

MPI jobs hanging

MPI jobs with large comm-worlds using asynchronous pt2pt communication can sometimes hang. A root cause for this hang has not yet been identified, however it appears to be affected by several factors, including rank distribution across nodes (total no. of ranks, nodes, and ranks per node), the amount of data each rank sends and receives, and the amount of data in individual sends and receives.

Workaround:

At this point in time, no full workaround has been identified. One current recommendation to anyone experiencing this hang is to try and adjust the distribution of ranks across nodes by making it more compact if possible. Pawsey staff testing has shown that when fewer nodes are used, more total ranks are needed to trigger the hang, given other variables (such as amount of data being sent) are the same.

Performance of the Cray environment

At the moment, vendor-provided libraries such as cray-libsci and cray-fftw do not provide the best performance and Pawsey staff will test them during the coming months. Moreover, the Cray C/C++ compiler seems to not optimise code very well. For this reason, you should avoid using the Cray stack other than cray-mpich and crayftn, and using the alternatives provided by the Pawsey software stack. To compile C/C++ code, we suggest using GCC compilers.

Linking to Cray libraries different than the default ones

Cray modulefiles do not set the LD_LIBRARY_PATH environment variable despite the fact that Cray compilers now default to dynamic linking. As a result, programs will link to the libraries found in  /opt/cray/pe/lib64, which are a symlink to the deployment of the latest version available.

Terminal 1. Content of the LD config file.
$ cat /etc/ld.so.conf.d/cray-pe.conf
/opt/cray/pe/lib64
/opt/cray/pe/lib64/cce



To avoid this behaviour, once the desired version of the Cray libraries has been loaded with the module command, set the appropriate environment variables accordingly.

$export LD_LIBRARY_PATH=$CRAY_LD_LIBRARY_PATH:$LD_LIBRARY_PATH

$export LIBRARY_PATH=$CRAY_LIBRARY_PATH:$LIBRARY_PATH

Threads and processes placement on Zen3 cores