Known Issues on Setonix

On this page:


Known ongoing issues

Parallel IO within Containers

Currently there are issues running MPI-enabled software that makes use of parallel IO from within a container being run by the Singularity container engine. The error message seen will be similar to:

Example of error message
Assertion failed in file ../../../../src/mpi/romio/adio/ad_cray/ad_cray_adio_open.c at line 520: liblustreapi != NULL
/opt/cray/pe/mpich/default/ofi/gnu/9.1/lib-abi-mpich/libmpi.so.12(MPL_backtrace_show+0x26) [0x14ac6c37cc4b]
/opt/cray/pe/mpich/default/ofi/gnu/9.1/lib-abi-mpich/libmpi.so.12(+0x1ff3684) [0x14ac6bd2e684]
/opt/cray/pe/mpich/default/ofi/gnu/9.1/lib-abi-mpich/libmpi.so.12(+0x2672775) [0x14ac6c3ad775]
/opt/cray/pe/mpich/default/ofi/gnu/9.1/lib-abi-mpich/libmpi.so.12(+0x26ae1c1) [0x14ac6c3e91c1]
/opt/cray/pe/mpich/default/ofi/gnu/9.1/lib-abi-mpich/libmpi.so.12(MPI_File_open+0x205) [0x14ac6c38e625]

Currently it is unclear exactly what is causing this issue. Investigations are ongoing. 

Workaround:

There is no workaround that does not require a change in the workflow. Either the container needs to be rebuilt to not make use of parallel IO libraries (e.g. the container was built using parallel HDF5) or if that is not possible, the software stack must be built "bare-metal" on Setonix itself (see How to Install Software). 

Issues with Slingshot network

MPI jobs failing to run compute nodes shared with other jobs

MPI jobs running on compute nodes shared with other jobs have been observed failing at MPI_Init providing error messages similar to:

Example of error message
MPICH ERROR [Rank 0] [job id 957725.1] [Wed Feb 22 01:53:50 2023] [nid001012] - Abort(1616271) (rank 0 in comm 0): Fatal error in PMPI_Init: Other MPI error, error stack:
MPIR_Init_thread(170).......: 
MPID_Init(501)..............: 
MPIDI_OFI_mpi_init_hook(814): 
create_endpoint(1359).......: OFI EP enable failed (ofi_init.c:1359:create_endpoint:Address already in use)

aborting job:
Fatal error in PMPI_Init: Other MPI error, error stack:
MPIR_Init_thread(170).......: 
MPID_Init(501)..............: 
MPIDI_OFI_mpi_init_hook(814): 
create_endpoint(1359).......: OFI EP enable failed (ofi_init.c:1359:create_endpoint:Address already in use)

This error is not properly a problem with MPI but with some components of the Slingshot network.


Workaround:

Firstly, allocate as less nodes as possible for the requested tasks and prefer for exclusive access

Users should avoid the unnecessary spread of MPI ranks across many nodes and favor the use of exclusive access if possible. This will promote better performance in the execution of own user's code, but also of the rest of the jobs being executed in the Supercomputer.

Users should stick to the use of the following best practices to avoid the unnecessary spread of MPI ranks across many nodes:

  • Any job allocation should be defined for the closest and most effective communication among MPI tasks. So if the job requires the use of less than 128 tasks, then users should explicitly ask for the use of single node (using -N 1) and do not let Slurm to spread tasks beyond one node.
    • For example, if the srun step initially requires 96 tasks:
      • Users should consider the possibility of modifying their code settings to make it able to use the whole compute node (128 tasks) with exclusive access:

         Click here to expand the example jobs script for 128 tasks in exclusive node
        Batch script requesting 128 cores (1 Node) in an exclusive allocation
        #!/bin/bash --login
        #SBATCH --account=[your-project]
        #SBATCH --partition=work
        #SBATCH --ntasks=128
        #SBATCH --nodes=1
        #SBATCH --cpus-per-task=1
        #SBATCH --exclusive
        #SBATCH --time=5:00:00
        
        # ---
        # Load here the needed modules
        
        # ---
        # Note we avoid any inadvertent OpenMP threading by setting
        # OMP_NUM_THREADS=1
        export OMP_NUM_THREADS=1
        
        # ---
        # Set MPI related environment variables. (Not all need to be set)
        # Main variables for multi-node jobs (activate for multinode jobs)
        # export MPICH_OFI_STARTUP_CONNECT=1
        # export MPICH_OFI_VERBOSE=1
        #Ask MPI to provide useful runtime information (activate if debugging)
        #export MPICH_ENV_DISPLAY=1
        #export MPICH_MEMORY_REPORT=1
        
        # ---
        # Run the desired code:
        srun -N $SLURM_JOB_NUM_NODES -n $SLURM_NTASKS -c $OMP_NUM_THREADS ./code_mpi.x
        
        
      • If code settings modification is not possible, then users should consider the use of the node in exclusive access while not using the total number of cores available in the node:

         Click here to expand the example jobs script for 96 tasks in exclusive node
        Batch script requesting 96 cores (1 Node) in an exclusive allocation
        #!/bin/bash --login
        #SBATCH --account=[your-project]
        #SBATCH --partition=work
        #SBATCH --ntasks=96
        #SBATCH --nodes=1
        #SBATCH --cpus-per-task=1
        #SBATCH --exclusive
        #SBATCH --time=5:00:00
        
        # ---
        # Load here the needed modules
        
        # ---
        # Note we avoid any inadvertent OpenMP threading by setting
        # OMP_NUM_THREADS=1
        export OMP_NUM_THREADS=1
        
        # ---
        # Set MPI related environment variables. (Not all need to be set)
        # Main variables for multi-node jobs (activate for multinode jobs)
        # export MPICH_OFI_STARTUP_CONNECT=1
        # export MPICH_OFI_VERBOSE=1
        #Ask MPI to provide useful runtime information (activate if debugging)
        #export MPICH_ENV_DISPLAY=1
        #export MPICH_MEMORY_REPORT=1
        
        # ---
        # Run the desired code:
        srun -N $SLURM_JOB_NUM_NODES -n $SLURM_NTASKS -c $OMP_NUM_THREADS ./code_mpi.x
        
        

        This practice of leaving some cores idle makes sense only if considerable performance improvements are seen compared to the shared access practice. Jobs with requirements of 64 cores or more are the usual candidates for the use of this approach. (Take into account that job will be charged for the use of the full node.)

      • If there are not considerable performance gains in using a compute node in exclusive access, then request resources in shared access, but explicitly ask to allocate all tasks into a single node :

         Click here to expand the example jobs script for 96 tasks in shared single node
        Batch script requesting 96 cores (1 Node) in a shared allocation
        #!/bin/bash --login
        #SBATCH --account=[your-project]
        #SBATCH --partition=work
        #SBATCH --ntasks=96
        #SBATCH --nodes=1
        #SBATCH --cpus-per-task=1
        #SBATCH --mem=176640M              #(96*1840M)
        #SBATCH --time=5:00:00
        
        # ---
        # Load here the needed modules
        
        # ---
        # Note we avoid any inadvertent OpenMP threading by setting
        # OMP_NUM_THREADS=1
        export OMP_NUM_THREADS=1
        
        # ---
        # Set MPI related environment variables. (Not all need to be set)
        # Main variables for multi-node jobs (activate for multinode jobs)
        # export MPICH_OFI_STARTUP_CONNECT=1
        # export MPICH_OFI_VERBOSE=1
        #Ask MPI to provide useful runtime information (activate if debugging)
        #export MPICH_ENV_DISPLAY=1
        #export MPICH_MEMORY_REPORT=1
        
        # ---
        # Temporal workaround for avoiding Slingshot issues on shared nodes:
        export FI_CXI_DEFAULT_VNI=$(od -vAn -N4 -tu < /dev/urandom)
        # Run the desired code:
        srun -N $SLURM_JOB_NUM_NODES -n $SLURM_NTASKS -c $OMP_NUM_THREADS -m block:block:block ./code_mpi.x
        
        

        (Note the use of the variable FI_CXI_DEFAULT_VNI for jobs with shared access. See note below.)


  • In the case of a requirement of, for example 192 tasks. Then users should explicitly request for the use of only 2 nodes (using -N 2) and follow a similar approach:
    • Users should consider the possibility of modifying their code settings to make it able to use whole compute nodes (256 tasks) with exclusive access:

       Click here to expand the example jobs script for 256 tasks in 2 exclusive nodes
      Batch script requesting 256 cores (2 Nodes) in an exclusive allocation
      #!/bin/bash --login
      #SBATCH --account=[your-project]
      #SBATCH --partition=work
      #SBATCH --ntasks=256
      #SBATCH --nodes=2
      #SBATCH --cpus-per-task=1
      #SBATCH --exclusive
      #SBATCH --time=5:00:00
      
      # ---
      # Load here the needed modules
      
      # ---
      # Note we avoid any inadvertent OpenMP threading by setting
      # OMP_NUM_THREADS=1
      export OMP_NUM_THREADS=1
      
      # ---
      # Set MPI related environment variables. (Not all need to be set)
      # Main variables for multi-node jobs (activate for multinode jobs)
      export MPICH_OFI_STARTUP_CONNECT=1
      export MPICH_OFI_VERBOSE=1
      #Ask MPI to provide useful runtime information (activate if debugging)
      #export MPICH_ENV_DISPLAY=1
      #export MPICH_MEMORY_REPORT=1
      
      # ---
      # Run the desired code:
      srun -N $SLURM_JOB_NUM_NODES -n $SLURM_NTASKS -c $OMP_NUM_THREADS ./code_mpi.x
      
      
    • If code settings modification is not possible, then users should consider the use of compute nodes in exclusive access while not using the total number of cores available in each node:

       Click here to expand the example jobs script for 192 tasks in 2 exclusive nodes
      Batch script requesting 192 cores (2 Nodes) in an exclusive allocation
      #!/bin/bash --login
      #SBATCH --account=[your-project]
      #SBATCH --partition=work
      #SBATCH --ntasks=192
      #SBATCH --nodes=2
      #SBATCH --ntasks-per-node=96
      #SBATCH --cpus-per-task=1
      #SBATCH --exclusive
      #SBATCH --time=5:00:00
      
      # ---
      # Load here the needed modules
      
      # ---
      # Note we avoid any inadvertent OpenMP threading by setting
      # OMP_NUM_THREADS=1
      export OMP_NUM_THREADS=1
      
      # ---
      # Set MPI related environment variables. (Not all need to be set)
      # Main variables for multi-node jobs (activate for multinode jobs)
      export MPICH_OFI_STARTUP_CONNECT=1
      export MPICH_OFI_VERBOSE=1
      #Ask MPI to provide useful runtime information (activate if debugging)
      #export MPICH_ENV_DISPLAY=1
      #export MPICH_MEMORY_REPORT=1
      
      # ---
      # Run the desired code:
      srun -N $SLURM_JOB_NUM_NODES -n $SLURM_NTASKS -c $OMP_NUM_THREADS ./code_mpi.x
      
      

      This practice of leaving some cores idle makes sense only if considerable performance improvements are seen compared to the shared access practice. Jobs with requirements of 64 cores or more per node are the usual candidates for the use of this approach. (Take into account that job will be charged for the use of the full nodes.)

    • If there are not considerable performance gains in using the compute nodes in exclusive access, then request resources in shared access, but explicitly ask to allocate all tasks into only 2 nodes:

       Click here to expand the example jobs script for 192 tasks in 2 shared nodes
      Batch script requesting 192 cores (2 Node) in a shared allocation
      #!/bin/bash --login
      #SBATCH --account=[your-project]
      #SBATCH --partition=work
      #SBATCH --ntasks=192
      #SBATCH --nodes=2
      #SBATCH --ntasks-per-node=96
      #SBATCH --cpus-per-task=1
      #SBATCH --mem=176640M              #(96*1840M) per node
      #SBATCH --time=5:00:00
      
      # ---
      # Load here the needed modules
      
      # ---
      # Note we avoid any inadvertent OpenMP threading by setting
      # OMP_NUM_THREADS=1
      export OMP_NUM_THREADS=1
      
      # ---
      # Set MPI related environment variables. (Not all need to be set)
      # Main variables for multi-node jobs (activate for multinode jobs)
      export MPICH_OFI_STARTUP_CONNECT=1
      export MPICH_OFI_VERBOSE=1
      #Ask MPI to provide useful runtime information (activate if debugging)
      #export MPICH_ENV_DISPLAY=1
      #export MPICH_MEMORY_REPORT=1
      
      # ---
      # Temporal workaround for avoiding Slingshot issues on shared nodes:
      export FI_CXI_DEFAULT_VNI=$(od -vAn -N4 -tu < /dev/urandom)
      # Run the desired code:
      srun -N $SLURM_JOB_NUM_NODES -n $SLURM_NTASKS -c $OMP_NUM_THREADS -m block:block:block ./code_mpi.x
      
      

      (Note the use of the variable FI_CXI_DEFAULT_VNI for jobs with shared access. See note below.)

  • For jobs with higher counts of required tasks, please follow the same approach.


Secondly, use the FI_CXI_DEFAULT_VNI variable when requesting jobs in shared access

Very important, if jobs are still required to use compute nodes in shared access, HPE has provided a workaround that avoids the problems with communications among shared nodes. (This workaround can be used while Slingshot settings/libraries are properly fixed.) Workaround consists on the setting of the Slingshot environment variable FI_CXI_DEFAULT_VNI to a unique value before execution of each srun step in a Slurm job script. Therefore, before any srun in the job script, users should assign the mentioned variable to a "unique" random value using /dev/urandom as indicated in the two example scripts above for shared access.

The same workaround of the use of FI_CXI_DEFAULT_VNI should be applied when executing multiple parallel srun-steps even on an exclusive compute node, as the examples in: Multiple Parallel Job Steps in a Single Main Job

Issues with libfabric affecting MPI and multi-node applications

Since 11-Nov-2022 problems with the update to the libfabric library on Setonix are leading to further instances of job failure with some applications run over multiple nodes.

Errors have been noted specifically with the following applications:

  • Multicore VASP on a single node has encountered spurious memory access errors on long runs.
  • We have seen out of memory errors in some multinode NAMD runs.

A full resolution of these issues is expected with the completion of the Setonix Phase 2 installation in the following months. Some of these issues have been resolved (see resolved issues section).

MPI jobs hanging

MPI jobs with large comm-worlds using asynchronous pt2pt communication can sometimes hang. A root cause for this hang has not yet been identified, however it appears to be affected by several factors, including rank distribution across nodes (total no. of ranks, nodes, and ranks per node), the amount of data each rank sends and receives, and the amount of data in individual sends and receives.

Workaround:

At this point in time, no full workaround has been identified. One current recommendation to anyone experiencing this hang is to try and adjust the distribution of ranks across nodes by making it more compact if possible. Pawsey staff testing has shown that when fewer nodes are used, more total ranks are needed to trigger the hang, given other variables (such as amount of data being sent) are the same.

Performance of the Cray environment

At the moment, vendor-provided libraries such as cray-libsci and cray-fftw do not provide the best performance and Pawsey staff will test them during the coming months. Moreover, the Cray C/C++ compiler seems to not optimise code very well. For this reason, you should avoid using the Cray stack other than cray-mpich and crayftn, and using the alternatives provided by the Pawsey software stack. To compile C/C++ code, we suggest using GCC compilers.

Linking to Cray libraries different than the default ones

Cray modulefiles do not set the LD_LIBRARY_PATH environment variable despite the fact that Cray compilers now default to dynamic linking. As a result, programs will link to the libraries found in  /opt/cray/pe/lib64, which are a symlink to the deployment of the latest version available.

Terminal 1. Content of the LD config file.
$ cat /etc/ld.so.conf.d/cray-pe.conf
/opt/cray/pe/lib64
/opt/cray/pe/lib64/cce


To avoid this behaviour, once the desired version of the Cray libraries has been loaded with the module command, set the appropriate environment variables accordingly.

$export LD_LIBRARY_PATH=$CRAY_LD_LIBRARY_PATH:$LD_LIBRARY_PATH

$export LIBRARY_PATH=$CRAY_LIBRARY_PATH:$LIBRARY_PATH

Threads and processes placement on Zen3 cores

Currently, Slurm presents unwanted behaviours that have an impact on the performance of a job when it is submitted without the --exlusive sbatch flag, In particular, Slurm loses awareness of the Zen3 architecture and threads and/or processes are placed onto cores with no reasonable mapping.

To avoid the issue, pass the -m block:block:block flag to srun within a sbatch script or an interactive session.

The default project for a user on Setonix may be incorrect

A user's default project is determined by the content of the ~/.pawsey_project file, which is read on login and used to populate the $PAWSEY_PROJECT environmental variable (EnvVar).
The PAWSEY_PROJECT environmental variable is used as a part of some other EnvVars.
(A full list of EnvVars that may affect your experience on Pawsey systems can be found on this wiki page: Setonix Software Environment)

The contents of the ~/.pawsey_project file may not be set to what you want as your current default, especially if you are a member of multiple projects or have moved from one project to another

To set the contents  of the ~/.pawsey_project file so that it contains the project that you want to be your default, you should do the following, where <project> should be replaced by your project

echo <project> > $HOME/.pawsey_project

After you have done that, you should logout and then log back in again, so that the change can take effect, and then type the following commands to ensure that the file, and the EnvVar are correct

cat$HOME/.pawsey_project

echo $PAWSEY_PROJECT


Slurm

  • There is a known bug in slurm related to memory requests, which will be fixed in the future with a patch. The amount of total memory on a node is incorrectly calculated when requesting more than 67 MPI processes per node with --ntasks-per-node=67 or more
    • Workaround: Provide the total memory required with --mem=<desired_value>
  • For shared node access, pinning of cpus to MPI processes or OpenMP threads will be poor. 
    • Workaround: srun should be run with -m block:block:block
  • Email notifications implemented through the --mail-type  and --mail-user options of Slurm are currently not working. The issue will be investigated soon.
  • The use of both --ntasks-per-node  and an explicit memory request (including --mem=0  and --exclusive ) can lead to some job requests being rejected with an error message of the form error: Job submit/allocate failed: Requested node configuration is not available .
    • Workaround: Only include total job resources (e.g. `–ntasks` and --nodes) in the resource request (sbatch script or salloc), and distribute tasks across nodes when invoking srun , i.e. srun --ntasks-per-node=<desired_value> 
  • When asking for a number of GPUs with the option --gpus-per-node=<desired_value> , the number of GPUs visible to processes launched during an srun  call does not always match what is asked for.
    • Workaround: Use --gres=gpu:<desired_value>  instead of --gpus-per-node , as recommended throughout our documentation.

Quota issues on /software

To avoid the metadata servers of the /software filesystem being overwhelmed with too many inodes, Pawsey imposes a 100k quota on the number of files each user can have on said filesystem. However, we acknowledge the chosen limit may be too strict for some software such as weather models, large Git repositories, etc. We are working on a solution and will update you as soon as we have found one that meets the requirements of Pawsey and user applications.

You can run the following command to check how many files you own on /software:

lfs quota -u $USER -h /software 

If the high file count comes from a Conda environment that is mature and won't see many changes you should consider installing it within a Singularity or Docker container. In this way, it will count as a single file. To know more about containers, check out the following documentation page: Containers.

Cannot find software installed with Spack

If you have used Spack prior to the October and November maintenance, you are affected by a configuration bug such that further software installations using Spack will be placed under the gcc/11.2.0 directory tree of your software installations path (both personal and project-wide), instead of the gcc/12.1.0  tree that is present by default in the MODULEPATH environment variable. The described issue happens because, although we have updated all the necessary configuration files to reflect the fact that we moved to use a newer version of GCC, the previous locations for modules and software installations are cached, on a per-user basis, within the ~/.spack directory.  We were not aware of this behaviour and we will ensure this won't happen in the future. For the moment, you can either remove the ~/.spack  directory or add the gcc/11.2.0  tree back to the MODULEPATH environment variable, using the following command:

$ module use $MYSOFTWARE/setonix/modules/zen3/gcc/11.2.0

Resolved issues and incidents


MPI Issues (01-Feb-2023)

Poor performance

MPI-enabled jobs will show poor performance involving point-to-point communication. An update to the libfabric library has addressed these issues. 

Jobs Hanging

Jobs hang for asynchronous communication, where there are large differences between a send and a corresponding receive. Code would hang with an MPI rank waiting to receive a message. If the code produces logging information while running, the log will have not been updated for a while.  An update to the libfabric library has addressed these issues. 

Memory Leak

There is a memory leak in the underlying libfabric used by MPI, which resulted in large mpi jobs crashing and could even impact single node jobs with large memory footprint. An update to the libfabric library has addressed these issues. 

Low memory  nodes

The aforementioned memory leak also has a knock on effect that even if MPI jobs successfully run, the memory leak will gradually reduce the amount of available memory to users over time. An update to the libfabric library has addressed these issues. 

Faulty nodes

The memory leak and the resulting MPI crash can leave the node unable to properly connect to the network, resulting in MPI jobs failing to initialise. 

How it presents itself: The node will fail to initialise with error messages similar to

MPI Initialisation error
aborting job:
Fatal error in PMPI_Init: Other MPI error, error stack:
MPIR_Init_thread(170).......:
MPID_Init(501)..............:
MPIDI_OFI_mpi_init_hook(647): OFI fi_open domain failed (ofi_init.c:647:MPIDI_OFI_mpi_init_hook:Invalid argument)
MPICH ERROR [Rank 0] [job id 315213.0] [Wed Nov 30 20:01:43 2022] [nid001023] - Abort(1091983) (rank 0 in comm 0): Fatal error in PMPI_Init: Other MPI error, error stack:
MPIR_Init_thread(170).......:
MPID_Init(501)..............:
MPIDI_OFI_mpi_init_hook(647): OFI fi_open domain failed (ofi_init.c:647:MPIDI_OFI_mpi_init_hook:Invalid argument)

An update to the libfabric library has addressed the specific issue here. However, there may be other related issues with MPI initialisation depending on the job being run. Please contact help@pawsey.org.au if you encounter MPI initialisation related issues.

Large jobs with asynchronous point-to-point communication crash

Jobs with large number of ranks could crash even if they have a small memory footprint if the code has portions with lots of asynchronous communication. The code can crash if there are > approximately 10000 active messages being sent and received and occurred when setting export MPICH_OFI_STARTUP_CONNECT=1

How it presents itself: There a variety of error messages but all will contain a mention of OFI (relating to the libfabric).

OFI error message
MPICH ERROR [Rank 231] [job id 141125.0] [Thu Aug  4 13:05:08 2022] [nid001064] - Abort(740983695) (rank 231 in comm 0): Fatal error in PMPI_Waitall: Other MPI error, error stack:
PMPI_Waitall(339)..............: MPI_Waitall(count=255, req_array=0x65ea40, status_array=0x1) failed
MPIR_Waitall(167)..............:
MPIR_Waitall_impl(51)..........:
MPID_Progress_wait(186)........:
MPIDI_Progress_test(80)........:
MPIDI_OFI_handle_cq_error(1059): OFI poll failed (ofi_events.c:1061:MPIDI_OFI_handle_cq_error:Input/output error - transport
retry counter exceeded)

An update to the libfabric library has addressed the specific issue here. However, there may be other related issues with large number of messages being sent depending on the job being run. Please contact help@pawsey.org.au if you encounter MPI initialisation related issues.

Hardware failure (25-Nov-2022)

A failure on worker nodes affected the filesystem service DVS. This caused a large number of nodes to go out of service and induced failures in the SLURM scheduler. Jobs submitted with SLURM may have failed immediately or have been paused indefinitely.

A fix and reboot of faulty nodes was completed on 30-Nov-2022. For a detailed description of the incident see /wiki/spaces/US/pages/51928992

Internal DNS failure (24-Nov-2022)

Partial disk failure led to intermittent failures in internal name resolution services. Jobs would terminate with errors relating to getaddrinfo() or related library functions, for example:

srun: error: slurm_set_addr: Unable to resolve "nid001084-nmn"

This issue was resolved 25-Nov-2022.

Current status and live incidents

Live status of Pawsey systems is available at https://status.pawsey.org.au/

Maintenance and Incidents