Known and Resolved Issues

List the "Known" issues problems and those that are solved in this upgrade

On this page:

Known Issues 

Parallel IO within Containers

Currently there are issues running MPI-enabled software that makes use of parallel IO from within a container being run by the Singularity container engine. The error message seen will be similar to:

-

Currently, HPE Cray has still not provided a fix for this issue. Pawsey is working on testing several possible solutions. 

Workaround:

There is no workaround that does not require a change in the workflow. Either the container needs to be rebuilt to not make use of parallel IO libraries (e.g. the container was built using parallel HDF5) or if that is not possible, the software stack must be built "bare-metal" on Setonix itself (see How to Install Software). 

Issues with Slingshot network

MPI jobs failing to run compute nodes shared with other jobs

MPI jobs running on compute nodes shared with other jobs have been observed failing at MPI_Init providing error messages similar to:

Example of error message
MPICH ERROR [Rank 0] [job id 957725.1] [Wed Feb 22 01:53:50 2023] [nid001012] - Abort(1616271) (rank 0 in comm 0): Fatal error in PMPI_Init: Other MPI error, error stack:
MPIR_Init_thread(170).......: 
MPID_Init(501)..............: 
MPIDI_OFI_mpi_init_hook(814): 
create_endpoint(1359).......: OFI EP enable failed (ofi_init.c:1359:create_endpoint:Address already in use)

aborting job:
Fatal error in PMPI_Init: Other MPI error, error stack:
MPIR_Init_thread(170).......: 
MPID_Init(501)..............: 
MPIDI_OFI_mpi_init_hook(814): 
create_endpoint(1359).......: OFI EP enable failed (ofi_init.c:1359:create_endpoint:Address already in use)

This error is not properly a problem with MPI but with some components of the Slingshot network.


Workaround:

There are several workarounds. 

  • Define an environment variable FI_CXI_DEFAULT_VNI in the sbatch script. Specifically before any srun in the job script, users should assign this variable to a "unique" random value using /dev/urandom . The same workaround of the use of FI_CXI_DEFAULT_VNI should be applied when executing multiple parallel srun-steps even on an exclusive compute node, as the examples in: Multiple Parallel Job Steps in a Single Main Job

     Click here to expand the example jobs script for 96 tasks in shared single node
    Batch script requesting 96 cores (1 Node) in a shared allocation
    #!/bin/bash --login
    #SBATCH --account=[your-project]
    #SBATCH --partition=work
    #SBATCH --ntasks=96
    #SBATCH --nodes=1
    #SBATCH --cpus-per-task=1
    #SBATCH --mem=176640M              #(96*1840M)
    #SBATCH --time=5:00:00
    
    # ---
    # Load here the needed modules
    
    # ---
    # Note we avoid any inadvertent OpenMP threading by setting
    # OMP_NUM_THREADS=1
    export OMP_NUM_THREADS=1
    
    # ---
    # Set MPI related environment variables. (Not all need to be set)
    # Main variables for multi-node jobs (activate for multinode jobs)
    # export MPICH_OFI_STARTUP_CONNECT=1
    # export MPICH_OFI_VERBOSE=1
    #Ask MPI to provide useful runtime information (activate if debugging)
    #export MPICH_ENV_DISPLAY=1
    #export MPICH_MEMORY_REPORT=1
    
    # ---
    # Temporary workaround for avoiding Slingshot issues on shared nodes:
    export FI_CXI_DEFAULT_VNI=$(od -vAn -N4 -tu < /dev/urandom)
    # Run the desired code:
    srun -N $SLURM_JOB_NUM_NODES -n $SLURM_NTASKS -c $OMP_NUM_THREADS -m block:block:block ./code_mpi.x
    
    
  • underutilise a node and request --exclusive  access. The practice of leaving some cores idle makes sense only if considerable performance improvements are seen compared to the shared access practice. Jobs with requirements of 64 cores or more per node are the usual candidates for the use of this approach. (Take into account that job will be charged for the use of the full nodes.)

MPI jobs hanging

MPI jobs with large comm-worlds using asynchronous pt2pt communication can sometimes hang. A root cause for this hang has not yet been identified, however it appears to be affected by several factors, including rank distribution across nodes (total no. of ranks, nodes, and ranks per node), the amount of data each rank sends and receives, and the amount of data in individual sends and receives.

Workaround:

At this point in time, no full workaround has been identified. One current recommendation to anyone experiencing this hang is to try and adjust the distribution of ranks across nodes by making it more compact if possible. Pawsey staff testing has shown that when fewer nodes are used, more total ranks are needed to trigger the hang, given other variables (such as amount of data being sent) are the same.


ANSYS FLUENT

Multi node mpi issue

Ansys fluent cannot run multi node jobs on Setonix with cray-mpich as MPI implementation due to incompatible binaries (not meant for CRAY EX system with slingshot).

Issue has been raised with Ansys, waiting for a resolution. Currently, Ansys fluent can run on a single node with 128 cores on a work partition. 

fluent 3ddp -g -t${SLURM_NTASKS} -mpi=intel inputfile.jou 





Building Software

Performance of the Cray environment

At the moment, vendor-provided libraries such as cray-libsci and cray-fftw do not provide the best performance and Pawsey staff will test them during the coming months. Moreover, the Cray C/C++ compiler seems to not optimise code very well. For this reason, you should avoid using the Cray stack other than cray-mpich and crayftn, and using the alternatives provided by the Pawsey software stack. To compile C/C++ code, we suggest using GCC compilers. As cray-libsci is automatically loaded by the PrgEnv-*  modules, we recommend unloading it explicitly when building and running software, e.g., module unload cray-libsci . This may be particularly necessary if you are using SLATE and other packages that provide GPU-enabled interfaces to BLAS and LAPACK.

Linking to Cray libraries different than the default ones

Cray modulefiles do not set the LD_LIBRARY_PATH environment variable despite the fact that Cray compilers now default to dynamic linking. As a result, programs will link to the libraries found in  /opt/cray/pe/lib64, which are a symlink to the deployment of the latest version available.

Terminal 1. Content of the LD config file.
$ cat /etc/ld.so.conf.d/cray-pe.conf
/opt/cray/pe/lib64
/opt/cray/pe/lib64/cce

To avoid this behaviour, once the desired version of the Cray libraries has been loaded with the module command, set the appropriate environment variables accordingly.

Terminal 2. Update library paths
$ export LD_LIBRARY_PATH=$CRAY_LD_LIBRARY_PATH:$LD_LIBRARY_PATH
$ export LIBRARY_PATH=$CRAY_LIBRARY_PATH:$LIBRARY_PATH

Slurm

  • There is a known bug in slurm related to memory requests, which will be fixed in the future with a patch. The amount of total memory on a node is incorrectly calculated when requesting more than 67 MPI processes per node with --ntasks-per-node=67 or more
    • Workaround: Provide the total memory required with --mem=<desired_value>
  • For shared node access, pinning of cpus to MPI processes or OpenMP threads will be poor. 
    • Workaround: srun should be run with -m block:block:block within a sbatch script or an interactive session.

Resolved Issues

Slurm

  • Email notifications work.