Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

Currently there are issues running MPI-enabled software that makes use of parallel IO from within a container being run by the Singularity container engine. The error message seen will be similar to:

Assertion failed in file ../../../../src/mpi/romio/adio/ad_cray/ad_cray_adio_open.c at line 520: liblustreapi != NULL /opt/cray/pe/mpich/default/ofi/gnu/9.1/lib-abi-mpich/libmpi.so.12(MPL_backtrace_show+0x26) [0x14ac6c37cc4b] /opt/cray/pe/mpich/default/ofi/gnu/9.1/lib-abi-mpich/libmpi.so.12(+0x1ff3684) [0x14ac6bd2e684] /opt/cray/pe/mpich/default/ofi/gnu/9.1/lib

-

abi-mpich/libmpi.so.12(+0x2672775) [0x14ac6c3ad775] /opt/cray/pe/mpich/default/ofi/gnu/9.1/lib-abi-mpich/libmpi.so.12(+0x26ae1c1) [0x14ac6c3e91c1] /opt/cray/pe/mpich/default/ofi/gnu/9.1/lib-abi-mpich/libmpi.so.12(MPI_File_open+0x205) [0x14ac6c38e625]


Column
Code Block
languagebash
themeDJango
titleExample of error message

Currently, HPE Cray has still not provided a fix for this issue. Pawsey is working on testing several possible solutions. 

...

At this point in time, no full workaround has been identified. One current recommendation to anyone experiencing this hang is to try and adjust the distribution of ranks across nodes by making it more compact if possible. Pawsey staff testing has shown that when fewer nodes are used, more total ranks are needed to trigger the hang, given other variables (such as amount of data being sent) are the same.


ANSYS FLUENT

Multi node mpi issue

Ansys fluent cannot run multi node jobs on Setonix with cray-mpich as MPI implementation due to incompatible binaries (not meant for CRAY EX system with slingshot).

Issue has been raised with Ansys, waiting for a resolution. Currently, Ansys fluent can run on a single node with 128 cores on a work partition. 

Code Block
fluent 3ddp -g -t${SLURM_NTASKS} -mpi=intel inputfile.jou 





Building Software

Performance of the Cray environment

...