Example Slurm Batch Scripts for Setonix on CPU Compute Nodes

Defining a batch script that correctly requests and uses supercomputing resources is easier if you are provided with a good starting point. This page collects examples of batch scripts for the most common scenarios for jobs in CPU compute nodes, like multi-node MPI jobs.

To better understand the code presented on this page, you should be familiar with the concepts presented in Job Scheduling.

Important

It is highly recommended that you specify values for the --ntasks, --cpus-per-task, --nodes (or --ntasks-per-node) and --time options that are optimal for the job.


Also use --mem for jobs that will share node resources: shared access. Or --exclusive for allocation of all node resources for a single job: exclusive access.

Important

In principle, jobs will get better performance when running on nodes with exclusive access, so it is recommended to plan for jobs to use a number of tasks multiple of 128. (Users may also consider to use --exclusive in jobs with less that 128 cores per node, this if the core count is not modifiable but it still allows for the better performance, even with the allocation charge for the full 128 cores in the exclusive nodes.)


If users can't run their jobs with exclusive access to the compute nodes and prefer to run in shared access (either because the number of tasks of their job is not adjustable or because there is not performance advantage vs  the allocation charges for the rest of the cores in the node), it is very important to request for the minimum number of nodes that can provide the needed resources. So, if the number of tasks for their job is less than 128 they should explicitly ask for --nodes=1. If 128 < ntasks < 256 , then use --nodes=2, etc. This to reduce network traffic when communication exist among tasks in different nodes, which will then allow for better performance for all jobs running in the cluster.

Getting started - a simple batch script

The first example batch script is a very simple one. It is meant to illustrate the Slurm options that are usually used when running jobs on Setonix. In this instance, the script executes the hostname  command,  which reports the hostname of the compute node executing it.

Listing 0. A very simple batch script
#!/bin/bash -l
#SBATCH --account=<project>
#SBATCH --ntasks=1
#SBATCH --ntasks-per-node=1
#SBATCH --cpus-per-task=1
#SBATCH --mem=1840M
#SBATCH --time=00:05:00

srun -N 1 -n 1 -c 1 hostname

Slurm batch scripts are BASH scripts. They must start with #!/bin/bash shebang line. Furthermore, note the -l (or --login) option. It must be there for the correct system modules to be loaded properly.

Following, the #SBATCH lines specify to Slurm the computational resources we want for our job. In this example,

  • The --account option tells the system which allocation to charge for the compute time.
  • The --ntasks option specifies the maximum number of processes (task is an MPI terminology to indicate a process) your program will execute in parallel at any time. This value is used to determine the physical resources needed to accommodate the job requirements.
  • The --ntasks-per-node option specifies how many tasks per node you want to run. With this information, Slurm calculates the number of nodes to be reserved for your job.
  • The --cpus-per-task option how many CPU cores per task you need.  The total number of requested CPU cores per node is then ntasks-per-node*cpus-per-task. On Setonix there are a maximum of 128 CPU cores available when simultaneous multithreading is not used (this is the default), 256 otherwise. In this case, only 1 CPU core per task was requested because hostname  is a serial program.
  • The --mem option specifies how much memory to use in each node allocated for the job and needs to be indicated for proper allocation of jobs sharing node resources: shared access. In this case, only one core of the total of 128 in a node are to be utilised, so it makes sense to share the resources of the node instead of reserving the whole node resources for this single core job. We are asking for the corresponding amount of memory available for a single core (total nodeRAM/128). Note that only integer values are recognised, so we use the integer 1840M value (instead of the wrong non-integer 1.84G value). We currently recommend the use of --mem over --mem-per-cpu as for the current version of Slurm on Setonix, the indication of memory per cpu is creating some allocation problems.
  • The --time option sets the maximum allowable running time for your job (that is, the wall-clock limit). This job is set to get cut-off by Slurm at the 5-minute mark.

Finally, the srun command launches the hostname executable, in what is known a job step:

  • -N 1 (or --nodes=1) indicates to srun command to use only 1 node.
  • -n 1 (or --ntasks=1) indicates to srun command to only spawn a single task.
  • -c 1 (or --cpus-per-task=1) indicates to srun command to assign one cpu (one core) per task.

The above configuration is the typical one to run a serial job, a job executing a serial program. A serial program is one that does not use multiple processes and/or multiple threads. Note that nodes on Setonix are shared by default, so the example job will run on a single cpu allocated specifically for this job, but the rest of the node shared with other jobs.

Multithreaded jobs (OpenMP, pthreads, etc ...)

Multithreaded jobs are the ones running a program that launches multiple threads to perform a computation, with each thread being assigned a different CPU core for execution. A program may use directives (OpenMP), frameworks (pthreads), or third-party libraries to take advantage of thread parallelism.

Shared access to the node

Shared access to the compute nodes is the default for Setonix and is the recommended use, unless sharing the node is affecting the performance of your code. Even if this is the default option, users still need to ask for the required memory (--mem) and take care of some specific Slurm options to reserve cores in a more packed form and promote better resources utilisation.

Listing 1 shows an example of a single process using OpenMP to distribute the work over only 32 cores of a compute node. The --cpus-per-task option assigns a number of CPU cores to each task. A value greater than one will allow threads of a process to run in parallel.

Listing 1. A single process using 32 cores on a node for multithreaded job.
#!/bin/bash --login

# SLURM directives
#
# Here we specify to SLURM we want an OpenMP job with 32 threads
# a wall-clock time limit of one hour.
#
# Replace [your-project] with the appropriate project name
# following --account (e.g., --account=project123). 

#SBATCH --account=[your-project]
#SBATCH --partition=work
#SBATCH --ntasks=1
#SBATCH --ntasks-per-node=1
#SBATCH --cpus-per-task=32
#SBATCH --mem=58880M      #Indicate the amount of memory per node when asking for share resources
#SBATCH --time=01:00:00

# ---
# Load here the needed modules

# ---
# OpenMP settings
export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK   #To define the number of threads, in this case will be 32
export OMP_PLACES=cores     #To bind threads to cores
export OMP_PROC_BIND=close  #To bind (fix) threads (allocating them as close as possible). This option works together with the "places" indicated above, then: allocates threads in closest cores.

# ---
# Temporal workaround for avoiding Slingshot issues on shared nodes:
export FI_CXI_DEFAULT_VNI=$(od -vAn -N4 -tu < /dev/urandom)
# Run the desired code:
srun -N $SLURM_JOB_NUM_NODES -n $SLURM_NTASKS -c $OMP_NUM_THREADS -m block:block:block ./code_omp.x

Note the use of SLURM variables to avoid the repetition of numbers used in the settings, which is prone to errors.

Also note the explicit use of -N, -n, and -c options in the srun command, from which the -c option is strictly necessary for the correct allocation of multithreaded jobs.

The --mem option is needed for correct allocation of shared access jobs and specifies how much memory to use in each node allocated for the job. In this case the corresponding amount of memory for 1/4 of the available resources of the node is requested. Note that Slurm only accepts integer values, therefore the use of 58880M (instead of the wrong non-integer value of 58.88G). We currently recommend the use of --mem over --mem-per-cpu as for the current version of Slurm on Setonix, the indication of memory per cpu is creating some allocation problems.

Note the use of the -m block:block:block option of srun. This option is not very self explanatory, but it is used to ensure that threads are packed together in contiguous cores. Furthermore, our recommendation is to use as number of threads a multiple of 8 (number of cores per chiplet of the AMD processor) for best L3 cache utilisation.

Note that OMP_PLACES and OMP_PROC_BIND variables are used to control thread affinity in OpenMP jobs (settings above are recommended, but many other options for this variables are possible which may be tested to improve performance).

Exclusive access to the node

Exclusive access to the compute nodes is NOT the default for Setonix (contrary to previous Cray systems). Therefore the use of --exclusive is needed to warranty exclusive use by a single job whenever it is needed. Also, the request for exclusive use, will help slurm to place threads and/or processes onto cores with efficient mapping. The major drawback being the charge of the full node resources on your allocation balance (even if some cores remain idle during the job), so this option needs to be use with care (also taking into account that idle resources may impede other jobs to run in the supercomputer).

Listing 2 shows an example of a single process using OpenMP to distribute the work over all the cores on a node. The --cpus-per-task option assigns a number of CPU cores to each task. A value greater than one will allow threads of a process to run in parallel.

Listing 2. A single process using all cores on a node for multithreaded job.
#!/bin/bash --login

# SLURM directives
#
# Here we specify to SLURM we want an OpenMP job with 128 threads
# a wall-clock time limit of one hour.
#
# Replace [your-project] with the appropriate project name
# following --account (e.g., --account=project123). 

#SBATCH --account=[your-project]
#SBATCH --partition=work
#SBATCH --ntasks=1
#SBATCH --ntasks-per-node=1
#SBATCH --cpus-per-task=128
#SBATCH --exclusive
#SBATCH --time=01:00:00

# ---
# Load here the needed modules

# ---
# OpenMP settings
export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK   #To define the number of threads, in this case will be 128
export OMP_PLACES=cores     #To bind threads to cores
export OMP_PROC_BIND=close  #To bind (fix) threads (allocating them as close as possible). This option works together with the "places" indicated above, then: allocates threads in closest cores.

# ---
# Run the desired code:
srun -N $SLURM_JOB_NUM_NODES -n $SLURM_NTASKS -c $OMP_NUM_THREADS ./code_omp.x

Note the use of SLURM variables to avoid the repetition of numbers used in the settings, which is prone to errors.

Also note the explicit use of -N, -n, and -c options in the srun command, from which the -c option is strictly necessary for the correct allocation of multithreaded jobs.

The --exclusive option is used to override the default shared access to the node into exclusive access. It is still needed for proper allocation of resources even when asking for the use of the 128 cores available per node.

Note that OMP_PLACES and OMP_PROC_BIND variables are used to control thread affinity in OpenMP jobs (settings above are recommended, but many other options for this variables are possible which may be tested to improve performance).

MPI jobs

This section presents examples of batch scripts designed to run parallel and distributed computations making use of MPI. In this scenario, the use of the srun command is critical for the creation of many tasks on multiple nodes.

Important

It is highly recommended that you set the following environment variables in your batch script when running multinode jobs:

MPI Environment variables
export MPICH_OFI_STARTUP_CONNECT=1
export MPICH_OFI_VERBOSE=1

Shared access to the node

Shared access to the compute nodes is the default for Setonix and is the recommended use, unless sharing the node is affecting the performance of your code. Even if this is the default option, users still need to ask for the required memory (--mem) and take care of some specific Slurm options to reserve cores in a more packed form and promote better resources utilisation.

Listing 3 shows an example where a total of 64 MPI tasks are created from an executable named code_mpi.x. The objective is to use all the cores of only one socket of the compute node; that is, 64 tasks per node and per socket and one MPI task per core.

Listing 3. Batch script requesting 64 cores in a single socket of a compute node for a pure MPI job.
#!/bin/bash --login

# SLURM directives
#
# Here we specify to SLURM we want to execute 64 tasks
# for an MPI job that will share the rest of the node with other jobs.
# The plan is to utilise fully 1 of the two sockets available (64 cores) and
# a wall-clock time limit of 24 hours
#
# Replace [your-project] with the appropriate project name
# following --account (e.g., --account=project123)

#SBATCH --account=[your-project]
#SBATCH --partition=work
#SBATCH --ntasks=64
#SBATCH --ntasks-per-node=64
#SBATCH --cpus-per-task=1
#SBATCH --mem=117G
#SBATCH --time=24:00:00

# ---
# Load here the needed modules

# ---
# Note we avoid any inadvertent OpenMP threading by setting
# OMP_NUM_THREADS=1
export OMP_NUM_THREADS=1

# ---
# Set MPI related environment variables. (Not all need to be set)
# Main variables for multi-node jobs (activate for multinode jobs)
#export MPICH_OFI_STARTUP_CONNECT=1
#export MPICH_OFI_VERBOSE=1
#Ask MPI to provide useful runtime information (activate if debugging)
#export MPICH_ENV_DISPLAY=1
#export MPICH_MEMORY_REPORT=1

# ---
# Temporal workaround for avoiding Slingshot issues on shared nodes:
export FI_CXI_DEFAULT_VNI=$(od -vAn -N4 -tu < /dev/urandom)
# Run the desired code:
srun -N $SLURM_JOB_NUM_NODES -n $SLURM_NTASKS -c $OMP_NUM_THREADS -m block:block:block ./code_mpi.x

Note the use of SLURM variables to avoid the repetition of numbers used in the settings, which is prone to errors.

Also note the explicit use of -N, -n, and -c options in the srun command.

The --mem option is needed for correct allocation of shared access jobs and specifies how much memory to use on each node allocated for the job. In this case the corresponding amount of memory for 1/2 of the available resources of the node is requested. Note that Slurm only accepts integer values, therefore the use of the rounded integer 117G (instead of the wrong non-integer value of 117.5G). We currently recommend the use of --mem over --mem-per-cpu as for the current version of Slurm on Setonix, the indication of memory per cpu is creating some allocation problems.

Note the use of the -m block:block:block option of srun. This option is not very self explanatory, but it is used to ensure that MPI tasks are placed on contiguous cores. Furthermore, our recommendation is to use as number of MPI tasks a multiple of 8 (number of cores per chiplet of the AMD processor) for best L3 cache utilisation.

Temporal workaround for avoiding issues with Slingshot

Note the use of the setting of the environment variable FI_CXI_DEFAULT_VNI before each srun.
This is to avoid a current problem we have identified with multiple jobs or srun-steps running at the same time on a compute node.
Please check further explanation in: Issues with Slingshot network

Exclusive access to the node

Exclusive access to the compute nodes is NOT the default for Setonix (contrary to previous Cray systems). Therefore the use of --exclusive is needed to warranty exclusive use by a single job whenever it is needed. Also, the request for exclusive use, will help slurm to place threads and/or processes onto cores with efficient mapping. The major drawback being the charge of the full node resources on your allocation balance (even it some cores remain idle during the job), so this option needs to be use with care (also taking into account that idle resources may impede other jobs to run on the supercomputer).

Listing 4 shows an example where a total of 512 MPI tasks are created from an executable named code_mpi.x. The objective is to use all the cores of each requested node; that is, 128 tasks per node and one MPI task per core.

Listing 4. Batch script requesting 512 cores on 4 nodes for a pure MPI job.
#!/bin/bash --login

# SLURM directives
#
# Here we specify to SLURM we want 512 tasks
# distributed by 128 tasks per node (using all available cores on 4 nodes)
# a wall-clock time limit of 24 hours
#
# Replace [your-project] with the appropriate project name
# following --account (e.g., --account=pawsey00XX)

#SBATCH --account=[your-project]
#SBATCH --partition=work
#SBATCH --ntasks=512
#SBATCH --ntasks-per-node=128
#SBATCH --exclusive
#SBATCH --time=24:00:00


# ---
# Load here the needed modules

# ---
# Set MPI related environment variables. (Not all need to be set)
# Main variables for multi-node jobs (activate for multinode jobs)
export MPICH_OFI_STARTUP_CONNECT=1
export MPICH_OFI_VERBOSE=1
#Ask MPI to provide useful runtime information (activate if debugging)
#export MPICH_ENV_DISPLAY=1
#export MPICH_MEMORY_REPORT=1

# ---
# Run the desired code:
srun -N $SLURM_JOB_NUM_NODES -n $SLURM_NTASKS ./code_mpi.x

Note the use of SLURM variables to avoid the repetition of numbers used in the settings, which is prone to errors.

Also note the explicit use of -N and -n options in the srun command.

The --exclusive option is used to override the default shared access to the node into exclusive access. It is still needed for proper allocation of resources even when asking for the use of the 128 cores available per node.

Also note the use of MPI variables to improve performance in multinode jobs, as recommended above in this section.


There are cases where you may not be able to use all cores of a node because you are limited by the amount of memory available. Each CPU-only node has 128 cores and 256GB of memory available (~235Gb in reality, as part of the memory is used by the system). If your MPI job requires, for instance, 3.5Gb of RAM  per task (that is 448Gb of RAM per every 128 tasks) it will not fit in the available resources in a compute node. Then you may need to distribute tasks across more nodes, leaving some cores unused on each one of them. To do that, you can modify the example above to use:

#SBATCH --ntasks-per-node=64

which will allocate 64 cores per node then allowing for more memory per core. In this case, will allocate 8 nodes instead of the 4 from the first example. You still need to use --exclusive to avoid sharing the node as your job will need all the memory available and cannot be shared, even if half of the cores remain idle. And, as mentioned above, charge for the job will be as for 128 cores per node, as the node is being allocated in exclusive access.

Hybrid MPI and OpenMP jobs

This is a mixed-mode job creating a MPI task for each socket and each task spawns 64 OpenMP threads to use all the cores within the assigned socket. The job spans 2 compute nodes, so 4 MPI tasks are created.

Listing 5. Hybrid MPI and OpenMP job using 2 nodes.
#!/bin/bash --login

# SLURM directives
#
# Here we specify to SLURM we want two nodes (--nodes=2) with
# a wall-clock time limit of twenty minutes (--time=00:20:00).
#
# Replace [your-project] with the appropriate project name
# following --account (e.g., --account=project123).

#SBATCH --account=[your-project]
#SBATCH --ntasks=4
#SBATCH --ntasks-per-node=2
#SBATCH --ntasks-per-socket=1
#SBATCH --cpus-per-task=64
#SBATCH --exclusive
#SBATCH --time=05:00:00

# ---
# Load here the needed modules

# ---
# OpenMP settings
export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK   #To define the number of threads per task, in this case will be 64
export OMP_PLACES=cores     #To bind threads to cores
export OMP_PROC_BIND=close  #To bind (fix) threads (allocating them as close as possible). This option works together with the "places" indicated above, then: allocates threads in closest cores.

# ---
# Set MPI related environment variables. (Not all need to be set)
# Main variables for multi-node jobs (activate for multinode jobs)
export MPICH_OFI_STARTUP_CONNECT=1
export MPICH_OFI_VERBOSE=1
#Ask MPI to provide useful runtime information (activate if debugging)
#export MPICH_ENV_DISPLAY=1
#export MPICH_MEMORY_REPORT=1

# ---
# Run the desired code:
srun -N $SLURM_JOB_NUM_NODES -n $SLURM_NTASKS -c $OMP_NUM_THREADS ./code_hybrid.x

Note the use of SLURM variables to avoid the repetition of numbers used in the settings, which is prone to errors.

Also note the explicit use of -N, -n, and -c options in the srun command, from which the -c option is strictly necessary for the correct allocation of multithreaded jobs.

The --exclusive option is used to override the default shared access to the node into exclusive access. It is still needed for proper allocation of resources even when asking for the use of the 128 cores available per node.

With --ntasks-per-socket=1, a maximum of 1 MPI task will be allocated per socket.

Note that OMP_PLACES and OMP_PROC_BIND variables are used to control thread affinity in OpenMP jobs (settings above are recommended, but many other options for this variables are possible which may be tested to improve performance)

Note the MPI environment variables needed for multinode jobs.

Hyper-threading jobs

All codes that have significant fraction of their compute in the form of logic should benefit from hyperthreading. Gadget can, as can codes that use oct trees, binary trees, etc. For codes dominated by FLOPs performance gets worse due to contention of arithmetic units. Hyper-threading, or hardware threading, is disabled by default. You can enable it by using the sbatch option --threads-per-core=2.

Multiple parallel job steps in a single main job

You can run multiple job steps within a job, each of which may be a parallel computation. Furthermore, job steps can be run sequentially or in parallel themselves.

Listing 6 shows a job encompassing multiple job steps. Each job step has to terminate before the next can start its execution. For this reason, each one of them can use all the allocated resources.

Listing 6. A batch script containing multiple job steps.
#!/bin/bash --login

# SLURM directives
# Replace [your-project] with the appropriate project name
# following --account (e.g., --account=project123)

#SBATCH --account=[your-account]
#SBATCH --ntasks=64
#SBATCH --ntasks-per-node=64
#SBATCH --cpus-per-task=1
#SBATCH --mem=117Gb
#SBATCH --time=10:00:00

# ---
# Load here the needed modules


# ---
# Set MPI related environment variables. (Not all need to be set)
# Main variables for multi-node jobs (activate for multinode jobs)
#export MPICH_OFI_STARTUP_CONNECT=1
#export MPICH_OFI_VERBOSE=1
#Ask MPI to provide useful runtime information (activate if debugging)
#export MPICH_ENV_DISPLAY=1
#export MPICH_MEMORY_REPORT=1

# ---
# Each of the sruns below will block the execution of the script 
# until current parallel job completes:

# (Temporal workaround for avoiding Slingshot issues on shared nodes:)
export FI_CXI_DEFAULT_VNI=$(od -vAn -N4 -tu < /dev/urandom)
# First srun-step:
srun -N $SLURM_JOB_NUM_NODES -n $SLURM_NTASKS -c $SLURM_CPUS_PER_TASK -m block:block:block ./code1.x


# Rest:
export FI_CXI_DEFAULT_VNI=$(od -vAn -N4 -tu < /dev/urandom)
srun -N $SLURM_JOB_NUM_NODES -n $SLURM_NTASKS -c $SLURM_CPUS_PER_TASK -m block:block:block ./code2.x

export FI_CXI_DEFAULT_VNI=$(od -vAn -N4 -tu < /dev/urandom)
srun -N $SLURM_JOB_NUM_NODES -n $SLURM_NTASKS -c $SLURM_CPUS_PER_TASK -m block:block:block ./code3.x

export FI_CXI_DEFAULT_VNI=$(od -vAn -N4 -tu < /dev/urandom)
srun -N $SLURM_JOB_NUM_NODES -n $SLURM_NTASKS -c $SLURM_CPUS_PER_TASK -m block:block:block ./code4.x

In this case we are assuming each of the steps run an MPI job (for further explanation of the settings see the listings above for pure MPI jobs with shared access, as this job does not use the full node).

Temporal workaround for avoiding issues with Slingshot

Note the use of the setting of the environment variable FI_CXI_DEFAULT_VNI before each srun.
This is to avoid a current problem we have identified with multiple jobs or srun-steps running at the same time in a compute node.
Please check further explanation in: Issues with Slingshot network

In listing 7, the ampersand symbol (&) is used to execute each srun command in a non-blocking way, so that the batch script is able to progress and launch all of them at about the same time. The wait command prevents the batch script from exiting its execution before all the simultaneous sruns commands are completed. Note that for all job steps to run in parallel, you must allocate a fraction of all resources to each one of them using srun command options. It is useful to do so when a single job step cannot use all the allocated resources, and there might be a limit on the number of jobs you can run on a node. This example is a type of job packing; for more versatile ways of job packing check Example Workflows.


Listing 7. Running multiple job steps in parallel.
#!/bin/bash --login

# SLURM directives
#
# Replace [your-project] with the appropriate project name
# following --account (e.g., --account=project123)


#SBATCH --account=[your-account]
#SBATCH --ntasks=128
#SBATCH --ntasks-per-node=128
#SBACTH --cpus-per-task=1
#SBATCH --exclusive
#SBATCH --time=02:00:00

# ---
# Load here the needed modules

# ---
# Set MPI related environment variables. (Not all need to be set)
# Main variables for multi-node jobs (activate for multinode jobs)
#export MPICH_OFI_STARTUP_CONNECT=1
#export MPICH_OFI_VERBOSE=1
#Ask MPI to provide useful runtime information (activate if debugging)
#export MPICH_ENV_DISPLAY=1
#export MPICH_MEMORY_REPORT=1

# ---
# "&" is used to execute multiple parallel jobs simultaneously
# "wait" is used to prevent natch script from exiting before all jobs complete

# (Temporal workaround for avoiding Slingshot issues on shared nodes:)
export FI_CXI_DEFAULT_VNI=$(od -vAn -N4 -tu < /dev/urandom)
# First srun-step:
srun -N 1 -n 32 -c 1 --mem=58G --exact -m block:block:block ./code1.x &

#Rest:
export FI_CXI_DEFAULT_VNI=$(od -vAn -N4 -tu < /dev/urandom)
srun -N 1 -n 32 -c 1 --mem=58G --exact -m block:block:block ./code2.x &

export FI_CXI_DEFAULT_VNI=$(od -vAn -N4 -tu < /dev/urandom)
srun -N 1 -n 32 -c 1 --mem=58G --exact -m block:block:block ./code3.x &

export FI_CXI_DEFAULT_VNI=$(od -vAn -N4 -tu < /dev/urandom)
srun -N 1 -n 32 -c 1 --mem=58G --exact -m block:block:block ./code4.x &

export FI_CXI_DEFAULT_VNI=$(od -vAn -N4 -tu < /dev/urandom)
srun -N 1 -n 64 -c 1 --mem=117G --exact -m block:block:block ./code5.x &

export FI_CXI_DEFAULT_VNI=$(od -vAn -N4 -tu < /dev/urandom)
srun -N 1 -n 64 -c 1 --mem=117G --exact -m block:block:block ./code6.x &

#Important wait command:
wait

In this case we are assuming each of the steps run an MPI job (for further explanation of the settings see the listings above for pure MPI jobs with exclusive access).

As mentioned above, the "&" sign allows steps to be launched simultaneously, and the wait command literally keeps the job script waiting for the completion of all the sub-steps.

Note the use of the --mem option indicating the needed memory for each of the srun sub-steps, and that different units for the memory are allowed but the given number should be an integer. In this case the 235Gb of memory available for calculations in compute nodes has been divided evenly among the sub-steps that are to be executed concurrently.

The --exact option indicates that each step has access only to the resources requested in each srun command.

Note that the settings of MPI variables for multinode jobs are not needed in this single node job.

Temporal workaround for avoiding issues with Slingshot

Note the use of the setting of the environment variable FI_CXI_DEFAULT_VNI before each srun.
This is to avoid a current problem we have identified with multiple jobs or srun-steps running at the same time in a compute node.
Please check further explanation in: Issues with Slingshot network

Plan for balanced execution times between sruns

Be very aware that the whole allocated resources will remain allocated until the last srun command finishes its execution. No partial resources are liberated for other users when an individual job step finishes. Therefore, you should plan this kind of jobs very carefully and aim for all job steps to have very similar execution times. For example, if many of the job steps finish quickly, but just one remains on execution until reaching the walltime, most of the resources will remain idle for a long time. Even if your project is still being charged for the resources that remained idle, the creation of idle allocations is a very bad practice and should be avoided at all costs.

Related pages

External links