MPI jobs / binaries via salloc vs sbatch subtle differences

Problem

This article will explain the following

  • Why MPI binaries will fail on the login node
  • Why MPI binaries will run on salloc job without using srun
  • Difference between sbatch and salloc

Solution

This gets really complicated and pretty messy but I'll try to explain it.

When you get a slurm session, basically it sets up the environment in the background and in the foreground ie in your terminal

I'm going to demonstrate it with hello world mpi example ie

achew@galaxy-1:~/jobs/hello.mpi.jobs/galaxy> cat hello.c
#include <stdio.h>
#include <mpi.h>
int main(int argc, char *argv[]) {
int numprocs, rank, namelen;
char processor_name[MPI_MAX_PROCESSOR_NAME];
MPI_Init(&argc, &argv);
MPI_Comm_size(MPI_COMM_WORLD, &numprocs);
MPI_Comm_rank(MPI_COMM_WORLD, &rank);
MPI_Get_processor_name(processor_name, &namelen);
printf("Process %d on %s out of %d\n", rank, processor_name, numprocs);
MPI_Finalize();
}
achew@galaxy-1:~/jobs/hello.mpi.jobs/galaxy> cc hello.c 
achew@galaxy-1:~/jobs/hello.mpi.jobs/galaxy> ls -al |grep -i a.out
-rwxr-xr-x 1 achew achew 10671528 Feb 12 17:28 a.out


So if a slurm session is not setup you get the following error with an mpi binary ie

achew@galaxy-1:~/jobs/hello.mpi.jobs/galaxy> ./a.out 
[Wed Feb 12 17:29:10 2020] [unknown] Fatal error in MPI_Init: Other MPI error, error stack:
MPIR_Init_thread(537): 
MPID_Init(246).......: channel initialization failed
MPID_Init(647).......: PMI2 init failed: 1 
Aborted (core dumped)


The MPI session is only setup when you do the srun command it does some magic.

Now in saying that, when you do salloc it really a wrapper calling srun command ie

achew@galaxy-1:~/jobs/hello.mpi.jobs/galaxy> cat /etc/opt/slurm/slurm.conf |grep -i salloc
SallocDefaultCommand = "srun -n1 -N1 --mem-per-cpu=0 --gres=craynetwork:0 --pty --preserve-env --mpi=none $SHELL"

So its really calling "srun -n1 -N1 --mem-per-cpu=0 --gres=craynetwork:0 --pty --preserve-env --mpi=none $SHELL"

The difference with sbatch shell script

  • you specify srun to actually bootstrap / setup the mpi based environment then runs the binary

While using salloc

  • The interactive terminal you get is inside the srun command already
  • So the mpi environment is already there as you using it live inside

Ie

achew@galaxy-1:~/jobs/hello.mpi.jobs/galaxy> salloc
Enter passphrase for key '/home/achew/.ssh/id_rsa': 
salloc: Granted job allocation 8946110
salloc: Waiting for resource configuration
salloc: Nodes nid00160 are ready for job
Notice the tasks is 1 basically because of the predefined srun command in slurm.conf
achew@nid00160:~/jobs/hello.mpi.jobs/galaxy> ./a.out 
Process 0 on nid00160 out of 1
If I wanted to change the number of tasks I would have to spawn another job step ie via srun to setup / allocate the resource which you have asked from the scheduler
achew@nid00160:~/jobs/hello.mpi.jobs/galaxy> srun --ntasks=8 ./a.out 
Process 4 on nid00160 out of 8
Process 5 on nid00160 out of 8
Process 6 on nid00160 out of 8
Process 0 on nid00160 out of 8
Process 1 on nid00160 out of 8
Process 3 on nid00160 out of 8
Process 7 on nid00160 out of 8
Process 2 on nid00160 out of 8

achew@nid00160:~/jobs/hello.mpi.jobs/galaxy> exit
exit
salloc: Relinquishing job allocation 8946110


You can confirm it by looking at the job output summary notice each srun is essentially a step


achew@galaxy-1:~/jobs/hello.mpi.jobs/galaxy> sacct --job=8946110 --format=User,JobID,Jobname,partition,state,time,start,end,elapsed,MaxRss,MaxVMSize,nnodes,ncpus,nodelist
User JobID JobName Partition State Timelimit Start End Elapsed MaxRSS MaxVMSize NNodes NCPUS NodeList 
--------- ------------ ---------- ---------- ---------- ---------- ------------------- ------------------- ---------- ---------- ---------- -------- ---------- --------------- 
achew 8946110 sh workq COMPLETED 1-00:00:00 17:41:53 17:43:45 00:01:52 1 40 nid00160 
8946110.ext+ extern COMPLETED 17:41:53 17:43:45 00:01:52 676K 4204K 1 40 nid00160 
8946110.0 bash COMPLETED 17:41:56 17:43:45 00:01:49 6456K 346456K 1 1 nid00160 
8946110.1 a.out COMPLETED 17:43:17 17:43:17 00:00:00 1916K 279896K 1 8 nid00160 


So it's working as intended, except salloc is within a srun whereas sbatch it not setup until you do a srun


Filter by label

There are no items with the selected labels at this time.