MPI jobs / binaries via salloc vs sbatch subtle differences
Problem
This article will explain the following
- Why MPI binaries will fail on the login node
- Why MPI binaries will run on salloc job without using srun
- Difference between sbatch and salloc
Solution
This gets really complicated and pretty messy but I'll try to explain it.
When you get a slurm session, basically it sets up the environment in the background and in the foreground ie in your terminal
I'm going to demonstrate it with hello world mpi example ie
achew@galaxy-1:~/jobs/hello.mpi.jobs/galaxy> cat hello.c #include <stdio.h> #include <mpi.h> int main(int argc, char *argv[]) { int numprocs, rank, namelen; char processor_name[MPI_MAX_PROCESSOR_NAME]; MPI_Init(&argc, &argv); MPI_Comm_size(MPI_COMM_WORLD, &numprocs); MPI_Comm_rank(MPI_COMM_WORLD, &rank); MPI_Get_processor_name(processor_name, &namelen); printf("Process %d on %s out of %d\n", rank, processor_name, numprocs); MPI_Finalize(); } achew@galaxy-1:~/jobs/hello.mpi.jobs/galaxy> cc hello.c achew@galaxy-1:~/jobs/hello.mpi.jobs/galaxy> ls -al |grep -i a.out -rwxr-xr-x 1 achew achew 10671528 Feb 12 17:28 a.out
So if a slurm session is not setup you get the following error with an mpi binary ie
achew@galaxy-1:~/jobs/hello.mpi.jobs/galaxy> ./a.out [Wed Feb 12 17:29:10 2020] [unknown] Fatal error in MPI_Init: Other MPI error, error stack: MPIR_Init_thread(537): MPID_Init(246).......: channel initialization failed MPID_Init(647).......: PMI2 init failed: 1 Aborted (core dumped)
The MPI session is only setup when you do the srun command it does some magic.
Now in saying that, when you do salloc it really a wrapper calling srun command ie
achew@galaxy-1:~/jobs/hello.mpi.jobs/galaxy> cat /etc/opt/slurm/slurm.conf |grep -i salloc SallocDefaultCommand = "srun -n1 -N1 --mem-per-cpu=0 --gres=craynetwork:0 --pty --preserve-env --mpi=none $SHELL"
So its really calling "srun -n1 -N1 --mem-per-cpu=0 --gres=craynetwork:0 --pty --preserve-env --mpi=none $SHELL"
The difference with sbatch shell script
- you specify srun to actually bootstrap / setup the mpi based environment then runs the binary
While using salloc
- The interactive terminal you get is inside the srun command already
- So the mpi environment is already there as you using it live inside
Ie
achew@galaxy-1:~/jobs/hello.mpi.jobs/galaxy> salloc Enter passphrase for key '/home/achew/.ssh/id_rsa': salloc: Granted job allocation 8946110 salloc: Waiting for resource configuration salloc: Nodes nid00160 are ready for job
Notice the tasks is 1 basically because of the predefined srun command in slurm.conf
achew@nid00160:~/jobs/hello.mpi.jobs/galaxy> ./a.out Process 0 on nid00160 out of 1
If I wanted to change the number of tasks I would have to spawn another job step ie via srun to setup / allocate the resource which you have asked from the scheduler
achew@nid00160:~/jobs/hello.mpi.jobs/galaxy> srun --ntasks=8 ./a.out Process 4 on nid00160 out of 8 Process 5 on nid00160 out of 8 Process 6 on nid00160 out of 8 Process 0 on nid00160 out of 8 Process 1 on nid00160 out of 8 Process 3 on nid00160 out of 8 Process 7 on nid00160 out of 8 Process 2 on nid00160 out of 8 achew@nid00160:~/jobs/hello.mpi.jobs/galaxy> exit exit salloc: Relinquishing job allocation 8946110
You can confirm it by looking at the job output summary notice each srun is essentially a step
achew@galaxy-1:~/jobs/hello.mpi.jobs/galaxy> sacct --job=8946110 --format=User,JobID,Jobname,partition,state,time,start,end,elapsed,MaxRss,MaxVMSize,nnodes,ncpus,nodelist User JobID JobName Partition State Timelimit Start End Elapsed MaxRSS MaxVMSize NNodes NCPUS NodeList --------- ------------ ---------- ---------- ---------- ---------- ------------------- ------------------- ---------- ---------- ---------- -------- ---------- --------------- achew 8946110 sh workq COMPLETED 1-00:00:00 17:41:53 17:43:45 00:01:52 1 40 nid00160 8946110.ext+ extern COMPLETED 17:41:53 17:43:45 00:01:52 676K 4204K 1 40 nid00160 8946110.0 bash COMPLETED 17:41:56 17:43:45 00:01:49 6456K 346456K 1 1 nid00160 8946110.1 a.out COMPLETED 17:43:17 17:43:17 00:00:00 1916K 279896K 1 8 nid00160
So it's working as intended, except salloc is within a srun whereas sbatch it not setup until you do a srun
Related articles
Filter by label
There are no items with the selected labels at this time.