Memo‎ > ‎

Use slurm to run mpi jobs

posted Mar 9, 2020, 7:21 AM by Teng-Yok Lee   [ updated Mar 9, 2020, 8:34 AM ]
MPI is a library to run parallel jobs on multiple computers, which is used in some distributed machine learning libraries like horovod. This page will briefly explain how to run programs with MPI on a slurm cluster:

To interactively run jobs, use salloc:
$ salloc <resources> mpirun -n <resources> <cmd> <arg> ...
For instance, the following command will launch 2 nodes with 2 CPUs each, and thus 2 host names should be printed:
$ salloc -n 4 --ntasks-per-node 2 mpirun -n 4 hostname
NOTE This is different from srun, which use -c  to specify #CPU per task.
To request generic consumable resources like GPU, --gres is used to specify the resources per node. For instance, the following command will request 2 nodes with 2 GPUs each.
$ salloc -n 4 --gres gpu:2 --tasks-per-node 2 mpirun -n 4 hostname
To submit a mpi job, use sbatch:
$ sbatch <resources> mpirun -n <resources> <cmd> <arg> ...

References
  1. https://github.com/horovod/horovod
  2. https://slurm.schedmd.com/sbatch.html

Comments