User Tools

Site Tools


using_slurm

Using the Slurm job scheduler

Queues

Some queues have been defined.

  • defq: n[001-004]: Dual processor AMD Epyc nodes 64cores with 2TB ECC DRAM and 4TB NVMe /work
  • dgx2q: g001: Nvidia DGX-2 Dual Processor Xeon Scalable 8168 48cores and 1.5TB RAM and 30TB NVMe /work
  • armq: n[005-009]: Dual processor Cavium ThunderX2 64cores with 1TB DRAM and 8TB SSD 12Gbps /work
  • slowq: n[041-048]: Single processor Xeon Silver 8cores

Use the command “sinfo” to get information on the availability of the various nodes:

torel@srl-login1:~$ sinfo
PARTITION   AVAIL  TIMELIMIT  NODES  STATE NODELIST
defq*          up   infinite      4   idle n[001-004]
dgx2q          up   infinite      1   idle g001
xeongold16q    up   infinite      4   idle srl-adm2,srl-login1,srl-mds[1-2]
armq           up   infinite      0    n/a 
slowq          up   infinite      8   idle n[041-048]

Simple sbatch script to run on nodes n041-n048 (slowq)

Example to run on IB2 topology using RDMA (Openfabric) (MXM)

torel@srl-login1:~/workspace/MPI$ cat run-osu-mbw-mr-srun-w_mxm-slowq.sbatch
 
#!/bin/bash
#SBATCH -p slowq # partition (queue)
#SBATCH -N 2 # number of nodes
#SBATCH -n 2  # number of cores
#SBATCH --mem 1G # memory pool for all cores
#SBATCH -t 0-4:00 # time (D-HH:MM)
#SBATCH -o slurm.%N.%j.out # STDOUT
#SBATCH -e slurm.%N.%j.err # STDERR
 
ulimit -s 10240
 
module load slurm 
module load openmpi/gcc/64/4.0.1 
 
export OMPI_MCA_btl_openib_warn_no_device_params_found=0
export OMPI_MCA_btl_openib_if_include=mlx5_0:1
export OMPI_MCA_btl=self,openib
export OMPI_MCA_btl_tcp_if_exclude=lo,dis0,enp113s0f0
 
# Alternative method
#mpirun -np  $SLURM_NTASKS numactl --cpunodebind=0 --localalloc  /home/torel/workspace/Benchmarks/MPI/OSU-Micro-Benchmarks/osu-micro-benchmarks-5.5/mpi/pt2pt/osu_mbw_mr 
 
# Prefered method using srun
#
srun --mpi=pmi2 -n $SLURM_NTASKS /home/torel/workspace/Benchmarks/MPI/OSU-Micro-Benchmarks/osu-micro-benchmarks-5.5/mpi/pt2pt/osu_mbw_mr

To submit the script to the queue use the command “sbatch [filename]”:

marikkes@srl-login1:~/STREAM/STREAM$ sbatch streamrun.sh
Submitted batch job 2080

The command “squeue” shows the current job queue:

marikkes@srl-login1:~$ squeue
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
              2080 xeongold1   stream marikkes  R       2:14      1 srl-adm2

The command “scancel [JOBID]” cancels a job from the queue.

For more commands and information see the Slurm Quick Start User Guide.

using_slurm.txt · Last modified: 2019/07/19 06:41 by marikken