Workload manager: SLURM

by redouane bouchouirbat
Announcement, Documentation | No Comments

SLURM (Simple Linux Utility for Ressource Management) is a scalable open-source scheduler used on a number of world class clusters.

This is a brief description page to give some hints and guidance to users to allow them to launch a job on the platform.

Once connected to the platform, you have to load the slurm module

$ module load slurm

There is two ways to launch a job.

1. Interactive jobs

Method 1:

You need to allocate some ressources.

$ salloc  -N2 -t 00:30:00
salloc: Granted job allocation 7397

squeue is used to have a look on the job state:

$ squeue 7397
JOBID PARTITION NAME USER      ST TIME NODES   NODELIST(REASON)
7397     defq   bash      bouchoui R   1:05      2    miriel[007-008]

squeue can also give more information.

$ squeue -o "%.18i %.9P %.8j %.8u %.2t %.10M %.6D %.3C %.20R"
JOBID PARTITION NAME USER    ST TIME NODES CPUS NODELIST(REASON)
11411  defq       bash    bouchoui R  0:33    2    25       miriel[004-005]

To get information about a specific job, scontrol show job <JobId> can be used. 
(man scontrol to get more help)

squeue header details:

JOBID The identifier  of job
PARTTION The partition on which the job is running, use sinfo command to display all partitions in the cluster.
NAME the name of job,  to define or change the name  (in batch mode) use -J <name_of_job>
USER the login of the job owner
ST the state of submitted job,  some state : PENDING, RUNING FAILED, COMPLETED, ... etc.
TIME The time limit for the job (NOTE : if the user doesn't define the time limit for his job, the default time limit of the partition will be used).
NODE Size of nodes.
NODELIST List of nodes used.

 

There is a compact format for job's state:

  • PD (pending): Job is awaiting resource allocation,
  • R (running): Job currently has an allocation,
  • CA (cancelled): Job was explicitly cancelled by the user or system administrator,
  • CF (configuring): Job has been allocated resources, but are waiting for them to become ready for use,
  • CG (completing): Job is in the process of completing. Some processes on some nodes may still be active,
  • CD (completed): Job has terminated all processes on all nodes,
  • F (failed): Job terminated with non-zer exit code or other failure condition,
  • TO (timeout): Job terminated upon reaching its time limit,
  • NF (node failure): Job terminated due to failure of one or more allocated nodes.

 

The job 7397 is in running (R) mode, on the default partition defq, and 2 nodes are used, miriel007 and miriel008.

In the same shell terminal, run  srun <  your_executable >,  (all commands will used the allocated resources)

$ srun hostname
miriel007
miriel008

you can also login to one of the allocated nodes by using ssh, however all slurm's variables environment will not be effective.

$ssh miriel007

If you want to run your command/binary_code on all allocated resources, you must run the srun command with the jobid option followed by the id associated to your job.

@miriel007~$ srun --jobid=7397 hostname
miriel007
miriel008

Method 2:

Launch the interactive session from one of the devel nodes.

$ srun -N1 - -exclusive - -time=30:00 - -pty bash -i
$ hostname
miriel006

This command allows to launch a shell on a node in interactive mode using the defq partition and the allocated resources.

NB: To export x11 display on the first, last or all allocated node(s), add the option "--x11=[batch|first|last|all]".

2. Batch jobs

$ cat script-slurm.sh
#!/usr/bin/env bash
#Job name
#SBATCH -J TEST_Slurm
# Asking for one node
#SBATCH -N 1
# Output results message
#SBATCH -o slurm.sh%j.out
# Output error message
#SBATCH -e slurm.sh%j.err
module purge
module load slurm/14.03.0
echo "=====my job informations ==== "echo "Node List: " $SLURM_NODELIST
echo "my jobID: " $SLURM_JOB_ID
echo "Partition: " $SLURM_JOB_PARTITION
echo "submit directory:" $SLURM_SUBMIT_DIR
echo "submit host:" $SLURM_SUBMIT_HOST
echo "In the directory: `pwd`"
echo "As the user: `whoami`"

Launch the job using the command sbatch

$ sbatch script-slurm.sh
Submitted batch job 7421

to get informations about the running jobs

$squeue

and more...

$scontrol show job <jobid>

to delete a running job

$scancel <jobid>

to watch the output of the job  7421

$cat slurm.sh7421.out
=====my job informations ====
Node List: miriel003
my jobID: 7449
Partition: defq
submit directory: /home/bouchoui/Tests
submit host: devel12
In the directory: /home/bouchoui/Tests
As the user: bouchoui

3. Job accounting

The command sacct displays accounting data for all jobs and job steps in the SLURM job accounting log or SLURM database.

By default, it displays, some informations ( JobID, JobName, Partition, Account, AllocCPUS, State, ExitCode) of user's jobs.

The accounting information, could be displayed for all jobs with specified interval of time ( option -S or --starttime)  (option -E or --endtime ),  or for specified job(.step) or list of job(.step)s.  sacct -j <jobid1, jobid2,...>.

eg :

sacct -j 7397
JobID JobName Partition Account AllocCPUS State ExitCode
------------ ---------- ---------- ---------- ---------- ---------- --------
7397 bash defq 48 COMPLETED 0:0
7397.0 hostname 2 COMPLETED 0:0

To display only jobs with particular state (completed, timout, canceled, ... etc), use  -s  <state>(or --state=<state>) option,  for an elapsed period  (ex :  from 10 September 2015  to 12 September 2015 ) :

sacct -S 09/10 -E 09/12 -s timeout
JobID JobName Partition Account AllocCPUS State ExitCode
------------ ---------- ---------- ---------- ---------- ---------- --------
13584 bash defq sed-bdx 1 TIMEOUT 0:1
13619 bash defq sed-bdx 12 TIMEOUT 0:1

you can choose informations for your jobs accounting to display with  --format option.

for example:

sacct -s completed
--format=jobid,jobname,partition,maxvmsize,maxvmsizenode,
avevmsize,averss,avecpu,ntasks,alloccpus,elapsed,state,exitcode,submit,end
-S 09/10 -E 09/15

for more details about sacct command, use  man sacct on your shell (or visit sacct)

SLURM epilog

After a job exits a node, an epilog script is run which will kill all processes for users who are not authorized to be running on that node. This has three useful effects:

  • to clean up jobs which may declare themselves done without actually killing all sub-processes
  • to terminate programs started on the node though means other than the SLURM manager.
  • to clean up all directory and files finded in the  /tmp space created by the users'job.

Advanced usage:

Parallel  Programming (mpi):

MPI use depends upon the type of MPI being used. (for more details follow this link  http://slurm.schedmd.com/mpi_guide.html).

Herein after, we describe how to use the three most used  MPI library on the plateforme PlaFRIM (Open MPI, Intel MPI and MVAPICH2).

OpenMPI:

Choose your build and execution environment using modules. For that load one of OpenMPI available library.

Currently, we provide the following MPI implementations (you can list the modules by running module avail):

module avail mpi/openmpi
mpi/openmpi/gcc/1.10.0-tm mpi/openmpi/gcc/1.8.1 mpi/openmpi/gcc/1.8.5-tm
mpi/openmpi/gcc/1.10.0-tm-mlx mpi/openmpi/gcc/1.8.4-tm mpi/openmpi/gcc/1.8.6-tm

Using openmpi.1.8.5-tm  (openmpi library with multi-threaded option)

module load mpi/openmpi/gcc/1.8.5-tm

The job is then run using mpirun.

$salloc  -n 3             # allocate 3 tasks for the job
> mpirun ./a.out       # launch with mpirun
exit                         #  ending the job

NOTE : To active infiniband on node like  sirocco or mistral you have to swich/load to openmpi-1.10.0-tm-mlx library

 module load mpi/openmpi/gcc/1.10.0-tm-mlx.

  • to compile the code:

mpicc -o a.out  programme.c

  • to run code:

mpirun --mca btl openib,self ./a.out.

all command used before work in batch mode.

Intel MPI:

Choose your build and execution environment using modules. For that load one of Intel-MPI available library.

In order to use intel-mpi library for your application, you have to load Intel-MPI library and compiler (Intel or/and gnu-gcc ).

Eg:

module load compiler/gcc/4.9.0
module add compiler/intel/64/2015.3.187
module add mpi/intel-mpi

Using mpiexec.hydra to launch your application with

  • Create a file with the names of the machines that you want to run your job on:

 srun hostname  -s| sort -u > mpd.hosts

  • To run your application on these nodes, use mpiexec.hydra, and choose the fabrics for intra-node and inter-nodes mpi communcation:

 export I_MPI_FABRICS=shm:tmi
mpiexec.hydra -f mpd.hosts -n $SLURM_NPROCS ./a.out

Lancement avec srun (Utilisation de SLURM) :

  • Set the I_MPI_PMI_LIBRARY environment variable to point to the Slurm Process Management Interface (PMI) library:

export I_MPI_PMI_LIBRARY=/cm/shared/apps/slurm/14.03.0/lib64/libpmi.so

  • To run your application on these nodes, use srun, and choose also the fabrics for intra-nodes and inter-nodes communication:

export I_MPI_FABRICS=shm:tmi
srun -n $SLURM_NPROCS  ./a.out

NB : $SLURM_NPROCS indicate the number of process.

Array of Jobs
Staged jobs
Debugging and analysis
Reservation session

On the same thematic