Submitting Jobs - Batch Scripts
The SLURM sbatch command allows automatic and persistent execution of commands. The list of commands sbatch performs are defined in a job batch (or submission) script, a BaSH shell script with some specialized cluster environment variables and commands. BaSH itself is a general purpose interpreted language which can perform almost any algorithm, but it is usually used to execute more efficient compiled programs designed to run on a cluster. Although many users' scripts are more complicated, typical users' batch scripts perform 3 tasks:
- Initialize resources
- Copy and move files to scratch
- Run preprocessing
- Calculate input variables
- Execute task
- Specialized program execution
- Data collection (this is usually done in parallel with execution)
- Finalize resources
- Run postprocessing
- Move files to job directory
- Remove files
Overview
Batch Submission
To simplify things, a number of template batch scripts have been created, in Examples Scripts section below, that most users can fill in required commands and resources. In addition to the steps defined above, there is a specialized set of comments, prefixed with #SBATCH, that act as input to the sbatch command options, as shown in the following example:
[username@log001 ~]# cat example.sh #This is our batch script.
#!/bin/bash
#SBATCH --cpus-per-task 2 --nodes 2
srun hostname
sleep 120
exit 0
[username@log01 ~]# sbatch example.sh --partition computeq #Note that ordering matters here!
sbatch: error: Batch job submission failed: No partition specified or system default partition
[username@master1 ~]# sbatch --partition=acomputeq example.sh
Submitted batch job 114499
[username@log01 ~]# squeue -j 114499 #We can use squeue to see that our script is running.
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
114499 acomputeq example. username R 0:06 2 c[077-078]
[username@log001 ~]# cat slurm-114499.out #This is the output of our batch script.
c077
c078
sbatch Options
sbatch options can be viewed by running man sbatch in the cluster terminal. Some of the more common options/switches in sbatch:
Long Form | Short Form | Description |
---|---|---|
--ntasks N |
-n N |
Number of cores, N, required by job, assuming C is default (1 CPU per task). The scheduler will distribute all tasks on 1 to N nodes if --nodes is undefined. Up to 1000 can be defined. |
--cpus-per-task C |
-c C |
Number of CPUs per task, C. These are always shared memory CPUs on one node. Up to 40 can be defined, and 1 CPU per task is the default. |
--nodes minM --nodes minM-maxM |
-N minM -N minM-maxM |
Number of nodes, from minM to maxM, to execute tasks on. Three combinations are usually used:
|
--partition P |
-p P |
Execute script on a node in partition P. We have three partitions:
|
--gres=gpu:G |
Used in combination with gpuq partition. Use G GPUs per node. Can be 1 or 2. |
|
--mem-per-cpu E |
Use E megabytes of memory per cpu. Default is 4096 megabytes. Can include units K (kilobytes), M (megabytes), G (gigabytes), and T (terabytes). For example "--mem-per-cpu 4G" would use 4 gigabytes of memory per CPU. |
|
--array A |
--a A |
Execute the script many different times with A indexing. SLURM_ARRAY_TASK_ID environment variable provides access to one index. Indexing can be as follows:
|
Example Scripts
Batch script examples are presented below (coming soon). Using the slurm sbatch command can be used to submit jobs using these scripts, and the prefilled options can be modified to suite your job's need.
Script | Description |
---|---|
Executing a job on 1 node with 1 CPU. |
|
Executing a job on 1 node with many CPUs. |
|
Executing a job on multiple nodes with many CPUs. |
|
scratch.sh |
Executing a job with scratch space. |
Executing many jobs with indexing. |
Execution and Data Models
The HPC's cluster supports a variety of execution and data models. For many users, simply executing a lot of concurrent serial jobs is enough, but some users may require different levels of parallelism. Some common modes of parallelism used on the cluster rely on the distinction of program and memory access locale1:
Shared Access Memory | Distributed or Simultaneous Access Memory | |
---|---|---|
Single CPU | Single thread or Serial | Stream processing or General Purpose GPU (GPGPU) |
Many CPUs | Single Node, Symmetric Multiprocessing (SMP) | Single Node plus Memory (or Disk) Locale swapping (Scratch Space) |
Distributed CPUs | Distributed Global Address Space (DGAS) | Message Passing Interface (MPI) |
1. Note that this particular taxonomy is not exhaustive and there are many mixtures of all these techniques.
Serial
The most common jobs on the HPC's cluster. Most users execute individual scripts, with sbatch, for each of these jobs. If a user needs a lot of jobs, a generator script can create and submit many of these jobs. A simple method for generating jobs is shown below:
[username@log01 ~]$ cat generate.sh #The generator script.
#!/bin/bash
input=("bill" "ted" "lisa" "42")
for i in ${input[*]}
do
cp example.sh example_${i}.sh
echo "echoMe $i" >> example_${i}.sh
sbatch example_${i}.sh
done
[username@log01 ~]$ cat example.sh #A base submission script.
#!/bin/bash
#SBATCH --partition acomputeq
#SBATCH --cpus-per-task 1
function echoMe {
echo "Hello $1 from job $SLURM_JOB_ID"
sleep 120
exit 0
}
[username@log01 ~]$ ./generate.sh #Multiple submissions all in one command.
Submitted batch job 114761
Submitted batch job 114762
Submitted batch job 114763
Submitted batch job 114764
[username@log01 ~]$ squeue -u username
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
114761 acomputeq example_ username R 0:07 1 ac10
114762 acomputeq example_ username R 0:07 1 ac05
114763 acomputeq example_ username R 0:07 1 ac05
114764 acomputeq example_ username R 0:07 1 ac09
[username@log01 ~]$ cat slurm-11476* #Concatenated output from all the jobs.
Hello bill from job 114761
Hello ted from job 114762
Hello lisa from job 114763
Hello 42 from job 114764
[username@log01 example]$ cat example_bill.sh #Each derived submission script looks similar to this.
#!/bin/bash
#SBATCH --partition acomputeq
#SBATCH --cpus-per-task 1
function echoMe {
echo "Hello $1 from job $SLURM_JOB_ID"
sleep 120
exit 0
}
echoMe bill
Another way to submit many jobs is to provide your program with indexing, also known as an array job. This isn't always possible if jobs have dependencies on other jobs or require unusual input. Currently the limit for array jobs is indexes from 0 to 32768. An example of an array job based on the one above:
[username@log01 ~]$ cat example_array.sh #Submission script for our array job.
#!/bin/bash
#SBATCH --partition acomputeq
#SBATCH --cpus-per-task 1
#SBATCH --array 0-3
input=("bill" "ted" "lisa" "42")
function echoMe {
echo "Hello $1 from job $SLURM_JOB_ID"
sleep 120
exit 0
}
echoMe ${input[SLURM_ARRAY_TASK_ID]}
[username@log01 ~]$ sbatch example_array.sh #Submitting our array job.
Submitted batch job 114773
[username@log01 ~]$ squeue -u username
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
114773_0 acomputeq example_ username R 0:03 1 ac01
114773_1 acomputeq example_ username R 0:03 1 ac02
114773_2 acomputeq example_ username R 0:03 1 ac02
114773_3 acomputeq example_ username R 0:03 1 ac05
[username@log01 ~]$ cat slurm-114773_* #Output from our array job.
Hello bill from job 114773
Hello ted from job 114774
Hello lisa from job 114775
Hello 42 from job 114776
GPGPU (CUDA and OpenCL)
GPUs utilize many lightweight cores to process data. The paradigm is called stream computing, and it is similar to another paradigm called Single Instruction Multiple Data (SIMD). The idea with SIMD is that a lot of distinct data is processed by a single instruction, creating new data that can be processed further. CUDA and OpenCL extend this with special functions called kernels that act as single instructions. Executing CUDA and OpenCL programs is pretty simple as long as --partition gpuq and --gres gpu:G sbatch options are used. Also, if you use CUDA, make sure to load the appropriate modules in your submission script.
SMP
Sometimes referred to as multi-threading, this type of job is extremely popular on the HPC's cluster. These jobs require the maximum number of CPUs to be known ahead of time. Common implementations are POSIX threads, OpenMP, C++11 threads, BaSH Job Control, Python multiprocessing, and MATLAB Parallel Computing Toolbox. There are many more, but a common requirement is to define sbatch's --cpus-per-task option and passing the number of CPUs to the program being executed. With OpenMP, the environment variable OMP_NUM_THREADS should be defined in the job's submission script:
#This is a submission script excerpt!
export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK
Other programs may require the user to pass the environment variable SLURM_CPUS_PER_TASK to the program; for example, MATLAB has an internally defined function called maxNumCompThreads which can be defined in your MATLAB script:
%This is a matlab script excerpt!
maxNumCompThreads(str2num(getenv('SLURM_CPUS_PER_TASK')))
Scratch Space
In some cases, the largest nodes on a cluster cannot provide enough memory to load all data for a job. To accommodate more memory, a job can utilize a work directory (/home/scratch/$USER). The general idea is for a program to grab some data, do work, store the modified data to the work directory, and then repeat. This is similar to the page file in Microsoft Windows and the swap partition on Linux or Linux-like operating systems. Like page file and swap partitions, this technique is significantly slower than having all the data in memory. On the other hand, scratch space has two advantages over using swap space on the cluster. First, it allows more efficient algorithms to manage data specific to the program. Second, it allows a program to use much less memory than it would otherwise. To utilize it in this manner, the program must be aware of where the work directory is located. For many programs, this is defined by the TMPDIR (or TMP, TEMP, TMP_DIR, TEMP_DIR, etc...) environment variable:
#This is a submission script excerpt!
export TMPDIR=/scratch/$USER/$SLURM_JOB_ID/
mkdir $TMPDIR
#Run your program here
rm -r $TMPDIR
In MATLAB, a parcluster member, JobStorageLocation can define a scratch space within the MATLAB script:
%This is a matlab script excerpt!
maxNumCompThreads(str2num(getenv('SLURM_CPUS_PER_TASK')))
cluster=parcluster('local')
cluster.JobStorageLocation = getenv('TMPDIR')
In Java, the tmpdir member from java.io can define the scratch upon execution in your submission script:
#This is a submission script excerpt!
export TMPDIR=/scratch/$USER/$SLURM_JOB_ID/
mkdir $TMPDIR
java -Djava.io.tmpdir=$TMPDIR #...
rm -r $TMPDIR
Another use case for scratch space is to upload large chunks of data to the cluster that don't need to be backed up. If you have several terabytes of data, like genomic or stock data, that only need to be on the cluster while your job is running, it is best to take the following steps:
- Define a directory name, such as MYDATA ("export MYDATA=/scratch/username/directoryName" from the cluster terminal).
- Make a directory for your data in the cluster's scratch directory ("mkdir $MYDATA" from the cluster terminal).
- Upload your data to your newly created directory in scratch (use WinSCP, MobaXTerm, or "scp -r data username@bigblue.memphis.edu:/scratch/username/directoryName" from your workstation).
- Use sbatch to submit your submission scripts (make sure you include the MYDATA export from above in your submission script).
- When your jobs complete and you no longer need that data copy the output back to your home directory and delete the data with the directory ("cp -r $MYDATA/myoutput $HOME/myoutput; rm $MYDATA").
MPI
"Message Passing Interface", the name really tells you how this model works. When started, an MPI program defines a master CPU and worker CPUs in a group or many groups. The CPUs in the group starts executing instructions until all CPUs reach finalize. The groups can be synchronized and check for messages from other CPUs. Each CPU can pass messages to other CPUs within a group.
Starting an MPI program can be done by using SLURM's srun command or OpenMPI/MPICH/Intel's mpirun command within a submission script. There are 3 common option combinations for submitting MPI jobs with sbatch:
- "--cpus-per-task C --nodes M": Use C CPUs per node on M nodes giving C by M total CPUs.
- This gives a big block of fixed CPUs across fixed nodes.
- The advantage is increased speed from CPU-CPU locality and shared memory on single tasks.
- The disadvantage is potentially increased job pending time and disallowing the scheduler to make space for other jobs.
- If your job requires symmetry among nodes, this will force it.
- "--ntasks N": Use N CPUs on at most N nodes.
- This can be allocated anywhere on the cluster where CPUs are available.
- The advantage is quick job submission and allowing the scheduler to make space for other jobs.
- The disadvantage is potential slowdown due to decreased CPU-CPU locality and inability to used shared memory.
- This will not guarantee symmetry among nodes.
- "--ntasks N --cpus-per-task C": Use C CPUS per task on up to N nodes giving C by N total CPUs.
- This gives lots of small CPU chunks across many nodes.
- This is a best of both worlds approach compared to the first 2 option combinations.
- This can provide symmetry among nodes.
DGAS
This model is just an abstraction that creates a "shared" memory using pinned memory (memory that is bound to a particular processor) and 1-sided communication (MPI uses 2-sided communication: send/receive). UPC, UPC++, Chapel, DEGAS, Co-array Fortran, Global Arrays, and Titanium have this capability. While this topic requires advanced knowledge to implement, utilizing it is almost identical to MPI on the HPC's cluster. An example of software that utilizes this paradigm is NWCHEM.
Software
Most of the software is located in /public/apps on the cluster:
[username@log001 ~]$ ls /public/apps/ #List everything in this directory.
abyss caffe gmt moose_orig quip
adf cap3 gnuplot mopac R
admb c-blosc gromacs mrbayes raxml
admixtools cd-hit gsl mtmetis rlwrap
admixture charm hdf4 mumps rosetta
allpaths-lg cmake hdf5 namd sac
amber cns hmetis nccl salmon
anaconda comsol hmmer ncl samtools
angsd connectome htslib netcdf sas
ansys converge hybpiper novoalign scalapack
apbs cope hyphy2 nwchem scons
arrayfire curl imsl openblas simulia
atlas espresso intel openbugs slurmOther
augustus exabayes jellyfish openfoam snappy
autoconf exonerate jmodeltest2 openmm spades
autodock faiss julia openmpi sqlite
autodock_vina fastqc juliaa OpenSees sumo
autogrid fastsimcoal2 lammps orthograph superlu_dist
automake fastText lapack pandaseq tie-array-packed
bamtools FastTree lastz parallelgnu treemix
bcftools ffmpeg latte paraview trimal
beagle fftw leveldb parmetis trimmomatic
beagle-lib flash-modified libint partitionfinder trinity
beast fox mafft petsc upcxx
bedtools gamess matlab phyml vcftools
blacs garli md++ picard velvet
blast gaussian metis plink vmd
blat gcc miniconda pnetcdf weka
boost gd modeller progressiveCactus yasra
bowtie2 gflags moe psi4
bwa git molpro pylith
bzip2 glog molro python
cactus gmp moose qt
Modules
We have a variety of modules on the cluster for different software. To see these just use module avail as the following shows:
[username@log001 ~]$ module avail
---------------------------- /cm/local/modulefiles -----------------------------
cluster-tools-dell/8.1 lua/5.3.4
cluster-tools/8.1 module-git
cmd module-info
dot null
freeipmi/1.5.7 openldap
gcc/7.2.0 (L) openmpi/mlnx/gcc/64/3.0.0rc6
ipmitool/1.8.18 shared (L)
---------------------------- /usr/share/modulefiles ----------------------------
DefaultModules (L)
----------------------------- /public/modulefiles ------------------------------
FastTree/2.1.10
OpenSees/6.1.mc
OpenSees/6.1 (D)
R/3.5.1/gcc.8.2.0
R/3.5.2/gcc.8.2.0 (D)
R/3.5.0/gcc.8.2.0
abyss/2.1.0
lines 1-21
Then you can use module load in your submission script:
#This is a submission script excerpt!
module load R
#Run your program here
You can also use module load on the cluster terminal to compile, build, or just to see the effect:
[username@log001 ~]$ module load R
The following have been reloaded with a version change:
1) gcc/7.2.0 => gcc/8.2.0
[username@log001 ~]$ R --version #Notice that this is the default (D) version from module avail
R version 3.5.2 (2018-12-20) -- "Eggshell Igloo"
Copyright (C) 2018 The R Foundation for Statistical Computing
Platform: x86_64-pc-linux-gnu (64-bit) #R is free software and comes with ABSOLUTELY NO WARRANTY.
You are welcome to redistribute it under the terms of the
GNU General Public License versions 2 or 3.
For more information about these matters see
http://www.gnu.org/licenses/.
Helpful Scheduler Commands
squeue
This command is useful if you want to see the status of your jobs:
[username@log001 ~]$ squeue -u username
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
114773_0 computeq example_ username R 0:03 1 c025
114773_1 computeq example_ username R 0:03 1 c037
114773_2 computeq example_ username R 0:03 1 c055
114773_3 computeq example_ username R 0:03 1 c075
Typically, you will see one of several states (ST):
- BF BOOT_FAIL: Job terminated due to launch failure, typically due to a hardware failure (e.g. unable to boot the node or block and the job can not be requeued).
- CA CANCELLED: Job was explicitly cancelled by the user or system administrator. The job may or may not have been initiated.
- CD COMPLETED: Job has terminated all processes on all nodes with an exit code of zero.
- CF CONFIGURING: Job has been allocated resources, but are waiting for them to become ready for use (e.g. booting).
- CG COMPLETING: Job is in the process of completing. Some processes on some nodes may still be active.
- DL DEADLINE: Job terminated on deadline.
- F FAILED: Job terminated with non-zero exit code or other failure condition.
- NF NODE_FAIL: Job terminated due to failure of one or more allocated nodes.
- OOM OUT_OF_MEMORY: Job experienced out of memory error.
- PD PENDING: Job is awaiting resource allocation.
- PR PREEMPTED: Job terminated due to preemption.
- R RUNNING: Job currently has an allocation.
- RD RESV_DEL_HOLD: Job is held.
- RF REQUEUE_FED: Job is being requeued by a federation.
- RH REQUEUE_HOLD: Held job is being requeued.
- RQ REQUEUED: Completing job is being requeued.
- RS RESIZING: Job is about to change size.
- RV REVOKED: Sibling was removed from cluster due to other cluster starting the job.
- SI SIGNALING: Job is being signaled.
- SE SPECIAL_EXIT: The job was requeued in a special state. This state can be set by users, typically in EpilogSlurmctld, if the job has terminated with a particular exit value.
- SO STAGE_OUT: Job is staging out files.
- ST STOPPED: Job has an allocation, but execution has been stopped with SIGSTOP signal. CPUS have been retained by this job.
- S SUSPENDED: Job has an allocation, but execution has been suspended and CPUs have been released for other jobs.
- TO TIMEOUT: Job terminated upon reaching its time limit.
More information can be found by running man squeue in the cluster terminal.
sacct
As a user, you would normally see your jobs submitted over the past day:
[jspngler@log001 ~]$ sacct
JobID JobName Partition Account AllocCPUS State ExitCode
------------ ---------- ---------- ---------- ---------- ---------- --------
114190_19 10NM_Free+ computeq mlaradjil+ 12 COMPLETED 0:0
114190_19.b+ batch mlaradjil+ 12 COMPLETED 0:0
114190_19.e+ extern mlaradjil+ 12 COMPLETED 0:0
114191_15 10NM_Free+ computeq mlaradjil+ 12 COMPLETED 0:0
114191_15.b+ batch mlaradjil+ 12 COMPLETED 0:0
114191_15.e+ extern mlaradjil+ 12 COMPLETED 0:0
114761 example_b+ computeq mlaradjil+ 1 COMPLETED 0:0
114761.batch batch mlaradjil+ 1 COMPLETED 0:0
114761.exte+ extern mlaradjil+ 1 COMPLETED 0:0
114762 example_t+ computeq mlaradjil+ 1 COMPLETED 0:0
114762.batch batch mlaradjil+ 1 COMPLETED 0:0
114762.exte+ extern mlaradjil+ 1 COMPLETED 0:0
114763 example_l+ computeq mlaradjil+ 1 COMPLETED 0:0
114763.batch batch mlaradjil+ 1 COMPLETED 0:0
114763.exte+ extern mlaradjil+ 1 COMPLETED 0:0
114764 example_4+ computeq mlaradjil+ 1 COMPLETED 0:0
114764.batch batch mlaradjil+ 1 COMPLETED 0:0
114764.exte+ extern mlaradjil+ 1 COMPLETED 0:0
114773_4 example_a+ computeq mlaradjil+ 1 COMPLETED 0:0
114773_4.ba+ batch mlaradjil+ 1 COMPLETED 0:0
114773_4.ex+ extern mlaradjil+ 1 COMPLETED 0:0
114773_0 example_a+ computeq mlaradjil+ 1 COMPLETED 0:0
114773_0.ba+ batch mlaradjil+ 1 COMPLETED 0:0
114773_0.ex+ extern mlaradjil+ 1 COMPLETED 0:0
114773_1 example_a+ computeq mlaradjil+ 1 COMPLETED 0:0
114773_1.ba+ batch mlaradjil+ 1 COMPLETED 0:0
114773_1.ex+ extern mlaradjil+ 1 COMPLETED 0:0
114773_2 example_a+ computeq mlaradjil+ 1 COMPLETED 0:0
114773_2.ba+ batch mlaradjil+ 1 COMPLETED 0:0
114773_2.ex+ extern mlaradjil+ 1 COMPLETED 0:0
114773_3 example_a+ computeq mlaradjil+ 1 COMPLETED 0:0
114773_3.ba+ batch mlaradjil+ 1 COMPLETED 0:0
114773_3.ex+ extern mlaradjil+ 1 COMPLETED 0:0
114797 bash computeq mlaradjil+ 1 RUNNING 0:0
114797.exte+ extern mlaradjil+ 1 RUNNING 0:0
114797.0 bash mlaradjil+ 1 RUNNING 0:0
The output can be controlled with other options specifying users, accounts, output, dates, and states. To see all of them, run man sacct in the cluster terminal.
scancel
Cancel a job. Usually scancel jobid will do, but if you want to cancel all your jobs, do scancel -u $USER. To see all the options, run man scancel in the cluster terminal.
scontrol
Output and modify your jobs. "scontrol show job 26943" gives some pretty detailed information about job with jobid 26943. "scontrol update jobId=26943 numtasks=1" would set the number of tasks for jobid 26943 to 1. Most update commands can only be executed on pending jobs or by an administrator. To see all the options, run man scontrol in the cluster terminal.
srun
The SLURM run command is used to run job steps in a batch job. If it is used with a non-parallel program, it will proceed to run it for every task. For example, if --ntasks 4, and srun hostname is in a script, you might see the list of the 4 nodes srun ran on with duplicates if a node allocation has more than one task. To make sure srun only runs once in a step, you can do srun --ntasks 1 hostname. For MPI jobs, srun can replace mpirun in most instances. To see all the options, run man srun in the cluster terminal.