Advanced SLURM Options

Environment Variables

Tip: if you are passing environment variables, there are a couple of gotchas.

1) In order for all your MPI ranks to see an environment variable, you must add an option to the mpirun command line to ensure your variable is passed properly. For example, if you want to run sbatch –export=MYVARIABLE scriptfile, in scriptfile you would call mpirun -x MYVARIABLE parallel_executable_file.

2) In certain situations (e.g. multiple nodes), the scriptfile itself will not see your environment variable. To get around this, separate your submission into a control file that you submit to sbatch, and a script file that is actually run within each task slot of your job. For example, if you want to run sbatch –export=MYVARIABLE controlfile, OR you have an environment variable MYVARIABLE already set and you just run sbatch controlfile, then your controlfile would have your regular #SBATCH headers and one command: srun scriptfile. This makes sure that your entire environment is transferred to the scriptfile on every job step, inside every task.

Infiniband

If you are using IB on a cluster with multiple IB islands (e.g. separate IB switch per rack), add the following option to ensure all nodes of your jobs are located on the same IB switch:

 -C "[ib1 | ib2 | ib3 | ib4]"

For example, you can put this at the top of your script:

#SBATCH -C "[ib1|ib2|ib3|ib4]"

or you can use it on the command line directly. In the following example, we are requesting two nodes that are both on the same Infiniband switch:

$ srun -C "[ib1|ib2|ib3|ib4]" -N 2 -pty bash

Alternatively, if a particular job does not require the fastest interconnect, you can fall back to Ethernet-only networking by adding

--mca btl self,vader,tcp

to your MPI command line arguments. If you are accustomed to using sm, try its replacement vader instead, which is faster.

Command Line Arguments

Sbatch will pass along arguments from the command line that follow the submission script filename. For example

$ cat submit.sh
#!/bin/sh
echo argument1 = $1

$ sbatch submit.sh greetings
Submitted batch job 3505

$ cat slurm-3505.out
argument1 = greetings

Multiple core jobs: how to specify?

One full 32-core node, 32 individual subtasks each using 1 core

#!/bin/sh
#SBATCH --nodes=1
#SBATCH --tasks-per-node=32
#SBATCH --cpus-per-task=1
#
# by default, standard output and standard error are merged into same output file
#SBATCH --output="dist-%j.out"
#
# reserve one entire 32-core node for this job (32 tasks x 1 cpu per task)

# this job step will run 32 times on same node; output will all be assembled together in outfile
srun env | grep SLURM_PROCID

# so will this job step, and we will see each individual core being assigned to each task; output
#  will be assembled together, and will all show up after the output of the above job step
srun hwloc-bind --get

One full 32-core node, 1 individual subtask using all 32 cores

#!/bin/sh
#SBATCH --nodes=1
#SBATCH --tasks-per-node=1
#SBATCH --cpus-per-task=32
#
# by default, standard output and standard error are merged into same output file
#SBATCH --output="dist-%j.out"
#
# reserve one entire 32-core node for this job (1 task x 32 cpus per task)
# it will be up to your script to make use of all the cores allocated to it

# this job step will run 1 time on same node; output go to outfile
srun env | grep SLURM_PROCID

# so will this job step, and we will see all cores being assigned to single task; output
#  will all show up after the output of the above job step
srun hwloc-bind --get

GPU usage

GPU: NODES / CARDS / GPUS / CUDA CORES

Each node has one or more GPU cards, and each GPU card is made up of one or more GPUs. Each GPU has multiple Streaming Multiprocessors (SMs), and each SM has multiple CUDA cores. In addition to CPU cores, the scheduler also manages GPU utilization. If your faculty sponsor purchased N GPU cards, please manually restrict your utilization at this time to just those GPU cards (which equals N * 2 for Tesla K80’s with GK210GL GPUs) between the GPU nodes.

Use –constraint=gpu (or -C gpu) with sbatch to explicitly select a GPU node from your partition, and –constraint=nogpu to explicitly avoid selecting a GPU node from your partition. In addition, use –gres=gpu:gk210gl:1 to request 1 of your GPUs, and the scheduler should manage GPU resources for you automatically. (On NLPGPU this makes no sense since they are all GPU nodes)

If you want to specify large memory GPUs, check which constraints can be specified with scontrol show node|egrep -ie ‘nodename|activefeatures’. You may find features named “48GBgpu” for example which will specify the GPU memory available, and you could then request -C 48GBgpu to require scheduling on nodes with that feature defined. If there are nodes with 24 and 48GB and you could use either, specify -C “24GBgpu|48GBgpu”.