Resource Management with Slurm
Published:
slurm (Simple Linux Utility for Resource Management) is a resouce manager for running compute jobs across multiple servers. Although this has the benefit of additional control, it imposes constraints on compute resources and constrains interaction with servers to the slurm interface, this can be a pain. This post aims to be a useful go-to guide to common slurm commands and examples of how slurm may be used without the pain.
Background
Commmon commands
- Remote Bash (debug) Typically one may access a commandline interface on remote machines through debug partition. This would be equivalent of ssh’ing into a remote machine.
srun --pty -t 0:30:00 --partition=<machine>-debug bash
- Check partitions
sinfo
greytail{thornton}% sinfo PARTITION AVAIL TIMELIMIT NODES STATE NODELIST swan01-debug* up 30:00 1 idle swan01.cpu.stats.ox.ac.uk swan02-debug up 30:00 1 idle swan02.cpu.stats.ox.ac.uk swan03-debug up 30:00 1 mix swan03.cpu.stats.ox.ac.uk swan11-debug up 30:00 1 idle swan11.cpu.stats.ox.ac.uk swan12-debug up 30:00 1 idle swan12.cpu.stats.ox.ac.uk grey01-debug up 30:00 1 idle grey01.cpu.stats.ox.ac.uk greyheron-debug up 30:00 1 idle greyheron.stats.ox.ac.uk greyplover-debug up 30:00 1 idle greyplover.stats.ox.ac.uk greywagtail-debug up 30:00 1 idle greywagtail.stats.ox.ac.uk greypartridge-debug up 30:00 1 idle greypartridge.stats.ox.ac.uk greyostrich-debug up 30:00 1 mix greyostrich.stats.ox.ac.uk grey-standard up 7-00:00:00 4 idle greyheron.stats.ox.ac.uk,greypartridge.stats.ox.ac.uk,greyplover.stats.ox.ac.uk,greywagtail.stats.ox.ac.uk grey-fast up 7-00:00:00 1 idle grey01.cpu.stats.ox.ac.uk grey-gpu up 7-00:00:00 1 mix greyostrich.stats.ox.ac.uk swan-1hr up 1:00:00 1 mix swan03.cpu.stats.ox.ac.uk swan-1hr up 1:00:00 2 idle swan01.cpu.stats.ox.ac.uk,swan02.cpu.stats.ox.ac.uk swan-6hrs up 6:00:00 1 mix swan03.cpu.stats.ox.ac.uk swan-6hrs up 6:00:00 1 idle swan02.cpu.stats.ox.ac.uk swan-2day up 2-00:00:00 1 mix swan03.cpu.stats.ox.ac.uk swan-large up 7-00:00:00 2 idle swan11.cpu.stats.ox.ac.uk,swan12.cpu.stats.ox.ac.uk stats-7day up 7-00:00:00 1 idle emu.stats.ox.ac.uk
- Check running jobs
squeue
greytail{thornton}% squeue JOBID PARTITION NAME ST TIME NODES NODELIST(REASON) 845457 swan03-de bash R 14:54 1 swan03.cpu.stats.ox.ac.uk 845455 swan03-de bash R 17:22 1 swan03.cpu.stats.ox.ac.uk 845215 swan-2day SCI R 6:33:10 1 swan03.cpu.stats.ox.ac.uk 845400 grey-gpu job01 R 3:06:28 1 greyostrich.stats.ox.ac.uk 845397 grey-gpu job01 R 3:10:17 1 greyostrich.stats.ox.ac.uk 841508 grey-gpu eff_n_12 R 1-07:35:22 1 greyostrich.stats.ox.ac.uk 838246 grey-gpu eff_n R 2-18:29:05 1 greyostrich.stats.ox.ac.uk
Running Scripts
- Create a file
launch.sh
on head node - Populate file with preamble to specify resources required
preamble = """#!/bin/bash #SBATCH -A oxwasp #SBATCH --time=20:00:00 #SBATCH --mail-user=james.thornton@stats.ox.ac.uk #SBATCH --mail-type=ALL #SBATCH --partition=grey-standard #SBATCH --nodelist="greyheron.stats.ox.ac.uk" #SBATCH --output="/tmp/slurm-JT-output" #SBATCH --mem "15G" #SBATCH --cpus-per-task 10 #SBATCH --gres=gpu:1 """
- Add commands to run something in the same file, after preamble (see below for example)
- Launch slurm job
sbatch launch.sh
Hosting a Jupyter Notebook
#!/bin/bash
#SBATCH -A oxwasp # Account to be used, e.g. academic, acadrel, aims, bigbayes, opig, oxcsml, oxwasp, rstudent, statgen, statml, visitors
#SBATCH -J job01 # Job name, can be useful but optional
#SBATCH --time=7-00:00:00 # Walltime - run time of just 30 seconds
#SBATCH --mail-user=james.thornton@stats.ox.ac.uk # set email address to use, change to your own email address instead of "me"
#SBATCH --mail-type=ALL # Caution: fine for debug, but not if handling hundreds of jobs!
#SBATCH --partition=grey-gpu # Select the swan one hour partition
#SBATCH --nodelist=greyostrich.stats.ox.ac.uk
#SBATCH --output="/tmp/slurm-JT-output"
#SBATCH --mem 20g
#SBATCH --cpus-per-task 5
#SBATCH --gres=gpu:1
cd /data/greyostrich/oxwasp/oxwasp18/thornton
source ./miniconda3/bin/activate bridge
pip install tornado
python -m ipykernel install --user --name=bridge
python -m jupyter notebook --ip greyostrich.stats.ox.ac.uk --no-browser --port 8888
Python Interface with Paramiko
Set-Up
- Install
parmiko
library for ssh utils - Connect to slurm head node e.g.
greytail
via paramikoclient = paramiko.SSHClient() client.load_system_host_keys() client.set_missing_host_key_policy(paramiko.AutoAddPolicy()) client.connect(hostname='greytail')
Launch individual commands
command = 'sinfo' stdin, stdout, stderr = client.exec_command(command) lines = stdout.readlines()
Launch scripts
- Create sbatch file in Python
preamble = """#!/bin/bash #SBATCH -A oxwasp #SBATCH --time=20:00:00 #SBATCH --mail-user=james.thornton@stats.ox.ac.uk #SBATCH --mail-type=ALL #SBATCH --partition=grey-standard #SBATCH --nodelist="greyheron.stats.ox.ac.uk" #SBATCH --output="/tmp/slurm-JT-output" #SBATCH --mem "15G" #SBATCH --cpus-per-task 10 """ command = preamble + "\n" + """ cd /data/localhost/oxwasp/oxwasp18/thornton touch test_new_file2.txt """
- Create new file on head node and write sbatch commands to file
slurm_wd = '/data/thornton' slurm_file = 'test_batch.sh' ftp = client.open_sftp() ftp.chdir(slurm_wd) file=ftp.file(slurm_file, "w", -1) file.write(command) file.flush() ftp.close()
- Launch slurm sbatch remotely
sbatch_cmd = 'sbatch {0}'.format(os.path.join(slurm_wd, slurm_file)) stdin, stdout, stderr = client.exec_command(sbatch_cmd)