Resource Management with Slurm

slurm (Simple Linux Utility for Resource Management) is a resouce manager for running compute jobs across multiple servers. Although this has the benefit of additional control, it imposes constraints on compute resources and constrains interaction with servers to the slurm interface, this can be a pain. This post aims to be a useful go-to guide to common slurm commands and examples of how slurm may be used without the pain.


Commmon commands

  • Remote Bash (debug) Typically one may access a commandline interface on remote machines through debug partition. This would be equivalent of ssh’ing into a remote machine.
    • srun --pty -t 0:30:00 --partition=<machine>-debug bash
  • Check partitions
    • sinfo
      greytail{thornton}% sinfo
      swan01-debug*          up      30:00      1   idle
      swan02-debug           up      30:00      1   idle
      swan03-debug           up      30:00      1    mix
      swan11-debug           up      30:00      1   idle
      swan12-debug           up      30:00      1   idle
      grey01-debug           up      30:00      1   idle
      greyheron-debug        up      30:00      1   idle
      greyplover-debug       up      30:00      1   idle
      greywagtail-debug      up      30:00      1   idle
      greypartridge-debug    up      30:00      1   idle
      greyostrich-debug      up      30:00      1    mix
      grey-standard          up 7-00:00:00      4   idle,,,
      grey-fast              up 7-00:00:00      1   idle
      grey-gpu               up 7-00:00:00      1    mix
      swan-1hr               up    1:00:00      1    mix
      swan-1hr               up    1:00:00      2   idle,
      swan-6hrs              up    6:00:00      1    mix
      swan-6hrs              up    6:00:00      1   idle
      swan-2day              up 2-00:00:00      1    mix
      swan-large             up 7-00:00:00      2   idle,
      stats-7day             up 7-00:00:00      1   idle
  • Check running jobs
    • squeue
      greytail{thornton}% squeue
              845457 swan03-de     bash   R      14:54      1
              845455 swan03-de     bash   R      17:22      1
              845215 swan-2day      SCI   R    6:33:10      1
              845400  grey-gpu    job01   R    3:06:28      1
              845397  grey-gpu    job01   R    3:10:17      1
              841508  grey-gpu eff_n_12   R 1-07:35:22      1
              838246  grey-gpu    eff_n   R 2-18:29:05      1

Running Scripts

  • Create a file on head node
  • Populate file with preamble to specify resources required
    preamble = """#!/bin/bash
    #SBATCH -A oxwasp
    #SBATCH --time=20:00:00
    #SBATCH --mail-type=ALL
    #SBATCH --partition=grey-standard
    #SBATCH --nodelist=""
    #SBATCH --output="/tmp/slurm-JT-output"
    #SBATCH --mem "15G"
    #SBATCH --cpus-per-task 10
    #SBATCH --gres=gpu:1
  • Add commands to run something in the same file, after preamble (see below for example)
  • Launch slurm job sbatch

Hosting a Jupyter Notebook

#SBATCH -A oxwasp                       # Account to be used, e.g. academic, acadrel, aims, bigbayes, opig, oxcsml, oxwasp, rstudent, statgen, statml, visitors
#SBATCH -J job01                          # Job name, can be useful but optional
#SBATCH --time=7-00:00:00                   # Walltime - run time of just 30 seconds
#SBATCH     # set email address to use, change to your own email address instead of "me"
#SBATCH --mail-type=ALL                   # Caution: fine for debug, but not if handling hundreds of jobs!
#SBATCH --partition=grey-gpu                # Select the swan one hour partition
#SBATCH --output="/tmp/slurm-JT-output"
#SBATCH --mem 20g
#SBATCH --cpus-per-task 5
#SBATCH --gres=gpu:1

cd /data/greyostrich/oxwasp/oxwasp18/thornton

source ./miniconda3/bin/activate bridge
pip install tornado
python -m ipykernel install --user --name=bridge

python -m jupyter notebook --ip --no-browser --port 8888

Python Interface with Paramiko


  • Install parmiko library for ssh utils
  • Connect to slurm head node e.g. greytail via paramiko
    client = paramiko.SSHClient()

    Launch individual commands

    command = 'sinfo'
    stdin, stdout, stderr = client.exec_command(command)
    lines = stdout.readlines()

    Launch scripts

  • Create sbatch file in Python
    preamble = """#!/bin/bash
    #SBATCH -A oxwasp
    #SBATCH --time=20:00:00
    #SBATCH --mail-type=ALL
    #SBATCH --partition=grey-standard
    #SBATCH --nodelist=""
    #SBATCH --output="/tmp/slurm-JT-output"
    #SBATCH --mem "15G"
    #SBATCH --cpus-per-task 10
    command = preamble + "\n" + """
    cd /data/localhost/oxwasp/oxwasp18/thornton
    touch test_new_file2.txt
  • Create new file on head node and write sbatch commands to file
    slurm_wd = '/data/thornton'
    slurm_file = ''
    ftp = client.open_sftp()
    file=ftp.file(slurm_file, "w", -1)
  • Launch slurm sbatch remotely
    sbatch_cmd = 'sbatch {0}'.format(os.path.join(slurm_wd, slurm_file))
    stdin, stdout, stderr = client.exec_command(sbatch_cmd)