Resource Management with Slurm

3 minute read

Published:

slurm (Simple Linux Utility for Resource Management) is a resouce manager for running compute jobs across multiple servers. Although this has the benefit of additional control, it imposes constraints on compute resources and constrains interaction with servers to the slurm interface, this can be a pain. This post aims to be a useful go-to guide to common slurm commands and examples of how slurm may be used without the pain.

Background

Commmon commands

  • Remote Bash (debug) Typically one may access a commandline interface on remote machines through debug partition. This would be equivalent of ssh’ing into a remote machine.
    • srun --pty -t 0:30:00 --partition=<machine>-debug bash
  • Check partitions
    • sinfo
      greytail{thornton}% sinfo
      PARTITION           AVAIL  TIMELIMIT  NODES  STATE NODELIST
      swan01-debug*          up      30:00      1   idle swan01.cpu.stats.ox.ac.uk
      swan02-debug           up      30:00      1   idle swan02.cpu.stats.ox.ac.uk
      swan03-debug           up      30:00      1    mix swan03.cpu.stats.ox.ac.uk
      swan11-debug           up      30:00      1   idle swan11.cpu.stats.ox.ac.uk
      swan12-debug           up      30:00      1   idle swan12.cpu.stats.ox.ac.uk
      grey01-debug           up      30:00      1   idle grey01.cpu.stats.ox.ac.uk
      greyheron-debug        up      30:00      1   idle greyheron.stats.ox.ac.uk
      greyplover-debug       up      30:00      1   idle greyplover.stats.ox.ac.uk
      greywagtail-debug      up      30:00      1   idle greywagtail.stats.ox.ac.uk
      greypartridge-debug    up      30:00      1   idle greypartridge.stats.ox.ac.uk
      greyostrich-debug      up      30:00      1    mix greyostrich.stats.ox.ac.uk
      grey-standard          up 7-00:00:00      4   idle greyheron.stats.ox.ac.uk,greypartridge.stats.ox.ac.uk,greyplover.stats.ox.ac.uk,greywagtail.stats.ox.ac.uk
      grey-fast              up 7-00:00:00      1   idle grey01.cpu.stats.ox.ac.uk
      grey-gpu               up 7-00:00:00      1    mix greyostrich.stats.ox.ac.uk
      swan-1hr               up    1:00:00      1    mix swan03.cpu.stats.ox.ac.uk
      swan-1hr               up    1:00:00      2   idle swan01.cpu.stats.ox.ac.uk,swan02.cpu.stats.ox.ac.uk
      swan-6hrs              up    6:00:00      1    mix swan03.cpu.stats.ox.ac.uk
      swan-6hrs              up    6:00:00      1   idle swan02.cpu.stats.ox.ac.uk
      swan-2day              up 2-00:00:00      1    mix swan03.cpu.stats.ox.ac.uk
      swan-large             up 7-00:00:00      2   idle swan11.cpu.stats.ox.ac.uk,swan12.cpu.stats.ox.ac.uk
      stats-7day             up 7-00:00:00      1   idle emu.stats.ox.ac.uk
      
  • Check running jobs
    • squeue
      greytail{thornton}% squeue
               JOBID PARTITION     NAME   ST       TIME  NODES NODELIST(REASON)
              845457 swan03-de     bash   R      14:54      1 swan03.cpu.stats.ox.ac.uk
              845455 swan03-de     bash   R      17:22      1 swan03.cpu.stats.ox.ac.uk
              845215 swan-2day      SCI   R    6:33:10      1 swan03.cpu.stats.ox.ac.uk
              845400  grey-gpu    job01   R    3:06:28      1 greyostrich.stats.ox.ac.uk
              845397  grey-gpu    job01   R    3:10:17      1 greyostrich.stats.ox.ac.uk
              841508  grey-gpu eff_n_12   R 1-07:35:22      1 greyostrich.stats.ox.ac.uk
              838246  grey-gpu    eff_n   R 2-18:29:05      1 greyostrich.stats.ox.ac.uk
      

Running Scripts

  • Create a file launch.sh on head node
  • Populate file with preamble to specify resources required
    preamble = """#!/bin/bash
    #SBATCH -A oxwasp
    #SBATCH --time=20:00:00
    #SBATCH --mail-user=james.thornton@stats.ox.ac.uk
    #SBATCH --mail-type=ALL
    #SBATCH --partition=grey-standard
    #SBATCH --nodelist="greyheron.stats.ox.ac.uk"
    #SBATCH --output="/tmp/slurm-JT-output"
    #SBATCH --mem "15G"
    #SBATCH --cpus-per-task 10
    #SBATCH --gres=gpu:1
    
    """
    
    
  • Add commands to run something in the same file, after preamble (see below for example)
  • Launch slurm job sbatch launch.sh

Hosting a Jupyter Notebook


#!/bin/bash
#SBATCH -A oxwasp                       # Account to be used, e.g. academic, acadrel, aims, bigbayes, opig, oxcsml, oxwasp, rstudent, statgen, statml, visitors
#SBATCH -J job01                          # Job name, can be useful but optional
#SBATCH --time=7-00:00:00                   # Walltime - run time of just 30 seconds
#SBATCH --mail-user=james.thornton@stats.ox.ac.uk     # set email address to use, change to your own email address instead of "me"
#SBATCH --mail-type=ALL                   # Caution: fine for debug, but not if handling hundreds of jobs!
#SBATCH --partition=grey-gpu                # Select the swan one hour partition
#SBATCH --nodelist=greyostrich.stats.ox.ac.uk
#SBATCH --output="/tmp/slurm-JT-output"
#SBATCH --mem 20g
#SBATCH --cpus-per-task 5
#SBATCH --gres=gpu:1

cd /data/greyostrich/oxwasp/oxwasp18/thornton

source ./miniconda3/bin/activate bridge
pip install tornado
python -m ipykernel install --user --name=bridge

python -m jupyter notebook --ip greyostrich.stats.ox.ac.uk --no-browser --port 8888

Python Interface with Paramiko

Set-Up

  • Install parmiko library for ssh utils
  • Connect to slurm head node e.g. greytail via paramiko
    client = paramiko.SSHClient()
    client.load_system_host_keys()
    client.set_missing_host_key_policy(paramiko.AutoAddPolicy())
    client.connect(hostname='greytail')
    

    Launch individual commands

    command = 'sinfo'
    stdin, stdout, stderr = client.exec_command(command)
    lines = stdout.readlines()
    

    Launch scripts

  • Create sbatch file in Python
    preamble = """#!/bin/bash
    #SBATCH -A oxwasp
    #SBATCH --time=20:00:00
    #SBATCH --mail-user=james.thornton@stats.ox.ac.uk
    #SBATCH --mail-type=ALL
    #SBATCH --partition=grey-standard
    #SBATCH --nodelist="greyheron.stats.ox.ac.uk"
    #SBATCH --output="/tmp/slurm-JT-output"
    #SBATCH --mem "15G"
    #SBATCH --cpus-per-task 10
    """
    
    
    command = preamble + "\n" + """
    
    cd /data/localhost/oxwasp/oxwasp18/thornton
    touch test_new_file2.txt
    """
    
  • Create new file on head node and write sbatch commands to file
    slurm_wd = '/data/thornton'
    slurm_file = 'test_batch.sh'
      
    ftp = client.open_sftp()
    ftp.chdir(slurm_wd)
    file=ftp.file(slurm_file, "w", -1)
    file.write(command)
    file.flush()
    ftp.close()
    
  • Launch slurm sbatch remotely
    sbatch_cmd = 'sbatch {0}'.format(os.path.join(slurm_wd, slurm_file))
    
    stdin, stdout, stderr = client.exec_command(sbatch_cmd)