Expanse: Getting Started with Batch Job Scheduling: Slurm Edition
Most high-performance computing (HPC) systems are specialized resources in high demand and shared simultaneously by many researchers across all domains of science, engineering, and beyond. In order to fairly distribute and share the compute resources of an HPC system among these researchers, which have varying compute demand profiles over time, most computational workloads on these systems are executed as batch jobs --- prescripted sets of commands that are executed on a certain type or set of compute resources for a given amount of time. Researchers submit these batch job scripts to a batch job scheduler, a piece of software that controls and tracks where and when the batch jobs submitted to the system will eventually run and execute the prescripted sets of commands. However, if this is your first time using an HPC system and interacting with a batch job scheduler like Slurm, writing your first batch job scripts and submitting them to the scheduler can be somewhat intimidating. Moreover, these batch job schedulers can be configured in many different ways and will often have unique features and options from system to system that you will need to consider when writing your batch jobs.
In this webinar, we will teach you how to write your first batch job script and submit it to a Slurm batch job scheduler. We will also discuss what we think are best practices on how to structure your batch job scripts, teach you how to leverage Slurm environment variables, and provide you with some tips on how to request resources from the scheduler to get your work done faster. We will also introduce you to some advanced features like Slurm job arrays and job dependencies for more structured computational workflows.
Instructor
Marty Kandes
Computational & Data Science Research Specialist, HPC User Services Group - SDSC
Marty Kandes is a Computational and Data Science Research Specialist in the High-Performance Computing User Services Group at SDSC. He currently helps manage user support for Expanse, SDSC’s NSF-funded supercomputer, and maintains all the Singularity containers supported on these systems.