March 5, 2026 5 min read

SLURM Job Scheduling: Submit & Monitor HPC Jobs

SLURM powers 60% of the world's HPC clusters, but cryptic path errors and resource misconfiguration trip up most researchers. Learn the exact patterns for writing single-CPU, parallel array, and GPU job scripts—plus how to interpret logs and debug common failures.

SLURM HPC job scheduling batch scripts resource management sbatch GPU computing cluster computing

Submit & Monitor HPC Jobs Using SLURM — Without Losing Your Data to Path Errors

You’ve just logged into your university’s HPC cluster for the first time. You write what you think is a perfect job script, submit it with sbatch, and 10 minutes later you find a cryptic error in the log file: “file does not exist.” Your absolute path was wrong. Or you forgot to specify GPU resources and your job waited in the queue for 3 hours doing nothing.

This is the moment most researchers abandon SLURM and go back to running jobs on their laptop.

The good news: SLURM job submission is learnable in one sitting—if you know the exact patterns to follow and the gotchas to avoid.

What SLURM Actually Does

SLURM (Simple Linux Utility for Resource Management) powers 60% of the world’s HPC clusters. Instead of running code directly on shared hardware, you write a batch script that tells SLURM:

How many CPUs or GPUs you need
How much memory and wall-clock time
Which compute partition (CPU-only, GPU, high-memory, etc.)
What command to run

SLURM queues your job and runs it when resources are available. You get structured logs, job IDs, and the ability to submit 100 jobs without crashing anything.

In this guide, you’ll write three working scripts—a single-CPU job, a parallel CPU array, and a GPU job—then interpret the output logs to debug the most common errors.

Prerequisites

Access & Software

SSH access to an HPC cluster with SLURM installed
A text editor on the cluster (nano, vim, or VS Code remote SSH)
Bash shell (default on Linux/Mac; use WSL2 on Windows)

Cluster Info You Need to Gather

Before writing your first script, get this from your cluster admin or documentation:

Available partition names (e.g., cpu, gpu, high-mem) — run sinfo to list them
CPU/GPU node specs (cores per node, GPU type, memory) — run sinfo -N -l
Max job duration allowed per partition
GPU types available (e.g., A100 40GB, RTX 3090 24GB)

A small dataset or script you’ve already tested locally (to avoid debugging SLURM + code bugs at once) is also helpful.

Setup in 4 Commands

SLURM is already installed on your cluster. Just set up your working directory:

ssh username@cluster.university.edu
mkdir -p ~/hpc_workshop/{scripts,logs,data}
cd ~/hpc_workshop
which sbatch && sinfo

The last command verifies SLURM is available and shows you which partitions you can access.

⚠️ Critical rule: Always use absolute paths in SLURM scripts. Relative paths are the #1 cause of “file does not exist” errors.

The Standard Workflow

Every SLURM job follows this cycle:

Write a batch script (.sh file with SLURM directives + code)
Submit it with sbatch
Check status with squeue
Read the output/error logs
Debug and resubmit

Let’s build three examples that cover 90% of real-world use cases.

Example 1: Single CPU Job

Create job_single_cpu.sh:

#!/bin/bash
#SBATCH --job-name=cpu_demo
#SBATCH --partition=cpu
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=1
#SBATCH --mem=4G
#SBATCH --time=00:10:00
#SBATCH --output=logs/cpu_demo_%j.out
#SBATCH --error=logs/cpu_demo_%j.err

python scripts/my_analysis.py

What each directive does:

--job-name: Human-readable label (appears in squeue)
--partition: Which compute nodes to use (ask your admin for names)
--ntasks: Number of parallel tasks (usually 1 for single jobs)
--cpus-per-task: CPU cores per task (increase for multithreading)
--mem: Total RAM allocated (4G, 16G, 64G, etc.)
--time: Max runtime in HH:MM:SS (job terminates if exceeded)
--output / --error: Log file paths (%j = job ID, %a = array index)

Submit and Monitor

sbatch job_single_cpu.sh

SLURM responds with your job ID:

Submitted batch job 12345

Check status:

squeue -u $USER

Watch in real-time (updates every 2 seconds):

watch -n 2 squeue -u $USER

Key columns: JOBID (your job number), ST (status: R = running, PD = pending, CA = cancelled, F = failed), TIME (elapsed time), NODES (which compute node).

Read the Logs

Once the job finishes:

cat logs/cpu_demo_12345.out    # Standard output
cat logs/cpu_demo_12345.err    # Errors and warnings

If you see “file does not exist,” verify all paths are absolute (start with /):

pwd  # Print working directory
ls -lh /absolute/path/to/file

Example 2: CPU Array Job (Parallel Runs)

To run the same script 5 times with different inputs, use --array:

#!/bin/bash
#SBATCH --job-name=cpu_array_demo
#SBATCH --partition=cpu
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=2
#SBATCH --mem=8G
#SBATCH --time=00:15:00
#SBATCH --array=0-4
#SBATCH --output=logs/cpu_array_%j_%a.out
#SBATCH --error=logs/cpu_array_%j_%a.err

python scripts/my_analysis.py --task_id $SLURM_ARRAY_TASK_ID

Submit once; SLURM launches 5 parallel instances:

sbatch job_cpu_array.sh

Each instance gets a unique array index (%a) in the log filename, so logs don’t overwrite each other.

Run in Batches

To limit concurrent jobs (e.g., only 2 running at once):

#SBATCH --array=0-19%2

This runs jobs 0–19 with a max of 2 simultaneously. Once 2 finish, the next 2 start.

Example 3: GPU Job

#!/bin/bash
#SBATCH --job-name=gpu_demo
#SBATCH --partition=gpu
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=4
#SBATCH --mem=16G
#SBATCH --gres=gpu:1
#SBATCH --time=00:30:00
#SBATCH --output=logs/gpu_demo_%j.out
#SBATCH --error=logs/gpu_demo_%j.err

python scripts/my_gpu_script.py

Key differences from CPU jobs:

--partition=gpu: Use a GPU partition (not cpu)
--gres=gpu:1: Request 1 GPU (gres = generic resources)
--cpus-per-task=4: CPU cores to support GPU (typically 4–8)
--mem=16G: Increase memory for GPU workloads

Request Specific GPU Types

If your cluster has multiple GPU types (A100 40GB, RTX 3090 24GB, etc.), specify which you want:

#SBATCH --gres=gpu:a100:1

Or allow any GPU:

#SBATCH --gres=gpu:1

Common Issues & Fixes

“File Does Not Exist”

Cause: You used a relative path (e.g., data/myfile.csv) instead of an absolute path.

Fix: Always use absolute paths. Verify with:

pwd
ls -lh /absolute/path/to/file

Job Stays in “PD” (Pending) for Hours

Cause: You requested more resources than available, or the partition is overloaded.

Fix: Check available resources:

sinfo -N -l

Reduce --mem, --cpus-per-task, or --time. Or switch partitions.

GPU Job Fails with “GPU Not Found”

Cause: Missing --gres=gpu:1 or wrong partition.

Fix: Add --gres=gpu:1 and ensure --partition=gpu.

Job Killed After Timeout

Cause: --time limit was too short.

Fix: Increase --time. Start generous and reduce after seeing actual runtime in logs.

Next Steps

Submit your first single-CPU job with a simple test (e.g., python -c "print('Hello from HPC')")
Create an array job that runs 5 instances and logs results
Request GPU resources and verify CUDA is accessible
Use watch -n 2 squeue -u $USER to see jobs launch and complete in real-time
Name jobs meaningfully — job_lr0.001_batch32 beats job_1 when you’re running 100 experiments

What’s your first SLURM job going to be—a quick CPU test, or do you have a GPU model ready to train? Reply and let me know.

What’s the most frustrating SLURM error you’ve hit, and how did you finally debug it?

Apptainer & Conda: Reproducible HPC Python Environments

Mar 5, 2026

SSH & SLURM: Transfer Code to HPC Clusters