Skip to main content

Building a Slurm HPC Cluster (Part 1) - Single Node Setup and Fundamentals

· 8 min read
Thanh-Giang Tan Nguyen
Founder at RIVER

Building a High-Performance Computing (HPC) cluster can seem daunting, but with the right approach, you can create a robust system for managing computational workloads. This is Part 1 of a 3-part series where we'll build a complete Slurm cluster from scratch.

In this first post, we'll cover the fundamentals by setting up a single-node Slurm cluster and understanding the core concepts.

Series Overview

  • Part 1 (This Post): Introduction, Architecture, and Single Node Setup
  • Part 2: Scaling to Production with Ansible
  • Part 3: Administration and Best Practices

Why Slurm?

When it comes to job scheduling in HPC environments, several options exist including PBS, Grid Engine, and IBM's LSF. However, Slurm (Simple Linux Utility for Resource Management) stands out for several compelling reasons:

  • Open Source: Free to use with a large, active community
  • Scalability: Designed to scale from small clusters to the world's largest supercomputers
  • Flexibility: Fine-grained control over job scheduling, resource allocation, and priority settings
  • Integration: Works seamlessly with MPI, distributed computing frameworks (Spark, Ray, Dask), and monitoring tools
  • Performance: Optimized for high throughput with minimal overhead

Understanding Slurm Architecture

Before diving into the implementation, it's crucial to understand the key components of a Slurm cluster:

HPC Architecture

source: https://www.marquette.edu/high-performance-computing/architecture.php

Core Components

  1. slurmctld (Controller Daemon): The brain of the cluster, running on the controller node. It handles job scheduling, resource tracking, and communicates with compute nodes.

  2. slurmd (Node Daemon): Runs on compute nodes to execute jobs and report status back to the controller.

  3. slurmdbd (Database Daemon): Optional but recommended for storing job accounting data, resource usage tracking, and fair-share scheduling.

Node Types

Node TypeServicesPurpose
ControllerslurmctldManages job scheduling and resources
ComputeslurmdExecutes submitted jobs
LoginSlurm clientsUser access point for job submission
Databaseslurmdbd, MySQL/MariaDBStores accounting data

Slurm Architecture

source: https://www.schedmd.com/

For a deeper understanding of Slurm architecture, check our Slurm Architecture documentation.

Single Node Setup - Understanding the Fundamentals

Starting with a single-node setup helps you understand how Slurm works before scaling up. This approach is perfect for learning and local development.

Slurm Logo

info

This setup runs on Ubuntu 20.04 and includes all standard Slurm features. Note that this configuration is for learning purposes - for production environments, you'll want the multi-node setup covered in Part 2.

Basic Installation

First, install the required Slurm components:

sudo apt-get update -y && sudo apt-get install -y slurmd slurmctld

Verify the installation:

# Locate slurmd and slurmctld
which slurmd
# Output: /usr/sbin/slurmd

which slurmctld
# Output: /usr/sbin/slurmctld

Configuring slurm.conf

The slurm.conf file is the heart of your Slurm configuration. This file must be identical across all nodes in a cluster (but for now, we just have one node).

Create your slurm.conf:

cat <<EOF > slurm.conf
# slurm.conf for a single-node Slurm cluster
ClusterName=localcluster
SlurmctldHost=localhost
MpiDefault=none
ProctrackType=proctrack/linuxproc
ReturnToService=2
SlurmctldPidFile=/run/slurmctld.pid
SlurmctldPort=6817
SlurmdPidFile=/run/slurmd.pid
SlurmdPort=6818
SlurmdSpoolDir=/var/lib/slurm-llnl/slurmd
SlurmUser=slurm
StateSaveLocation=/var/lib/slurm-llnl/slurmctld
SwitchType=switch/none
TaskPlugin=task/none

# TIMERS
InactiveLimit=0
KillWait=30
MinJobAge=300
SlurmctldTimeout=120
SlurmdTimeout=300
Waittime=0

# SCHEDULING
SchedulerType=sched/backfill
SelectType=select/cons_tres
SelectTypeParameters=CR_Core

# ACCOUNTING (not enabled yet)
AccountingStorageType=accounting_storage/none
JobAcctGatherType=jobacct_gather/none
JobAcctGatherFrequency=30

# LOGGING
SlurmctldDebug=info
SlurmctldLogFile=/var/log/slurm-llnl/slurmctld.log
SlurmdDebug=info
SlurmdLogFile=/var/log/slurm-llnl/slurmd.log

# COMPUTE NODES (adjust CPUs and RealMemory to match your system)
NodeName=localhost CPUs=2 Sockets=1 CoresPerSocket=2 ThreadsPerCore=1 RealMemory=1024 State=UNKNOWN

# PARTITION CONFIGURATION
PartitionName=LocalQ Nodes=ALL Default=YES MaxTime=INFINITE State=UP
EOF

sudo mv slurm.conf /etc/slurm-llnl/slurm.conf

Starting Services

Start the Slurm daemons:

# Start slurmd (compute daemon)
sudo service slurmd start
sudo service slurmd status

# Start slurmctld (controller daemon)
sudo service slurmctld start
sudo service slurmctld status

slurmd status

slurmctld status

Test your setup by submitting a simple interactive job:

srun --mem 500MB -c 1 --pty bash

# Check job details
squeue -o "%i %P %u %T %M %l %D %C %m %R %Z %N" | column -t

Critical: Resource Limiting with cgroups

warning

This is a critical step that's often overlooked!

Without proper cgroup configuration, jobs can exceed their allocated resources, potentially causing system instability or crashes. The job scheduler will accept resource limits, but won't actually enforce them.

Let's test this problem first. Submit a job requesting 500MB and try to allocate much more:

srun --mem 500MB -c 1 --pty bash

# Try to allocate 1GB of memory (exceeding the 500MB limit)
declare -a mem
i=0
while :; do
mem[$i]=$(head -c 100M </dev/zero | tr '\000' 'x')
((i++))
echo "Allocated: $((i * 100)) MB"
done

Before submitting the job, memory usage is less than 200MB:

Memory before stress

After allocating 1GB, the job is not killed due to missing control group (cgroup) configuration:

Over resource limit

You'll notice the job continues running even after exceeding 500MB - that's the problem!

Now let's fix it with cgroups:

cat <<EOF > cgroup.conf
CgroupAutomount=yes
CgroupMountpoint=/sys/fs/cgroup
ConstrainCores=yes
ConstrainRAMSpace=yes
ConstrainDevices=yes
ConstrainSwapSpace=yes
MaxSwapPercent=5
MemorySwappiness=0
EOF

sudo mv cgroup.conf /etc/slurm-llnl/cgroup.conf

Update slurm.conf to use cgroup plugins:

sudo sed -i -e "s|ProctrackType=proctrack/linuxproc|ProctrackType=proctrack/cgroup|" \
-e "s|TaskPlugin=task/none|TaskPlugin=task/cgroup|" /etc/slurm-llnl/slurm.conf

Enable cgroup in GRUB and reboot:

sudo sed -i 's/^GRUB_CMDLINE_LINUX="/GRUB_CMDLINE_LINUX="cgroup_enable=memory swapaccount=1 /' /etc/default/grub
sudo update-grub
sudo reboot

After reboot, restart Slurm services:

sudo service slurmctld restart
sudo service slurmd restart

Now test again with the same memory allocation script - this time, the job will be killed when it exceeds the limit!

Out of Memory

Enabling Accounting

Job accounting is essential for:

  • Tracking who is using resources
  • Monitoring job completion and failures
  • Enforcing resource limits per user/group
  • Fair-share scheduling

Accounting disabled

Install the required packages:

sudo apt-get install slurmdbd mariadb-server -y

Create the database and user:

sudo service mysql start

sudo mysql -e "CREATE DATABASE slurm_acct_db;"
sudo mysql -e "CREATE USER 'slurm'@'localhost' IDENTIFIED BY 'slurm';"
sudo mysql -e "GRANT ALL PRIVILEGES ON slurm_acct_db.* TO 'slurm'@'localhost';"
sudo mysql -e "FLUSH PRIVILEGES;"

Verify the database was created:

sudo mysql -e "SHOW DATABASES;" 
sudo mysql -e "SELECT User, Host FROM mysql.user;"

Add database

Configure slurmdbd:

cat <<EOF > slurmdbd.conf
PidFile=/run/slurmdbd.pid
LogFile=/var/log/slurm/slurmdbd.log
DebugLevel=error
DbdHost=localhost
DbdPort=6819

# DB connection data
StorageType=accounting_storage/mysql
StorageHost=localhost
StoragePort=3306
StorageUser=slurm
StoragePass=slurm
StorageLoc=slurm_acct_db
SlurmUser=slurm
EOF

sudo mv slurmdbd.conf /etc/slurm-llnl/slurmdbd.conf
sudo service slurmdbd start

Update slurm.conf to enable accounting:

sudo sed -i -e "s|AccountingStorageType=accounting_storage/none|AccountingStorageType=accounting_storage/slurmdbd\nAccountingStorageEnforce=associations,limits,qos\nAccountingStorageHost=localhost\nAccountingStoragePort=6819|" /etc/slurm-llnl/slurm.conf 

sudo sed -i -e "s|JobAcctGatherType=jobacct_gather/none|JobAcctGatherType=jobacct_gather/cgroup|" /etc/slurm-llnl/slurm.conf

sudo systemctl restart slurmctld slurmd

Add your cluster and user to accounting:

# Add cluster
sudo sacctmgr -i add cluster localcluster

# Add account for your user
sudo sacctmgr -i add account $USER Cluster=localcluster

# Add your user to the account
sudo sacctmgr -i add user $USER account=$USER DefaultAccount=$USER

sudo systemctl restart slurmctld slurmd

Add account

Now test accounting by submitting a job and viewing its details:

# Submit a test job
srun --mem 500MB -c 1 hostname

# View accounting information
sacct

Account usage

Key Takeaways

In this first part of our series, we've covered:

  1. Why Slurm: Understanding the advantages of Slurm over alternatives
  2. Architecture: Core components (slurmctld, slurmd, slurmdbd) and their roles
  3. Basic Setup: Installing and configuring a single-node cluster
  4. Critical cgroups: Why resource limiting is essential (and how to enable it)
  5. Accounting: Setting up job tracking and resource monitoring
info

What's Next?

In Part 2, we'll take this knowledge and scale to a multi-node production cluster using Ansible automation. We'll add monitoring with Grafana, alerting via Slack, and shared storage with NFS.

Resources


This is Part 1 of the RiverXData series on building Slurm HPC clusters. Continue to Part 2 to learn about production deployment with Ansible.