Building a Slurm HPC Cluster (Part 2) - Scaling to Production with Ansible

January 12, 2026 · 9 min read

Founder at RIVER

In Part 1, we learned the fundamentals by building a single-node Slurm cluster. Now it's time to scale up to a production-ready, multi-node cluster with automated deployment, monitoring, and alerting.

In this post, we'll use Ansible to automate the entire deployment process, making it reproducible and maintainable.

Series Overview

Part 1: Introduction, Architecture, and Single Node Setup
Part 2 (This Post): Scaling to Production with Ansible
Part 3: Administration and Best Practices

Why Ansible for HPC Clusters?

Moving from a single-node setup to a multi-node production cluster involves:

Configuring multiple machines identically
Managing dependencies and installation order
Keeping configurations synchronized
Handling user management across nodes
Setting up monitoring and logging infrastructure

Doing this manually is error-prone and time-consuming. Infrastructure automation tools like Ansible, Puppet, or Terraform solve this problem. We chose Ansible because:

Agentless: No software to install on managed nodes
Declarative: Describe the desired state, not the steps
Idempotent: Safe to run multiple times
YAML-based: Easy to read and version control
Large ecosystem: Many pre-built roles available

What is Ansible? (Quick Primer)

If you're new to Ansible, watch this excellent 100-second introduction:

Production Cluster Architecture

Our production setup includes these components:

Source: https://cs.phenikaa-uni.edu.vn/vi/post/gioi-thieu/co-so-vat-chat/he-thong-tinh-toan-hieu-nang-cao-phenikaa-hpc

The head node manages the cluster and provides user access:

slurmctld: Job scheduling and resource management
slurmdbd: Accounting database
NFS Server: Shares directories to compute nodes
Prometheus: Metrics collection
Grafana: Monitoring dashboards
Alertmanager: Slack notifications

Compute Nodes

Worker nodes that execute jobs:

slurmd: Job execution daemon
NFS Client: Mounts shared storage
Node Exporter: Exposes system metrics
Slurm Exporter: Exposes Slurm-specific metrics

Shared Storage (NFS)

NFS provides unified file system access across all nodes:

Source: https://thuanbui.me/cai-dat-nfs-server-va-nfs-client-tren-ubuntu-22-04/

/home: User home directories (fast SSD/NVMe storage)
/mnt/data: Large datasets (high-capacity HDD)

Monitoring Stack

Complete observability for your cluster:

Source: https://swsmith.cc/posts/grafana-slurm.html

Prometheus: Time-series metrics database
Grafana: Beautiful dashboards for visualization
Alertmanager: Sends alerts to Slack when issues occur
Node Exporter: System-level metrics (CPU, memory, disk)
Slurm Exporter: Slurm-specific metrics (jobs, partitions, nodes)

Setting Up the RiverXData Slurm Cluster

We've created a comprehensive Ansible playbook that automates everything. Let's get started!

Prerequisites

Multiple Ubuntu 20.04 or 24.04 machines (or VMs)
SSH access to all nodes
Sudo privileges on all nodes
A Slack workspace (for alerts)

Step 1: Clone the Repository

git clone https://github.com/riverxdata/river-slurm.git -b 1.0.0
cd river-slurm

Step 2: Install Ansible and Dependencies

# Ubuntu 24.04, without Vagrant
bash scripts/setup.sh 24.04 false

# For Ubuntu 20.04
bash scripts/setup.sh 20.04 false

# For developers: Install with Vagrant support
bash scripts/setup.sh 24.04 true

This script installs:

Ansible and required Python packages
Community Ansible collections
Galaxy roles (geerlingguy.docker, etc.)

Step 3: Set Up Slack Alerts

info

In production environments, even small teams benefit from proactive monitoring. Slack is perfect for this - you'll get notifications when nodes go down, jobs fail, or resources run low.

Create a Slack App

Go to Slack API and create a new app
Choose "From scratch"
Name it (e.g., "Slurm Cluster Monitor")
Select your workspace

Enable Incoming Webhooks

Navigate to "Incoming Webhooks" in your app settings
Activate incoming webhooks
Click "Add New Webhook to Workspace"
Select the channel for notifications (e.g., #cluster-alerts)
Copy the webhook URL

Test Your Webhook

curl -X POST -H 'Content-type: application/json' \
  --data '{"text":"Hello from Slurm cluster!"}' \
  https://hooks.slack.com/services/YOUR/WEBHOOK/URL

You should see the message appear in your Slack channel!

Step 4: Configure Your Inventory

Create inventories/hosts (or copy from inventories/hosts.example):

[slurm_master]
controller-01 ansible_host=192.168.58.10

[slurm_worker]
worker-01 ansible_host=192.168.58.11
worker-02 ansible_host=192.168.58.12

[slurm:children]
slurm_master
slurm_worker

[all:vars]
ansible_user=your_username
slurm_password=secure_munge_password
slurm_account_db_pass=secure_db_password
slack_api_url=https://hooks.slack.com/services/YOUR/WEBHOOK/URL
slack_channel=#cluster-alerts
admin_user=admin
admin_password=secure_grafana_password

warning

Security Best Practice: Use Ansible Vault to encrypt sensitive variables like passwords and API keys.

# Create encrypted vault
ansible-vault create inventories/vault.yml

# Or encrypt existing file
ansible-vault encrypt inventories/hosts

Optional Parameters

default_password=temporary_user_password  # Forces change on first login
users=alice,bob,charlie                   # Comma-separated list

Step 5: Deploy the Cluster

Now for the magic moment - deploy your entire cluster with one command!

# If you have passwordless sudo configured
ansible-playbook -i inventories/hosts river_cluster.yml

# If you need to enter sudo password
ansible-playbook -i inventories/hosts river_cluster.yml --ask-become-pass

What this playbook does:

Prepares all nodes:
- Updates packages
- Installs dependencies
- Configures firewalls
Sets up the controller:
- Installs slurmctld and slurmdbd
- Configures MariaDB for accounting
- Sets up NFS server
- Installs monitoring stack
Configures compute nodes:
- Installs slurmd
- Mounts NFS shares
- Configures metrics exporters
Deploys monitoring:
- Prometheus for metrics collection
- Grafana with pre-configured dashboards
- Alertmanager with Slack integration
Synchronizes configurations:
- Copies slurm.conf to all nodes
- Sets up Munge authentication
- Configures log aggregation

Step 6: Add Users

ansible-playbook -i inventories/hosts river_users.yml

This creates Linux users on all nodes with:

Synchronized UID/GID across nodes
Home directories on shared NFS
Slurm accounting associations

info

Note on User Management: For production, consider integrating with LDAP or Active Directory. However, NIS and LDAP setup can be complex on Ubuntu. Our Ansible approach provides a simpler alternative that works well for small to medium clusters.

Step 7: Verify the Setup

SSH into the controller node and run:

# Check cluster status
sinfo

# Expected output:
# PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
# compute*     up   infinite      2   idle worker-01,worker-02

# View job queue
squeue

# Submit a test job
srun --nodes=1 --ntasks=1 hostname

# Check accounting
sacct

# View cluster configuration
scontrol show config | head -20

Step 8: Access Grafana Dashboards

Grafana runs on the controller node at port 3000. To access it securely from your local machine:

# Create SSH tunnel
ssh -N -L 3001:localhost:3000 your_user@controller_ip

# Now open in browser: http://localhost:3001
# Login: admin / your_grafana_password

You'll see pre-configured dashboards showing:

Node Metrics Dashboard

CPU usage per node
Memory utilization
Disk I/O
Network traffic
System load

Slurm Metrics Dashboard

Active jobs
Job queue length
Node states (idle, allocated, down)
CPU allocation
Memory usage
Job completion rates

What About Alerts?

Alertmanager is configured to send Slack notifications for:

Node down: When a compute node becomes unresponsive
Node resumed: When a node comes back online
High CPU usage: Sustained high CPU across cluster
High memory usage: Memory pressure warnings
Disk space low: Storage running out

Example alert in Slack when a node goes down:

For detailed information, check the Grafana dashboard:

Testing Your Cluster

Let's run some tests to ensure everything works:

Test 1: Simple Job

srun hostname

Test 2: Multi-node Job

srun --nodes=2 --ntasks=2 hostname

Test 3: Interactive Session

srun --nodes=1 --cpus-per-task=2 --mem=2G --pty bash

# Inside the session
hostname
nproc
free -h
exit

Test 4: Batch Job

Create test_job.sh:

#!/bin/bash
#SBATCH --job-name=test
#SBATCH --output=test_%j.out
#SBATCH --error=test_%j.err
#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=2
#SBATCH --mem=1G
#SBATCH --time=00:05:00

echo "Job started at $(date)"
echo "Running on node: $(hostname)"
echo "CPUs allocated: $SLURM_CPUS_PER_TASK"
echo "Memory allocated: $SLURM_MEM_PER_NODE MB"

# Do some work
sleep 60

echo "Job finished at $(date)"

Submit it:

sbatch test_job.sh

# Check status
squeue

# When done, view output
cat test_*.out

Test 5: Resource Limits

# Submit job requesting more resources than available
srun --mem=999999 --pty bash

# Should fail with:
# srun: error: Unable to allocate resources: Requested node configuration is not available

Test 6: Accounting

# View your jobs
sacct

# Detailed accounting info
sacct --format=JobID,JobName,User,State,Start,End,Elapsed,CPUTime,MaxRSS

# Cluster usage summary
sreport cluster utilization

Architecture Diagram

Here's what you've built:

┌─────────────────────────────────────────────┐
│           Users SSH to Controller           │
└─────────────────┬───────────────────────────┘
                  │
    ┌─────────────▼────────────────┐
    │   Controller Node (Head)     │
    │  ┌────────────────────────┐  │
    │  │ slurmctld              │  │  Job Scheduling
    │  │ slurmdbd + MariaDB     │  │  Accounting
    │  │ NFS Server             │  │  Shared Storage
    │  │ Prometheus + Grafana   │  │  Monitoring
    │  │ Alertmanager           │  │  Alerts
    │  └────────────────────────┘  │
    └────┬──────────────────┬──────┘
         │                  │
    ┌────▼─────┐       ┌───▼──────┐
    │ worker-01│       │ worker-02│
    │ ┌──────┐ │       │ ┌──────┐ │
    │ │slurmd│ │       │ │slurmd│ │
    │ │NFS ↑ │ │       │ │NFS ↑ │ │
    │ └──────┘ │       │ └──────┘ │
    └──────────┘       └──────────┘

For Developers: Local Testing with Vagrant

If you want to test the deployment locally using VMs:

# Install Vagrant with libvirt provider
bash scripts/setup.sh 24.04 true

# Create local VMs and deploy cluster
vagrant up

# SSH to controller
vagrant ssh controller-01

# Destroy VMs when done
vagrant destroy -f

Key Takeaways

In this post, we've covered:

Why Automation: The benefits of using Ansible for cluster management
Production Architecture: Multi-node setup with monitoring and alerting
Slack Integration: Proactive monitoring with notifications
Automated Deployment: One command to deploy the entire cluster
Verification: Testing your cluster thoroughly

info

What's Next?

In Part 3, we'll cover daily administration tasks, troubleshooting, security best practices, and advanced resource management.

Resources

GitHub Repository: RiverXData Slurm Ansible
Deployment Docs: Scalable Slurm Deployment
Architecture Overview: Slurm Architecture
Ansible Documentation: docs.ansible.com

Contact

Questions about the deployment? Reach out at: nttg8100@gmail.com

This is Part 2 of the RiverXData series on building Slurm HPC clusters. Continue to Part 3 for administration and best practices.

Series Overview​

Why Ansible for HPC Clusters?​

What is Ansible? (Quick Primer)​

Production Cluster Architecture​

Head Node (Controller + Login)​

Compute Nodes​

Shared Storage (NFS)​

Monitoring Stack​

Setting Up the RiverXData Slurm Cluster​

Prerequisites​

Step 1: Clone the Repository​

Step 2: Install Ansible and Dependencies​

Step 3: Set Up Slack Alerts​

Create a Slack App​

Enable Incoming Webhooks​

Test Your Webhook​

Step 4: Configure Your Inventory​

Optional Parameters​

Step 5: Deploy the Cluster​

Step 6: Add Users​

Step 7: Verify the Setup​

Step 8: Access Grafana Dashboards​

Node Metrics Dashboard​

Slurm Metrics Dashboard​

What About Alerts?​

Testing Your Cluster​

Test 1: Simple Job​

Test 2: Multi-node Job​

Test 3: Interactive Session​

Test 4: Batch Job​

Test 5: Resource Limits​

Test 6: Accounting​

Architecture Diagram​

For Developers: Local Testing with Vagrant​

Key Takeaways​

Resources​

Contact​

Series Overview

Why Ansible for HPC Clusters?

What is Ansible? (Quick Primer)

Production Cluster Architecture

Head Node (Controller + Login)

Compute Nodes

Shared Storage (NFS)

Monitoring Stack

Setting Up the RiverXData Slurm Cluster

Prerequisites

Step 1: Clone the Repository

Step 2: Install Ansible and Dependencies

Step 3: Set Up Slack Alerts

Create a Slack App

Enable Incoming Webhooks

Test Your Webhook

Step 4: Configure Your Inventory

Optional Parameters

Step 5: Deploy the Cluster

Step 6: Add Users

Step 7: Verify the Setup

Step 8: Access Grafana Dashboards

Node Metrics Dashboard

Slurm Metrics Dashboard

What About Alerts?

Testing Your Cluster

Test 1: Simple Job

Test 2: Multi-node Job

Test 3: Interactive Session

Test 4: Batch Job

Test 5: Resource Limits

Test 6: Accounting

Architecture Diagram

For Developers: Local Testing with Vagrant

Key Takeaways

Resources

Contact