Building a Slurm HPC Cluster (Part 3) - Administration and Best Practices

January 14, 2026 · 13 min read

Founder at RIVER

In Part 1 and Part 2, we built a complete Slurm HPC cluster from a single node to a production-ready multi-node system. Now let's learn how to manage, maintain, and secure it effectively.

This final post covers daily administration tasks, troubleshooting, security hardening, and integration with data processing frameworks.

Series Overview

Part 1: Introduction, Architecture, and Single Node Setup
Part 2: Scaling to Production with Ansible
Part 3 (This Post): Administration and Best Practices

Administration Overview

Managing a Slurm cluster involves several key areas:

Cluster Management: Build, maintain, and update the cluster via Ansible
User Management: Synchronize users across nodes with proper permissions
Login Security: Implement SSH hardening with 2FA or key pairs
Resource Management: Enforce limits and fair-share policies
Monitoring: Track performance and resource utilization
Troubleshooting: Diagnose and resolve issues

User and Resource Management

Adding Users and Groups

Slurm uses accounts (groups) to organize users and apply resource policies:

# Add a new account/group
sacctmgr add account research_team Description="Research Team"

# Add a user to an account
sacctmgr add user john account=research_team

# Add user with multiple accounts
sacctmgr add user alice account=research_team,dev_team DefaultAccount=research_team

# View accounts
sacctmgr show account

# View users
sacctmgr show user

Setting Resource Limits

Account-Level Limits

Control resources for entire groups:

# Limit CPU minutes (prevents monopolizing cluster)
sacctmgr modify account research_team set GrpCPUMins=100000

# Limit memory (in MB)
sacctmgr modify account research_team set GrpMem=500000

# Limit concurrent jobs
sacctmgr modify account research_team set GrpJobs=50

# Limit concurrent running jobs
sacctmgr modify account research_team set GrpJobsRun=20

# Limit number of nodes
sacctmgr modify account research_team set GrpNodes=10

User-Level Limits

Control individual user behavior:

# Limit jobs in queue
sacctmgr modify user john set MaxJobs=10

# Limit running jobs
sacctmgr modify user john set MaxJobsRun=5

# Limit wall time (in minutes)
sacctmgr modify user john set MaxWall=1440  # 24 hours

# Limit CPUs per job
sacctmgr modify user john set MaxCPUs=32

# View user limits
sacctmgr show user john withassoc format=user,account,maxjobs,maxsubmit,maxwall

Quality of Service (QoS)

QoS allows you to create service tiers with different priorities:

# Create QoS levels
sacctmgr add qos normal priority=100
sacctmgr add qos high priority=500 MaxWall=2-00:00:00 MaxJobs=5
sacctmgr add qos low priority=50

# Assign QoS to account
sacctmgr modify account research_team set qos=normal,high

# Users can specify QoS when submitting
sbatch --qos=high job_script.sh

Ensure equitable resource distribution:

# Set fair-share values (higher = more priority)
sacctmgr modify account research_team set fairshare=100
sacctmgr modify account dev_team set fairshare=50

# View fair-share tree
sshare -a

# View detailed fair-share info
sshare -A research_team --all

Node Management

Checking Node Status

# View all nodes
sinfo

# Detailed node information
sinfo -Nel

# Show node states
sinfo -N -o "%N %T %C %m %e %f"

# View specific node details
scontrol show node worker-01

Node states you'll encounter:

IDLE: Available for jobs
ALLOCATED: Running jobs
MIXED: Partially allocated
DRAIN: Won't accept new jobs (draining)
DRAINED: Fully drained
DOWN: Not responding

Node Maintenance

Draining a Node

When you need to perform maintenance:

# Drain node (won't accept new jobs, allows running jobs to finish)
scontrol update NodeName=worker-01 State=drain Reason="Hardware upgrade"

# Force drain (terminate running jobs)
scontrol update NodeName=worker-01 State=drain Reason="Emergency maintenance"

# Check drain reason
sinfo -R

Resuming a Node

After maintenance:

# Resume node
scontrol update NodeName=worker-01 State=resume

# Verify it's back
sinfo -n worker-01

Forcing Node Down

If a node is misbehaving:

# Mark node as down
scontrol update NodeName=worker-01 State=down Reason="Hardware failure"

# When fixed, resume
scontrol update NodeName=worker-01 State=resume

Adding New Compute Nodes

Update Ansible inventory (inventories/hosts):

[slurm_worker]
worker-01 ansible_host=192.168.58.11
worker-02 ansible_host=192.168.58.12
worker-03 ansible_host=192.168.58.13  # NEW

Run Ansible playbook:

ansible-playbook -i inventories/hosts river_cluster.yml --limit worker-03

Update slurm.conf on controller and all nodes (Ansible handles this)
Restart slurmctld:

sudo systemctl restart slurmctld

Verify the new node:

sinfo
scontrol show node worker-03

Monitoring and Troubleshooting

Checking Slurm Logs

Logs are essential for diagnosing issues:

# Controller logs
sudo tail -f /var/log/slurm/slurmctld.log

# Worker node logs (on compute nodes)
sudo tail -f /var/log/slurm/slurmd.log

# Database logs
sudo tail -f /var/log/slurm/slurmdbd.log

# Filter for errors
sudo grep "error" /var/log/slurm/*.log

# Filter for specific node
sudo grep "worker-01" /var/log/slurm/slurmctld.log

# Last 100 lines with context
sudo tail -100 /var/log/slurm/slurmctld.log

Common Issues and Solutions

Issue: Node Shows as DOWN

Diagnosis:

sinfo
# OUTPUT: worker-01    down   ...

scontrol show node worker-01
# Check "Reason" field

Solutions:

# 1. Check if slurmd is running
ssh worker-01 "sudo systemctl status slurmd"

# 2. Restart slurmd
ssh worker-01 "sudo systemctl restart slurmd"

# 3. Check network connectivity
ping worker-01

# 4. Check logs
ssh worker-01 "sudo tail -50 /var/log/slurm/slurmd.log"

# 5. Resume the node
scontrol update NodeName=worker-01 State=resume

Issue: Jobs Stuck in Pending

Diagnosis:

squeue
# See jobs in PD (pending) state

# Check why job is pending
squeue --start -j JOB_ID

# View detailed job info
scontrol show job JOB_ID

Common reasons:

Resources: Not enough resources available
Priority: Lower priority than other jobs
Dependency: Waiting for another job to complete
QOSMaxJobsPerUser: User has too many jobs running

Solutions:

# 1. Check available resources
sinfo -o "%P %a %l %D %N %C"

# 2. View job requirements
scontrol show job JOB_ID | grep -E "Partition|NumNodes|MinMemory"

# 3. Cancel job if needed
scancel JOB_ID

# 4. Modify pending job
scontrol update JobId=JOB_ID NumNodes=1

Issue: Jobs Failing Immediately

Diagnosis:

# Check job status
sacct -j JOB_ID

# View job output files
cat slurm-JOB_ID.out
cat slurm-JOB_ID.err

Common causes:

Script errors (check shebang line)
Missing executables
Resource limits exceeded
Permission issues

Issue: Accounting Database Not Working

Diagnosis:

# Check slurmdbd status
sudo systemctl status slurmdbd

# Test database connection
sudo mysql -u slurm -p slurm_acct_db -e "SHOW TABLES;"

# Check slurmdbd logs
sudo tail -50 /var/log/slurm/slurmdbd.log

Solutions:

# 1. Restart slurmdbd
sudo systemctl restart slurmdbd

# 2. Verify database credentials in slurmdbd.conf
sudo cat /etc/slurm-llnl/slurmdbd.conf

# 3. Check database permissions
sudo mysql -e "SHOW GRANTS FOR 'slurm'@'localhost';"

# 4. Restart slurmctld to reconnect
sudo systemctl restart slurmctld

System Logs with rsyslog

Our Ansible setup configures centralized logging:

# On controller (rsyslog server)
sudo tail -f /var/log/syslog

# Filter by hostname
sudo grep "worker-01" /var/log/syslog

# Filter by service
sudo grep "slurmd" /var/log/syslog

# Check authentication logs
sudo tail -f /var/log/auth.log

Security Best Practices

SSH Hardening

warning

Secure your login nodes! HPC clusters are attractive targets for attackers.

For detailed SSH security setup, see our SSH Remote Server documentation.

Key recommendations:

Disable Password Authentication:

# /etc/ssh/sshd_config
PasswordAuthentication no
PubkeyAuthentication yes

Implement 2FA with Google Authenticator or similar
Use SSH Key Pairs:

# Generate key on your machine
ssh-keygen -t ed25519 -C "your_email@example.com"

# Copy to cluster
ssh-copy-id user@controller-node

Limit SSH Access:

# /etc/ssh/sshd_config
AllowUsers alice bob charlie
AllowGroups cluster_users

# Or deny specific users
DenyUsers baduser

Change Default Port (security through obscurity):

# /etc/ssh/sshd_config
Port 2222

Munge Authentication

Munge provides authentication between Slurm components:

# Verify munge is running
sudo systemctl status munge

# Test munge
munge -n | unmunge

# Generate new key (do this on controller, then distribute)
sudo /usr/sbin/create-munge-key

# Copy key to all nodes (Ansible does this automatically)
sudo scp /etc/munge/munge.key worker-01:/etc/munge/

# Restart munge on all nodes
sudo systemctl restart munge

warning

Critical: The munge key must be identical on all nodes and have proper permissions (0400, owned by munge:munge).

Docker Security

danger

Critical Security Issue: Users in the docker group can gain root privileges!

# DON'T DO THIS (unless they're admins)
sudo usermod -aG docker regular_user

Why? Because they can run:

docker run -v /:/hostfs --privileged -it ubuntu bash
# Now they have root access to the host filesystem!

Solutions:

Use Docker Rootless Mode:

# Install rootless docker
curl -fsSL https://get.docker.com/rootless | sh

Use Apptainer/Singularity (designed for HPC):

# Install Apptainer
sudo apt-get install apptainer

# Run containers without root
apptainer run docker://ubuntu:latest

Restrict Docker Group: Only add administrators to docker group

Firewall Configuration

Restrict access to Slurm ports:

# Allow Slurm ports only from cluster network
sudo ufw allow from 192.168.58.0/24 to any port 6817  # slurmctld
sudo ufw allow from 192.168.58.0/24 to any port 6818  # slurmd
sudo ufw allow from 192.168.58.0/24 to any port 6819  # slurmdbd

# Allow SSH from anywhere
sudo ufw allow 22/tcp

# Enable firewall
sudo ufw enable

Shared Storage Best Practices

NFS Performance Tuning

Optimize NFS for your workload:

# /etc/fstab on compute nodes
controller-01:/home /home nfs4 rw,soft,rsize=262144,wsize=262144,timeo=14,intr 0 0

Parameters explained:

soft: Timeout after retry (vs hard which waits forever)
rsize/wsize: Read/write buffer size (larger = better performance)
timeo: Timeout value
intr: Allow interrupts

Storage Layout

Recommended directory structure:

/home/          # User home directories (SSD/NVMe)
  ├─ alice/
  ├─ bob/
  └─ charlie/

/mnt/data/      # Large datasets (HDD or object storage)
  ├─ shared/    # Common datasets
  ├─ projects/  # Project-specific data
  └─ scratch/   # Temporary data (auto-cleanup)

/opt/           # Shared software/modules
  ├─ anaconda/
  ├─ modules/
  └─ apps/

Quotas

Prevent users from filling up shared storage:

# Set user quotas
sudo setquota -u alice 50G 60G 0 0 /home
sudo setquota -u alice 500G 550G 0 0 /mnt/data

# Check quotas
quota -u alice

# View all quotas
sudo repquota -a

Integration with Data Processing Frameworks

One of Slurm's greatest strengths is integration with modern computing frameworks:

Apache Spark

Submit Spark jobs to Slurm:

#!/bin/bash
#SBATCH --job-name=spark-job
#SBATCH --nodes=4
#SBATCH --ntasks-per-node=1
#SBATCH --cpus-per-task=8
#SBATCH --mem=32G

# Load Spark module
module load spark/3.5.0

# Run Spark application
spark-submit \
  --master yarn \
  --num-executors 4 \
  --executor-cores 8 \
  --executor-memory 28G \
  my_spark_app.py

Ray (Distributed ML)

#!/bin/bash
#SBATCH --job-name=ray-job
#SBATCH --nodes=2
#SBATCH --ntasks-per-node=1
#SBATCH --cpus-per-task=16
#SBATCH --gpus-per-node=2

# Start Ray cluster
ray start --head --port=6379
srun --nodes=1 --ntasks=1 ray start --address=$HEAD_NODE:6379

# Run Ray application
python ray_train.py

Dask

from dask_jobqueue import SLURMCluster
from dask.distributed import Client

cluster = SLURMCluster(
    cores=8,
    memory="16GB",
    processes=2,
    walltime="02:00:00",
    queue="compute"
)

cluster.scale(jobs=10)  # Request 10 jobs
client = Client(cluster)

# Your Dask code here

Nextflow (Bioinformatics)

// nextflow.config
process {
    executor = 'slurm'
    queue = 'compute'
    memory = '8 GB'
    time = '2h'
}

Run with:

nextflow run nf-core/rnaseq -profile slurm

Maintenance Tasks

Regular Updates

# Update cluster via Ansible
ansible-playbook -i inventories/hosts river_cluster.yml --tags update

# Update specific nodes
ansible-playbook -i inventories/hosts river_cluster.yml --limit worker-01,worker-02

Backup Critical Data

# Backup Slurm configuration
sudo cp /etc/slurm-llnl/slurm.conf /backup/slurm.conf.$(date +%Y%m%d)

# Backup accounting database
sudo mysqldump -u slurm -p slurm_acct_db > slurm_acct_backup_$(date +%Y%m%d).sql

# Backup user data (use rsync for efficiency)
sudo rsync -av /home/ /backup/home/

Monitoring Disk Space

# Check disk usage on all nodes
ansible all -i inventories/hosts -m shell -a "df -h"

# Check specific directory
ansible all -i inventories/hosts -m shell -a "du -sh /var/log/slurm"

# Find large files
find /home -type f -size +1G -exec ls -lh {} \;

Performance Optimization Tips

1. Tune Scheduler Parameters

# /etc/slurm-llnl/slurm.conf

# Increase scheduling frequency
SchedulerTimeSlice=30

# Prioritize recent submitters less
PriorityWeightAge=1000
PriorityWeightFairshare=10000

# Enable backfill scheduling with larger window
SchedulerType=sched/backfill
bf_window=1440  # 24 hours

2. Optimize Job Packing

# Use CR_CPU for CPU-bound jobs
SelectType=select/cons_tres
SelectTypeParameters=CR_CPU

# Or CR_Memory for memory-bound jobs
SelectTypeParameters=CR_Memory

# Or CR_Core for mixed workloads
SelectTypeParameters=CR_Core

3. Create Multiple Partitions

# /etc/slurm-llnl/slurm.conf

# Fast partition for short jobs
PartitionName=quick Nodes=worker-[01-02] Default=NO MaxTime=01:00:00 State=UP Priority=100

# Standard partition
PartitionName=standard Nodes=worker-[01-04] Default=YES MaxTime=2-00:00:00 State=UP Priority=50

# Long partition for extended jobs
PartitionName=long Nodes=worker-[03-04] Default=NO MaxTime=7-00:00:00 State=UP Priority=25

# GPU partition
PartitionName=gpu Nodes=gpu-[01-02] Default=NO MaxTime=1-00:00:00 State=UP Priority=75

4. Enable Job Arrays for Batch Processing

#!/bin/bash
#SBATCH --array=1-100%10  # 100 tasks, max 10 concurrent

# Process task based on array index
python process.py --input data_${SLURM_ARRAY_TASK_ID}.txt

Conclusion

Congratulations! You now have the knowledge to build, deploy, and manage a production Slurm HPC cluster. Let's recap the journey:

Part 1: Foundations

Understanding Slurm architecture
Single-node setup for learning
Critical cgroup configuration
Job accounting basics

Part 2: Production Deployment

Ansible automation
Multi-node cluster setup
Monitoring with Grafana
Slack alerting

Part 3: Administration (This Post)

User and resource management
Node maintenance and troubleshooting
Security hardening
Performance optimization
Framework integration

Key Takeaways

Start Simple, Scale Smart: Master single-node before going multi-node
Automate Everything: Use Ansible for reproducible deployments
Monitor Proactively: Set up alerting before problems occur
Security First: SSH hardening, proper permissions, Docker caution
Regular Maintenance: Backups, updates, and log monitoring
Documentation: Document your cluster configuration and procedures

What's Next?

Consider these advanced topics:

High Availability: Redundant controllers with failover
LDAP Integration: Centralized authentication for large organizations
GPU Scheduling: Optimize for machine learning workloads
Cloud Bursting: Expand to cloud resources during peak demand
Custom Plugins: Extend Slurm with custom scheduling policies

Resources

Part 1: Single Node Setup
Part 2: Production Deployment
Documentation: Administration Guide
GitHub: RiverXData Slurm Ansible
Official Docs: SchedMD Slurm Documentation

Contact

Have questions or need help with your cluster? Reach out at: nttg8100@gmail.com

This concludes the RiverXData series on building Slurm HPC clusters. Thank you for following along! We hope this guide helps you build and manage effective HPC infrastructure.

Series Overview​

Administration Overview​

User and Resource Management​

Adding Users and Groups​

Setting Resource Limits​

Account-Level Limits​

User-Level Limits​

Quality of Service (QoS)​

Fair-Share Scheduling​

Node Management​

Checking Node Status​

Node Maintenance​

Draining a Node​

Resuming a Node​

Forcing Node Down​

Adding New Compute Nodes​

Monitoring and Troubleshooting​

Checking Slurm Logs​

Common Issues and Solutions​

Issue: Node Shows as DOWN​

Issue: Jobs Stuck in Pending​

Issue: Jobs Failing Immediately​

Issue: Accounting Database Not Working​

System Logs with rsyslog​

Security Best Practices​

SSH Hardening​

Munge Authentication​

Docker Security​

Firewall Configuration​

Shared Storage Best Practices​

NFS Performance Tuning​

Storage Layout​

Quotas​

Integration with Data Processing Frameworks​

Apache Spark​

Ray (Distributed ML)​

Dask​

Nextflow (Bioinformatics)​

Maintenance Tasks​

Regular Updates​

Backup Critical Data​

Monitoring Disk Space​

Performance Optimization Tips​

1. Tune Scheduler Parameters​

2. Optimize Job Packing​

3. Create Multiple Partitions​

4. Enable Job Arrays for Batch Processing​

Conclusion​

Part 1: Foundations​

Part 2: Production Deployment​

Part 3: Administration (This Post)​

Key Takeaways​

What's Next?​

Resources​

Contact​

Series Overview

Administration Overview

User and Resource Management

Adding Users and Groups

Setting Resource Limits

Account-Level Limits

User-Level Limits

Quality of Service (QoS)

Fair-Share Scheduling

Node Management

Checking Node Status

Node Maintenance

Draining a Node

Resuming a Node

Forcing Node Down

Adding New Compute Nodes

Monitoring and Troubleshooting

Checking Slurm Logs

Common Issues and Solutions

Issue: Node Shows as DOWN

Issue: Jobs Stuck in Pending

Issue: Jobs Failing Immediately

Issue: Accounting Database Not Working

System Logs with rsyslog

Security Best Practices

SSH Hardening

Munge Authentication

Docker Security

Firewall Configuration

Shared Storage Best Practices

NFS Performance Tuning

Storage Layout

Quotas

Integration with Data Processing Frameworks

Apache Spark

Ray (Distributed ML)

Dask

Nextflow (Bioinformatics)

Maintenance Tasks

Regular Updates

Backup Critical Data

Monitoring Disk Space

Performance Optimization Tips

1. Tune Scheduler Parameters

2. Optimize Job Packing

3. Create Multiple Partitions

4. Enable Job Arrays for Batch Processing

Conclusion

Part 1: Foundations

Part 2: Production Deployment

Part 3: Administration (This Post)

Key Takeaways

What's Next?

Resources

Contact