Overview

info

Summary on how to scale up the SLURM HPC using ansible to configure on multiple nodes, has the monitor and alert system.
You can config by yourself using this setup: https://github.com/riverxdata/river-slurm
Contact for work: nttg8100@gmail.com

The head node is responsible for managing the SLURM scheduler, job queue, and user authentication. Key services running on the head node include:

SLURM Controller (slurmctld): Centralized SLURM job manager.
SLURM Database (slurmdbd): Stores accounting and job history.
NFS Server: Exposes shared directories to all compute nodes.

Compute Nodes

Compute nodes are responsible for executing jobs submitted to the SLURM scheduler. Each compute node runs:

SLURM Compute Daemon (slurmd): Handles job execution.
NFS Client: Mounts shared storage from the NFS server.

Shared File System (NFS)

Source: https://thuanbui.me/cai-dat-nfs-server-va-nfs-client-tren-ubuntu-22-04/

NFS provides a shared directory structure to ensure uniform access to data across nodes. Key directories:

/home: User home directories with frequently and quick access (SSD NVMe)
/mnt/: Possible hard drive with large storage (like HDD)

User Management

User authentication and identity management should be centralized using the user mangement systems like NIS/LDAP. However, these set up are not easy to configure. To simplify the cluster, the linux users, the slurm users will be created using the ansible playbook with matched UID and GID.

Monitor

Source: https://swsmith.cc/posts/grafana-slurm.html

All of the nodes need to be monitored via the prometheus (metrics) and grafana (visualization). Using the alertmanager with slack api for notifications

Head Node (Controller-Login)​

Compute Nodes​

Shared File System (NFS)​

User Management​

Monitor​

Head Node (Controller-Login)

Compute Nodes

Shared File System (NFS)

User Management

Monitor