Administration
Quick review:
- Cluster: Build, maintain, and update the cluster via the RIVERXADATA SLURM cluster
- Users: Synchronize the users between nodes Sync User at section 5
- Login: Set up the SSH server that allows users to log in with 2FA or required key pairs
- Permissions: Set the appropriate permissions for users and groups
If a user is in the docker group, they can gain root permissions. Do not add users to this group unless they are administrators. Consider using docker-rootless/apptainer instead.
Monitor management
According to the cluster manag
Computing resource managements
To add users, groups, and limit their resources using sacctmgr
, follow these steps:
-
Add a new group:
sacctmgr add account <group_name>
-
Add a new user to a group:
sacctmgr add user <username> account=<group_name>
-
Set resource limits for a group:
sacctmgr modify account <group_name> set GrpCPUMins=<cpu_minutes> GrpMem=<memory_limit>
-
Set resource limits for a user:
sacctmgr modify user <username> set MaxJobs=<max_jobs> MaxSubmitJobs=<max_submit_jobs>
Replace <group_name>
, <username>
, <cpu_minutes>
, <memory_limit>
, <max_jobs>
, and <max_submit_jobs>
with the appropriate values.
Node management
Checking SLURM Logs
To manage and troubleshoot SLURM components (slurmdbd
, slurmd
, slurmctld
), follow these steps to check their logs located in /var/log/slurm/
:
-
Access the Node:
- SSH into the node where the SLURM component is running.
-
Locate SLURM Log Files:
- SLURM log files are typically located in the
/var/log/slurm/
directory. Common log files include:/var/log/slurm/slurmdbd.log
- SLURM database daemon log./var/log/slurm/slurmd.log
- SLURM node daemon log./var/log/slurm/slurmctld.log
- SLURM controller daemon log.
- SLURM log files are typically located in the
-
View SLURM Logs:
- Use commands like
cat
,less
,more
, ortail
to view the log files. For example:tail -f /var/log/slurm/slurmctld.log
- Use commands like
-
Filter SLURM Logs:
- Use
grep
to filter logs for specific keywords. For example:grep "error" /var/log/slurm/slurmd.log
- Use
-
Check SLURM Configuration:
- Ensure that the SLURM configuration files (
/etc/slurm/slurm.conf
and related files) are correctly set up.
- Ensure that the SLURM configuration files (
-
Verify SLURM Daemon Status:
- Check the status of SLURM daemons to ensure they are running properly. For example:
systemctl status slurmctld
- Check the status of SLURM daemons to ensure they are running properly. For example:
By following these steps, you can effectively manage and troubleshoot SLURM components by checking their logs.
Checking Logs
To manage and troubleshoot nodes effectively, it's essential to check the logs. Assuming the rsyslog server and client are already installed, follow these steps to check the logs:
-
Access the Node:
- SSH into the node you want to check the logs for.
-
Locate Log Files:
- Log files are typically located in the
/var/log/
directory. Common log files include:/var/log/syslog
- General system log./var/log/auth.log
- Authentication log./var/log/kern.log
- Kernel log.
- Log files are typically located in the
-
View Logs:
- Use commands like
cat
,less
,more
, ortail
to view the log files. For example:tail -f /var/log/syslog
- Use commands like
-
Filter Logs:
- Use
grep
to filter logs for specific keywords. For example:grep "error" /var/log/syslog
- Use
-
Check rsyslog Configuration:
- Ensure that the rsyslog configuration files (
/etc/rsyslog.conf
and files in/etc/rsyslog.d/
) are correctly set up to forward logs to the rsyslog server.
- Ensure that the rsyslog configuration files (
-
Verify Log Forwarding:
- On the rsyslog server, check the logs to ensure that logs from the client nodes are being received. Logs are usually stored in
/var/log/
on the server as well.
- On the rsyslog server, check the logs to ensure that logs from the client nodes are being received. Logs are usually stored in
By following these steps, you can effectively manage and troubleshoot nodes by checking their logs.