Skip to main content

Administration

Quick review:

  • Cluster: Build, maintain, and update the cluster via the RIVERXADATA SLURM cluster
  • Users: Synchronize the users between nodes Sync User at section 5
  • Login: Set up the SSH server that allows users to log in with 2FA or required key pairs
  • Permissions: Set the appropriate permissions for users and groups
warning

If a user is in the docker group, they can gain root permissions. Do not add users to this group unless they are administrators. Consider using docker-rootless/apptainer instead.

Monitor management

According to the cluster manag

Computing resource managements

To add users, groups, and limit their resources using sacctmgr, follow these steps:

  1. Add a new group:

    sacctmgr add account <group_name>
  2. Add a new user to a group:

    sacctmgr add user <username> account=<group_name>
  3. Set resource limits for a group:

    sacctmgr modify account <group_name> set GrpCPUMins=<cpu_minutes> GrpMem=<memory_limit>
  4. Set resource limits for a user:

    sacctmgr modify user <username> set MaxJobs=<max_jobs> MaxSubmitJobs=<max_submit_jobs>

Replace <group_name>, <username>, <cpu_minutes>, <memory_limit>, <max_jobs>, and <max_submit_jobs> with the appropriate values.

Node management

Checking SLURM Logs

To manage and troubleshoot SLURM components (slurmdbd, slurmd, slurmctld), follow these steps to check their logs located in /var/log/slurm/:

  1. Access the Node:

    • SSH into the node where the SLURM component is running.
  2. Locate SLURM Log Files:

    • SLURM log files are typically located in the /var/log/slurm/ directory. Common log files include:
      • /var/log/slurm/slurmdbd.log - SLURM database daemon log.
      • /var/log/slurm/slurmd.log - SLURM node daemon log.
      • /var/log/slurm/slurmctld.log - SLURM controller daemon log.
  3. View SLURM Logs:

    • Use commands like cat, less, more, or tail to view the log files. For example:
      tail -f /var/log/slurm/slurmctld.log
  4. Filter SLURM Logs:

    • Use grep to filter logs for specific keywords. For example:
      grep "error" /var/log/slurm/slurmd.log
  5. Check SLURM Configuration:

    • Ensure that the SLURM configuration files (/etc/slurm/slurm.conf and related files) are correctly set up.
  6. Verify SLURM Daemon Status:

    • Check the status of SLURM daemons to ensure they are running properly. For example:
      systemctl status slurmctld

By following these steps, you can effectively manage and troubleshoot SLURM components by checking their logs.

Checking Logs

To manage and troubleshoot nodes effectively, it's essential to check the logs. Assuming the rsyslog server and client are already installed, follow these steps to check the logs:

  1. Access the Node:

    • SSH into the node you want to check the logs for.
  2. Locate Log Files:

    • Log files are typically located in the /var/log/ directory. Common log files include:
      • /var/log/syslog - General system log.
      • /var/log/auth.log - Authentication log.
      • /var/log/kern.log - Kernel log.
  3. View Logs:

    • Use commands like cat, less, more, or tail to view the log files. For example:
      tail -f /var/log/syslog
  4. Filter Logs:

    • Use grep to filter logs for specific keywords. For example:
      grep "error" /var/log/syslog
  5. Check rsyslog Configuration:

    • Ensure that the rsyslog configuration files (/etc/rsyslog.conf and files in /etc/rsyslog.d/) are correctly set up to forward logs to the rsyslog server.
  6. Verify Log Forwarding:

    • On the rsyslog server, check the logs to ensure that logs from the client nodes are being received. Logs are usually stored in /var/log/ on the server as well.

By following these steps, you can effectively manage and troubleshoot nodes by checking their logs.