Basic Installation

Install the pacakages:

sudo apt install -y munge libmunge-dev

Create munge keys

sudo /usr/sbin/create-munge-key
sudo chown munge: /etc/munge/munge.key
sudo chmod 400 /etc/munge/munge.key

If the above fails do it manually with

sudo mungekey --create --force --keyfile /etc/munge/munge.key
sudo chown -R munge: /etc/munge /var/log/munge /var/lib/munge /run/munge
sudo chmod 0700 /etc/munge /var/log/munge /var/lib/munge
sudo chmod 0755 /run/munge

Start and check munge

sudo systemctl enable munge
sudo systemctl start munge
sudo systemctl status munge
munge -n | unmunge

Result should be

STATUS:          Success (0)
ENCODE_HOST:     user-G4101 (127.0.1.1)
ENCODE_TIME:     2025-08-25 15:05:03 -0700 (1756159503)
DECODE_TIME:     2025-08-25 15:05:03 -0700 (1756159503)
TTL:             300
CIPHER:          aes128 (4)
MAC:             sha256 (5)
ZIP:             none (0)
UID:             subho (1002)
GID:             subho (1002)
LENGTH:          0

Install wsl slurm of ubuntu with (though it is easier to install, this is old and partial gpu feature often fails, for advance features compile it)

sudo apt install -y slurm-wlm

Commands

  • Show fairshare usage for various queue and partitions sacctmgr show qos.
  • Show user’s fairshare score sshare -l | grep usernameOrGrpName.
  • To obtain basic partition info sinfo -p sulcgpu3 -o "%P %a %l %D %t %N" or sinfo -Nel | grep "sulcgpu3".
  • For information on a particular node scontrol show node nodeName
  • To obtain queue info use squeue -p parititonName or squeue -u userName
  • To submit a empty job sbatch -n 20 --gres=gpu:1 --time=4:00:00 -q sulcgpu1 -p sulcgpu1 --output=/dev/null --error=/dev/null $@ _interactiveScript

where the _interactiveScript can be a very simple command like blank screen.


#!/bin/sh
# Simple batch script that starts SCREEN.

exec screen -Dm -S slurm$SLURM_JOB_ID
  • Empty jobs can also be submited using slurm script as given below
#!/bin/sh
#SBATCH -q private
#SBATCH -p general
#SBATCH -G a100:1
#SBATCH -t 5-00:00
#SBATCH -c 20
#SBATCH -o empty.out
#SBATCH -e empty.err
#SBATCH -J PHBdnaD

exec screen -Dm -S slurm$SLURM_JOB_ID

Now to enter into the submitted job, one can use srun --pty --jobid=######## /bin/bash or simply ssh nodeid.

Setting up slurmdbd

  • Install mariadb
    sudo apt-get install mariadb-server
    sudo systemctl enable mariadb
    sudo systemctl start mariadb
    
  • Generate a database to use with slurm using ```bash sudo mysql -u root -p CREATE DATABASE slurm_acct_db; CREATE USER ‘slurmuser’@’localhost’ IDENTIFIED BY ‘slurmPass’; GRANT ALL ON slurm_acct_db.* TO ‘slurmuser’@’localhost’; FLUSH PRIVILEGES; EXIT;
- Default port fro mariadb is 3306 use `sudo ss -tulpn` to identify that.
- change the following parameters in mydraid `sudo nvim /etc/mysql/mariadb.conf.d/50-server.cnf`

```bash
innodb_buffer_pool_size = 1G
innodb_lock_wait_timeout = 300
  • sudo systemctl restart mariadb; sudo systemctl status mariadb

  • sudo systemctl restart mariadb; sudo systemctl status mariadb

  • A sample slurmdb.conf at /etc/slurm/slurmdbd.conf is given below:

#
# Sample /etc/slurmdbd.conf
#
ArchiveEvents=yes
ArchiveJobs=yes
ArchiveResvs=yes
ArchiveSteps=no
ArchiveSuspend=no
ArchiveTXN=no
ArchiveUsage=no
#ArchiveScript=/usr/sbin/slurm.dbd.archive
AuthInfo=/var/run/munge/munge.socket.2
AuthType=auth/munge
DbdHost=localhost
DbdPort=6819
DebugLevel=info
PurgeEventAfter=1month
PurgeJobAfter=12month
PurgeResvAfter=1month
PurgeStepAfter=1month
PurgeSuspendAfter=1month
PurgeTXNAfter=12month
PurgeUsageAfter=24month
LogFile=/var/log/slurm/slurmdbd.log
PidFile=/var/run/slurm/slurmdbd.pid
SlurmUser=slurm
StoragePass=slurmPass
StorageType=accounting_storage/mysql
StorageUser=slurmuser
StoragePort=3306
  • Sample slurm.conf at /etc/slurm/slurm.conf is given below:

ClusterName=ringtail
SlurmctldHost=ringtail

MailProg=/usr/bin/mail

ProctrackType=proctrack/cgroup

ReturnToService=1
SlurmctldPidFile=/var/run/slurmctld.pid
SlurmctldPort=6817
SlurmdPidFile=/var/run/slurmd.pid
SlurmdPort=6818
SlurmdSpoolDir=/var/spool/slurmd
SlurmUser=slurm
AuthType=auth/munge

StateSaveLocation=/var/spool/slurmctld

TaskPlugin=task/affinity,task/cgroup

InactiveLimit=0
KillWait=30

MinJobAge=300

SlurmctldTimeout=120
SlurmdTimeout=300

Waittime=0

SchedulerType=sched/backfill
SelectType=select/cons_tres

AccountingStorageEnforce=limits,qos,safe
AccountingStorageHost=localhost

AccountingStoragePort=6819
AccountingStorageType=accounting_storage/slurmdbd

JobCompLoc=/var/log/slurm/job_completions.log

JobCompType=jobcomp/filetxt

JobAcctGatherFrequency=30

SlurmctldDebug=info
SlurmctldLogFile=/var/log/slurm/slurmctld.log
SlurmdDebug=info
SlurmdLogFile=/var/log/slurm/slurmd.log

# COMPUTE NODES
GresTypes=gpu,mps
NodeName=gpu CPUs=128 Sockets=1 CoresPerSocket=128 ThreadsPerCore=1 Gres=gpu:rtx4060:1,gpu:rtx4080:1,mps:200 NodeHostname=ringtail NodeAddr=ringtail State=idle RealMemory=96092
PartitionName=general Nodes=ALL Default=YES MaxTime=INFINITE State=UP
  • sudo sacctmgr add account generic cluster=ringtail description="Generic Account" organization="SulcLab

  • sacctmgr list account

  • sudo sacctmgr add user subho account=generic cluster=ringtail

Conclusion

SLURM is one of the best open source workload manager and is highly used in modern servers.

References

  • https://slurm.schedmd.com/classic_fair_share.html
  • https://slurm.schedmd.com/fair_tree.html