SLURM
Basic Installation
Install the pacakages:
sudo apt install -y munge libmunge-dev
Create munge keys
sudo /usr/sbin/create-munge-key
sudo chown munge: /etc/munge/munge.key
sudo chmod 400 /etc/munge/munge.key
If the above fails do it manually with
sudo mungekey --create --force --keyfile /etc/munge/munge.key
sudo chown -R munge: /etc/munge /var/log/munge /var/lib/munge /run/munge
sudo chmod 0700 /etc/munge /var/log/munge /var/lib/munge
sudo chmod 0755 /run/munge
Start and check munge
sudo systemctl enable munge
sudo systemctl start munge
sudo systemctl status munge
munge -n | unmunge
Result should be
STATUS: Success (0)
ENCODE_HOST: user-G4101 (127.0.1.1)
ENCODE_TIME: 2025-08-25 15:05:03 -0700 (1756159503)
DECODE_TIME: 2025-08-25 15:05:03 -0700 (1756159503)
TTL: 300
CIPHER: aes128 (4)
MAC: sha256 (5)
ZIP: none (0)
UID: subho (1002)
GID: subho (1002)
LENGTH: 0
Install wsl slurm of ubuntu with (though it is easier to install, this is old and partial gpu feature often fails, for advance features compile it)
sudo apt install -y slurm-wlm
Commands
- Show fairshare usage for various queue and partitions
sacctmgr show qos. - Show user’s fairshare score
sshare -l | grep usernameOrGrpName. - To obtain basic partition info
sinfo -p sulcgpu3 -o "%P %a %l %D %t %N"orsinfo -Nel | grep "sulcgpu3". - For information on a particular node
scontrol show node nodeName - To obtain queue info use
squeue -p parititonNameorsqueue -u userName - To submit a empty job
sbatch -n 20 --gres=gpu:1 --time=4:00:00 -q sulcgpu1 -p sulcgpu1 --output=/dev/null --error=/dev/null $@ _interactiveScript
where the _interactiveScript can be a very simple command like blank screen.
#!/bin/sh
# Simple batch script that starts SCREEN.
exec screen -Dm -S slurm$SLURM_JOB_ID
- Empty jobs can also be submited using slurm script as given below
#!/bin/sh
#SBATCH -q private
#SBATCH -p general
#SBATCH -G a100:1
#SBATCH -t 5-00:00
#SBATCH -c 20
#SBATCH -o empty.out
#SBATCH -e empty.err
#SBATCH -J PHBdnaD
exec screen -Dm -S slurm$SLURM_JOB_ID
Now to enter into the submitted job, one can use srun --pty --jobid=######## /bin/bash or simply ssh nodeid.
Setting up slurmdbd
- Install mariadb
sudo apt-get install mariadb-server sudo systemctl enable mariadb sudo systemctl start mariadb - Generate a database to use with slurm using ```bash sudo mysql -u root -p CREATE DATABASE slurm_acct_db; CREATE USER ‘slurmuser’@’localhost’ IDENTIFIED BY ‘slurmPass’; GRANT ALL ON slurm_acct_db.* TO ‘slurmuser’@’localhost’; FLUSH PRIVILEGES; EXIT;
- Default port fro mariadb is 3306 use `sudo ss -tulpn` to identify that.
- change the following parameters in mydraid `sudo nvim /etc/mysql/mariadb.conf.d/50-server.cnf`
```bash
innodb_buffer_pool_size = 1G
innodb_lock_wait_timeout = 300
-
sudo systemctl restart mariadb; sudo systemctl status mariadb -
sudo systemctl restart mariadb; sudo systemctl status mariadb -
A sample slurmdb.conf at
/etc/slurm/slurmdbd.confis given below:
#
# Sample /etc/slurmdbd.conf
#
ArchiveEvents=yes
ArchiveJobs=yes
ArchiveResvs=yes
ArchiveSteps=no
ArchiveSuspend=no
ArchiveTXN=no
ArchiveUsage=no
#ArchiveScript=/usr/sbin/slurm.dbd.archive
AuthInfo=/var/run/munge/munge.socket.2
AuthType=auth/munge
DbdHost=localhost
DbdPort=6819
DebugLevel=info
PurgeEventAfter=1month
PurgeJobAfter=12month
PurgeResvAfter=1month
PurgeStepAfter=1month
PurgeSuspendAfter=1month
PurgeTXNAfter=12month
PurgeUsageAfter=24month
LogFile=/var/log/slurm/slurmdbd.log
PidFile=/var/run/slurm/slurmdbd.pid
SlurmUser=slurm
StoragePass=slurmPass
StorageType=accounting_storage/mysql
StorageUser=slurmuser
StoragePort=3306
- Sample slurm.conf at
/etc/slurm/slurm.confis given below:
ClusterName=ringtail
SlurmctldHost=ringtail
MailProg=/usr/bin/mail
ProctrackType=proctrack/cgroup
ReturnToService=1
SlurmctldPidFile=/var/run/slurmctld.pid
SlurmctldPort=6817
SlurmdPidFile=/var/run/slurmd.pid
SlurmdPort=6818
SlurmdSpoolDir=/var/spool/slurmd
SlurmUser=slurm
AuthType=auth/munge
StateSaveLocation=/var/spool/slurmctld
TaskPlugin=task/affinity,task/cgroup
InactiveLimit=0
KillWait=30
MinJobAge=300
SlurmctldTimeout=120
SlurmdTimeout=300
Waittime=0
SchedulerType=sched/backfill
SelectType=select/cons_tres
AccountingStorageEnforce=limits,qos,safe
AccountingStorageHost=localhost
AccountingStoragePort=6819
AccountingStorageType=accounting_storage/slurmdbd
JobCompLoc=/var/log/slurm/job_completions.log
JobCompType=jobcomp/filetxt
JobAcctGatherFrequency=30
SlurmctldDebug=info
SlurmctldLogFile=/var/log/slurm/slurmctld.log
SlurmdDebug=info
SlurmdLogFile=/var/log/slurm/slurmd.log
# COMPUTE NODES
GresTypes=gpu,mps
NodeName=gpu CPUs=128 Sockets=1 CoresPerSocket=128 ThreadsPerCore=1 Gres=gpu:rtx4060:1,gpu:rtx4080:1,mps:200 NodeHostname=ringtail NodeAddr=ringtail State=idle RealMemory=96092
PartitionName=general Nodes=ALL Default=YES MaxTime=INFINITE State=UP
-
sudo sacctmgr add account generic cluster=ringtail description="Generic Account" organization="SulcLab -
sacctmgr list account -
sudo sacctmgr add user subho account=generic cluster=ringtail
Conclusion
SLURM is one of the best open source workload manager and is highly used in modern servers.
References
- https://slurm.schedmd.com/classic_fair_share.html
- https://slurm.schedmd.com/fair_tree.html
Enjoy Reading This Article?
Here are some more articles you might like to read next: