环境准备

1. 集群架构

系统版本IP地址Hostname角色版本
Ubuntu 20.04192.168.8.194master控制+计算19.05.5
Ubuntu 20.04192.168.8.195node01计算节点19.05.5
Ubuntu 20.04192.168.8.196node02计算节点19.05.5

2、Hosts设置

192.168.8.194  master 
192.168.8.195  node01
192.168.8.196  node02

munge集群部署

1. 所有节点安装slurm(munge程序会作为依赖一并安装)

sudo apt install slurm-wlm slurm-wlm-doc -y

2. 三台机器启动munge程序

systemctl start munge

3. 在master上创建密钥

root@master:~/Slurm# create-munge-key 
The munge key /etc/munge/munge.key already exists
Do you want to overwrite it? (y/N) y
Generating a pseudo-random key using /dev/urandom completed.

4. 拷贝密钥至计算节点

scp /etc/munge/munge.key root@node01:/etc/munge/
scp /etc/munge/munge.key root@node02:/etc/munge/

5. 三台机器重启munge程序使配置生效

systemctl restart munge

6. 测试Munge服务是否正常

root@master:~# munge -n | unmunge
STATUS:           Success (0)
ENCODE_HOST:      master (192.168.8.194)
ENCODE_TIME:      2024-06-06 14:55:18 +0800 (1717656918)
DECODE_TIME:      2024-06-06 14:55:18 +0800 (1717656918)
TTL:              300
CIPHER:           aes128 (4)
MAC:              sha256 (5)
ZIP:              none (0)
UID:              root (0)
GID:              root (0)
LENGTH:           0

root@master:~# munge -n | ssh node01 unmunge
root@node01's password: 
STATUS:           Success (0)
ENCODE_HOST:      node01 (192.168.8.194)
ENCODE_TIME:      2024-06-06 14:54:36 +0800 (1717656876)
DECODE_TIME:      2024-06-06 14:54:40 +0800 (1717656880)
TTL:              300
CIPHER:           aes128 (4)
MAC:              sha256 (5)
ZIP:              none (0)
UID:              root (0)
GID:              root (0)
LENGTH:           0

root@master:~# munge -n | ssh node02 unmunge
root@node02's password: 
STATUS:           Success (0)
ENCODE_HOST:      node02 (192.168.8.194)
ENCODE_TIME:      2024-06-06 14:55:36 +0800 (1717656936)
DECODE_TIME:      2024-06-06 14:55:40 +0800 (1717656940)
TTL:              300
CIPHER:           aes128 (4)
MAC:              sha256 (5)
ZIP:              none (0)
UID:              root (0)
GID:              root (0)
LENGTH:           0

部署 slurm

1.验证是否安装成功

root@master:~# slurmd --version
slurm-wlm 19.05.5

2.创建slurm主配置文件

默认slurm的配置文件是读取/etc/slurm-llnl/slurm.conf

cat > /etc/slurm-llnl/slurm.conf << EOF
ClusterName=cool
ControlMachine=master
MailProg=/usr/bin/s-nail
SlurmUser=root
SlurmctldPort=6817

SlurmdPort=6818
AuthType=auth/munge
StateSaveLocation=/var/spool/slurmctld
SlurmdSpoolDir=/var/spool/slurmd
SwitchType=switch/none
MpiDefault=none
SlurmctldPidFile=/var/run/slurmctld.pid
SlurmdPidFile=/var/run/slurmd.pid
ProctrackType=proctrack/pgid
ReturnToService=0
SlurmctldTimeout=300
SlurmdTimeout=300
InactiveLimit=0
MinJobAge=300
KillWait=30
Waittime=0
SchedulerType=sched/backfill

SlurmctldDebug=info
SlurmctldLogFile=/var/log/slurm-llnl/slurmctld.log
SlurmdDebug=info
SlurmdLogFile=/var/log/slurm-llnl/slurmd.log
JobCompType=jobcomp/none
JobCompPass=Sec@2024...

PartitionName=gpu Nodes=node[01-02] Default=NO MaxTime=INFINITE State=UP
PartitionName=cpu Nodes=node[01-02] Default=NO MaxTime=INFINITE State=UP
PartitionName=memory Nodes=node[01-02] Default=NO MaxTime=INFINITE State=UP
NodeName=master CPUs=40 Boards=1 SocketsPerBoard=2 CoresPerSocket=10 ThreadsPerCore=2 RealMemory=32063
NodeName=node01 CPUs=16 Boards=1 SocketsPerBoard=1 CoresPerSocket=8 ThreadsPerCore=2 RealMemory=32020
NodeName=node02 CPUs=20 Boards=1 SocketsPerBoard=1 CoresPerSocket=10 ThreadsPerCore=2 RealMemory=15620
EOF

注意36行至38行需要在对应机器执行 slurmd -C进行获取

root@master:~# slurmd -C
NodeName=master CPUs=40 Boards=1 SocketsPerBoard=2 CoresPerSocket=10 ThreadsPerCore=2 RealMemory=32063

3.创建slurmctld控制器保存其状态信息的位置和slurmd工作节点守护进程的本地缓存和工作位置(三台机器)

mkdir /var/spool/

4.将/etc/slurm-llnl/slurm.conf拷贝至工作节点

scp /etc/slurm-llnl/slurm.conf root@node01:/etc/slurm-llnl/
scp /etc/slurm-llnl/slurm.conf root@node02:/etc/slurm-llnl/

5.启动服务

5.1、master启动服务

sudo systemctl enable slurmctld --now
sudo systemctl enable slurmd --now

5.2、node01和node02启动服务

sudo systemctl enable slurmd --now

6.查看slurm队列信息

root@master:~# sinfo
PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
gpu          up   infinite      3   idle node[01-02]
cpu          up   infinite      3   idle node[01-02]
memory       up   infinite      3   idle node[01-02]
最后修改:2024 年 06 月 12 日
如果觉得我的文章对你有用,请随意赞赏