环境准备
1. 集群架构
系统版本 | IP地址 | Hostname | 角色 | 版本 |
---|---|---|---|---|
Ubuntu 20.04 | 192.168.8.194 | master | 控制+计算 | 19.05.5 |
Ubuntu 20.04 | 192.168.8.195 | node01 | 计算节点 | 19.05.5 |
Ubuntu 20.04 | 192.168.8.196 | node02 | 计算节点 | 19.05.5 |
2、Hosts设置
192.168.8.194 master
192.168.8.195 node01
192.168.8.196 node02
munge集群部署
1. 所有节点安装slurm(munge程序会作为依赖一并安装)
sudo apt install slurm-wlm slurm-wlm-doc -y
2. 三台机器启动munge程序
systemctl start munge
3. 在master上创建密钥
root@master:~/Slurm# create-munge-key
The munge key /etc/munge/munge.key already exists
Do you want to overwrite it? (y/N) y
Generating a pseudo-random key using /dev/urandom completed.
4. 拷贝密钥至计算节点
scp /etc/munge/munge.key root@node01:/etc/munge/
scp /etc/munge/munge.key root@node02:/etc/munge/
5. 三台机器重启munge程序使配置生效
systemctl restart munge
6. 测试Munge服务是否正常
root@master:~# munge -n | unmunge
STATUS: Success (0)
ENCODE_HOST: master (192.168.8.194)
ENCODE_TIME: 2024-06-06 14:55:18 +0800 (1717656918)
DECODE_TIME: 2024-06-06 14:55:18 +0800 (1717656918)
TTL: 300
CIPHER: aes128 (4)
MAC: sha256 (5)
ZIP: none (0)
UID: root (0)
GID: root (0)
LENGTH: 0
root@master:~# munge -n | ssh node01 unmunge
root@node01's password:
STATUS: Success (0)
ENCODE_HOST: node01 (192.168.8.194)
ENCODE_TIME: 2024-06-06 14:54:36 +0800 (1717656876)
DECODE_TIME: 2024-06-06 14:54:40 +0800 (1717656880)
TTL: 300
CIPHER: aes128 (4)
MAC: sha256 (5)
ZIP: none (0)
UID: root (0)
GID: root (0)
LENGTH: 0
root@master:~# munge -n | ssh node02 unmunge
root@node02's password:
STATUS: Success (0)
ENCODE_HOST: node02 (192.168.8.194)
ENCODE_TIME: 2024-06-06 14:55:36 +0800 (1717656936)
DECODE_TIME: 2024-06-06 14:55:40 +0800 (1717656940)
TTL: 300
CIPHER: aes128 (4)
MAC: sha256 (5)
ZIP: none (0)
UID: root (0)
GID: root (0)
LENGTH: 0
部署 slurm
1.验证是否安装成功
root@master:~# slurmd --version
slurm-wlm 19.05.5
2.创建slurm主配置文件
默认slurm的配置文件是读取/etc/slurm-llnl/slurm.conf
cat > /etc/slurm-llnl/slurm.conf << EOF
ClusterName=cool
ControlMachine=master
MailProg=/usr/bin/s-nail
SlurmUser=root
SlurmctldPort=6817
SlurmdPort=6818
AuthType=auth/munge
StateSaveLocation=/var/spool/slurmctld
SlurmdSpoolDir=/var/spool/slurmd
SwitchType=switch/none
MpiDefault=none
SlurmctldPidFile=/var/run/slurmctld.pid
SlurmdPidFile=/var/run/slurmd.pid
ProctrackType=proctrack/pgid
ReturnToService=0
SlurmctldTimeout=300
SlurmdTimeout=300
InactiveLimit=0
MinJobAge=300
KillWait=30
Waittime=0
SchedulerType=sched/backfill
SlurmctldDebug=info
SlurmctldLogFile=/var/log/slurm-llnl/slurmctld.log
SlurmdDebug=info
SlurmdLogFile=/var/log/slurm-llnl/slurmd.log
JobCompType=jobcomp/none
JobCompPass=Sec@2024...
PartitionName=gpu Nodes=node[01-02] Default=NO MaxTime=INFINITE State=UP
PartitionName=cpu Nodes=node[01-02] Default=NO MaxTime=INFINITE State=UP
PartitionName=memory Nodes=node[01-02] Default=NO MaxTime=INFINITE State=UP
NodeName=master CPUs=40 Boards=1 SocketsPerBoard=2 CoresPerSocket=10 ThreadsPerCore=2 RealMemory=32063
NodeName=node01 CPUs=16 Boards=1 SocketsPerBoard=1 CoresPerSocket=8 ThreadsPerCore=2 RealMemory=32020
NodeName=node02 CPUs=20 Boards=1 SocketsPerBoard=1 CoresPerSocket=10 ThreadsPerCore=2 RealMemory=15620
EOF
注意36行至38行需要在对应机器执行 slurmd -C进行获取
root@master:~# slurmd -C
NodeName=master CPUs=40 Boards=1 SocketsPerBoard=2 CoresPerSocket=10 ThreadsPerCore=2 RealMemory=32063
3.创建slurmctld控制器保存其状态信息的位置和slurmd工作节点守护进程的本地缓存和工作位置(三台机器)
mkdir /var/spool/
4.将/etc/slurm-llnl/slurm.conf拷贝至工作节点
scp /etc/slurm-llnl/slurm.conf root@node01:/etc/slurm-llnl/
scp /etc/slurm-llnl/slurm.conf root@node02:/etc/slurm-llnl/
5.启动服务
5.1、master启动服务
sudo systemctl enable slurmctld --now
sudo systemctl enable slurmd --now
5.2、node01和node02启动服务
sudo systemctl enable slurmd --now
6.查看slurm队列信息
root@master:~# sinfo
PARTITION AVAIL TIMELIMIT NODES STATE NODELIST
gpu up infinite 3 idle node[01-02]
cpu up infinite 3 idle node[01-02]
memory up infinite 3 idle node[01-02]
仅登录用户可评论,点击 登录