###### tags: `Taiwania 3` `HPC` `slurm` `MPI`
版次:
2021/01/14 建立
2021/01/27 增加Hybrid MPI/OpenMP範例
2021/01/28 修正IntelMPI範例的I_MPI_FABRICS變數
2021/02/09 新增gpu job script
# SLURM使用說明與常用MPI範例
## 1. 常用指令與參數
* ==__sinfo__ <span><font color="salmon">(Done)</font>==
查詢節點與partition(Queue)狀態,有"*"的partition為預設partition。
```
[spiraea7@lgn301 jobs]$ sinfo
PARTITION AVAIL TIMELIMIT NODES STATE NODELIST
cpu down infinite 900 idle cpn[3001-3900]
test* up infinite 64 idle cpn[3001-3064]
bgm down infinite 4 idle bgm[3001-3004]
gpu up infinite 12 idle gpn[3001-3012]
```
* -l
```
[spiraea7@lgn301 jobs]$ sinfo -l
Tue Jan 12 19:29:51 2021
PARTITION AVAIL TIMELIMIT JOB_SIZE ROOT OVERSUBS GROUPS NODES STATE NODELIST
cpu down infinite 1-infinite no NO all 900 idle cpn[3001-3900]
test* up infinite 1-infinite no NO all 64 idle cpn[3001-3064]
bgm down infinite 1-infinite no NO all 4 idle bgm[3001-3004]
gpu up infinite 1-infinite no NO all 12 idle gpn[3001-3012]
```
* -N 查詢節點狀態
```
[spiraea7@lgn301 jobs]$ sinfo -N -l
Tue Jan 12 19:34:41 2021
NODELIST NODES PARTITION STATE CPUS S:C:T MEMORY TMP_DISK WEIGHT AVAIL_FE REASON
bgm3001 1 bgm idle 112 4:28:1 619111 0 1 (null) none
bgm3002 1 bgm idle 112 4:28:1 619111 0 1 (null) none
bgm3003 1 bgm idle 112 4:28:1 619111 0 1 (null) none
bgm3004 1 bgm idle 112 4:28:1 619111 0 1 (null) none
cpn3001 1 test* idle 56 2:28:1 191880 0 1 (null) none
cpn3001 1 cpu idle 56 2:28:1 191880 0 1 (null) none
cpn3002 1 test* idle 56 2:28:1 191880 0 1 (null) none
cpn3002 1 cpu idle 56 2:28:1 191880 0 1 (null) none
cpn3003 1 test* idle 56 2:28:1 191880 0 1 (null) none
....(略)
```
* ==__squeue__ <span><font color="salmon">(Done)</font>==
查詢job狀態,PD為Pending、R為Run
```
[spiraea7@lgn301 jobs]$ squeue
JOBID PARTITION NAME USER ST TIME TRES_P NODES EXEC_HOST
1334 test test spiraea7 PD 0:00 N/A 6 n/a
1335 test test spiraea7 PD 0:00 N/A 6 n/a
1336 test test spiraea7 PD 0:00 N/A 6 n/a
1337 test test spiraea7 PD 0:00 N/A 6 n/a
1338 test test spiraea7 PD 0:00 N/A 6 n/a
1339 test test spiraea7 PD 0:00 N/A 6 n/a
1340 test test spiraea7 PD 0:00 N/A 6 n/a
1341 test test spiraea7 PD 0:00 N/A 6 n/a
1324 test test spiraea7 R 2:03 N/A 6 cpn3001
1325 test test spiraea7 R 2:00 N/A 6 cpn3007
1326 test test spiraea7 R 2:00 N/A 6 cpn3013
```
* ==__scancel__ <span><font color="salmon">(Done)</font>==
中止或暫停計算工作
* -i 互動模式
```
[spiraea7@lgn301 jobs]$ squeue
JOBID PARTITION NAME USER ST TIME TRES_P NODES EXEC_HOST
1356 test hello_wo spiraea7 R 6:14 N/A 3 cpn3001
1358 test hello_wo spiraea7 R 6:07 N/A 3 cpn3010
1360 test hello_wo spiraea7 R 6:04 N/A 3 cpn3019
1362 test hello_wo spiraea7 R 6:00 N/A 3 cpn3028
1365 test hello_wo spiraea7 R 5:38 N/A 3 cpn3061
1366 test test spiraea7 R 0:37 N/A 24 cpn3004
1367 test hello_wo spiraea7 R 0:37 N/A 3 cpn3039
[spiraea7@lgn301 jobs]$ scancel -i 1366
Cancel job_id=1366 name=test partition=test [y/n]? y
[spiraea7@lgn301 jobs]$ squeue
JOBID PARTITION NAME USER ST TIME TRES_P NODES EXEC_HOST
1356 test hello_wo spiraea7 R 6:39 N/A 3 cpn3001
1358 test hello_wo spiraea7 R 6:32 N/A 3 cpn3010
1360 test hello_wo spiraea7 R 6:29 N/A 3 cpn3019
1362 test hello_wo spiraea7 R 6:25 N/A 3 cpn3028
1365 test hello_wo spiraea7 R 6:03 N/A 3 cpn3061
1367 test hello_wo spiraea7 R 1:02 N/A 3 cpn3039
```
* -u 指定user
```
[spiraea7@lgn301 jobs]$ squeue
JOBID PARTITION NAME USER ST TIME TRES_P NODES EXEC_HOST
1356 test hello_wo spiraea7 R 9:15 N/A 3 cpn3001
1358 test hello_wo spiraea7 R 9:08 N/A 3 cpn3010
1360 test hello_wo spiraea7 R 9:05 N/A 3 cpn3019
1362 test hello_wo spiraea7 R 9:01 N/A 3 cpn3028
1365 test hello_wo spiraea7 R 8:39 N/A 3 cpn3061
1367 test hello_wo spiraea7 R 3:38 N/A 3 cpn3039
[spiraea7@lgn301 jobs]$ scancel -u spiraea7
[spiraea7@lgn301 jobs]$ squeue
JOBID PARTITION NAME USER ST TIME TRES_P NODES EXEC_HOST
[spiraea7@lgn301 jobs]$
```
* ==__sbatch__ <span><font color="salmon">(Done)</font>==
submit 批次計算工作至系統中,相關參數選項請參閱man page或下方的job script範例
```
[spiraea7@lgn301 jobs]$ ls -la
total 35
drwxr-xr-x 2 spiraea7 TRI107056 16384 Jan 12 21:27 .
drwxr-xr-x 4 spiraea7 TRI107056 4096 Jan 12 18:05 ..
-rw-r--r-- 1 spiraea7 TRI107056 809 Jan 12 21:11 intel_ior.sh
-rw-r--r-- 1 spiraea7 TRI107056 773 Jan 12 21:14 intel.sh
-rw-r--r-- 1 spiraea7 TRI107056 610 Jan 12 21:14 openmpi.sh
[spiraea7@lgn301 jobs]$ sbatch openmpi.sh
Submitted batch job 1368
```
* ==__scontrol__ <span><font color="salmon">(Done)</font>==
主要作為slurm系統設定變更與查看設定之用,亦可查詢計算工作詳細內容,如等待原因、相依性...。
```
[spiraea7@lgn301 jobs]$ scontrol show job 1383
JobId=1383 JobName=test
UserId=spiraea7(10102) GroupId=TRI107056(3094) MCS_label=N/A
Priority=4294901665 Nice=0 Account=ent108161 QOS=alpha_test
JobState=PENDING Reason=Priority Dependency=(null)
Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0
RunTime=00:00:00 TimeLimit=00:05:00 TimeMin=N/A
SubmitTime=2021-01-12T21:32:46 EligibleTime=2021-01-12T21:32:46
AccrueTime=2021-01-12T21:32:46
StartTime=2021-01-12T21:56:00 EndTime=2021-01-12T22:01:00 Deadline=N/A
SuspendTime=None SecsPreSuspend=0 LastSchedEval=2021-01-12T21:35:12
Partition=test AllocNode:Sid=lgn301:286156
ReqNodeList=(null) ExcNodeList=(null)
NodeList=(null) SchedNodeList=cpn[3001-3024]
NumNodes=24-24 NumCPUs=48 NumTasks=48 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
TRES=cpu=48,node=24,billing=48
Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=*
MinCPUsNode=1 MinMemoryNode=0 MinTmpDiskNode=0
Features=(null) DelayBoot=00:00:00
OverSubscribe=NO Contiguous=0 Licenses=(null) Network=(null)
Command=/home/spiraea7/test/jobs/openmpi.sh
WorkDir=/home/spiraea7/test/jobs
StdErr=/home/spiraea7/test/jobs/1383.log
StdIn=/dev/null
StdOut=/home/spiraea7/test/jobs/1383.log
Power=
MailUser=(null) MailType=NONE
```
* ==__sacct__ <span><font color="salmon">(Done)</font>==
查詢計算工作執行狀態與資源使用量
* 不加參數
```
[spiraea7@lgn301 jobs]$ sacct
JobID JobName Partition Account AllocCPUS State ExitCode
------------ ---------- ---------- ---------- ---------- ---------- --------
1291 test cpu ent108161 24 CANCELLED+ 0:0
1292 test test ent108161 672 FAILED 1:0
1292.batch batch ent108161 56 FAILED 1:0
1293 test test ent108161 672 COMPLETED 0:0
1293.batch batch ent108161 56 COMPLETED 0:0
1294 test test ent108161 672 COMPLETED 0:0
1294.batch batch ent108161 56 COMPLETED 0:0
1295 test test ent108161 112 FAILED 1:0
1295.batch batch ent108161 56 FAILED 1:0
1295.0 hello ent108161 24 FAILED 1:0
1296 test test ent108161 112 FAILED 1:0
1296.batch batch ent108161 56 FAILED 1:0
1297 test test ent108161 112 COMPLETED 0:0
1297.batch batch ent108161 56 COMPLETED 0:0
1297.0 orted ent108161 1 COMPLETED 0:0
1298 test test ent108161 1344 COMPLETED 0:0
```
* -A, --account=$account_list
指定account清單,本系統為計畫代號,如ENT108161
* -e, --helpformat
顯示所有可查詢的欄位,欄位如下:
```
Account AdminComment AllocCPUS AllocNodes
AllocTRES AssocID AveCPU AveCPUFreq
AveDiskRead AveDiskWrite AvePages AveRSS
AveVMSize BlockID Cluster Comment
Constraints ConsumedEnergy ConsumedEnergyRaw CPUTime
CPUTimeRAW DBIndex DerivedExitCode Elapsed
ElapsedRaw Eligible End ExitCode
Flags GID Group JobID
JobIDRaw JobName Layout MaxDiskRead
MaxDiskReadNode MaxDiskReadTask MaxDiskWrite MaxDiskWriteNode
MaxDiskWriteTask MaxPages MaxPagesNode MaxPagesTask
MaxRSS MaxRSSNode MaxRSSTask MaxVMSize
MaxVMSizeNode MaxVMSizeTask McsLabel MinCPU
MinCPUNode MinCPUTask NCPUS NNodes
NodeList NTasks Priority Partition
QOS QOSRAW Reason ReqCPUFreq
ReqCPUFreqMin ReqCPUFreqMax ReqCPUFreqGov ReqCPUS
ReqMem ReqNodes ReqTRES Reservation
ReservationId Reserved ResvCPU ResvCPURAW
Start State Submit Suspended
SystemCPU SystemComment Timelimit TimelimitRaw
TotalCPU TRESUsageInAve TRESUsageInMax TRESUsageInMaxNode
TRESUsageInMaxTask TRESUsageInMin TRESUsageInMinNode TRESUsageInMinTask
TRESUsageInTot TRESUsageOutAve TRESUsageOutMax TRESUsageOutMaxNode
TRESUsageOutMaxTask TRESUsageOutMin TRESUsageOutMinNode TRESUsageOutMinTask
TRESUsageOutTot UID User UserCPU
WCKey WCKeyID WorkDir
```
各欄位代表意義請man sacct
* -S, --starttime
* -E, --endtime=end_time
時間格式如下:
```
HH:MM[:SS][AM|PM]
MMDD[YY][-HH:MM[:SS]]
MM.DD[.YY][-HH:MM[:SS]]
MM/DD[/YY][-HH:MM[:SS]]
YYYY-MM-DD[THH:MM[:SS]]
```
如:2021-01-02T13:06:00 為2021年1月2日下午1時6分
```
[spiraea7@lgn301 jobs]$ sacct -S 2021-01-12T21:30 -E 2021-01-12T21:50
JobID JobName Partition Account AllocCPUS State ExitCode
------------ ---------- ---------- ---------- ---------- ---------- --------
1368 test test ent108161 1344 TIMEOUT 0:0
1368.batch batch ent108161 56 CANCELLED 0:15
1368.0 orted ent108161 23 COMPLETED 0:0
1369 test test ent108161 1344 TIMEOUT 0:0
1369.batch batch ent108161 56 CANCELLED 0:15
1369.0 orted ent108161 23 COMPLETED 0:0
1370 hello_wor+ test ent108161 168 COMPLETED 0:0
1370.batch batch ent108161 56 COMPLETED 0:0
1370.0 pmi_proxy ent108161 3 COMPLETED 0:0
1371 test test ent108161 1344 TIMEOUT 0:0
1371.batch batch ent108161 56 CANCELLED 0:15
1371.0 orted ent108161 23 COMPLETED 0:0
1372 hello_wor+ test ent108161 168 COMPLETED 0:0
```
* -o, --format
自訂欄位,可依需求自訂輸出欄位,欄位名稱請參閱-e選項之輸出,大小寫不拘,亦可先設定SACCT_FORMAT變數,不用再於命令行指定輸出欄位,如:
```
[spiraea7@lgn301 jobs]$ sacct -S 2021-01-12T21:30 -E 2021-01-12T21:50 --format=jobid,user,NCPUS,Submit,Start,End,State,Elapsed,ElapsedRaw,account
JobID User NCPUS Submit Start End State Elapsed ElapsedRaw Account
------------ --------- ---------- ------------------- ------------------- ------------------- ---------- ---------- ---------- ----------
1368 spiraea7 1344 2021-01-12T21:31:09 2021-01-12T21:31:10 2021-01-12T21:36:36 TIMEOUT 00:05:26 326 ent108161
1368.batch 56 2021-01-12T21:31:10 2021-01-12T21:31:10 2021-01-12T21:36:36 CANCELLED 00:05:26 326 ent108161
1368.0 23 2021-01-12T21:31:10 2021-01-12T21:31:10 2021-01-12T21:31:10 COMPLETED 00:00:00 0 ent108161
1369 spiraea7 1344 2021-01-12T21:32:11 2021-01-12T21:32:12 2021-01-12T21:37:36 TIMEOUT 00:05:24 324 ent108161
1369.batch 56 2021-01-12T21:32:12 2021-01-12T21:32:12 2021-01-12T21:37:36 CANCELLED 00:05:24 324 ent108161
1369.0 23 2021-01-12T21:32:12 2021-01-12T21:32:12 2021-01-12T21:32:12 COMPLETED 00:00:00 0 ent108161
1370 spiraea7 168 2021-01-12T21:32:19 2021-01-12T21:32:20 2021-01-12T21:52:21 COMPLETED 00:20:01 1201 ent108161
1370.batch 56 2021-01-12T21:32:20 2021-01-12T21:32:20 2021-01-12T21:52:21 COMPLETED 00:20:01 1201 ent108161
1370.0 3 2021-01-12T21:32:20 2021-01-12T21:32:20 2021-01-12T21:32:21 COMPLETED 00:00:01 1 ent108161
1371 spiraea7 1344 2021-01-12T21:32:22 2021-01-12T21:36:36 2021-01-12T21:41:36 TIMEOUT 00:05:00 300 ent108161
1371.batch 56 2021-01-12T21:36:36 2021-01-12T21:36:36 2021-01-12T21:41:36 CANCELLED 00:05:00 300 ent108161
1371.0 23 2021-01-12T21:36:36 2021-01-12T21:36:36 2021-01-12T21:36:36 COMPLETED 00:00:00 0 ent108161
1372 spiraea7 168 2021-01-12T21:32:25 2021-01-12T21:32:41 2021-01-12T21:52:43 COMPLETED 00:20:02 1202 ent108161
1372.batch 56 2021-01-12T21:32:41 2021-01-12T21:32:41 2021-01-12T21:52:43 COMPLETED 00:20:02 1202 ent108161
1372.0 3 2021-01-12T21:32:43 2021-01-12T21:32:43 2021-01-12T21:32:43 COMPLETED 00:00:00 0 ent108161
```
```
[spiraea7@lgn301 jobs]$ export SACCT_FORMAT=jobid,user,NCPUS,Submit,Start,End,State,Elapsed,ElapsedRaw,account
[spiraea7@lgn301 jobs]$ sacct -S 2021-01-12T21:30 -E 2021-01-12T21:50
JobID User NCPUS Submit Start End State Elapsed ElapsedRaw Account
------------ --------- ---------- ------------------- ------------------- ------------------- ---------- ---------- ---------- ----------
1368 spiraea7 1344 2021-01-12T21:31:09 2021-01-12T21:31:10 2021-01-12T21:36:36 TIMEOUT 00:05:26 326 ent108161
1368.batch 56 2021-01-12T21:31:10 2021-01-12T21:31:10 2021-01-12T21:36:36 CANCELLED 00:05:26 326 ent108161
1368.0 23 2021-01-12T21:31:10 2021-01-12T21:31:10 2021-01-12T21:31:10 COMPLETED 00:00:00 0 ent108161
1369 spiraea7 1344 2021-01-12T21:32:11 2021-01-12T21:32:12 2021-01-12T21:37:36 TIMEOUT 00:05:24 324 ent108161
1369.batch 56 2021-01-12T21:32:12 2021-01-12T21:32:12 2021-01-12T21:37:36 CANCELLED 00:05:24 324 ent108161
1369.0 23 2021-01-12T21:32:12 2021-01-12T21:32:12 2021-01-12T21:32:12 COMPLETED 00:00:00 0 ent108161
1370 spiraea7 168 2021-01-12T21:32:19 2021-01-12T21:32:20 2021-01-12T21:52:21 COMPLETED 00:20:01 1201 ent108161
1370.batch 56 2021-01-12T21:32:20 2021-01-12T21:32:20 2021-01-12T21:52:21 COMPLETED 00:20:01 1201 ent108161
1370.0 3 2021-01-12T21:32:20 2021-01-12T21:32:20 2021-01-12T21:32:21 COMPLETED 00:00:01 1 ent108161
1371 spiraea7 1344 2021-01-12T21:32:22 2021-01-12T21:36:36 2021-01-12T21:41:36 TIMEOUT 00:05:00 300 ent108161
1371.batch 56 2021-01-12T21:36:36 2021-01-12T21:36:36 2021-01-12T21:41:36 CANCELLED 00:05:00 300 ent108161
1371.0 23 2021-01-12T21:36:36 2021-01-12T21:36:36 2021-01-12T21:36:36 COMPLETED 00:00:00 0 ent108161
1372 spiraea7 168 2021-01-12T21:32:25 2021-01-12T21:32:41 2021-01-12T21:52:43 COMPLETED 00:20:02 1202 ent108161
1372.batch 56 2021-01-12T21:32:41 2021-01-12T21:32:41 2021-01-12T21:52:43 COMPLETED 00:20:02 1202 ent108161
1372.0 3 2021-01-12T21:32:43 2021-01-12T21:32:43 2021-01-12T21:32:43 COMPLETED 00:00:00 0 ent108161
```
* ==__salloc__ <span><font color="salmon">(Done)</font>==
向系統要求指定的計算資源,系統會依照需求安排保留通常會搭配srun一起使用,參數與sbatch同或man salloc
* -A, --account= 為必要參數
* -N, --nodes= 節點數
* -p, --partition= partition名稱
在執行salloc取得所需計算資源後,系統即開始依照實際取得的計算資源開始記帳,直到計算工作結束且退出salloc狀態,才停止記帳。
* ==__srun__<span><font color="salmon">(Done)</font>==
執行互動式(interactive)計算工作或於slurm job script中執行,可用參數與sbatch、salloc相似,詳細請man srun
```
[spiraea7@lgn301 ~]$ salloc -A ENT108161 -p test -N 2
salloc: Granted job allocation 1398
[spiraea7@lgn301 ~]$ srun uname -a
Linux cpn3003.nchc-2020-hpc 3.10.0-1127.el7.x86_64 #1 SMP Tue Mar 31 23:36:51 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux
Linux cpn3002.nchc-2020-hpc 3.10.0-1127.el7.x86_64 #1 SMP Tue Mar 31 23:36:51 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux
[spiraea7@lgn301 ~]$ module load intel/2020
[spiraea7@lgn301 ~]$ mpirun -n 8 ~/bin/intel-hello
Hello, world, I am 0 of 8, cpn3002.nchc-2020-hpc
Hello, world, I am 2 of 8, cpn3002.nchc-2020-hpc
Hello, world, I am 6 of 8, cpn3002.nchc-2020-hpc
Hello, world, I am 4 of 8, cpn3002.nchc-2020-hpc
Hello, world, I am 5 of 8, cpn3002.nchc-2020-hpc
Hello, world, I am 3 of 8, cpn3002.nchc-2020-hpc
Hello, world, I am 7 of 8, cpn3002.nchc-2020-hpc
Hello, world, I am 1 of 8, cpn3002.nchc-2020-hpc
```
```
[spiraea7@lgn301 ~]$ salloc -A ENT108161 -p test -n 8 -N 4
salloc: Granted job allocation 1400
[spiraea7@lgn301 ~]$ module load intel/2020
Message from TWCC HPC admin
-----------------------
loading Intel(R) Parallel Studio XE
Cluster Edition 2020 Update 1 for Linux*
setting I_MPI_OFI_PROVIDER=mlx (with ucx 1.8)
-----------------------
[spiraea7@lgn301 ~]$ export I_MPI_FABRIC=shm:ofi
[spiraea7@lgn301 ~]$ export UCX_TLS=rc,ud,sm,self
[spiraea7@lgn301 ~]$ mpiexec.hydra -bootstrap slurm -n 8 ~/bin/intel-hello
Hello, world, I am 4 of 8, cpn3004.nchc-2020-hpc
Hello, world, I am 5 of 8, cpn3004.nchc-2020-hpc
Hello, world, I am 6 of 8, cpn3005.nchc-2020-hpc
Hello, world, I am 7 of 8, cpn3005.nchc-2020-hpc
Hello, world, I am 3 of 8, cpn3003.nchc-2020-hpc
Hello, world, I am 2 of 8, cpn3003.nchc-2020-hpc
Hello, world, I am 0 of 8, cpn3002.nchc-2020-hpc
Hello, world, I am 1 of 8, cpn3002.nchc-2020-hpc
[spiraea7@lgn301 ~]$ exit
exit
salloc: Relinquishing job allocation 1400
[spiraea7@lgn301 ~]
```
---
////////////////////////// ==以下未確認是否保留== /////////////////////////
## 2. 常用MPI範例
* IntelMPI
* OpenMPI
* Hybrid MPI/OpenMP
---
* __IntelMPI__
1. 設定環境 ([module使用](https://man.twcc.ai/@twnia3/BJxhYk2CD))
```
[user@lgn301 jobs]$ module load compiler/intel/2020u4 IntelMPI/2020
```
2. 編譯程式
```
[user@lgn301 test]$ which mpiicc
/opt/ohpc/Taiwania3/pkg/intel/2020/compilers_and_libraries_2020.4.304/linux/mpi/intel64/bin/mpiicc
[user@lgn301 test]$ mpiicc -o ../../bin/intel-hello ./hello.c
```
3. 撰寫job script
```
#簡易版
#!/bin/bash
#SBATCH -A ENT108161 # Account name/project number
#SBATCH -J hello_world # Job name
#SBATCH -p test # Partiotion name
#SBATCH -n 24 # Number of MPI tasks (i.e. processes)
#SBATCH -c 1 # Number of cores per MPI task
#SBATCH -N 3 # Maximum number of nodes to be allocated
#SBATCH -o %j.out # Path to the standard output file
#SBATCH -e %j.err # Path to the standard error ouput file
module load compiler/intel/2020u4 IntelMPI/2020
mpiexec.hydra -bootstrap slurm -n 24 /home/user/bin/intel-hello
```
*若未指定standard error輸出,stderr將會與stdout輸出至同一檔案*
```
#複雜版
#!/bin/bash
#SBATCH --account=ENT108161 # (-A) Account/project number
#SBATCH --job-name=hello_world # (-J) Job name
#SBATCH --partition=test # (-p) Specific slurm partition
#SBATCH --mail-type=END,FAIL # Mail events (NONE, BEGIN, END, FAIL, ALL)
#SBATCH --mail-user=user@mybox.mail # Where to send mail. Set this to your email address
#SBATCH --ntasks=24 # (-n) Number of MPI tasks (i.e. processes)
#SBATCH --cpus-per-task=1 # (-c) Number of cores per MPI task
#SBATCH --nodes=2 # (-N) Maximum number of nodes to be allocated
#SBATCH --ntasks-per-node=12 # Maximum number of tasks on each node
#SBATCH --ntasks-per-socket=6 # Maximum number of tasks on each socket
#SBATCH --distribution=cyclic:cyclic # (-m) Distribute tasks cyclically first among nodes and then among sockets within a node
#SBATCH --mem-per-cpu=600mb # Memory (i.e. RAM) per processor
#SBATCH --time=00:05:00 # (-t) Wall time limit (days-hrs:min:sec)
#SBATCH --output=%j.log # (-o) Path to the standard output and error files relative to the working directory
#SBATCH --error=%j.err # (-e) Path to the standard error ouput
#SBATCH --nodelist=cpn[3001-3002] # (-w) specific list of nodes
module load compiler/intel/2020u4 IntelMPI/2020
mpiexec.hydra -bootstrap slurm -n 24 /home/user/bin/intel-hello
```
5. submit
```
[spiraea7@lgn301 jobs]$ sbatch intel.sh
Submitted batch job 1302
```
* __OpenMPI__
以隨Mellanox driver安裝的OpenMPI為例
1. 查詢可用的MPI library
```
[user@tchead job]$ mpi-selector --list
openmpi-4.0.3rc4
```
2. 設定MPI library
```
[user@tchead job]$ mpi-selector --set openmpi-4.0.3rc4
```
3. 登出後重新登入
```
[user@tchead ~]$ which mpirun
/usr/mpi/gcc/openmpi-4.0.3rc4/bin/mpirun
```
4. 編譯程式
```
[user@tchead user]$ mpicc -o hello_world ./src/hello_c.c
```
5. 撰寫job script
因此版的slrum編譯時,未支援pmix,故使用mpirun執行。
```
#!/bin/bash
#SBATCH -A ENT108161 # Account name/project number
#SBATCH -J test # Job name
#SBATCH -p test
#SBATCH -n 24 # Number of MPI tasks (i.e. processes)
#SBATCH -c 1 # Number of cores per MPI task
#SBATCH -N 6 # Maximum number of nodes to be allocated
#SBATCH -t 05:00 # Wall time limit (days-hrs:min:sec)
#SBATCH -o %j.logi # Path to the standard output and error files relative to the working directory
mpirun -n 24 /home/user/bin/hello ##或
mpirun -n 24 --mca pml ucx --mca btl ^vader,tcp,openib,uct -x UCX_NET_DEVICES=mlx5_0:1 /home/user/bin/hello
```
6. 提交計算
```
[user@lgn301 jobs]$ sbatch openmpi.sh
Submitted batch job 1300
```
* __Hybrid MPI/OpenMP 範例__
1. 以Intel MPI為例,[程式範例](https://rcc.uchicago.edu/docs/running-jobs/hybrid/index.html)
2. 載入執行環境
可以先執行ml compiler/intel/2021,再執行ml av就可以看到IntelMPI/2021
```
[spiraea7@lgn301 ~]$ ml compiler/intel/2021 IntelMPI/2021
```
3. 編譯程式
```
[spiraea7@lgn301 test]$ mpiicc -fopenmp ./hybridhello.c -o hybrid_hello
```
4. 撰寫job script
```
#!/bin/bash
#SBATCH -A ENT108161 # Account name/project number
#SBATCH -J hybridMPI # Job name
#SBATCH -p test # Partiotion name
#SBATCH -n 16 # Number of MPI tasks (i.e. processes)
#SBATCH -c 4 # Number of cores per MPI task
#SBATCH -N 4 # Maximum number of nodes to be allocated
#SBATCH -o %j.out # Path to the standard output file
#SBATCH -e %j.err # Path to the standard error ouput file
module load compiler/intel/2021 IntelMPI/2021
# Set OMP_NUM_THREADS to the number of CPUs per task we asked for.
export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK
mpiexec.hydra -bootstrap slurm -n 16 /home/spiraea7/bin/hybrid_hello
```
__說明:__
* #SBATCH -n 16 要執行16個MPI process
* #SBATCH -c 4 每個MPI process 需要4個計算核心
* #SBATCH -N 4 分散到4個節點上執行,若只是執行單純的OpenMP程式,未使用MPI library提供的跨節點執行能力,-N 應該為1或者省略不寫。
* 須設定$OMP_NUM_THREADS,本例為4,可以變數\$SLURM_CPUS_PER_TASK帶入
* 本例在4個節點上,共執行16個MPI processes,每個MPI process共使用4個計算核心,因此共須64個計算核心。
6. 提交計算工作
```
[spiraea7@lgn301 jobs]$ sbatch intel-hybrid.sh
Submitted batch job 2647
[spiraea7@lgn301 jobs]$ sacct -j 2647
JobID User NCPUS AllocCPUS AllocNodes AllocTRES Elapsed
------------ --------- ---------- ---------- ---------- ------------------------------ ----------
2647 spiraea7 64 64 4 billing=64,cpu=64,mem=741.50G+ 00:01:11
2647.batch 16 16 1 cpu=16,mem=189824M,node=1 00:01:11
2647.0 16 16 4 cpu=16,mem=741.50G,node=4 00:00:01
```
7. 執行結果
```
[spiraea7@lgn301 jobs]$ less 2647.out
Hello from thread 0 out of 4 from process 2 out of 16 on cpn3024
Hello from thread 2 out of 4 from process 2 out of 16 on cpn3024
Hello from thread 3 out of 4 from process 2 out of 16 on cpn3024
Hello from thread 0 out of 4 from process 10 out of 16 on cpn3026
Hello from thread 1 out of 4 from process 10 out of 16 on cpn3026
Hello from thread 3 out of 4 from process 10 out of 16 on cpn3026
Hello from thread 2 out of 4 from process 10 out of 16 on cpn3026
Hello from thread 2 out of 4 from process 3 out of 16 on cpn3024
Hello from thread 3 out of 4 from process 3 out of 16 on cpn3024
Hello from thread 0 out of 4 from process 3 out of 16 on cpn3024
Hello from thread 1 out of 4 from process 3 out of 16 on cpn3024
...(略)
```
可以看到每個process有4個thraeds
* __取用GPU範例__
1. 使用CUDA sample中的n-body sample code,
路徑:`/opt/qct/ohpc/pub/nvidia/cuda/cuda-11.0/samples`
[程式範例](https://developer.nvidia.com/cuda-code-samples)
3. 使用CUDA 11.0編譯範例程式
`module load cuda/11.0`
4. 修改nbody資料夾下Makefile內相對應的部分,例如第37行與cuda相關變數
`CUDA_PATH ?= /opt/qct/ohpc/pub/nvidia/cuda/cuda-11.0`
5. 編譯程式
`make clean && make`
6. 準備job script
```
#!/bin/bash
#SBATCH --job-name nbody_test # Job name
#SBATCH --output %x-%j.out # Name of stdout output file (%x expands to jobname, %j expands to jobId)
#SBATCH --nodes=1 #Controls the number of nodes allocated to the job
#SBATCH --cpus-per-task=1 #Controls the number of CPUs allocated per task
#SBATCH --gres=gpu:1
#SBATCH --partition gpu
#SBATCH --account GOV109199
module purge
module load nvidia/cuda/11.0
echo "Your nbody job starts at `date`"
./nbody -benchmark -fp64 -numbodies=200000
echo "Your nbody job completed at `date` "
```
__說明:__
* #SBATCH --gres=gpu:1
在slurm裡面,gpu被視為通用資源(generic resource)的一種,要使用--gres參數來向系統索取,資源名稱為「gpu」,在gpu後加上冒號,指定該資源的使用數量,這邊的範例是向系統索取一張gpu。
6. 提交計算工作
```
20:49:42 p00acy00@lgn301:nbody$ sbatch nbody.sh
Submitted batch job 55098
20:50:08 p00acy00@lgn301:nbody$ sacct
JobID JobName Partition Account AllocCPUS State ExitCode
------------ ---------- ---------- ---------- ---------- ---------- --------
55098 nbody_test gpu gov109199 1 COMPLETED 0:0
55098.batch batch gov109199 1 COMPLETED 0:0
```
8. 執行結果
`cat nbody_test-55098.out`
```
Your nbody job starts at Tue Feb 9 20:50:09 CST 2021
Run "nbody -benchmark [-numbodies=<numBodies>]" to measure performance.
-fullscreen (run n-body simulation in fullscreen mode)
-fp64 (use double precision floating point values for simulation)
-hostmem (stores simulation data in host memory)
-benchmark (run benchmark to measure performance)
-numbodies=<N> (number of bodies (>= 1) to run in simulation)
-device=<d> (where d=0,1,2.... for the CUDA device to use)
-numdevices=<i> (where i=(number of CUDA devices > 0) to use for simulation)
-compare (compares simulation results running once on the default GPU and once on the CPU)
-cpu (run n-body simulation on the CPU)
-tipsy=<file.bin> (load a tipsy model file for simulation)
NOTE: The CUDA Samples are not meant for performance measurements. Results may vary when GPU Boost is enabled.
> Windowed mode
> Simulation data stored in video memory
> Double precision floating point simulation
> 1 Devices used for simulation
GPU Device 0: "Volta" with compute capability 7.0
> Compute 7.0 CUDA device: [Tesla V100-SXM2-32GB]
Warning: "number of bodies" specified 200000 is not a multiple of 256.
Rounding up to the nearest multiple: 200192.
200192 bodies, total time for 10 iterations: 2133.370 ms
= 187.857 billion interactions per second
= 5635.709 double-precision GFLOP/s at 30 flops per interaction
Your nbody job completed at Tue Feb 9 20:50:12 CST 2021
```