SLURM使用說明與常用MPI範例

###### tags: `Taiwania 3` `HPC` `slurm` `MPI` 版次： 2021/01/14 建立 2021/01/27 增加Hybrid MPI/OpenMP範例 2021/01/28 修正IntelMPI範例的I_MPI_FABRICS變數 2021/02/09 新增gpu job script # SLURM使用說明與常用MPI範例 ## 1. 常用指令與參數 * ==__sinfo__ (Done)== 查詢節點與partition(Queue)狀態，有"*"的partition為預設partition。 ``` [spiraea7@lgn301 jobs]$ sinfo PARTITION AVAIL TIMELIMIT NODES STATE NODELIST cpu down infinite 900 idle cpn[3001-3900] test* up infinite 64 idle cpn[3001-3064] bgm down infinite 4 idle bgm[3001-3004] gpu up infinite 12 idle gpn[3001-3012] ``` * -l ``` [spiraea7@lgn301 jobs]$ sinfo -l Tue Jan 12 19:29:51 2021 PARTITION AVAIL TIMELIMIT JOB_SIZE ROOT OVERSUBS GROUPS NODES STATE NODELIST cpu down infinite 1-infinite no NO all 900 idle cpn[3001-3900] test* up infinite 1-infinite no NO all 64 idle cpn[3001-3064] bgm down infinite 1-infinite no NO all 4 idle bgm[3001-3004] gpu up infinite 1-infinite no NO all 12 idle gpn[3001-3012] ``` * -N 查詢節點狀態 ``` [spiraea7@lgn301 jobs]$ sinfo -N -l Tue Jan 12 19:34:41 2021 NODELIST NODES PARTITION STATE CPUS S:C:T MEMORY TMP_DISK WEIGHT AVAIL_FE REASON bgm3001 1 bgm idle 112 4:28:1 619111 0 1 (null) none bgm3002 1 bgm idle 112 4:28:1 619111 0 1 (null) none bgm3003 1 bgm idle 112 4:28:1 619111 0 1 (null) none bgm3004 1 bgm idle 112 4:28:1 619111 0 1 (null) none cpn3001 1 test* idle 56 2:28:1 191880 0 1 (null) none cpn3001 1 cpu idle 56 2:28:1 191880 0 1 (null) none cpn3002 1 test* idle 56 2:28:1 191880 0 1 (null) none cpn3002 1 cpu idle 56 2:28:1 191880 0 1 (null) none cpn3003 1 test* idle 56 2:28:1 191880 0 1 (null) none ....(略) ``` * ==__squeue__ (Done)== 查詢job狀態，PD為Pending、R為Run ``` [spiraea7@lgn301 jobs]$ squeue JOBID PARTITION NAME USER ST TIME TRES_P NODES EXEC_HOST 1334 test test spiraea7 PD 0:00 N/A 6 n/a 1335 test test spiraea7 PD 0:00 N/A 6 n/a 1336 test test spiraea7 PD 0:00 N/A 6 n/a 1337 test test spiraea7 PD 0:00 N/A 6 n/a 1338 test test spiraea7 PD 0:00 N/A 6 n/a 1339 test test spiraea7 PD 0:00 N/A 6 n/a 1340 test test spiraea7 PD 0:00 N/A 6 n/a 1341 test test spiraea7 PD 0:00 N/A 6 n/a 1324 test test spiraea7 R 2:03 N/A 6 cpn3001 1325 test test spiraea7 R 2:00 N/A 6 cpn3007 1326 test test spiraea7 R 2:00 N/A 6 cpn3013 ``` * ==__scancel__ (Done)== 中止或暫停計算工作 * -i 互動模式 ``` [spiraea7@lgn301 jobs]$ squeue JOBID PARTITION NAME USER ST TIME TRES_P NODES EXEC_HOST 1356 test hello_wo spiraea7 R 6:14 N/A 3 cpn3001 1358 test hello_wo spiraea7 R 6:07 N/A 3 cpn3010 1360 test hello_wo spiraea7 R 6:04 N/A 3 cpn3019 1362 test hello_wo spiraea7 R 6:00 N/A 3 cpn3028 1365 test hello_wo spiraea7 R 5:38 N/A 3 cpn3061 1366 test test spiraea7 R 0:37 N/A 24 cpn3004 1367 test hello_wo spiraea7 R 0:37 N/A 3 cpn3039 [spiraea7@lgn301 jobs]$ scancel -i 1366 Cancel job_id=1366 name=test partition=test [y/n]? y [spiraea7@lgn301 jobs]$ squeue JOBID PARTITION NAME USER ST TIME TRES_P NODES EXEC_HOST 1356 test hello_wo spiraea7 R 6:39 N/A 3 cpn3001 1358 test hello_wo spiraea7 R 6:32 N/A 3 cpn3010 1360 test hello_wo spiraea7 R 6:29 N/A 3 cpn3019 1362 test hello_wo spiraea7 R 6:25 N/A 3 cpn3028 1365 test hello_wo spiraea7 R 6:03 N/A 3 cpn3061 1367 test hello_wo spiraea7 R 1:02 N/A 3 cpn3039 ``` * -u 指定user ``` [spiraea7@lgn301 jobs]$ squeue JOBID PARTITION NAME USER ST TIME TRES_P NODES EXEC_HOST 1356 test hello_wo spiraea7 R 9:15 N/A 3 cpn3001 1358 test hello_wo spiraea7 R 9:08 N/A 3 cpn3010 1360 test hello_wo spiraea7 R 9:05 N/A 3 cpn3019 1362 test hello_wo spiraea7 R 9:01 N/A 3 cpn3028 1365 test hello_wo spiraea7 R 8:39 N/A 3 cpn3061 1367 test hello_wo spiraea7 R 3:38 N/A 3 cpn3039 [spiraea7@lgn301 jobs]$ scancel -u spiraea7 [spiraea7@lgn301 jobs]$ squeue JOBID PARTITION NAME USER ST TIME TRES_P NODES EXEC_HOST [spiraea7@lgn301 jobs]$ ``` * ==__sbatch__ (Done)== submit 批次計算工作至系統中，相關參數選項請參閱man page或下方的job script範例 ``` [spiraea7@lgn301 jobs]$ ls -la total 35 drwxr-xr-x 2 spiraea7 TRI107056 16384 Jan 12 21:27 . drwxr-xr-x 4 spiraea7 TRI107056 4096 Jan 12 18:05 .. -rw-r--r-- 1 spiraea7 TRI107056 809 Jan 12 21:11 intel_ior.sh -rw-r--r-- 1 spiraea7 TRI107056 773 Jan 12 21:14 intel.sh -rw-r--r-- 1 spiraea7 TRI107056 610 Jan 12 21:14 openmpi.sh [spiraea7@lgn301 jobs]$ sbatch openmpi.sh Submitted batch job 1368 ``` * ==__scontrol__ (Done)== 主要作為slurm系統設定變更與查看設定之用，亦可查詢計算工作詳細內容，如等待原因、相依性...。 ``` [spiraea7@lgn301 jobs]$ scontrol show job 1383 JobId=1383 JobName=test UserId=spiraea7(10102) GroupId=TRI107056(3094) MCS_label=N/A Priority=4294901665 Nice=0 Account=ent108161 QOS=alpha_test JobState=PENDING Reason=Priority Dependency=(null) Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0 RunTime=00:00:00 TimeLimit=00:05:00 TimeMin=N/A SubmitTime=2021-01-12T21:32:46 EligibleTime=2021-01-12T21:32:46 AccrueTime=2021-01-12T21:32:46 StartTime=2021-01-12T21:56:00 EndTime=2021-01-12T22:01:00 Deadline=N/A SuspendTime=None SecsPreSuspend=0 LastSchedEval=2021-01-12T21:35:12 Partition=test AllocNode:Sid=lgn301:286156 ReqNodeList=(null) ExcNodeList=(null) NodeList=(null) SchedNodeList=cpn[3001-3024] NumNodes=24-24 NumCPUs=48 NumTasks=48 CPUs/Task=1 ReqB:S:C:T=0:0:*:* TRES=cpu=48,node=24,billing=48 Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=* MinCPUsNode=1 MinMemoryNode=0 MinTmpDiskNode=0 Features=(null) DelayBoot=00:00:00 OverSubscribe=NO Contiguous=0 Licenses=(null) Network=(null) Command=/home/spiraea7/test/jobs/openmpi.sh WorkDir=/home/spiraea7/test/jobs StdErr=/home/spiraea7/test/jobs/1383.log StdIn=/dev/null StdOut=/home/spiraea7/test/jobs/1383.log Power= MailUser=(null) MailType=NONE ``` * ==__sacct__ (Done)== 查詢計算工作執行狀態與資源使用量 * 不加參數 ``` [spiraea7@lgn301 jobs]$ sacct JobID JobName Partition Account AllocCPUS State ExitCode ------------ ---------- ---------- ---------- ---------- ---------- -------- 1291 test cpu ent108161 24 CANCELLED+ 0:0 1292 test test ent108161 672 FAILED 1:0 1292.batch batch ent108161 56 FAILED 1:0 1293 test test ent108161 672 COMPLETED 0:0 1293.batch batch ent108161 56 COMPLETED 0:0 1294 test test ent108161 672 COMPLETED 0:0 1294.batch batch ent108161 56 COMPLETED 0:0 1295 test test ent108161 112 FAILED 1:0 1295.batch batch ent108161 56 FAILED 1:0 1295.0 hello ent108161 24 FAILED 1:0 1296 test test ent108161 112 FAILED 1:0 1296.batch batch ent108161 56 FAILED 1:0 1297 test test ent108161 112 COMPLETED 0:0 1297.batch batch ent108161 56 COMPLETED 0:0 1297.0 orted ent108161 1 COMPLETED 0:0 1298 test test ent108161 1344 COMPLETED 0:0 ``` * -A, --account=$account_list 指定account清單，本系統為計畫代號，如ENT108161 * -e, --helpformat 顯示所有可查詢的欄位，欄位如下： ``` Account AdminComment AllocCPUS AllocNodes AllocTRES AssocID AveCPU AveCPUFreq AveDiskRead AveDiskWrite AvePages AveRSS AveVMSize BlockID Cluster Comment Constraints ConsumedEnergy ConsumedEnergyRaw CPUTime CPUTimeRAW DBIndex DerivedExitCode Elapsed ElapsedRaw Eligible End ExitCode Flags GID Group JobID JobIDRaw JobName Layout MaxDiskRead MaxDiskReadNode MaxDiskReadTask MaxDiskWrite MaxDiskWriteNode MaxDiskWriteTask MaxPages MaxPagesNode MaxPagesTask MaxRSS MaxRSSNode MaxRSSTask MaxVMSize MaxVMSizeNode MaxVMSizeTask McsLabel MinCPU MinCPUNode MinCPUTask NCPUS NNodes NodeList NTasks Priority Partition QOS QOSRAW Reason ReqCPUFreq ReqCPUFreqMin ReqCPUFreqMax ReqCPUFreqGov ReqCPUS ReqMem ReqNodes ReqTRES Reservation ReservationId Reserved ResvCPU ResvCPURAW Start State Submit Suspended SystemCPU SystemComment Timelimit TimelimitRaw TotalCPU TRESUsageInAve TRESUsageInMax TRESUsageInMaxNode TRESUsageInMaxTask TRESUsageInMin TRESUsageInMinNode TRESUsageInMinTask TRESUsageInTot TRESUsageOutAve TRESUsageOutMax TRESUsageOutMaxNode TRESUsageOutMaxTask TRESUsageOutMin TRESUsageOutMinNode TRESUsageOutMinTask TRESUsageOutTot UID User UserCPU WCKey WCKeyID WorkDir ``` 各欄位代表意義請man sacct * -S, --starttime * -E, --endtime=end_time 時間格式如下： ``` HH:MM[:SS][AM|PM] MMDD[YY][-HH:MM[:SS]] MM.DD[.YY][-HH:MM[:SS]] MM/DD[/YY][-HH:MM[:SS]] YYYY-MM-DD[THH:MM[:SS]] ``` 如：2021-01-02T13:06:00 為2021年1月2日下午1時6分 ``` [spiraea7@lgn301 jobs]$ sacct -S 2021-01-12T21:30 -E 2021-01-12T21:50 JobID JobName Partition Account AllocCPUS State ExitCode ------------ ---------- ---------- ---------- ---------- ---------- -------- 1368 test test ent108161 1344 TIMEOUT 0:0 1368.batch batch ent108161 56 CANCELLED 0:15 1368.0 orted ent108161 23 COMPLETED 0:0 1369 test test ent108161 1344 TIMEOUT 0:0 1369.batch batch ent108161 56 CANCELLED 0:15 1369.0 orted ent108161 23 COMPLETED 0:0 1370 hello_wor+ test ent108161 168 COMPLETED 0:0 1370.batch batch ent108161 56 COMPLETED 0:0 1370.0 pmi_proxy ent108161 3 COMPLETED 0:0 1371 test test ent108161 1344 TIMEOUT 0:0 1371.batch batch ent108161 56 CANCELLED 0:15 1371.0 orted ent108161 23 COMPLETED 0:0 1372 hello_wor+ test ent108161 168 COMPLETED 0:0 ``` * -o, --format 自訂欄位，可依需求自訂輸出欄位，欄位名稱請參閱-e選項之輸出，大小寫不拘，亦可先設定SACCT_FORMAT變數，不用再於命令行指定輸出欄位，如： ``` [spiraea7@lgn301 jobs]$ sacct -S 2021-01-12T21:30 -E 2021-01-12T21:50 --format=jobid,user,NCPUS,Submit,Start,End,State,Elapsed,ElapsedRaw,account JobID User NCPUS Submit Start End State Elapsed ElapsedRaw Account ------------ --------- ---------- ------------------- ------------------- ------------------- ---------- ---------- ---------- ---------- 1368 spiraea7 1344 2021-01-12T21:31:09 2021-01-12T21:31:10 2021-01-12T21:36:36 TIMEOUT 00:05:26 326 ent108161 1368.batch 56 2021-01-12T21:31:10 2021-01-12T21:31:10 2021-01-12T21:36:36 CANCELLED 00:05:26 326 ent108161 1368.0 23 2021-01-12T21:31:10 2021-01-12T21:31:10 2021-01-12T21:31:10 COMPLETED 00:00:00 0 ent108161 1369 spiraea7 1344 2021-01-12T21:32:11 2021-01-12T21:32:12 2021-01-12T21:37:36 TIMEOUT 00:05:24 324 ent108161 1369.batch 56 2021-01-12T21:32:12 2021-01-12T21:32:12 2021-01-12T21:37:36 CANCELLED 00:05:24 324 ent108161 1369.0 23 2021-01-12T21:32:12 2021-01-12T21:32:12 2021-01-12T21:32:12 COMPLETED 00:00:00 0 ent108161 1370 spiraea7 168 2021-01-12T21:32:19 2021-01-12T21:32:20 2021-01-12T21:52:21 COMPLETED 00:20:01 1201 ent108161 1370.batch 56 2021-01-12T21:32:20 2021-01-12T21:32:20 2021-01-12T21:52:21 COMPLETED 00:20:01 1201 ent108161 1370.0 3 2021-01-12T21:32:20 2021-01-12T21:32:20 2021-01-12T21:32:21 COMPLETED 00:00:01 1 ent108161 1371 spiraea7 1344 2021-01-12T21:32:22 2021-01-12T21:36:36 2021-01-12T21:41:36 TIMEOUT 00:05:00 300 ent108161 1371.batch 56 2021-01-12T21:36:36 2021-01-12T21:36:36 2021-01-12T21:41:36 CANCELLED 00:05:00 300 ent108161 1371.0 23 2021-01-12T21:36:36 2021-01-12T21:36:36 2021-01-12T21:36:36 COMPLETED 00:00:00 0 ent108161 1372 spiraea7 168 2021-01-12T21:32:25 2021-01-12T21:32:41 2021-01-12T21:52:43 COMPLETED 00:20:02 1202 ent108161 1372.batch 56 2021-01-12T21:32:41 2021-01-12T21:32:41 2021-01-12T21:52:43 COMPLETED 00:20:02 1202 ent108161 1372.0 3 2021-01-12T21:32:43 2021-01-12T21:32:43 2021-01-12T21:32:43 COMPLETED 00:00:00 0 ent108161 ``` ``` [spiraea7@lgn301 jobs]$ export SACCT_FORMAT=jobid,user,NCPUS,Submit,Start,End,State,Elapsed,ElapsedRaw,account [spiraea7@lgn301 jobs]$ sacct -S 2021-01-12T21:30 -E 2021-01-12T21:50 JobID User NCPUS Submit Start End State Elapsed ElapsedRaw Account ------------ --------- ---------- ------------------- ------------------- ------------------- ---------- ---------- ---------- ---------- 1368 spiraea7 1344 2021-01-12T21:31:09 2021-01-12T21:31:10 2021-01-12T21:36:36 TIMEOUT 00:05:26 326 ent108161 1368.batch 56 2021-01-12T21:31:10 2021-01-12T21:31:10 2021-01-12T21:36:36 CANCELLED 00:05:26 326 ent108161 1368.0 23 2021-01-12T21:31:10 2021-01-12T21:31:10 2021-01-12T21:31:10 COMPLETED 00:00:00 0 ent108161 1369 spiraea7 1344 2021-01-12T21:32:11 2021-01-12T21:32:12 2021-01-12T21:37:36 TIMEOUT 00:05:24 324 ent108161 1369.batch 56 2021-01-12T21:32:12 2021-01-12T21:32:12 2021-01-12T21:37:36 CANCELLED 00:05:24 324 ent108161 1369.0 23 2021-01-12T21:32:12 2021-01-12T21:32:12 2021-01-12T21:32:12 COMPLETED 00:00:00 0 ent108161 1370 spiraea7 168 2021-01-12T21:32:19 2021-01-12T21:32:20 2021-01-12T21:52:21 COMPLETED 00:20:01 1201 ent108161 1370.batch 56 2021-01-12T21:32:20 2021-01-12T21:32:20 2021-01-12T21:52:21 COMPLETED 00:20:01 1201 ent108161 1370.0 3 2021-01-12T21:32:20 2021-01-12T21:32:20 2021-01-12T21:32:21 COMPLETED 00:00:01 1 ent108161 1371 spiraea7 1344 2021-01-12T21:32:22 2021-01-12T21:36:36 2021-01-12T21:41:36 TIMEOUT 00:05:00 300 ent108161 1371.batch 56 2021-01-12T21:36:36 2021-01-12T21:36:36 2021-01-12T21:41:36 CANCELLED 00:05:00 300 ent108161 1371.0 23 2021-01-12T21:36:36 2021-01-12T21:36:36 2021-01-12T21:36:36 COMPLETED 00:00:00 0 ent108161 1372 spiraea7 168 2021-01-12T21:32:25 2021-01-12T21:32:41 2021-01-12T21:52:43 COMPLETED 00:20:02 1202 ent108161 1372.batch 56 2021-01-12T21:32:41 2021-01-12T21:32:41 2021-01-12T21:52:43 COMPLETED 00:20:02 1202 ent108161 1372.0 3 2021-01-12T21:32:43 2021-01-12T21:32:43 2021-01-12T21:32:43 COMPLETED 00:00:00 0 ent108161 ``` * ==__salloc__ (Done)== 向系統要求指定的計算資源，系統會依照需求安排保留通常會搭配srun一起使用，參數與sbatch同或man salloc * -A, --account= 為必要參數 * -N, --nodes= 節點數 * -p, --partition= partition名稱在執行salloc取得所需計算資源後，系統即開始依照實際取得的計算資源開始記帳，直到計算工作結束且退出salloc狀態，才停止記帳。 * ==__srun__(Done)== 執行互動式(interactive)計算工作或於slurm job script中執行，可用參數與sbatch、salloc相似，詳細請man srun ``` [spiraea7@lgn301 ~]$ salloc -A ENT108161 -p test -N 2 salloc: Granted job allocation 1398 [spiraea7@lgn301 ~]$ srun uname -a Linux cpn3003.nchc-2020-hpc 3.10.0-1127.el7.x86_64 #1 SMP Tue Mar 31 23:36:51 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux Linux cpn3002.nchc-2020-hpc 3.10.0-1127.el7.x86_64 #1 SMP Tue Mar 31 23:36:51 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux [spiraea7@lgn301 ~]$ module load intel/2020 [spiraea7@lgn301 ~]$ mpirun -n 8 ~/bin/intel-hello Hello, world, I am 0 of 8, cpn3002.nchc-2020-hpc Hello, world, I am 2 of 8, cpn3002.nchc-2020-hpc Hello, world, I am 6 of 8, cpn3002.nchc-2020-hpc Hello, world, I am 4 of 8, cpn3002.nchc-2020-hpc Hello, world, I am 5 of 8, cpn3002.nchc-2020-hpc Hello, world, I am 3 of 8, cpn3002.nchc-2020-hpc Hello, world, I am 7 of 8, cpn3002.nchc-2020-hpc Hello, world, I am 1 of 8, cpn3002.nchc-2020-hpc ``` ``` [spiraea7@lgn301 ~]$ salloc -A ENT108161 -p test -n 8 -N 4 salloc: Granted job allocation 1400 [spiraea7@lgn301 ~]$ module load intel/2020 Message from TWCC HPC admin ----------------------- loading Intel(R) Parallel Studio XE Cluster Edition 2020 Update 1 for Linux* setting I_MPI_OFI_PROVIDER=mlx (with ucx 1.8) ----------------------- [spiraea7@lgn301 ~]$ export I_MPI_FABRIC=shm:ofi [spiraea7@lgn301 ~]$ export UCX_TLS=rc,ud,sm,self [spiraea7@lgn301 ~]$ mpiexec.hydra -bootstrap slurm -n 8 ~/bin/intel-hello Hello, world, I am 4 of 8, cpn3004.nchc-2020-hpc Hello, world, I am 5 of 8, cpn3004.nchc-2020-hpc Hello, world, I am 6 of 8, cpn3005.nchc-2020-hpc Hello, world, I am 7 of 8, cpn3005.nchc-2020-hpc Hello, world, I am 3 of 8, cpn3003.nchc-2020-hpc Hello, world, I am 2 of 8, cpn3003.nchc-2020-hpc Hello, world, I am 0 of 8, cpn3002.nchc-2020-hpc Hello, world, I am 1 of 8, cpn3002.nchc-2020-hpc [spiraea7@lgn301 ~]$ exit exit salloc: Relinquishing job allocation 1400 [spiraea7@lgn301 ~] ``` --- ////////////////////////// ==以下未確認是否保留== ///////////////////////// ## 2. 常用MPI範例 * IntelMPI * OpenMPI * Hybrid MPI/OpenMP --- * __IntelMPI__ 1. 設定環境 ([module使用](https://man.twcc.ai/@twnia3/BJxhYk2CD)) ``` [user@lgn301 jobs]$ module load compiler/intel/2020u4 IntelMPI/2020 ``` 2. 編譯程式 ``` [user@lgn301 test]$ which mpiicc /opt/ohpc/Taiwania3/pkg/intel/2020/compilers_and_libraries_2020.4.304/linux/mpi/intel64/bin/mpiicc [user@lgn301 test]$ mpiicc -o ../../bin/intel-hello ./hello.c ``` 3. 撰寫job script ``` #簡易版 #!/bin/bash #SBATCH -A ENT108161 # Account name/project number #SBATCH -J hello_world # Job name #SBATCH -p test # Partiotion name #SBATCH -n 24 # Number of MPI tasks (i.e. processes) #SBATCH -c 1 # Number of cores per MPI task #SBATCH -N 3 # Maximum number of nodes to be allocated #SBATCH -o %j.out # Path to the standard output file #SBATCH -e %j.err # Path to the standard error ouput file module load compiler/intel/2020u4 IntelMPI/2020 mpiexec.hydra -bootstrap slurm -n 24 /home/user/bin/intel-hello ``` *若未指定standard error輸出，stderr將會與stdout輸出至同一檔案* ``` #複雜版 #!/bin/bash #SBATCH --account=ENT108161 # (-A) Account/project number #SBATCH --job-name=hello_world # (-J) Job name #SBATCH --partition=test # (-p) Specific slurm partition #SBATCH --mail-type=END,FAIL # Mail events (NONE, BEGIN, END, FAIL, ALL) #SBATCH --mail-user=user@mybox.mail # Where to send mail. Set this to your email address #SBATCH --ntasks=24 # (-n) Number of MPI tasks (i.e. processes) #SBATCH --cpus-per-task=1 # (-c) Number of cores per MPI task #SBATCH --nodes=2 # (-N) Maximum number of nodes to be allocated #SBATCH --ntasks-per-node=12 # Maximum number of tasks on each node #SBATCH --ntasks-per-socket=6 # Maximum number of tasks on each socket #SBATCH --distribution=cyclic:cyclic # (-m) Distribute tasks cyclically first among nodes and then among sockets within a node #SBATCH --mem-per-cpu=600mb # Memory (i.e. RAM) per processor #SBATCH --time=00:05:00 # (-t) Wall time limit (days-hrs:min:sec) #SBATCH --output=%j.log # (-o) Path to the standard output and error files relative to the working directory #SBATCH --error=%j.err # (-e) Path to the standard error ouput #SBATCH --nodelist=cpn[3001-3002] # (-w) specific list of nodes module load compiler/intel/2020u4 IntelMPI/2020 mpiexec.hydra -bootstrap slurm -n 24 /home/user/bin/intel-hello ``` 5. submit ``` [spiraea7@lgn301 jobs]$ sbatch intel.sh Submitted batch job 1302 ``` * __OpenMPI__ 以隨Mellanox driver安裝的OpenMPI為例 1. 查詢可用的MPI library ``` [user@tchead job]$ mpi-selector --list openmpi-4.0.3rc4 ``` 2. 設定MPI library ``` [user@tchead job]$ mpi-selector --set openmpi-4.0.3rc4 ``` 3. 登出後重新登入 ``` [user@tchead ~]$ which mpirun /usr/mpi/gcc/openmpi-4.0.3rc4/bin/mpirun ``` 4. 編譯程式 ``` [user@tchead user]$ mpicc -o hello_world ./src/hello_c.c ``` 5. 撰寫job script 因此版的slrum編譯時，未支援pmix，故使用mpirun執行。 ``` #!/bin/bash #SBATCH -A ENT108161 # Account name/project number #SBATCH -J test # Job name #SBATCH -p test #SBATCH -n 24 # Number of MPI tasks (i.e. processes) #SBATCH -c 1 # Number of cores per MPI task #SBATCH -N 6 # Maximum number of nodes to be allocated #SBATCH -t 05:00 # Wall time limit (days-hrs:min:sec) #SBATCH -o %j.logi # Path to the standard output and error files relative to the working directory mpirun -n 24 /home/user/bin/hello ##或 mpirun -n 24 --mca pml ucx --mca btl ^vader,tcp,openib,uct -x UCX_NET_DEVICES=mlx5_0:1 /home/user/bin/hello ``` 6. 提交計算 ``` [user@lgn301 jobs]$ sbatch openmpi.sh Submitted batch job 1300 ``` * __Hybrid MPI/OpenMP 範例__ 1. 以Intel MPI為例，[程式範例](https://rcc.uchicago.edu/docs/running-jobs/hybrid/index.html) 2. 載入執行環境可以先執行ml compiler/intel/2021，再執行ml av就可以看到IntelMPI/2021 ``` [spiraea7@lgn301 ~]$ ml compiler/intel/2021 IntelMPI/2021 ``` 3. 編譯程式 ``` [spiraea7@lgn301 test]$ mpiicc -fopenmp ./hybridhello.c -o hybrid_hello ``` 4. 撰寫job script ``` #!/bin/bash #SBATCH -A ENT108161 # Account name/project number #SBATCH -J hybridMPI # Job name #SBATCH -p test # Partiotion name #SBATCH -n 16 # Number of MPI tasks (i.e. processes) #SBATCH -c 4 # Number of cores per MPI task #SBATCH -N 4 # Maximum number of nodes to be allocated #SBATCH -o %j.out # Path to the standard output file #SBATCH -e %j.err # Path to the standard error ouput file module load compiler/intel/2021 IntelMPI/2021 # Set OMP_NUM_THREADS to the number of CPUs per task we asked for. export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK mpiexec.hydra -bootstrap slurm -n 16 /home/spiraea7/bin/hybrid_hello ``` __說明：__ * #SBATCH -n 16 要執行16個MPI process * #SBATCH -c 4 每個MPI process 需要4個計算核心 * #SBATCH -N 4 分散到4個節點上執行，若只是執行單純的OpenMP程式，未使用MPI library提供的跨節點執行能力，-N 應該為1或者省略不寫。 * 須設定$OMP_NUM_THREADS，本例為4，可以變數\$SLURM_CPUS_PER_TASK帶入 * 本例在4個節點上，共執行16個MPI processes，每個MPI process共使用4個計算核心，因此共須64個計算核心。 6. 提交計算工作 ``` [spiraea7@lgn301 jobs]$ sbatch intel-hybrid.sh Submitted batch job 2647 [spiraea7@lgn301 jobs]$ sacct -j 2647 JobID User NCPUS AllocCPUS AllocNodes AllocTRES Elapsed ------------ --------- ---------- ---------- ---------- ------------------------------ ---------- 2647 spiraea7 64 64 4 billing=64,cpu=64,mem=741.50G+ 00:01:11 2647.batch 16 16 1 cpu=16,mem=189824M,node=1 00:01:11 2647.0 16 16 4 cpu=16,mem=741.50G,node=4 00:00:01 ``` 7. 執行結果 ``` [spiraea7@lgn301 jobs]$ less 2647.out Hello from thread 0 out of 4 from process 2 out of 16 on cpn3024 Hello from thread 2 out of 4 from process 2 out of 16 on cpn3024 Hello from thread 3 out of 4 from process 2 out of 16 on cpn3024 Hello from thread 0 out of 4 from process 10 out of 16 on cpn3026 Hello from thread 1 out of 4 from process 10 out of 16 on cpn3026 Hello from thread 3 out of 4 from process 10 out of 16 on cpn3026 Hello from thread 2 out of 4 from process 10 out of 16 on cpn3026 Hello from thread 2 out of 4 from process 3 out of 16 on cpn3024 Hello from thread 3 out of 4 from process 3 out of 16 on cpn3024 Hello from thread 0 out of 4 from process 3 out of 16 on cpn3024 Hello from thread 1 out of 4 from process 3 out of 16 on cpn3024 ...(略) ``` 可以看到每個process有4個thraeds * __取用GPU範例__ 1. 使用CUDA sample中的n-body sample code，路徑:`/opt/qct/ohpc/pub/nvidia/cuda/cuda-11.0/samples` [程式範例](https://developer.nvidia.com/cuda-code-samples) 3. 使用CUDA 11.0編譯範例程式 `module load cuda/11.0` 4. 修改nbody資料夾下Makefile內相對應的部分，例如第37行與cuda相關變數 `CUDA_PATH ?= /opt/qct/ohpc/pub/nvidia/cuda/cuda-11.0` 5. 編譯程式 `make clean && make` 6. 準備job script ``` #!/bin/bash #SBATCH --job-name nbody_test # Job name #SBATCH --output %x-%j.out # Name of stdout output file (%x expands to jobname, %j expands to jobId) #SBATCH --nodes=1 #Controls the number of nodes allocated to the job #SBATCH --cpus-per-task=1 #Controls the number of CPUs allocated per task #SBATCH --gres=gpu:1 #SBATCH --partition gpu #SBATCH --account GOV109199 module purge module load nvidia/cuda/11.0 echo "Your nbody job starts at `date`" ./nbody -benchmark -fp64 -numbodies=200000 echo "Your nbody job completed at `date` " ``` __說明：__ * #SBATCH --gres=gpu:1 在slurm裡面，gpu被視為通用資源(generic resource)的一種，要使用--gres參數來向系統索取，資源名稱為「gpu」，在gpu後加上冒號，指定該資源的使用數量，這邊的範例是向系統索取一張gpu。 6. 提交計算工作 ``` 20:49:42 p00acy00@lgn301:nbody$ sbatch nbody.sh Submitted batch job 55098 20:50:08 p00acy00@lgn301:nbody$ sacct JobID JobName Partition Account AllocCPUS State ExitCode ------------ ---------- ---------- ---------- ---------- ---------- -------- 55098 nbody_test gpu gov109199 1 COMPLETED 0:0 55098.batch batch gov109199 1 COMPLETED 0:0 ``` 8. 執行結果 `cat nbody_test-55098.out` ``` Your nbody job starts at Tue Feb 9 20:50:09 CST 2021 Run "nbody -benchmark [-numbodies=<numBodies>]" to measure performance. -fullscreen (run n-body simulation in fullscreen mode) -fp64 (use double precision floating point values for simulation) -hostmem (stores simulation data in host memory) -benchmark (run benchmark to measure performance) -numbodies=<N> (number of bodies (>= 1) to run in simulation) -device=<d> (where d=0,1,2.... for the CUDA device to use) -numdevices= (where i=(number of CUDA devices > 0) to use for simulation) -compare (compares simulation results running once on the default GPU and once on the CPU) -cpu (run n-body simulation on the CPU) -tipsy=<file.bin> (load a tipsy model file for simulation) NOTE: The CUDA Samples are not meant for performance measurements. Results may vary when GPU Boost is enabled. > Windowed mode > Simulation data stored in video memory > Double precision floating point simulation > 1 Devices used for simulation GPU Device 0: "Volta" with compute capability 7.0 > Compute 7.0 CUDA device: [Tesla V100-SXM2-32GB] Warning: "number of bodies" specified 200000 is not a multiple of 256. Rounding up to the nearest multiple: 200192. 200192 bodies, total time for 10 iterations: 2133.370 ms = 187.857 billion interactions per second = 5635.709 double-precision GFLOP/s at 30 flops per interaction Your nbody job completed at Tue Feb 9 20:50:12 CST 2021 ```