--- title: Slurm 指令 | zh tags: Guide, TWNIA3, TW GA: --- {%hackmd @docsharedstyle/default %} # Slurm 指令 Slurm提供多種指令幫助使用者查看任務狀態,以下為常使用之指令介紹: ::: success ::: spoiler <b>1. job submit - `sbatch`</b> <br> 在 <b>terminal</b> 輸入 `sbatch <name.sh>` (intel<area>.sh) 進行任務遞交後,將會獲得一個 **Job ID** (job 84684)。 ``` [u***@lgn302 ~]$ sbatch intel.sh Submitted batch job 84684 ``` ::: ::: success ::: spoiler <b>2. 工作狀態查詢 - `scontrol`</b> <br> 在 <b>terminal</b> 輸入 `scontrol show job ID` (84242) 即顯示目前任務、任務步驟、節點、分區以及保留資源和系統設定等詳細資訊。 ``` [u***@lgn302 ~]$ scontrol show job 84242 JobId=84242 JobName=hello_world UserId=u***(22153) GroupId=GOV109199(55727) MCS_label=N/A Priority=110004 Nice=0 Account=acd110078 QOS=normal JobState=PENDING Reason=Nodes_required_for_job_are_DOWN,_DRAINED_or_reserved_for_jobs_in_higher_priority_partitions Dependency=(null) Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0 RunTime=00:00:00 TimeLimit=12:00:00 TimeMin=N/A SubmitTime=2021-06-15T16:50:54 EligibleTime=2021-06-15T16:50:54 AccrueTime=2021-06-15T16:50:54 StartTime=2021-06-16T01:48:48 EndTime=2021-06-16T13:48:48 Deadline=N/A SuspendTime=None SecsPreSuspend=0 LastSchedEval=2021-06-15T17:23:04 Partition=normal AllocNode:Sid=lgn302:124318 ReqNodeList=(null) ExcNodeList=(null) NodeList=(null) SchedNodeList=cpn[3541-3543] NumNodes=3-3 NumCPUs=24 NumTasks=24 CPUs/Task=1 ReqB:S:C:T=0:0:*:* TRES=cpu=24,node=3,billing=24 Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=* MinCPUsNode=1 MinMemoryNode=0 MinTmpDiskNode=0 Features=(null) DelayBoot=00:00:00 OverSubscribe=YES Contiguous=0 Licenses=(null) Network=(null) Command=/home/u5418422/intel.sh WorkDir=/home/u5418422 StdErr=/home/u5418422/84242.err StdIn=/dev/null StdOut=/home/u5418422/84242.out Power= MailUser=(null) MailType=NONE ``` `scontrol` 主要作為 <b>slurm</b> 系統設定變更與查看設定之用,亦可查詢計算工作詳細內容,如等待原因、相依性…。 ::: ::: success ::: spoiler <b>3. 管理的分區和節點狀態 - `sinfo`</b> <br> `sinfo` 資訊包含 <b>Slurm</b> 管理的分區和節點的狀態,`sinfo` 具有多種的過濾、排序,以及格式選項。也很常用於測試 <b>Slurm</b> 指令是否可以被正常使用,例如可透過 <b>`sinfo -V`</b> 來確認 <b>Slurm</b> 版本資訊。 * `sinfo` 查詢節點與partition(Queue)狀態,有"*"的partition為預設partition。 ``` [***@lgn301 jobs]$ sinfo PARTITION AVAIL TIMELIMIT NODES STATE NODELIST --------- ----- ---------- ----- ----- ------------- cpu down infinite 900 idle cpn[3001-3900] test* up infinite 64 idle cpn[3001-3064] bgm down infinite 4 idle bgm[3001-3004] gpu up infinite 12 idle gpn[3001-3012] ``` * `-l` ``` [***@lgn301 jobs]$ sinfo -l Tue Jan 12 19:29:51 2021 PARTITION AVAIL TIMELIMIT JOB_SIZE ROOT OVERSUBS GROUPS NODES STATE NODELIST --------- ----- --------- ---------- ---- -------- ------ ----- ----- ------------- cpu down infinite 1-infinite no NO all 900 idle cpn[3001-3900] test* up infinite 1-infinite no NO all 64 idle cpn[3001-3064] bgm down infinite 1-infinite no NO all 4 idle bgm[3001-3004] gpu up infinite 1-infinite no NO all 12 idle gpn[3001-3012] ``` * `N` 查詢節點狀態 ``` [***@lgn301 jobs]$ sinfo -N -l Tue Jan 12 19:34:41 2021 NODELIST NODES PARTITION STATE CPUS S:C:T MEMORY TMP_DISK WEIGHT AVAIL_FE REASON --------- ----- --------- ----- ---- ------------- -------- ------- -------- ------ bgm3001 1 bgm idle 112 4:28:1 619111 0 1 (null) none bgm3002 1 bgm idle 112 4:28:1 619111 0 1 (null) none bgm3003 1 bgm idle 112 4:28:1 619111 0 1 (null) none bgm3004 1 bgm idle 112 4:28:1 619111 0 1 (null) none cpn3001 1 test* idle 56 2:28:1 191880 0 1 (null) none cpn3001 1 cpu idle 56 2:28:1 191880 0 1 (null) none cpn3002 1 test* idle 56 2:28:1 191880 0 1 (null) none cpn3002 1 cpu idle 56 2:28:1 191880 0 1 (null) none cpn3003 1 test* idle 56 2:28:1 191880 0 1 (null) none ....(略) ``` ::: ::: success ::: spoiler <b>4. 顯示任務或任務集狀態 - `squeue`</b> <br> `squeue` 顯示任務或任務集的狀態,它具有各種過濾,排序和格式選項。 預設是按優先順序顯示正在運行的任務,然後按優先順序顯示正在等待的任務,是常用檢視任務的指令。範例如下: * `squeue` 查詢job狀態,PD為Pending、R為Run ``` [***@lgn301 jobs]$ squeue JOBID | PARTITION | NAME | USER ST | TIME TRES_P | NODES | EXEC_HOST 1334 test test spiraea7 PD 0:00 N/A 6 n/a 1335 test test spiraea7 PD 0:00 N/A 6 n/a 1336 test test spiraea7 PD 0:00 N/A 6 n/a 1337 test test spiraea7 PD 0:00 N/A 6 n/a 1338 test test spiraea7 PD 0:00 N/A 6 n/a 1339 test test spiraea7 PD 0:00 N/A 6 n/a 1340 test test spiraea7 PD 0:00 N/A 6 n/a 1341 test test spiraea7 PD 0:00 N/A 6 n/a 1324 test test spiraea7 R 2:03 N/A 6 cpn3001 1325 test test spiraea7 R 2:00 N/A 6 cpn3007 1326 test test spiraea7 R 2:00 N/A 6 cpn3013 ``` ::: ::: success ::: spoiler <b>5. 列出過往相關任務或任務集狀態 - `sacct`</b> <br> `sacct` 用於列出帳號的相關任務或任務集之狀態,例如運行中、已終止或是已完成,是最基本的檢視任務指令。 使用 <b>Slurm</b> 調度任務的相關紀錄在本系統中是儲存在 Log 以及資料庫中,透過 `sacct` 指令預設可顯示任務、任務步驟、相關狀態,以及退出碼(exitcodes)。也可以透過 `--format` 選項指定要輸出的內容。 需注意,<b>Slurm</b> 資料庫的資訊預設只以小寫英文字母保存與維護,建議使用者在指定任務名稱以及相關參數時使用小寫輸入。範例如下: ``` [***@lgn301 jobs]$ sacct JobID JobName Partition Account AllocCPUS State ExitCode --------- --------- --------- --------- --------- ---------- ---------- 1291 test cpu ent108161 24 CANCELLED+ 0:0 1292 test test ent108161 672 FAILED 1:0 1292.batch batch ent108161 56 FAILED 1:0 1293 test test ent108161 672 COMPLETED 0:0 1293.batch batch ent108161 56 COMPLETED 0:0 1294 test test ent108161 672 COMPLETED 0:0 1294.batch batch ent108161 56 COMPLETED 0:0 1295 test test ent108161 112 FAILED 1:0 1295.batch batch ent108161 56 FAILED 1:0 1295.0 hello ent108161 24 FAILED 1:0 1296 test test ent108161 112 FAILED 1:0 1296.batch batch ent108161 56 FAILED 1:0 1297 test test ent108161 112 COMPLETED 0:0 1297.batch batch ent108161 56 COMPLETED 0:0 1297.0 orted ent108161 1 COMPLETED 0:0 1298 test test ent108161 1344 COMPLETED 0:0 ``` `-A`, `--account` = $`account_list` 指定account清單,本系統為計畫代號,如GOV109199 `-e`, `--helpformat` 顯示所有可查詢的欄位,欄位如下: ``` Account AdminComment AllocCPUS AllocNodes AllocTRES AssocID AveCPU AveCPUFreq AveDiskRead AveDiskWrite AvePages AveRSS AveVMSize BlockID Cluster Comment Constraints ConsumedEnergy ConsumedEnergyRaw CPUTime CPUTimeRAW DBIndex DerivedExitCode Elapsed ElapsedRaw Eligible End ExitCode Flags GID Group JobID JobIDRaw JobName Layout MaxDiskRead MaxDiskReadNode MaxDiskReadTask MaxDiskWrite MaxDiskWriteNode MaxDiskWriteTask MaxPages MaxPagesNode MaxPagesTask MaxRSS MaxRSSNode MaxRSSTask MaxVMSize MaxVMSizeNode MaxVMSizeTask McsLabel MinCPU MinCPUNode MinCPUTask NCPUS NNodes NodeList NTasks Priority Partition QOS QOSRAW Reason ReqCPUFreq ReqCPUFreqMin ReqCPUFreqMax ReqCPUFreqGov ReqCPUS ReqMem ReqNodes ReqTRES Reservation ReservationId Reserved ResvCPU ResvCPURAW Start State Submit Suspended SystemCPU SystemComment Timelimit TimelimitRaw TotalCPU TRESUsageInAve TRESUsageInMax TRESUsageInMaxNode TRESUsageInMaxTask TRESUsageInMin TRESUsageInMinNode TRESUsageInMinTask TRESUsageInTot TRESUsageOutAve TRESUsageOutMax TRESUsageOutMaxNode TRESUsageOutMaxTask TRESUsageOutMin TRESUsageOutMinNode TRESUsageOutMinTask TRESUsageOutTot UID User UserCPU WCKey WCKeyID WorkDir ``` `-S`, `--starttime` `-E`, `--endtime` = $`end_time` 時間格式如下: ``` HH:MM[:SS][AM|PM] MMDD[YY][-HH:MM[:SS]] MM.DD[.YY][-HH:MM[:SS]] MM/DD[/YY][-HH:MM[:SS]] YYYY-MM-DD[THH:MM[:SS]] ``` 如:2021-01-02T13:06:00 為2021年1月2日下午1時6分 ``` [***@lgn301 jobs]$ sacct -S 2021-01-12T21:30 -E 2021-01-12T21:50 JobID JobName Partition Account AllocCPUS State ExitCode --------- --------- --------- --------- --------- ---------- ---------- 1368 test test ent108161 1344 TIMEOUT 0:0 1368.batch batch ent108161 56 CANCELLED 0:15 1368.0 orted ent108161 23 COMPLETED 0:0 1369 test test ent108161 1344 TIMEOUT 0:0 1369.batch batch ent108161 56 CANCELLED 0:15 1369.0 orted ent108161 23 COMPLETED 0:0 1370 hello_wor+ test ent108161 168 COMPLETED 0:0 1370.batch batch ent108161 56 COMPLETED 0:0 1370.0 pmi_proxy ent108161 3 COMPLETED 0:0 1371 test test ent108161 1344 TIMEOUT 0:0 1371.batch batch ent108161 56 CANCELLED 0:15 1371.0 orted ent108161 23 COMPLETED 0:0 1372 hello_wor+ test ent108161 168 COMPLETED 0:0 ``` `--o`, `--format` 自訂欄位,可依需求自訂輸出欄位,欄位名稱請參閱 `-e` 選項之輸出,大小寫不拘,亦可先設定 <b>SACCT_FORMAT</b> 變數,不用再於命令行指定輸出欄位,如: ``` [***@lgn301 jobs]$ sacct -S 2021-01-12T21:30 -E 2021-01-12T21:50 --format=jobid,user,NCPUS,Submit,Start,End,State,Elapsed,ElapsedRaw,account JobID User NCPUS Submit Start End State Elapsed ElapsedRaw Account ----------- ----------- ---------- ------------------- ------------------- ------------------- ---------- ----------- ------------- ---------- 1368 spiraea7 1344 2021-01-12T21:31:09 2021-01-12T21:31:10 2021-01-12T21:36:36 TIMEOUT 00:05:26 326 ent108161 1368.batch 56 2021-01-12T21:31:10 2021-01-12T21:31:10 2021-01-12T21:36:36 CANCELLED 00:05:26 326 ent108161 1368.0 23 2021-01-12T21:31:10 2021-01-12T21:31:10 2021-01-12T21:31:10 COMPLETED 00:00:00 0 ent108161 1369 spiraea7 1344 2021-01-12T21:32:11 2021-01-12T21:32:12 2021-01-12T21:37:36 TIMEOUT 00:05:24 324 ent108161 1369.batch 56 2021-01-12T21:32:12 2021-01-12T21:32:12 2021-01-12T21:37:36 CANCELLED 00:05:24 324 ent108161 1369.0 23 2021-01-12T21:32:12 2021-01-12T21:32:12 2021-01-12T21:32:12 COMPLETED 00:00:00 0 ent108161 1370 spiraea7 168 2021-01-12T21:32:19 2021-01-12T21:32:20 2021-01-12T21:52:21 COMPLETED 00:20:01 1201 ent108161 1370.batch 56 2021-01-12T21:32:20 2021-01-12T21:32:20 2021-01-12T21:52:21 COMPLETED 00:20:01 1201 ent108161 1370.0 3 2021-01-12T21:32:20 2021-01-12T21:32:20 2021-01-12T21:32:21 COMPLETED 00:00:01 1 ent108161 1371 spiraea7 1344 2021-01-12T21:32:22 2021-01-12T21:36:36 2021-01-12T21:41:36 TIMEOUT 00:05:00 300 ent108161 1371.batch 56 2021-01-12T21:36:36 2021-01-12T21:36:36 2021-01-12T21:41:36 CANCELLED 00:05:00 300 ent108161 1371.0 23 2021-01-12T21:36:36 2021-01-12T21:36:36 2021-01-12T21:36:36 COMPLETED 00:00:00 0 ent108161 1372 spiraea7 168 2021-01-12T21:32:25 2021-01-12T21:32:41 2021-01-12T21:52:43 COMPLETED 00:20:02 1202 ent108161 1372.batch 56 2021-01-12T21:32:41 2021-01-12T21:32:41 2021-01-12T21:52:43 COMPLETED 00:20:02 1202 ent108161 1372.0 3 2021-01-12T21:32:43 2021-01-12T21:32:43 2021-01-12T21:32:43 COMPLETED 00:00:00 0 ent108161 ``` ``` [***@lgn301 jobs]$ export SACCT_FORMAT=jobid,user,NCPUS,Submit,Start,End,State,Elapsed,ElapsedRaw,account [***@lgn301 jobs]$ sacct -S 2021-01-12T21:30 -E 2021-01-12T21:50 JobID User NCPUS Submit Start End State Elapsed ElapsedRaw Account ------------ --------- ---------- ------------------- ------------------- ------------------- ---------- ---------- ------------- ---------- 1368 spiraea7 1344 2021-01-12T21:31:09 2021-01-12T21:31:10 2021-01-12T21:36:36 TIMEOUT 00:05:26 326 ent108161 1368.batch 56 2021-01-12T21:31:10 2021-01-12T21:31:10 2021-01-12T21:36:36 CANCELLED 00:05:26 326 ent108161 1368.0 23 2021-01-12T21:31:10 2021-01-12T21:31:10 2021-01-12T21:31:10 COMPLETED 00:00:00 0 ent108161 1369 spiraea7 1344 2021-01-12T21:32:11 2021-01-12T21:32:12 2021-01-12T21:37:36 TIMEOUT 00:05:24 324 ent108161 1369.batch 56 2021-01-12T21:32:12 2021-01-12T21:32:12 2021-01-12T21:37:36 CANCELLED 00:05:24 324 ent108161 1369.0 23 2021-01-12T21:32:12 2021-01-12T21:32:12 2021-01-12T21:32:12 COMPLETED 00:00:00 0 ent108161 1370 spiraea7 168 2021-01-12T21:32:19 2021-01-12T21:32:20 2021-01-12T21:52:21 COMPLETED 00:20:01 1201 ent108161 1370.batch 56 2021-01-12T21:32:20 2021-01-12T21:32:20 2021-01-12T21:52:21 COMPLETED 00:20:01 1201 ent108161 1370.0 3 2021-01-12T21:32:20 2021-01-12T21:32:20 2021-01-12T21:32:21 COMPLETED 00:00:01 1 ent108161 1371 spiraea7 1344 2021-01-12T21:32:22 2021-01-12T21:36:36 2021-01-12T21:41:36 TIMEOUT 00:05:00 300 ent108161 1371.batch 56 2021-01-12T21:36:36 2021-01-12T21:36:36 2021-01-12T21:41:36 CANCELLED 00:05:00 300 ent108161 1371.0 23 2021-01-12T21:36:36 2021-01-12T21:36:36 2021-01-12T21:36:36 COMPLETED 00:00:00 0 ent108161 1372 spiraea7 168 2021-01-12T21:32:25 2021-01-12T21:32:41 2021-01-12T21:52:43 COMPLETED 00:20:02 1202 ent108161 1372.batch 56 2021-01-12T21:32:41 2021-01-12T21:32:41 2021-01-12T21:52:43 COMPLETED 00:20:02 1202 ent108161 1372.0 3 2021-01-12T21:32:43 2021-01-12T21:32:43 2021-01-12T21:32:43 COMPLETED 00:00:00 0 ent108161 ``` ::: ::: success ::: spoiler <b>6. 刪除工作 - `scancel`</b> <br> 用於取消正在等待中,或運行中的任務或任務集。 `scancel` 用於發出信號或取消特定的任務、任務陣列,或者任務步驟。範例如下: * `scancel -i [job ID]` ``` [***@lgn301 jobs]$ squeue JOBID | PARTITION | NAME | USER ST | TIME TRES_P | NODES | EXEC_HOST 1365 test hello_wo spiraea7 R 5:38 N/A 3 cpn3061 1366 test test spiraea7 R 0:37 N/A 24 cpn3004 1367 test hello_wo spiraea7 R 0:37 N/A 3 cpn3039 [spiraea7@lgn301 jobs]$ scancel -i 1366 Cancel job_id=1366 name=test partition=test [y/n]? y [spiraea7@lgn301 jobs]$ squeue JOBID | PARTITION | NAME | USER ST | TIME TRES_P | NODES | EXEC_HOST 1365 test hello_wo spiraea7 R 6:03 N/A 3 cpn3061 1367 test hello_wo spiraea7 R 1:02 N/A 3 cpn3039 ``` * `-u` 指定user ``` [***@lgn301 jobs]$ squeue JOBID | PARTITION | NAME | USER ST | TIME TRES_P | NODES | EXEC_HOST 1365 test hello_wo spiraea7 R 8:39 N/A 3 cpn3061 1367 test hello_wo spiraea7 R 3:38 N/A 3 cpn3039 [***@lgn301 jobs]$ scancel -u spiraea7 [***@lgn301 jobs]$ squeue JOBID | PARTITION | NAME | USER ST | TIME TRES_P | NODES | EXEC_HOST ``` :::