# 提交與管理Job範例 1. 使用sbatch提交工作 :::success ``` # 建立一個sbatch Job Script (sample-job.sh) # 此範例資源會分配兩個計算節點,每個節點執行112個task,每個task使用1個CPU Core # 因沒指定 memory 相關參數,預設分配節點所有可用記憶體(總共482582MB * 2) # 若指定 --mem=450G,則總記憶體為 450G * 2(nodes) [user@ilgn01 ~]$ vim sample-job.sh #!/bin/bash #SBATCH --account=<PROJECT_ID> # (-A) iService Project ID #SBATCH --job-name=sbatch # (-J) Job name #SBATCH --partition=development # (-p) Slurm partition #SBATCH --nodes=2 # (-N) Maximum number of nodes to be allocated #SBATCH --cpus-per-task=1 # (-c) Number of cores per MPI task #SBATCH --ntasks-per-node=112 # Maximum number of tasks on each node #SBATCH --time=00:30:00 # (-t) Wall time limit (days-hrs:min:sec) #SBATCH --output=job-%j.out # (-o) Path to the standard output file #SBATCH --error=job-%j.err # (-e) Path to the standard error file #SBATCH --mail-type=END,FAIL # Mail events (NONE, BEGIN, END, FAIL, ALL) #SBATCH --mail-user=user@example.com # Where to send mail. Set this to your email address module purge module load intel/.... mpiexec ./hello ``` ``` # 提交Job Script [user@ilgn01 ~]$ sbatch sample-job.sh ``` ``` # 查看job queue狀態 [user@ilgn01 ~]$ squeue -u $UID ``` ``` # 查看Job 執行狀態 [user@ilgn01 ~]$ scontrol show job <job_id> ``` ::: 2. 使用salloc提交工作 :::success ``` # 提交一個使用單一台CPU節點單一核心的互動式工作 # 資源分配成功後,可以看到Job ID為6938,分配的計算節點為 icpnp305。 [user@ilgn01 ~]$ salloc --partition=development --account=<PROJECT_ID> --ntasks=1 --tasks-per-node=1 salloc: Granted job allocation 6938 salloc: Waiting for resource configuration salloc: Nodes icpnp305 are ready for job # 此時會進入salloc的專用SHELL # 在您離開這個SHELL之前,Job 6938會處於RUNNING狀態且持續計費 [user@ilgn01 salloc_6938 ~]$ # 要離開salloc SHELL,可以輸入 exit 指令 [user@ilgn01 salloc_6938 ~]$ ``` ``` # 查看Slurm Job環境資訊,可查看Job相關資訊 [user@ilgn01 salloc_6938 ~]$ env |grep -i slurm ``` ``` # 在salloc SHELL中,您可以執行srun指令,每個srun等同於一個Job Step [user@ilgn01 salloc_6938 ~]$ srun hostname icpnp305 ``` ``` # 您可以直接以ssh進入此Job分配的計算節點執行程式或指令 [user@ilgn01 salloc_6938 ~]$ ssh icpnp305 [user@icpnp305 ~]$ hostname icpnp305 [user@icpnp305 ~]$ exit logout Connection to icpnp305 closed. [user@ilgn01 salloc_6938 ~]$ # 或者,您也可以直接用srun進入此Job分配的計算節點執行程式或指令 [user@ilgn01 salloc_6938 ~]$ srun --pty /bin/bash [user@icpnp305 salloc_6948 ~]$ hostname icpnp305 [user@icpnp305 salloc_93123 ~]$ exit exit [user@icpnp305 salloc_93123 ~]$ ``` ``` # 您可以使用sacct 指令查詢Job Step清單與執行結果 # 6938.0 和6938.1即為前面執行的step,數字0和1是step的編號 [user@ilgn01 salloc_6938 ~]$ sacct -j $SLURM_JOBID JobID JobName Partition Account AllocCPUS State ExitCode ------------ ---------- ---------- ---------- ---------- ---------- -------- 6938 interacti+ developme+ govXXXXXX 1 RUNNING 0:0 6938.extern extern govXXXXXX 1 RUNNING 0:0 6938.0 hostname govXXXXXX 1 COMPLETED 0:0 6938.1 bash govXXXXXX 1 COMPLETED 0:0 ``` ``` # 欲離開互動式工作,輸入exit指令 [user@ilgn01 salloc_6938 ~]$ exit salloc: Relinquishing job allocation 6938 salloc: Job allocation 6938 has been revoked. [user@ilgn01 ~]$ ``` :::