TWCC Parabricks Quickstart & Tutorial

--- tags: tutorial, HPC, TWCC --- # TWCC Parabricks Quickstart & Tutorial ## 關於 Parabricks Parabricks 是一套分析次世代定序資料的軟體，主要利用GPU去加速基因體變異點位分析工具如BWA、GATK4 及 Deepvariant。 Parabricks除了包含分析中會用到的多個工具，也包含了常用的分析流程。目前在國網中心的測試結果中，使用TWCC跑Parabricks 比在 Tawania1 的 CPU 上跑 GATK4 快約13倍。 ![](https://i.imgur.com/PKXpjN8.png) ### Parabricks 官方文件 https://www.nvidia.com/en-us/docs/parabricks/ https://www.nvidia.com/parabricks ## 登入 ln01.twcc.ai 詳情可參考 https://www.twcc.ai/doc?page=hpc_cli - 登入方式 `ssh iService主機帳號@ln01.twcc.ai` - 登入密碼 `iService主機密碼加上OTP動態碼(ex: $iservice_pw$otp`) ## 操作簡述 HPC 操作需在登入節點撰寫 `slurm job script` ，並提交 job 才能運行計算。使用 `environment module` 管理軟體環境。 - slurm job script 在腳本開頭定義需索取資源，讓排程系統 slurm 了解你的 job 需求，並安排適合你 job 的節點運行。底下會有範例。 - environment module 管理 HPC 軟體環境變數的工具。主要指令有 `module list`(列出可用)、`module load $工具名稱`、`module purge`(卸載已載入)。 - 資源限制 job 需符合 1 GPU : 4 CPUs : 90 GB Memory 比例 - queue 不同排隊的隊列。slurm 排程系統稱作 partition。目前增設 `covid19`，當 job 提交到此 queue 會獲得較高的優先權(需被允許的計畫)。其他現有的 queue 有 `gtest`(測試用)、`gp1d`(最長跑 1 天)、`gp2d`(最長跑 2 天)、`gp4d` (最長跑 4 天)、`express` (企業用戶，最長跑 4 天)、`covid19`(抗疫專案，最長跑 4 天)。 ## slurm job script 該 job 運行 parabricks 的[官網範例](https://www.nvidia.com/en-us/docs/parabricks/quickstart-guide/) (這裡以Alignment工作為範例，如要做variant calling 請見[變異點位分析](#變異點位分析（Variant-Calling）)）。 job script 索取 1 節點、 8 張 GPUs、 32 CPUs。 `$parabricks_sample` 是一環境變數，是已經把 parabricks 提供範例解壓縮至此路徑。 ```bash= #!/bin/bash #SBATCH --job-name=Hello_twcc ## job name #SBATCH --nodes=1 ## 索取 1 節點 #SBATCH --cpus-per-task=32 ## 該 task 用 32 CPUs #SBATCH --gres=gpu:8 ## 每個節點索取 8 GPUs #SBATCH --time=00:10:00 ## 最長跑 10 分鐘 (測試完這邊記得改掉，或是測試完直接刪除該行) #SBATCH --account=iService_ID ## iService_ID 請填入計畫ID(ex: MST108XXX)，扣款也會根據此計畫ID #SBATCH --partition=covid19 ## 抗疫專案專用 queue module purge module load parabricks echo parabricks sample path is $parabricks_sample pbrun fq2bam --ref $parabricks_sample/Ref/Homo_sapiens_assembly38.fasta --in-fq $parabricks_sample/Data/sample_1.fq.gz $parabricks_sample/Data/sample_2.fq.gz --out-bam output.bam ``` ## 提交 slurm job script ``` sbatch slurm_job.sh ``` 可以使用 `tail -f slurm-$jobid.out` 觀看運行狀態，以及使用 slurm 指令`squeue -u $user`(查看正在運行的 job)、`sacct -X` (查看今日運行的 job 及狀態)，查看是否還在運行 or 運算結束了。 ## 執行結果 ```bash= cat slurm-$job_id.out ``` ```bash= Message from HPC admin ----------------------- For Fighting COVID-19, Parabricks provided by NVIDIA. See more: https://www.nvidia.com/parabricks Simple tutorial https://hackmd.io/@kmo/twcc_hpc_parabricks ----------------------- parabricks sample path is /opt/ohpc/twcc/parabricks/parabricks_sample Please visit https://www.nvidia.com/en-us/docs/parabricks/ for detailed documentation [Parabricks Options Mesg]: Checking argument compatibility [Parabricks Options Mesg]: Automatically generating ID prefix [Parabricks Options Mesg]: Read group created for /opt/ohpc/twcc/parabricks/parabricks_sample/Data/sample_1.fq.gz and /opt/ohpc/twcc/parabricks/parabricks_sample/Data/sample_2.fq.gz [Parabricks Options Mesg]: @RG\tID:HK3TJBCX2.1\tLB:lib1\tPL:bar\tSM:sample\tPU:HK3TJBCX2.1 ------------------------------------------------------------------------------ || Parabricks accelerated Genomics Pipeline || || Version v3.1.6 || || GPU-BWA mem, Sorting Phase-I || || Contact: Parabricks-Support@nvidia.com || ------------------------------------------------------------------------------ [M::bwa_idx_load_from_disk] read 0 ALT contigs WARNING The system has 36 threads, however recommended number of threads with 8 GPU is 48. The run might not finish or might have less than expected performance. GPU-BWA mem ProgressMeter Reads Base Pairs Aligned [11:43:47] 5043564 720000000 [11:43:52] 10087128 1160000000 [11:43:58] 15130692 1760000000 [11:43:58] 15130692 1780000000 [11:44:03] 20348172 2400000000 [11:44:09] 25391736 2930000000 [11:44:15] 30435300 3520000000 [11:44:21] 35478864 4070000000 [11:44:26] 40522428 4710000000 [11:44:32] 45565992 5250000000 [11:44:38] 50609556 5810000000 GPU-BWA Mem time: 118.623977 seconds GPU-BWA Mem is finished. GPU Sorting, Marking Dups, BQSR ProgressMeter SAM Entries Completed Total GPU-BWA Mem + Sorting + MarkingDups + BQSR Generation + BAM writing Processing time: 118.636142 seconds [main] CMD: PARABRICKS mem -Z ./pbOpts.txt /opt/ohpc/twcc/parabricks/parabricks_sample/Ref/Homo_sapiens_assembly38.fasta /opt/ohpc/twcc/parabricks/parabricks_sample/Data/sample_1.fq.gz /opt/ohpc/twcc/parabricks/parabricks_sample/Data/sample_2.fq.gz @RG\tID:HK3TJBCX2.1\tLB:lib1\tPL:bar\tSM:sample\tPU:HK3TJBCX2.1 [main] Real time: 121.890 sec; CPU: 2158.981 sec ------------------------------------------------------------------------------ || Program: GPU-BWA mem, Sorting Phase-I || || Version: v3.1.6 || || Start Time: Mon Nov 9 11:43:18 2020 || || End Time: Mon Nov 9 11:45:20 2020 || || Total Time: 2 minutes 2 seconds || ------------------------------------------------------------------------------ ------------------------------------------------------------------------------ || Parabricks accelerated Genomics Pipeline || || Version v3.1.6 || || Sorting Phase-II || || Contact: Parabricks-Support@nvidia.com || ------------------------------------------------------------------------------ progressMeter - Percentage [11:45:23] 0.0 0.00 GB Sorting and Marking: 10.001 seconds ------------------------------------------------------------------------------ || Program: Sorting Phase-II || || Version: v3.1.6 || || Start Time: Mon Nov 9 11:45:23 2020 || || End Time: Mon Nov 9 11:45:33 2020 || || Total Time: 10 seconds || ------------------------------------------------------------------------------ ------------------------------------------------------------------------------ || Parabricks accelerated Genomics Pipeline || || Version v3.1.6 || || Marking Duplicates, BQSR || || Contact: Parabricks-Support@nvidia.com || ------------------------------------------------------------------------------ progressMeter - Percentage [11:45:43] 73.3 1.58 GB [11:45:53] 100.0 0.00 GB BQSR and writing final BAM: 20.051 seconds ------------------------------------------------------------------------------ || Program: Marking Duplicates, BQSR || || Version: v3.1.6 || || Start Time: Mon Nov 9 11:45:33 2020 || || End Time: Mon Nov 9 11:45:53 2020 || || Total Time: 20 seconds || ------------------------------------------------------------------------------ Please visit https://www.nvidia.com/en-us/docs/parabricks/ for detailed documentation ``` ## 取消正在運行的 job 若想停止正在執行的 job，可以執行此指令 ```bash= scancel $jobid ``` --- # 變異點位分析（Variant Calling） ## Germline Pipeline Parabrikcs Germline Pipeline 與 GATK4 best practices pipeline類似。輸入檔案有 BWA-indexed reference files、fastq檔案、給BQSR計算的knownSites,輸出檔案有： * Aligned, co-ordinate sorted, duplicated marked bam * BQSR report * Variants in vcf/g.vcf/g.vcf.gz format ### 流程內容這個流程串接好幾個Parabricks工具在一起，Parabricks也可以獨立執行這些工具。這裡我們直接使用Parabricks串接好的Germline pipeline。 ![](https://i.imgur.com/TmxexM6.png) 圖片來源：[https://www.nvidia.com/en-us/docs/parabricks/germline/](https://www.nvidia.com/en-us/docs/parabricks/germline/) ### 指令 Parabricks germline pipeline的指令如下，請把檔案部分替換成您自己的檔案即可。如何上傳檔案請參考 [https://www.twcc.ai/doc?page=hfs](https://www.twcc.ai/doc?page=hfs)。 ```shell pbrun germline --ref Homo_sapiens_assembly38.fasta \ --in-fq sample0_0.fq.gz sample_0_1.fq.gz \ --knownSites known.vcf.gz \ --out-bam output.bam \ --out-variants output.vcf \ --out-recal-file report.txt ``` ::: info #### 參數解釋 | 參數 | 檔案 | 內容 | | ------------ | ---------------------------- | -------------- | | \-\-ref | Homo_sapiens_assembly38.fasta | 參考序列 | | \-\-in-fq | sample0_0.fq.gz | 序列檔案 | | \-\-knownSites | known.vcf.gz | 已知變異點位檔案 | | \-\-out-bam | output.bam | 序列比對序列檔案 | | \-\-out-variants | output.vcf | 變異點位分析結果 | | \-\-out-recal-file | report.txt | Recalibraiton 報告 | 其他參數解釋，請參考 [使用手冊](https://www.nvidia.com/en-us/docs/parabricks/somatic/)。或直接在登入後執行`module load parabricks； pbrun germline -h` 可以直接看使用手冊。 ::: ### 送出工作 #### 編輯Slurm工作指令稿若要送出germline pipeline工作，請修改先前提到的 [slurm_job.sh](#slurm-job-script) 裡面的內容，將最後一個指令修改如下，然後送出工作： ```shell= #!/bin/bash #SBATCH --job-name=Hello_twcc ## job name #SBATCH --nodes=1 ## 索取 1 節點 #SBATCH --cpus-per-task=32 ## 該 task 用 32 CPUs #SBATCH --gres=gpu:8 ## 每個節點索取 8 GPUs #SBATCH --time=00:10:00 ## 最長跑 10 分鐘 (測試完這邊記得改掉，或是測試完直接刪除該行) #SBATCH --account=iService_ID ## iService_ID 請填入計畫ID(ex: MST108XXX)，扣款也會根據此計畫ID #SBATCH --partition=covid19 ## 抗疫專案專用 queue module purge module load parabricks echo parabricks sample path is $parabricks_sample ##以下為 germline pipeline 程式 pbrun germline --ref Homo_sapiens_assembly38.fasta \ --in-fq sample0_0.fq.gz sample_0_1.fq.gz \ --knownSites known.vcf.gz \ --out-bam output.bam \ --out-variants output.vcf \ --out-recal-file report.txt ``` ## SOMATIC PIPELINE Somatic pipeline 可分析腫瘤定序資料，可選擇加入正常細胞定序資料一起分析。輸出檔案格式為VCF。如要送出工作請參考 [Germline Pipline](https://hackmd.io/Sg3_rfIQT8y2kOsJ_3YrrQ?both#Germline-Pipeline) 做法。 ### 流程內容 ![](https://i.imgur.com/60rons9.png) 圖片來源：[https://www.nvidia.com/en-us/docs/parabricks/germline/](https://www.nvidia.com/en-us/docs/parabricks/germline/) ### 指令 #### 1. 如果只跑tumor sample： ```bash= pbrun somatic --ref Ref/Homo_sapiens_assembly38.fasta \ --in-tumor-fq sample1-0.fq.gz sample1-1.fq.gz \ --out-vcf output.vcf \ --out-tumor-bam tumor.bam \ --out-tumor-recal-file recal.txt ``` #### 2. 如果要加入正常細胞定序資料比對： ```bash= pbrun somatic --ref Ref/Homo_sapiens_assembly38.fasta \ --knownSites knownsites.vcf.gz \ --in-tumor-fq tumor0.fq.gz tumor1.fq.gz "@RG\tID:sm_tumor_rg1\tLB:lib1\tPL:bar\tSM:sm_tumor\tPU:sm_tumor_rg1" \ --out-vcf output.vcf \ --out-tumor-bam tumor.bam \ --out-tumor-recal-file recal.txt \ --in-normal-fq normal0.fq.gz normal1.fq.gz "@RG\tID:sm_normal_rg1\tLB:lib1\tPL:bar\tSM:sm_normal\tPU:sm_normal_rg1" \ --out-normal-bam normal.bam ``` :::info #### 參數解釋 | 參數 | 檔案 | 內容 | | ------------------------ | -------------------------------------------------------------------------------------------- | ----------------------------------------------------------- | | \-\-ref | Homo_sapiens_assembly38.fasta | 參考序列 | | \-\-in-tumor-fq | sample0_0.fq.gz | 腫瘤序列資料 | | | "@RG\tID:**sm_tumor_rg1**\tLB:**lib1**\tPL:**bar**\tSM:**sm_tumor**\tPU:**sm_tumor_rg1**" | Read group 標籤,請置換粗體部分為自己的資料,請見下方詳細解釋 | | \-\-in-normal-fq | normal0.fq.gz normal1.fq.gz | 正常細胞序列資料 | | | "@RG\tID:**sm_normal_rg1**\tLB:**lib1**\tPL:**bar**\tSM:**sm_normal**\tPU:**sm_normal_rg1**" | Read group 標籤,請置換粗體部分為自己的資料,請見下方詳細解釋 | | \-\-knownSites | known.vcf.gz | 已知變異點位檔案 | | \-\-out-tumor-bam | output.bam | 腫瘤序列mapping結果 | | \-\-out-normal-bam | output.bam | 正常細胞序列mapping結果 | | \-\-out-vcf | output.vcf | 變異點位分析結果 | | \-\-out-tumor-recal-file | report.txt | Recalibraiton 報告 | 其他參數請見 [https://www.nvidia.com/en-us/docs/parabricks/somatic/](https://www.nvidia.com/en-us/docs/parabricks/somatic/) 或直接在登入後執行`module load parabricks； pbrun somatic -h` 可以直接看使用手冊。 ::: :::spoiler **Read Group 標籤** [資料來源GATK Documents](https://gatk.broadinstitute.org/hc/en-us/articles/360035890671-Read-groups) **ID = Read group identifier** This tag identifies which read group each read belongs to, so each read group's ID must be unique. It is referenced both in the read group definition line in the file header (starting with @RG) and in the RG:Z tag for each read record. Note that some Picard tools have the ability to modify IDs when merging SAM files in order to avoid collisions. In Illumina data, read group IDs are composed using the flowcell name and lane number, making them a globally unique identifier across all sequencing data in the world. Use for BQSR: ID is the lowest denominator that differentiates factors contributing to technical batch effects: therefore, a read group is effectively treated as a separate run of the instrument in data processing steps such as base quality score recalibration (unless you have PU defined), since they are assumed to share the same error model. **PU = Platform Unit** The PU holds three types of information, the {FLOWCELL_BARCODE}.{LANE}.{SAMPLE_BARCODE}. The {FLOWCELL_BARCODE} refers to the unique identifier for a particular flow cell. The {LANE} indicates the lane of the flow cell and the {SAMPLE_BARCODE} is a sample/library-specific identifier. Although the PU is not required by GATK but takes precedence over ID for base recalibration if it is present. In the example shown earlier, two read group fields, ID and PU, appropriately differentiate flow cell lane, marked by .2, a factor that contributes to batch effects. **SM = Sample** The name of the sample sequenced in this read group. GATK tools treat all read groups with the same SM value as containing sequencing data for the same sample, and this is also the name that will be used for the sample column in the VCF file. Therefore it is critical that the SM field be specified correctly. When sequencing pools of samples, use a pool name instead of an individual sample name. **PL = Platform/technology used to produce the read** This constitutes the only way to know what sequencing technology was used to generate the sequencing data. Valid values: ILLUMINA, SOLID, LS454, HELICOS and PACBIO. **LB = DNA preparation library identifier** MarkDuplicates uses the LB field to determine which read groups might contain molecular duplicates, in case the same DNA library was sequenced on multiple lanes. ::: ### 送出工作請參考 [Germline Pipeline](#送出工作0) 做法。 ## 其他Tools以及Piplines Parabricks 除了上述兩個比較常用的 pipelines 以外，還提供的 stand-alone 的工具以及其他分析 pipelines。如有需要可以參考其網頁的說明： [https://www.nvidia.com/en-us/docs/parabricks/](https://www.nvidia.com/en-us/docs/parabricks/)