Singularity 建立自己的使用環境

# Singularity 建立自己的使用環境 [TOC] 您可透過「Singularity」包裝您所需的套件與程式，建立在晶創主機 (命令列介面) 服務中，執行運算工作的環境。 ## 操作範例1 : 在晶創主機中已有建立好的singularity映像檔案。 ### 1.映像檔.sif路徑系統上已打包好一個.sif映像檔在此路徑上，無須安裝。 ``` [user@cbi-lgn01 ]$ /work/hpc_sys/sifs/pytorch_22.09-py3_horovod.sif ``` [Nvidia 官方網站](https://docs.nvidia.com/deeplearning/frameworks/pytorch-release-notes/rel-22-09.html) : 提供的範例是Nvidia官方所release的版本(22.09) ### 2.Horovod 的 benchmark script 此章節以 Horovod 撰寫的 benchmark script 為編輯和測試的範例：使用wget將檔案從github下載到你的目錄中 ``` [user@cbi-lgn01 ]$ wget https://raw.githubusercontent.com/horovod/horovod/v0.20.3/examples/pytorch/pytorch_synthetic_benchmark.py [user@cbi-lgn01 ]$ ls pytorch_synthetic_benchmark.py ``` [Horovod Github](https://github.com/horovod/horovod/blob/master/examples/pytorch/pytorch_synthetic_benchmark.py) : 官網提供的效能測試.py程式碼。 ### 3.撰寫Slurm Job (以pytorch_synthetic_benchmark.sh為例) ``` [user@cbi-lgn01 ]$ vi pytorch_synthetic_benchmark.sh #!/bin/bash #SBATCH --job-name=singularity ## job name #SBATCH --nodes=2 ## 索取 2 節點 #SBATCH --ntasks-per-node=4 ## 每個節點運行 4 srun tasks #SBATCH --cpus-per-task=4 ## 每個 srun task 索取 4 CPUs #SBATCH --gres=gpu:4 #SBATCH --account=GOV113XXX ## PROJECT_ID 請填入計畫ID #SBATCH --partition=normal ## #SBATCH -o %j_mine.out # Path to the standard output file #SBATCH -e %j_mine.err # Path to the standard error ouput file module purge ml singularity export UCX_NET_DEVICES=mlx5_0:1 export UCX_IB_GPU_DIRECT_RDMA=1 export HOROVOD_CPU_OPERATIONS=CCL export NCCL_DEBUG=WARN SIF=/work/hpc_sys/sifs/pytorch_22.09-py3_horovod.sif SINGULARITY="singularity run --nv --no-home -B .:/data $SIF" HOROVOD="python /data/pytorch_synthetic_benchmark.py --batch-size 256" export HOROVOD_CPU_OPERATIONS=CCL export NCCL_DEBUG=WARN srun --mpi=pmi2 "${SINGULARITY[@]}" "${HOROVOD[@]}" ``` ::::spoiler 腳本程式碼教學 :::info 1. ml singularity module load singularity: 是使用 Singularity 的第一步，它讓系統知道我們要開始使用 Singularity，並為我們準備好所需的環境。 2. export 環境變數環境變數主要用於優化深度學習訓練的性能。透過指定特定的網路裝置、啟用RDMA、選擇高效的通信庫，以及開啟適當的除錯等級。 3. 從路徑下載入SIF的image檔案 /work/hpc_sys/sifs/pytorch_23.02-py3_horovod.sif 4. /data用於共享資料可以在容器內直接存取宿主機上的檔案，例如訓練資料集、模型權重等。也就是說您可以在容器內修改 /data 目錄下的檔案，而這些修改也會反映到宿主機上，方便您進行實驗和調整。 `當前目錄（.）綁定到容器內的 /data 目錄。` SINGULARITY="singularity run --nv --no-home -B .:/data $SIF" `要執行的 Python 程式位於容器內的 /data 目錄下。` HOROVOD="python /data/pytorch_synthetic_benchmark.py --batch-size 256" :::: ### 4.提交Slurm Job ``` [user@cbi-lgn01]$ sbatch pytorch_synthetic_benchmark.sh Submitted batch job 10925 ``` ### 5.查看執行輸出的結果 ``` [user@cbi-lgn01 ]$ ll -rw-r--r-- 1 user TRI1123529 0 Jan 23 11:21 10925_mine.err -rw-r--r-- 1 user TRI1123529 36082 Jan 23 11:21 10925_mine.out [user@cbi-lgn01 ]$ cat 10925_job.out ============= == PyTorch == ============= NVIDIA Release 22.09 (build 44877844) PyTorch Version 1.13.0a0+d0d6b1f Container image Copyright (c) 2022, NVIDIA CORPORATION & AFFILIATES. All rights reserved. ... ... COMPLETE Model: resnet50 Batch size: 256 Number of GPUs: 4 Running warmup... Running benchmark... Iter #0: 1701.0 img/sec per GPU Iter #1: 1699.9 img/sec per GPU Iter #2: 1701.7 img/sec per GPU Iter #3: 1701.7 img/sec per GPU Iter #4: 1701.3 img/sec per GPU Iter #5: 1703.5 img/sec per GPU Iter #6: 1703.4 img/sec per GPU Iter #7: 1703.1 img/sec per GPU Iter #8: 1703.3 img/sec per GPU Iter #9: 1702.2 img/sec per GPU Img/sec per GPU: 1702.1 +-2.3 Total img/sec on 4 GPU(s): 6808.4 +-9.0 hgpn06:4110637:4111521 [2] NCCL INFO comm 0x153fa49c1370 rank 2 nranks 4 cudaDev 2 busId b9000 - Destroy COMPLETE hgpn06:4110639:4111514 [1] NCCL INFO comm 0x1521909ad4b0 rank 1 nranks 4 cudaDev 1 busId 87000 - Destroy COMPLETE hgpn06:4110634:4111516 [0] NCCL INFO comm 0x14861c9dd490 rank 0 nranks 4 cudaDev 0 busId 86000 - Destroy COMPLETE hgpn06:4110636:4111519 [3] NCCL INFO comm 0x1551cc9c0930 rank 3 nranks 4 cudaDev 3 busId bc000 - Destroy COMPLETE ``` ## 操作範例2 : 在自己的虛擬上建立.sif映像檔使用 CUDA 工具包在 h2o4gpu 上執行 python 的範例。 ### 1. 建立映像檔(.sif) :::info :warning: HPC 晶創主機禁止使用 --fakeroot ，是為了安全性的考量。建議在自己的虛擬機上建立好.sif映像檔再上傳到晶創主機。 ``` [user@cbi-lgn01]$ singularity build --fakeroot h2o4gpuPy11.sif h2o4gpuPy.def` ATAL: could not use fakeroot: no mapping entry found in /etc/subuid for user ``` ::: ``` 在你的虛擬主機上建立.sif檔案 [user@localhost ]$ vi h2o4gpuPy.def BootStrap: docker From: nvidia/cuda:12.1.0-devel-ubuntu18.04 # Note: This container will have only the Python API enabled %environment # ----------------------------------------------------------------------------------- export PYTHON_VERSION=3.6 export CUDA_HOME=/usr/local/cuda export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:$CUDA_HOME/lib64/:$CUDA_HOME/lib/:$CUDA_HOME/extras/CUPTI/lib64 export LD_LIBRARY_PATH=/usr/local/cuda/lib64:$LD_LIBRARY_PATH export LC_ALL=C %post # ----------------------------------------------------------------------------------- # this will install all necessary packages and prepare the contianer export PYTHON_VERSION=3.6 export CUDA_HOME=/usr/local/cuda export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:$CUDA_HOME/lib64/:$CUDA_HOME/lib/:$CUDA_HOME/extras/CUPTI/lib64 apt-get -y update apt-get install -y --no-install-recommends build-essential apt-get install -y --no-install-recommends git apt-get install -y --no-install-recommends vim apt-get install -y --no-install-recommends wget apt-get install -y --no-install-recommends ca-certificates apt-get install -y --no-install-recommends libjpeg-dev apt-get install -y --no-install-recommends libpng-dev apt-get install -y --no-install-recommends libpython3.6-dev apt-get install -y --no-install-recommends libopenblas-dev pbzip2 apt-get install -y --no-install-recommends libcurl4-openssl-dev libssl-dev libxml2-dev apt-get install -y --no-install-recommends python3-pip apt-get install -y --no-install-recommends wget ln -s /usr/bin/python${PYTHON_VERSION} /usr/bin/python ln -s /usr/bin/pip3 /usr/bin/pip pip3 install setuptools pip3 install --upgrade pip wget https://s3.amazonaws.com/h2o-release/h2o4gpu/releases/stable/ai/h2o/h2o4gpu/0.4-cuda10/rel-0.4.0/h2o4gpu-0.4.0-cp36-cp36m-linux_x86_64.whl pip install h2o4gpu-0.4.0-cp36-cp36m-linux_x86_64.whl [user@localhost ]$ singularity build --fakeroot h2o4gpuPy11.sif h2o4gpuPy.def INFO: Starting build... ... INFO: Adding environment to container INFO: Creating SIF file... INFO: Build complete: h2o4gpuPy.sif [user@localhost ]$ ls h2o4gpuPy.def h2o4gpuPy.sif #上傳到晶創主機 [user@localhost ]$ sftp user@140.110.148.5 (user@140.110.148.5) Please select the 2FA login method. 1. Mobile APP OTP 2. Mobile APP PUSH 3. Email OTP Login method: 2 #雙因子認證 (user@140.110.148.5) Password: #輸入密碼 Connected to 140.110.148.5. sftp> put h2o4gpuPy.sif #上傳檔案(h2o4gpuPy.sif) Uploading h2o4gpuPy.sif to /home/user/h2o4gpuPy.sif h2o4gpuPy.sif 100% 3503MB 243.4MB/s 00:14 sftp> exit [user@localhost ]$ ``` ### 2. h2o4gpu 套件的範例程式 ``` [user@cbi-lgn01 ]$ vi h2o4gpu_sample.py import h2o4gpu import numpy as np # Larger sample data (1000 samples with 2 features) X = np.random.rand(1000, 2) # Create and fit the KMeans model model = h2o4gpu.KMeans(n_clusters=10, random_state=1234).fit(X) # Get cluster centers cluster_centers = model.cluster_centers_ print("Cluster Centers:", cluster_centers) ``` ### 3.撰寫Slurm Job (以h2o4gpuPy.sh為例) ``` [user@cbi-lgn01]$ vi h2o4gpuPy.sh #!/bin/bash #SBATCH --job-name=singularity ## job name #SBATCH --nodes=1 ## 索取 1 節點 #SBATCH --ntasks-per-node=1 ## 每個節點運行 1 srun tasks #SBATCH --cpus-per-task=1 ## 每個 srun task 索取 1 CPUs #SBATCH --gres=gpu:1 #SBATCH --account=GOV1XXXXX ## PROJECT_ID 請填入計畫ID #SBATCH --partition=dev ## 選擇所需partition #SBATCH -o %j_h2o4gpuPy.out ## Path to the standard output file #SBATCH -e %j_h2o4gpuPy.err ## Path to the standard error ouput file module purge ml load singularity ml load cuda singularity exec --nv h2o4gpuPy.sif python3 h2o4gpu_sample.py ``` ### 4.提交Slurm Job ``` [user@cbi-lgn01]$ sbatch h2o4gpuPy.sh Submitted batch job 15212 ``` ### 5.查看執行輸出的結果 ``` [user@cbi-lgn01]$ cat 15212_h2o4gpuPy.out Cluster Centers: [[0.60558998 0.49576345] [0.83222364 0.80290834] [0.1857559 0.84386754] [0.14520802 0.17203146] [0.80368887 0.13313996] [0.34662795 0.44191851] [0.5055868 0.82354161] [0.45088371 0.15640011] [0.88583178 0.45586675] [0.10908311 0.54831662]] [user@cbi-lgn01]$ cat 15212_h2o4gpuPy.err ----------------- loading CUDA 12.2 ----------------- ```

Read more

Miniconda 建立自己的使用環境

晶創主機(Nano5) - 使用說明

硬體規格

晶創25(Nano5) Queue 列表與說明 (Partition)