# singularity 建立自己的使用環境
[TOC]
您可透過 「Singularity」 包裝您所需的套件與程式,建立在 晶創主機 (命令列介面) 服務中,執行運算工作的環境。
## 操作範例1 : 在晶創主機中已有建立好的singularity映像檔案。
### 1.映像檔.sif路徑
系統上已打包好一個.sif映像檔在此路徑上,無須安裝。
```
[user@cbi-lgn01 ]$ /work/hpc_sys/sifs/pytorch_22.09-py3_horovod.sif
```
[Nvidia 官方網站](https://docs.nvidia.com/deeplearning/frameworks/pytorch-release-notes/rel-22-09.html) : 提供的範例是Nvidia官方所release的版本(22.09)
### 2.Horovod 的 benchmark script
此章節以 Horovod 撰寫的 benchmark script 為編輯和測試的範例:
使用wget將檔案從github下載到你的目錄中
```
[user@cbi-lgn01 ]$ wget https://raw.githubusercontent.com/horovod/horovod/v0.20.3/examples/pytorch/pytorch_synthetic_benchmark.py
[user@cbi-lgn01 ]$ ls
pytorch_synthetic_benchmark.py
```
[Horovod Github](https://github.com/horovod/horovod/blob/master/examples/pytorch/pytorch_synthetic_benchmark.py) : 官網提供的效能測試.py程式碼。
### 3.撰寫Slurm Job (以pytorch_synthetic_benchmark.sh為例)
```
[user@cbi-lgn01 ]$ vi pytorch_synthetic_benchmark.sh
#!/bin/bash
#SBATCH --job-name=singularity ## job name
#SBATCH --nodes=2 ## 索取 2 節點
#SBATCH --ntasks-per-node=4 ## 每個節點運行 4 srun tasks
#SBATCH --cpus-per-task=4 ## 每個 srun task 索取 4 CPUs
#SBATCH --gres=gpu:4
#SBATCH --account=GOV113XXX ## PROJECT_ID 請填入計畫ID
#SBATCH --partition=normal ##
#SBATCH -o %j_mine.out # Path to the standard output file
#SBATCH -e %j_mine.err # Path to the standard error ouput file
module purge
ml singularity
export UCX_NET_DEVICES=mlx5_0:1
export UCX_IB_GPU_DIRECT_RDMA=1
export HOROVOD_CPU_OPERATIONS=CCL
export NCCL_DEBUG=WARN
SIF=/work/hpc_sys/sifs/pytorch_22.09-py3_horovod.sif
SINGULARITY="singularity run --nv --no-home -B .:/data $SIF"
HOROVOD="python /data/pytorch_synthetic_benchmark.py --batch-size 256"
export HOROVOD_CPU_OPERATIONS=CCL
export NCCL_DEBUG=WARN
srun --mpi=pmi2 "${SINGULARITY[@]}" "${HOROVOD[@]}"
```
::::spoiler 腳本程式碼教學
:::info
1. ml singularity
module load singularity: 是使用 Singularity 的第一步,它讓系統知道我們要開始使用 Singularity,並為我們準備好所需的環境。
2. export 環境變數
環境變數主要用於 優化深度學習訓練的性能。透過指定特定的網路裝置、啟用RDMA、選擇高效的通信庫,以及開啟適當的除錯等級。
3. 從路徑下載入SIF的image檔案
/work/hpc_sys/sifs/pytorch_23.02-py3_horovod.sif
4. /data用於共享資料
可以在容器內直接存取宿主機上的檔案,例如訓練資料集、模型權重等。
也就是說您可以在容器內修改 /data 目錄下的檔案,而這些修改也會反映到宿主機上,方便您進行實驗和調整。
`當前目錄(.)綁定到容器內的 /data 目錄。`
SINGULARITY="singularity run --nv --no-home -B .:/data $SIF"
`要執行的 Python 程式位於容器內的 /data 目錄下。`
HOROVOD="python /data/pytorch_synthetic_benchmark.py --batch-size 256"
::::
### 4.提交Slurm Job
```
[user@cbi-lgn01]$ sbatch pytorch_synthetic_benchmark.sh
Submitted batch job 10925
```
### 5.查看執行輸出的結果
```
[user@cbi-lgn01 ]$ ll
-rw-r--r-- 1 user TRI1123529 0 Jan 23 11:21 10925_mine.err
-rw-r--r-- 1 user TRI1123529 36082 Jan 23 11:21 10925_mine.out
[user@cbi-lgn01 ]$ cat 10925_job.out
=============
== PyTorch ==
=============
NVIDIA Release 22.09 (build 44877844)
PyTorch Version 1.13.0a0+d0d6b1f
Container image Copyright (c) 2022, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
...
...
COMPLETE
Model: resnet50
Batch size: 256
Number of GPUs: 4
Running warmup...
Running benchmark...
Iter #0: 1701.0 img/sec per GPU
Iter #1: 1699.9 img/sec per GPU
Iter #2: 1701.7 img/sec per GPU
Iter #3: 1701.7 img/sec per GPU
Iter #4: 1701.3 img/sec per GPU
Iter #5: 1703.5 img/sec per GPU
Iter #6: 1703.4 img/sec per GPU
Iter #7: 1703.1 img/sec per GPU
Iter #8: 1703.3 img/sec per GPU
Iter #9: 1702.2 img/sec per GPU
Img/sec per GPU: 1702.1 +-2.3
Total img/sec on 4 GPU(s): 6808.4 +-9.0
hgpn06:4110637:4111521 [2] NCCL INFO comm 0x153fa49c1370 rank 2 nranks 4 cudaDev 2 busId b9000 - Destroy COMPLETE
hgpn06:4110639:4111514 [1] NCCL INFO comm 0x1521909ad4b0 rank 1 nranks 4 cudaDev 1 busId 87000 - Destroy COMPLETE
hgpn06:4110634:4111516 [0] NCCL INFO comm 0x14861c9dd490 rank 0 nranks 4 cudaDev 0 busId 86000 - Destroy COMPLETE
hgpn06:4110636:4111519 [3] NCCL INFO comm 0x1551cc9c0930 rank 3 nranks 4 cudaDev 3 busId bc000 - Destroy COMPLETE
```
## 操作範例2 : 在自己的虛擬上建立.sif映像檔
使用 CUDA 工具包在 h2o4gpu 上執行 python 的範例。
### 1. 建立映像檔(.sif)
:::info
:warning: HPC 晶創主機禁止使用 --fakeroot ,是為了安全性的考量。
建議在自己的虛擬機上建立好.sif映像檔再上傳到晶創主機。
```
[user@cbi-lgn01]$ singularity build --fakeroot h2o4gpuPy11.sif h2o4gpuPy.def`
ATAL: could not use fakeroot: no mapping entry found in /etc/subuid for user
```
:::
```
在你的虛擬主機上建立.sif檔案
[user@localhost ]$ vi h2o4gpuPy.def
BootStrap: docker
From: nvidia/cuda:12.1.0-devel-ubuntu18.04
# Note: This container will have only the Python API enabled
%environment
# -----------------------------------------------------------------------------------
export PYTHON_VERSION=3.6
export CUDA_HOME=/usr/local/cuda
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:$CUDA_HOME/lib64/:$CUDA_HOME/lib/:$CUDA_HOME/extras/CUPTI/lib64
export LD_LIBRARY_PATH=/usr/local/cuda/lib64:$LD_LIBRARY_PATH
export LC_ALL=C
%post
# -----------------------------------------------------------------------------------
# this will install all necessary packages and prepare the contianer
export PYTHON_VERSION=3.6
export CUDA_HOME=/usr/local/cuda
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:$CUDA_HOME/lib64/:$CUDA_HOME/lib/:$CUDA_HOME/extras/CUPTI/lib64
apt-get -y update
apt-get install -y --no-install-recommends build-essential
apt-get install -y --no-install-recommends git
apt-get install -y --no-install-recommends vim
apt-get install -y --no-install-recommends wget
apt-get install -y --no-install-recommends ca-certificates
apt-get install -y --no-install-recommends libjpeg-dev
apt-get install -y --no-install-recommends libpng-dev
apt-get install -y --no-install-recommends libpython3.6-dev
apt-get install -y --no-install-recommends libopenblas-dev pbzip2
apt-get install -y --no-install-recommends libcurl4-openssl-dev libssl-dev libxml2-dev
apt-get install -y --no-install-recommends python3-pip
apt-get install -y --no-install-recommends wget
ln -s /usr/bin/python${PYTHON_VERSION} /usr/bin/python
ln -s /usr/bin/pip3 /usr/bin/pip
pip3 install setuptools
pip3 install --upgrade pip
wget https://s3.amazonaws.com/h2o-release/h2o4gpu/releases/stable/ai/h2o/h2o4gpu/0.4-cuda10/rel-0.4.0/h2o4gpu-0.4.0-cp36-cp36m-linux_x86_64.whl
pip install h2o4gpu-0.4.0-cp36-cp36m-linux_x86_64.whl
[user@localhost ]$ singularity build --fakeroot h2o4gpuPy11.sif h2o4gpuPy.def
INFO: Starting build...
...
INFO: Adding environment to container
INFO: Creating SIF file...
INFO: Build complete: h2o4gpuPy.sif
[user@localhost ]$ ls
h2o4gpuPy.def h2o4gpuPy.sif
#上傳到晶創主機
[user@localhost ]$ sftp user@140.110.148.5
(user@140.110.148.5) Please select the 2FA login method.
1. Mobile APP OTP
2. Mobile APP PUSH
3. Email OTP
Login method: 2 #雙因子認證
(user@140.110.148.5) Password: #輸入密碼
Connected to 140.110.148.5.
sftp> put h2o4gpuPy.sif #上傳檔案(h2o4gpuPy.sif)
Uploading h2o4gpuPy.sif to /home/user/h2o4gpuPy.sif
h2o4gpuPy.sif 100% 3503MB 243.4MB/s 00:14
sftp> exit
[user@localhost ]$
```
### 2. h2o4gpu 套件的範例程式
```
[user@cbi-lgn01 ]$ vi h2o4gpu_sample.py
import h2o4gpu
import numpy as np
# Larger sample data (1000 samples with 2 features)
X = np.random.rand(1000, 2)
# Create and fit the KMeans model
model = h2o4gpu.KMeans(n_clusters=10, random_state=1234).fit(X)
# Get cluster centers
cluster_centers = model.cluster_centers_
print("Cluster Centers:", cluster_centers)
```
### 3.撰寫Slurm Job (以h2o4gpuPy.sh為例)
```
[user@cbi-lgn01]$ vi h2o4gpuPy.sh
#!/bin/bash
#SBATCH --job-name=singularity ## job name
#SBATCH --nodes=1 ## 索取 1 節點
#SBATCH --ntasks-per-node=1 ## 每個節點運行 1 srun tasks
#SBATCH --cpus-per-task=1 ## 每個 srun task 索取 1 CPUs
#SBATCH --gres=gpu:1
#SBATCH --account=GOV1XXXXX ## PROJECT_ID 請填入計畫ID
#SBATCH --partition=dev ## 選擇所需partition
#SBATCH -o %j_h2o4gpuPy.out ## Path to the standard output file
#SBATCH -e %j_h2o4gpuPy.err ## Path to the standard error ouput file
module purge
ml load singularity
ml load cuda
singularity exec --nv h2o4gpuPy.sif python3 h2o4gpu_sample.py
```
### 4.提交Slurm Job
```
[user@cbi-lgn01]$ sbatch h2o4gpuPy.sh
Submitted batch job 15212
```
### 5.查看執行輸出的結果
```
[user@cbi-lgn01]$ cat 15212_h2o4gpuPy.out
Cluster Centers: [[0.60558998 0.49576345]
[0.83222364 0.80290834]
[0.1857559 0.84386754]
[0.14520802 0.17203146]
[0.80368887 0.13313996]
[0.34662795 0.44191851]
[0.5055868 0.82354161]
[0.45088371 0.15640011]
[0.88583178 0.45586675]
[0.10908311 0.54831662]]
[user@cbi-lgn01]$ cat 15212_h2o4gpuPy.err
-----------------
loading CUDA 12.2
-----------------
```