TWCC
    • Sharing Link copied
    • /edit
    • View mode
      • Edit mode
      • View mode
      • Book mode
      • Slide mode
      Edit mode View mode Book mode Slide mode
    • Note Permission
    • Read
      • Owners
      • Signed-in users
      • Everyone
      Owners Signed-in users Everyone
    • Write
      • Owners
      • Signed-in users
      • Everyone
      Owners Signed-in users Everyone
    • More (Comment, Invitee)
    • Publishing
    • Commenting Enable
      Disabled Forbidden Owners Signed-in users Everyone
    • Permission
      • Forbidden
      • Owners
      • Signed-in users
      • Everyone
    • Invitee
    • No invitee
    • Options
    • Versions and GitLab Sync
    • Transfer ownership
    • Delete this note
    • Template
    • Insert from template
    • Export
    • Google Drive Export to Google Drive
    • Import
    • Google Drive Import from Google Drive
    • Gist
    • Clipboard
    • Download
    • Markdown
    • HTML
    • Raw HTML
Menu Sharing Help
Menu
Options
Versions and GitLab Sync Transfer ownership Delete this note
Export
Google Drive Export to Google Drive
Import
Google Drive Import from Google Drive Gist Clipboard
Download
Markdown HTML Raw HTML
Back
Sharing
Sharing Link copied
/edit
View mode
  • Edit mode
  • View mode
  • Book mode
  • Slide mode
Edit mode View mode Book mode Slide mode
Note Permission
Read
Owners
  • Owners
  • Signed-in users
  • Everyone
Owners Signed-in users Everyone
Write
Owners
  • Owners
  • Signed-in users
  • Everyone
Owners Signed-in users Everyone
More (Comment, Invitee)
Publishing
More (Comment, Invitee)
Commenting Enable
Disabled Forbidden Owners Signed-in users Everyone
Permission
Owners
  • Forbidden
  • Owners
  • Signed-in users
  • Everyone
Invitee
No invitee
   owned this note    owned this note      
Published Linked with GitLab
Like BookmarkBookmarked
Subscribed
  • Any changes
    Be notified of any changes
  • Mention me
    Be notified of mention me
  • Unsubscribe
Subscribe
# Singularity 建立自己的使用環境 [TOC] 您可透過 「Singularity」 包裝您所需的套件與程式,建立在 晶創主機 (命令列介面) 服務中,執行運算工作的環境。 ## 操作範例1 : 在晶創主機中已有建立好的singularity映像檔案。 ### 1.映像檔.sif路徑 系統上已打包好一個.sif映像檔在此路徑上,無須安裝。 ``` [user@cbi-lgn01 ]$ /work/hpc_sys/sifs/pytorch_22.09-py3_horovod.sif ``` [Nvidia 官方網站](https://docs.nvidia.com/deeplearning/frameworks/pytorch-release-notes/rel-22-09.html) : 提供的範例是Nvidia官方所release的版本(22.09) ### 2.Horovod 的 benchmark script 此章節以 Horovod 撰寫的 benchmark script 為編輯和測試的範例: 使用wget將檔案從github下載到你的目錄中 ``` [user@cbi-lgn01 ]$ wget https://raw.githubusercontent.com/horovod/horovod/v0.20.3/examples/pytorch/pytorch_synthetic_benchmark.py [user@cbi-lgn01 ]$ ls pytorch_synthetic_benchmark.py ``` [Horovod Github](https://github.com/horovod/horovod/blob/master/examples/pytorch/pytorch_synthetic_benchmark.py) : 官網提供的效能測試.py程式碼。 ### 3.撰寫Slurm Job (以pytorch_synthetic_benchmark.sh為例) ``` [user@cbi-lgn01 ]$ vi pytorch_synthetic_benchmark.sh #!/bin/bash #SBATCH --job-name=singularity ## job name #SBATCH --nodes=2 ## 索取 2 節點 #SBATCH --ntasks-per-node=4 ## 每個節點運行 4 srun tasks #SBATCH --cpus-per-task=4 ## 每個 srun task 索取 4 CPUs #SBATCH --gres=gpu:4 #SBATCH --account=GOV113XXX ## PROJECT_ID 請填入計畫ID #SBATCH --partition=normal ## #SBATCH -o %j_mine.out # Path to the standard output file #SBATCH -e %j_mine.err # Path to the standard error ouput file module purge ml singularity export UCX_NET_DEVICES=mlx5_0:1 export UCX_IB_GPU_DIRECT_RDMA=1 export HOROVOD_CPU_OPERATIONS=CCL export NCCL_DEBUG=WARN SIF=/work/hpc_sys/sifs/pytorch_22.09-py3_horovod.sif SINGULARITY="singularity run --nv --no-home -B .:/data $SIF" HOROVOD="python /data/pytorch_synthetic_benchmark.py --batch-size 256" export HOROVOD_CPU_OPERATIONS=CCL export NCCL_DEBUG=WARN srun --mpi=pmi2 "${SINGULARITY[@]}" "${HOROVOD[@]}" ``` ::::spoiler 腳本程式碼教學 :::info 1. ml singularity module load singularity: 是使用 Singularity 的第一步,它讓系統知道我們要開始使用 Singularity,並為我們準備好所需的環境。 2. export 環境變數 環境變數主要用於 優化深度學習訓練的性能。透過指定特定的網路裝置、啟用RDMA、選擇高效的通信庫,以及開啟適當的除錯等級。 3. 從路徑下載入SIF的image檔案 /work/hpc_sys/sifs/pytorch_23.02-py3_horovod.sif 4. /data用於共享資料 可以在容器內直接存取宿主機上的檔案,例如訓練資料集、模型權重等。 也就是說您可以在容器內修改 /data 目錄下的檔案,而這些修改也會反映到宿主機上,方便您進行實驗和調整。 `當前目錄(.)綁定到容器內的 /data 目錄。` SINGULARITY="singularity run --nv --no-home -B .:/data $SIF" `要執行的 Python 程式位於容器內的 /data 目錄下。` HOROVOD="python /data/pytorch_synthetic_benchmark.py --batch-size 256" :::: ### 4.提交Slurm Job ``` [user@cbi-lgn01]$ sbatch pytorch_synthetic_benchmark.sh Submitted batch job 10925 ``` ### 5.查看執行輸出的結果 ``` [user@cbi-lgn01 ]$ ll -rw-r--r-- 1 user TRI1123529 0 Jan 23 11:21 10925_mine.err -rw-r--r-- 1 user TRI1123529 36082 Jan 23 11:21 10925_mine.out [user@cbi-lgn01 ]$ cat 10925_job.out ============= == PyTorch == ============= NVIDIA Release 22.09 (build 44877844) PyTorch Version 1.13.0a0+d0d6b1f Container image Copyright (c) 2022, NVIDIA CORPORATION & AFFILIATES. All rights reserved. ... ... COMPLETE Model: resnet50 Batch size: 256 Number of GPUs: 4 Running warmup... Running benchmark... Iter #0: 1701.0 img/sec per GPU Iter #1: 1699.9 img/sec per GPU Iter #2: 1701.7 img/sec per GPU Iter #3: 1701.7 img/sec per GPU Iter #4: 1701.3 img/sec per GPU Iter #5: 1703.5 img/sec per GPU Iter #6: 1703.4 img/sec per GPU Iter #7: 1703.1 img/sec per GPU Iter #8: 1703.3 img/sec per GPU Iter #9: 1702.2 img/sec per GPU Img/sec per GPU: 1702.1 +-2.3 Total img/sec on 4 GPU(s): 6808.4 +-9.0 hgpn06:4110637:4111521 [2] NCCL INFO comm 0x153fa49c1370 rank 2 nranks 4 cudaDev 2 busId b9000 - Destroy COMPLETE hgpn06:4110639:4111514 [1] NCCL INFO comm 0x1521909ad4b0 rank 1 nranks 4 cudaDev 1 busId 87000 - Destroy COMPLETE hgpn06:4110634:4111516 [0] NCCL INFO comm 0x14861c9dd490 rank 0 nranks 4 cudaDev 0 busId 86000 - Destroy COMPLETE hgpn06:4110636:4111519 [3] NCCL INFO comm 0x1551cc9c0930 rank 3 nranks 4 cudaDev 3 busId bc000 - Destroy COMPLETE ``` ## 操作範例2 : 在自己的虛擬上建立.sif映像檔 使用 CUDA 工具包在 h2o4gpu 上執行 python 的範例。 ### 1. 建立映像檔(.sif) :::info :warning: HPC 晶創主機禁止使用 --fakeroot ,是為了安全性的考量。 建議在自己的虛擬機上建立好.sif映像檔再上傳到晶創主機。 ``` [user@cbi-lgn01]$ singularity build --fakeroot h2o4gpuPy11.sif h2o4gpuPy.def` ATAL: could not use fakeroot: no mapping entry found in /etc/subuid for user ``` ::: ``` 在你的虛擬主機上建立.sif檔案 [user@localhost ]$ vi h2o4gpuPy.def BootStrap: docker From: nvidia/cuda:12.1.0-devel-ubuntu18.04 # Note: This container will have only the Python API enabled %environment # ----------------------------------------------------------------------------------- export PYTHON_VERSION=3.6 export CUDA_HOME=/usr/local/cuda export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:$CUDA_HOME/lib64/:$CUDA_HOME/lib/:$CUDA_HOME/extras/CUPTI/lib64 export LD_LIBRARY_PATH=/usr/local/cuda/lib64:$LD_LIBRARY_PATH export LC_ALL=C %post # ----------------------------------------------------------------------------------- # this will install all necessary packages and prepare the contianer export PYTHON_VERSION=3.6 export CUDA_HOME=/usr/local/cuda export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:$CUDA_HOME/lib64/:$CUDA_HOME/lib/:$CUDA_HOME/extras/CUPTI/lib64 apt-get -y update apt-get install -y --no-install-recommends build-essential apt-get install -y --no-install-recommends git apt-get install -y --no-install-recommends vim apt-get install -y --no-install-recommends wget apt-get install -y --no-install-recommends ca-certificates apt-get install -y --no-install-recommends libjpeg-dev apt-get install -y --no-install-recommends libpng-dev apt-get install -y --no-install-recommends libpython3.6-dev apt-get install -y --no-install-recommends libopenblas-dev pbzip2 apt-get install -y --no-install-recommends libcurl4-openssl-dev libssl-dev libxml2-dev apt-get install -y --no-install-recommends python3-pip apt-get install -y --no-install-recommends wget ln -s /usr/bin/python${PYTHON_VERSION} /usr/bin/python ln -s /usr/bin/pip3 /usr/bin/pip pip3 install setuptools pip3 install --upgrade pip wget https://s3.amazonaws.com/h2o-release/h2o4gpu/releases/stable/ai/h2o/h2o4gpu/0.4-cuda10/rel-0.4.0/h2o4gpu-0.4.0-cp36-cp36m-linux_x86_64.whl pip install h2o4gpu-0.4.0-cp36-cp36m-linux_x86_64.whl [user@localhost ]$ singularity build --fakeroot h2o4gpuPy11.sif h2o4gpuPy.def INFO: Starting build... ... INFO: Adding environment to container INFO: Creating SIF file... INFO: Build complete: h2o4gpuPy.sif [user@localhost ]$ ls h2o4gpuPy.def h2o4gpuPy.sif #上傳到晶創主機 [user@localhost ]$ sftp user@140.110.148.5 (user@140.110.148.5) Please select the 2FA login method. 1. Mobile APP OTP 2. Mobile APP PUSH 3. Email OTP Login method: 2 #雙因子認證 (user@140.110.148.5) Password: #輸入密碼 Connected to 140.110.148.5. sftp> put h2o4gpuPy.sif #上傳檔案(h2o4gpuPy.sif) Uploading h2o4gpuPy.sif to /home/user/h2o4gpuPy.sif h2o4gpuPy.sif 100% 3503MB 243.4MB/s 00:14 sftp> exit [user@localhost ]$ ``` ### 2. h2o4gpu 套件的範例程式 ``` [user@cbi-lgn01 ]$ vi h2o4gpu_sample.py import h2o4gpu import numpy as np # Larger sample data (1000 samples with 2 features) X = np.random.rand(1000, 2) # Create and fit the KMeans model model = h2o4gpu.KMeans(n_clusters=10, random_state=1234).fit(X) # Get cluster centers cluster_centers = model.cluster_centers_ print("Cluster Centers:", cluster_centers) ``` ### 3.撰寫Slurm Job (以h2o4gpuPy.sh為例) ``` [user@cbi-lgn01]$ vi h2o4gpuPy.sh #!/bin/bash #SBATCH --job-name=singularity ## job name #SBATCH --nodes=1 ## 索取 1 節點 #SBATCH --ntasks-per-node=1 ## 每個節點運行 1 srun tasks #SBATCH --cpus-per-task=1 ## 每個 srun task 索取 1 CPUs #SBATCH --gres=gpu:1 #SBATCH --account=GOV1XXXXX ## PROJECT_ID 請填入計畫ID #SBATCH --partition=dev ## 選擇所需partition #SBATCH -o %j_h2o4gpuPy.out ## Path to the standard output file #SBATCH -e %j_h2o4gpuPy.err ## Path to the standard error ouput file module purge ml load singularity ml load cuda singularity exec --nv h2o4gpuPy.sif python3 h2o4gpu_sample.py ``` ### 4.提交Slurm Job ``` [user@cbi-lgn01]$ sbatch h2o4gpuPy.sh Submitted batch job 15212 ``` ### 5.查看執行輸出的結果 ``` [user@cbi-lgn01]$ cat 15212_h2o4gpuPy.out Cluster Centers: [[0.60558998 0.49576345] [0.83222364 0.80290834] [0.1857559 0.84386754] [0.14520802 0.17203146] [0.80368887 0.13313996] [0.34662795 0.44191851] [0.5055868 0.82354161] [0.45088371 0.15640011] [0.88583178 0.45586675] [0.10908311 0.54831662]] [user@cbi-lgn01]$ cat 15212_h2o4gpuPy.err ----------------- loading CUDA 12.2 ----------------- ```

Import from clipboard

Advanced permission required

Your current role can only read. Ask the system administrator to acquire write and comment permission.

This team is disabled

Sorry, this team is disabled. You can't edit this note.

This note is locked

Sorry, only owner can edit this note.

Reach the limit

Sorry, you've reached the max length this note can be.
Please reduce the content or divide it to more notes, thank you!

Import from Gist

Import from Snippet

or

Export to Snippet

Are you sure?

Do you really want to delete this note?
All users will lost their connection.

Create a note from template

Create a note from template

Oops...
This template has been removed or transferred.


Upgrade

All
  • All
  • Team
No template.

Create a template


Upgrade

Delete template

Do you really want to delete this template?

This page need refresh

You have an incompatible client version.
Refresh to update.
New version available!
See releases notes here
Refresh to enjoy new features.
Your user state has changed.
Refresh to load new user state.

Sign in

Sign in via SAML

Help

  • English
  • 中文
  • 日本語

Documents

Tutorials

Book Mode Tutorial

Slide Example

YAML Metadata

Resources

Releases

Blog

Policy

Terms

Privacy

Cheatsheet

Syntax Example Reference
# Header Header 基本排版
- Unordered List
  • Unordered List
1. Ordered List
  1. Ordered List
- [ ] Todo List
  • Todo List
> Blockquote
Blockquote
**Bold font** Bold font
*Italics font* Italics font
~~Strikethrough~~ Strikethrough
19^th^ 19th
H~2~O H2O
++Inserted text++ Inserted text
==Marked text== Marked text
[link text](https:// "title") Link
![image alt](https:// "title") Image
`Code` Code 在筆記中貼入程式碼
```javascript
var i = 0;
```
var i = 0;
:smile: :smile: Emoji list
{%youtube youtube_id %} Externals
$L^aT_eX$ LaTeX
:::info
This is a alert area.
:::

This is a alert area.

Versions

Versions and GitLab Sync

Sign in to link this note to GitLab Learn more
This note is not linked with GitLab Learn more
 
Add badge Pull Push GitLab Link Settings
Upgrade now

Version named by    

More Less
  • Edit
  • Delete

Note content is identical to the latest version.
Compare with
    Choose a version
    No search result
    Version not found

Feedback

Submission failed, please try again

Thanks for your support.

On a scale of 0-10, how likely is it that you would recommend HackMD to your friends, family or business associates?

Please give us some advice and help us improve HackMD.

 

Thanks for your feedback

Remove version name

Do you want to remove this version name and description?

Transfer ownership

Transfer to
    Warning: is a public team. If you transfer note to this team, everyone on the web can find and read this note.

      Link with GitLab

      Please authorize HackMD on GitLab

      Please sign in to GitLab and authorize HackMD to access your projects. Learn more

       Sign in to GitLab

      Push the note to GitLab Push to GitLab Pull a file from GitLab

        Authorize again
       

      Choose which file to push to

      Select repo
      Refresh
      Select branch
      Select file
      Select branch
      Choose version(s) to push
      • Save a new version and push
      • Choose from existing versions
      Available push count

      Upgrade

      Pull from GitLab

       
      File from GitLab
      File from HackMD

      GitLab Link Settings

      File linked

      Linked by
      File path
      Last synced branch
      Available push count

      Upgrade

      Danger Zone

      Unlink
      You will no longer receive notification when GitLab file changes after unlink.

      Syncing

      Push failed

      Push successfully