How to use AlphaFold v2 on Taiwania 3

--- tags: tutorial, 生醫, T3, AlphaFold, GPU --- # How to use AlphaFold v2 on Taiwania 3 2023/02/23 Edited by nchcgaoqz ## Introduction of AlphaFold v2 AlphaFold v2 is a tool that utilizes the amino acid sequence to predict the structure of a given protein. When provided with multiple amino acid sequences, it can predict the structure of a protein complex. The process of predicting protein structures using AlphaFold v2 can be broadly divided into three time-consuming stages: 1. Multiple sequence alignment (MSA) 2. Artificial intelligence inference (AI) 3. Protein structure refinement (Relaxation, Relax) ![](https://i.imgur.com/kfsxFio.png) The MSA stage is entirely CPU-based, while the AI and Relaxation stages can be accelerated using GPU. AlphaFold v2 employs five slightly different AI models to infer protein structures, each providing varying inference results based on the same MSA outcome as input. The final structures are ranked by the estimated quality based on internal computation, where rank 1 represents the best outcome. ## Format Requirements for Input Files AlphaFold v2 exclusively accepts the single-letter code of the 20 standard amino acids. Amino acid sequence information must be saved in the fasta file format and have a file extension of .fasta. Multiple amino acid sequences need to be written in the fasta format and saved in the same fasta file. ###### Fasta format example of a single peptide chain: ```bash=1 >AAD45181.1 calmodulin [Homo sapiens] MADQLTEEQIAEFKEAFSLFDKDGDGTITTKELGTVMRSLGQNPTEAELQDMINEVDADGNGTIDFPEFL TMMARKMKDTDSEEEIREAFRVFDKDGNGYISAAELRHVMTNLGEKLTDEEVDEMIREADIDGDGQVNYE EFVQMMTAK ``` ###### Fasta format example of two peptide chains: ```bash=1 >1A5F_1|Chain A[auth L]|MONOCLONAL ANTI-E-SELECTIN 7A9 ANTIBODY (LIGHT CHAIN)|Mus musculus (10090) DIVMTQSPSSLTVTTGEKVTMTCKSSQSLLNSGAQKNYLTWYQQ KPGQSPKLLIYWASTRESGVPDRFTGSGSGTDFTLSISGVQAEDLAVYYCQNNYNYPLTFGAGTK LELKRADAAPTVSIFPPSSEQLTSG GASVVCFLNNFYPKDINVKWKIDGSERQNGVLNSWTDQDSKD STYSMSSTLTLTKDEYERHNSYTCEATHKTSTSPIVKSFNRNEC >1A5F_2|Chain B[auth H]|MONOCLONAL ANTI-E-SELECTIN 7A9 ANTIBODY (HEAVY CHAIN)|Mus musculus (10090) EVALQQSGAELVKPGASVKLS CAASGFTIKDAYM HWVKQKPEQGLEWIGRIDSGSSNT NYDPTFKGKATITADDSSNTAYLQMSSLTSEDTAVYYCARVGLSYWYAMDYWGQGTSVTVSS AKTTPPSVYPLAPGSAAQTNSMVTLGCLV KGYFPEPVTVTWNSGSLSSGVHTFPAVLQSDLYTLSSSVSVPTSTETVTCNVAHAPSSTKVDKKIVPR ``` ###### Fasta format example of three peptide chains: ```bash=1 >2II7_1|Chains A, B, C, D, E, F, G, H|Anabaena sensory rhodopsin transducer protein|Anabaena sp. (1167) MSLSIGRTCWAIAEGYIPPYGNGPEPQFISHETVCILNAG DEDAHVEITIYYSDKEPVGPYRLTVPARRTKHVRFNDLND PAPIPHDTDFASVIQSNVPIVVQHTRLDSRQAENALLSTIAYANTHHHHHH >2II7_1|Chains A, B, C, D, E, F, G, H|Anabaena sensory rhodopsin transducer protein|Anabaena sp. (1167) MSLSIGRTCWAIAEGYIPPYGNGPEPQFISHETVCILNAG DEDAHVEITIYYSDKEPVGPYRLTVPARRTKHVRFNDLND PAPIPHDTDFASVIQSNVPIVVQHTRLDSRQAENALLSTIAYANTHHHHHH >2II7_1|Chains A, B, C, D, E, F, G, H|Anabaena sensory rhodopsin transducer protein|Anabaena sp. (1167) MSLSIGRTCWAIAEGYIPPYGNGPEPQFISHETVCILNAG DEDAHVEITIYYSDKEPVGPYRLTVPARRTKHVRFNDLND PAPIPHDTDFASVIQSNVPIVVQHTRLDSRQAENALLSTIAYANTHHHHHH ``` ## Output Files The ultimate output is stored in PDB format, with a file extension of .pdb. In addition to protein structural information, the Temperature factor field (also known as the B factor) in the file (from the 61st to 66th character, as defined in the [PDB file format page](http://www.wwpdb.org/documentation/file-format-content/format33/sect9.html) for `ATOM`) represents the predicted Local Distance Difference Test (pLDDT), which is a confidence score for each residue, ranging from 0 to 100, where a score of 100 represents the highest confidence in the position of that residue. Each of the three main stages of the AlphaFold v2 process has its own output file, some of which may have a larger file size, and may be deleted if not required. ## The way for executing the optimized Taiwania 3 version of AlphaFold v2 ### Path of the executable The executable is stored in the following directory on Taiwania 3: `/opt/ohpc/Taiwania3/pkg/biology/alphafold/slurmFiles` The default executable, `twn3af2batchs` , is a link to the latest and fastest version of AlphaFold v2. If the project ID is `ABC123456`, the name of the fasta file is `seqNames.fasta`, and the number of executions is `1`, the command to be executed is as follows: ```bash=1 export PATH="/opt/ohpc/Taiwania3/pkg/biology/alphafold/slurmFiles:$PATH" twn3af2batchs ABC123456 seqNames 1 ``` The number of executions refers to the number of times the AI stage is repeated. Entering `1` will result in 5 structures being generated and sorted, while entering `2` will result in 10 structures being generated and sorted, and so on. During the execution process, the user will be prompted with some questions. Once the answers are provided without error, the job will be submitted to SLURM for execution. ### Examples of execution If the name of the fasta file is `8.fasta`, as shown in the following figure: ![](https://i.imgur.com/A5Z1Gwq.png) After the variable `PATH` has been set，an executable command may be as follows (the project ID has been blurred): ![](https://i.imgur.com/ZuuXHq5.png) As the first question in the above figure, the user is asked for the required resource. In this example `1gpu` is answered，as shown in the following figure: ![](https://i.imgur.com/jP4o1qv.png) AF2Complex is another tools for protein complex prediction based on AlphaFold v2. Because it has not been updated to the latest version, please do not select it. In this example `1` is answered，as shown in the following figure: ![](https://i.imgur.com/bhYav8D.png) This question asks whether to save intermediate process files. As the size of the intermediate process files grows quadratically with the length of the sequence, more disk space will be consumed when the sequence is longer. If there is no need, option `1` can be selected. If the same sequence may be used for multiple predictions, in addition to using a larger number of executions at the beginning, it may also be necessary to rerun the prediction on the same sequence. If this is the case, option `2` can be selected to preserve the MSA result, which will save the time required to perform MSA when running the prediction again. If disk space is sufficient and it is expected to run the same sequence again and sort all results or if there are other requirements, option `3` can be selected. In this example `1` is answered，as shown in the following figure: ![](https://i.imgur.com/yqGjVjf.png) First, `dos2unix` will be used to change the format of the fasta file, and then the file will be checked and converted as necessary. Next, the length of the amino acid sequence that has been read will be displayed. Finally, the Job ID on SLURM will be provided. By inputing `ls`, the sub-directories and files in this directory are showed in the following figure: ![](https://i.imgur.com/iV4ZOu7.png) The file `8.err` contains detailed information about the execution process, as shown in the following figure (only part of containing shown): ![](https://i.imgur.com/bBhYu5O.png) Information related to this SLURM job including the whole execution time is stored in file `8.log`, as shown in the following figure (the project ID has been blurred): ![](https://i.imgur.com/hkAOz9r.png) The message passed to SLURM is stored in `8.sbatch`. The command for executing AlphaFold v2 is saved in a file with a randomly generated numerical portion in the filename, such as `af664136064842j`. The file `8_top.log` contains information on the CPU and main memory usage during the execution process. The file `8_nv.log` contains information on the execution process only when a GPU is used. The results of all AlphaFold v2 executions are stored in the `result` directory, as shown in the following figure: ![](https://i.imgur.com/kXHLTjE.png) The pdb files with the name started with `ranked` are the final result. ### Meaning of other executable names If the user wants to try AlphaFold v2 in other versions, there are other versions supported, as shown in the following figure: ![](https://i.imgur.com/mX73drF.png) Except for the shared prefix `twn3af2batchs`, the separated numbers represent the version numbers of AlphaFold, TensorFlow, and jaxlib, respectively. ## The effect of sequence length on execution time in each stage To provide a rough estimate of the required computational resources and execution time for users, we tested the execution time required for different single-sequence lengths, based on the average results of nine experiments. It should be noted that, during testing, other users may have used different GPUs on the same node, which has eight GPUs in total. If there is significant memory or disk transfer, the test results may be affected, and this effect cannot be eliminated by averaging the results of nine experiments. Currently, it is not possible to use multiple GPUs to accelerate the AI stage, and GPU memory can only use the memory of a single V100 GPU (32 GiB). If the sequence is too long and requires more than 32 GiB of memory, the main memory will be used, which will significantly reduce the execution speed with increasing main memory usage. Since the resource allocation for Taiwania 3 is limited by the number of CPU cores or GPUs, there is a corresponding upper limit on the amount of main memory that can be allocated. Therefore, when the main memory usage exceeds the upper limit allocated for `1gpu` (90 GiB), it is necessary to choose `2gpu` to obtain a larger upper limit (180 GiB) for main memory usage, even if multiple V100 GPUs and their GPU memory space cannot be used. In addition to the AlphaFold program itself, the software versions and optimizations used in the MSA stage were the same for all tested sequences. The AI stage is affected by the versions of TensorFlow and jaxlib, and the Relax stage is affected by the version of jaxlib. These effects include execution speed, memory usage, and whether the program can run. ###### AlphaFold 2.3.1, TensorFlow 2.9.3, jaxlib 0.3.25 ![](https://i.imgur.com/TAP5Ooa.png) Blue, red, and orange correspond to three different stages respectively. The text marked in the upper left corner of each column is the multiple of the previous sequence length usage time. The usage time of each stage for each sequence length is noted in the upper right corner of each column in minutes. The green color represents the sum of the AI and Relax stages, both of which use GPU for acceleration. If the GPU acceleration process additionally utilizes main memory, it will be marked with a light red background behind the column. For this test, no extra main memory was used. However, there was an error during the test with a sequence length of 512 amino acids, which may have been caused by the HH-suite software or the database it used. For more details, please refer to the [HH-suite issue #277](https://github.com/soedinglab/hh-suite/issues/277). Since the database used by AlphaFold 2.1-2.2 is different from that used by AlphaFold 2.3.x, and the previous version of the database did not have an error in this test (although errors may still occur using other sequences), the required execution time can be referred to the test results of AlphaFold 2.2.4 below. ###### AlphaFold 2.3.1, TensorFlow 2.5.3, jaxlib 0.1.70 ![](https://i.imgur.com/ychP55y.png) The versions of jaxlib have an impact on the efficiency of the AI and Relax stages, as well as the GPU memory usage. Jaxlib version 0.1.70 uses more memory than version 0.3.25, and the limit of the sequence length that doesn't use main memory is 2011 amino acids. Beyond this length, the main memory is used. ###### AlphaFold 2.2.4, TF 2.5.3, jaxlib 0.1.70 ![](https://i.imgur.com/QN47N1g.png) ###### The old test result of AlphaFold 2.1.x ![](https://i.imgur.com/Ahj4d90.png) As shown in the graph above, once the main memory is used (when the sequence length exceeds 2011 amino acids), the execution time increases sharply. For example, when the sequence length is roughly twice the limit at 4011 amino acids, the execution time increases by approximately 60 times. When removing potential interference from other computing tasks, the increase is still approximately 30 times the execution time. When the sequence length exceeds approximately 3211 amino acids, `2gpu` are needed to obtain more than 90 GiB of main memory. Due to the large memory usage, the execution efficiency is easily affected by other tasks on the same node. In an actual research project, the sequence length was 5431 amino acids in the project studied by a professor of China Medical University, who used `4gpu` to obtain more than 180 GiB of main memory. The execution time was one month. ## ### Last updated: 2023/02/23 08:27