バイオインフォマティクス武者修行 #1 "EnTAP"

おにぎりベンゼン

2022年11月22日 10:43

こんにちは。ベンゼンです。
アブストラクトの記事が全然進んでいないのは思ったより編集が大変だったからです。あまり更新しないと思います(笑)

今日からは、僕が研究の中で作ったいろいろなソフトウェアの注意点をまとめたテキストを共有していきたいと思います。

最初はべた張りですが、直に更新していきたいと思っています。頻度はご愛敬。

それではどうぞ。

記念すべき１回目はEnTAPと呼ばれる相同性検索のワークフローです。

EnTAPを用いたゲノム・トランスクリプトームの大規模アノテーション

参考：

https://entap.readthedocs.io/_/downloads/en/latest/pdf/

#はじめに
Eukaryotic Non-Model Transcriptome Annotation Pipeline:EnTAP

これまでのアノテーション作業で推定された遺伝子配列に対して、種々のデータベースに登録されている遺伝子配列と比較して機能アノテーションを付与してくれるツール。

※その他のアノテーションツールはかなり難しいようです
⑴かずさ DNA 研がリリースしている Hayai-Annotation-Plants

・文字通り植物用のツールであること
・Mac ユーザーには簡単に使えても、WSL 環境のユーザーにはかなりハードルが高い
⑵NextFlowというワークフローのコマンド(多分DockerとかSingularityとかと同じような類と思われる)を使ったEnTAP-nf
・そもそもNextFlow自体難しい。調べた情報でもよくわからない。221023時点では、調べ学習で一旦引き上げる。

参考サイト

https://qiita.com/yuifu/items/3699d89152f2e39fdc26

NextflowからCWLで書かれたワークフローを呼び出す

　厳選！ラボ活性化ツール

　遺伝研スパコンでRNAseqを最速・自動で行う

　nextflowを使ったGATKのバリアントコールパイプライン

　雑に使う Nextflowワークフロー言語　　　　(´・ω・｀)

　Nextflowを使ってバイオインフォマティクスのツールを動かす

　ゼロから触ってみるNextflow

　Tracing & visualisationExecution log

　遺伝研スパコンでRNAseqを最速・自動で行う

　Pipeline sharing

　Get started

EnTAP のインストールはanaconda で一発とはいかないのでセットアップのやり方を書いておく。

#準備
#EnTAPの環境を構築する
conda create -n ENTAP rsem TransDecoder diamond interproscan

localのRを使っている場合、諸々入っていないパッケージが見つかるが、その際はそのたびにBiocManager::install("パッケージ名")でインストールすること。
not foundとかが出てくる。

Rでのパッケージインストールの詳しい方法については以下を参照。

(CRANもしくはBiocManagerから）※(BiocManager::install("パッケージ名")のこと。

　(install.packages("パッケージ名")

GCC 4.8.1以上と CMake 3.00以上が必要
gcc --version
cmake --version

gccは基本的に4.8.1以上あるはず
無ければcondaでインストール
conda install -c "conda-forge/label/cf202003" gcc
※すでにlocal環境でインストールしている場合は、競合が起こるためエラーが生じる。

cmakeはlocalにインストールすることはあまりないので、condaでインストールしても問題ない。
conda install -c anaconda cmake

ちなみに、local環境下にインストールする場合は、
「cmakeのlocal環境下へのインストール」
を参照

データベースのダウンロード
cd ~/blast/database_for_blast/sequences

#Uniprot_Swiss -prot
wget https://ftp.uniprot.org/pub/databases/uniprot/current_release/knowledgebase/complete/uniprot_sprot.fasta.gz

 #Uniprot_TrEMBL 
wget https://ftp.uniprot.org/pub/databases/uniprot/current_release/knowledgebase/complete/uniprot_trembl.fasta.gz #注意 ！半日くらいかかります #RefSeqの植物だけのデータ 
wget https://ftp.ncbi.nlm.nih.gov/refseq/release/plant/*.faa.gz
pigz -d -p 16 *.faa.gz
cat *.faa > plants.faa #NCBI  non redundant data
wget https://ftp.ncbi.nlm.nih.gov/blast/db/v5/v5/FASTA/nr.gz
cd ~/blast #InterProScan 
wget http://ftp.ebi.ac.uk/pub/software/unix/iprscan/5/5.59-91.0/interproscan-5.59-91.0-64-bit.tar.gz #30分くらいかかる 
tar xzf interproscan-5.59-91.0-64-bit.tar.gz

展開したら interproscan-5.59-91.0 というディレクトリが出来ていて、
その中に入っている data というディレクトリを anaconda の ENTAP 環境に移す。
移すべき場所は　~/anaconda3/envs/ENTAP/share/ である。
しかしそこにはテストデータが格納された data ディレクトリが既にあるので、先にそれを削除してから移す必要がある。

rm -rf ~/anaconda3/envs/ENTAP/share/InterProScan/data/
mv interproscan-5.59-91.0/data ~/anaconda3/envs/ENTAP/share/InterProScan/ #Pantherのデータ 
wget ftp://ftp.ebi.ac.uk/pub/software/unix/iprscan/5/data/panther-data-14.1.tar.gz

※いずれ使うであろうSTRINGのデータもダウンロードしておく
wget --no-check-certificate https://stringdb-static.org/download/protein.sequences.v11.0.fa.gz
wget --no-check-certificate https://stringdb-static.org/download/protein.links.full.v11.0.txt.gz #注意 ！3時間くらいかかります
wget --no-check-certificate https://stringdb-static.org/download/protein.info.v11.0.txt.gz

#GeneMarkS -Tのインストール
まずは http://topaz.gatech.edu/GeneMark/license_download.cgi からいちばん下にある GeneMarkS-T をダウンロードする。
ユーザー情報を適宜入力して、"I agree to the terms of this license agreement" ボタンをクリックすると、ダウンロードに進む。

その後、sftpを使って ~/local/gmst_linux_64 に入れておく

sftp "サーバー名"@gw.ddbj.nig.ac.jp
mkdir /home/iceplant4561/local/gmst_linux_64
cd /home/iceplant4561/local/gmst_linux_64
put ./Downloads/gmst_linux_64.tar.gz #サーバーに戻る 
cd /home/iceplant4561/local/gmst_linux_64
tar xzvf gmst_linux_64.tar.gz
rm gmst_linux_64.tar.gz

※なお、例によってライセンスは200日限りなので、200日経ったらまた上記のサイトに行って、登録情報を入力し直し、
ライセンスに同意し、Please download key 32_bit or 64_bit のリンクからライセンスキー (gm_key_64.gz) をダウンロード・解凍したら、
sftpで/home/iceplant4561/Agarie_group/ice_plant_genome_from_GSA/GeMoMA/gmst_linux_64に移動させる。最後に、ホームディレクトリにコピーしておく

sftp "サーバー名"@gw.ddbj.nig.ac.jp
cd /home/iceplant4561/Agarie_group/ice_plant_genome_from_GSA/GeMoMA/gmst_linux_64
put ./Downloads/gm_key_64.gz

 #サーバーに戻る 
cd /home/iceplant4561/Agarie_group/ice_plant_genome_from_GSA/GeMoMA/gmst_linux_64
guznip gm_key_64.gz
cp gm_key_64 ~/.gm_key

#EnTAPのインストール
参考：https://qiita.com/drk0311/items/9bbe56dc0669376eb98f#2-entap-のインストールとデータのダウンロード

このページを参考にし、EnTAPをlocal下にインストールする

cd ~/local
wget https://github.com/harta55/EnTAP/archive/refs/tags/v0.10.8-beta.tar.gz
tar xzvf v0.10.8-beta.tar.gz

ダウンロードが終わったら、EnTAP の中に入っている entap_config.ini に書かれている種々のコマンドパスを編集する。
編集すべきなのは、以下の14行（行頭の数字はファイル中の行番号）。

22 entap-db-bin=/bin/entap_database.bin
25 entap-db-sql=/databases/entap_database.db
28 entap-graph=/src/entap_graphing.py
46 rsem-calculate-expression=/libs/RSEM-1.3.3//rsem-calculate-expression
50 rsem-sam-validator=/libs/RSEM-1.3.3//rsem-sam-validator
54 rsem-prepare-reference=/libs/RSEM-1.3.3//rsem-prepare-reference
58 convert-sam-for-rsem=/libs/RSEM-1.3.3//convert-sam-for-rsem
77 genemarkst-exe=gmst.pl
83 transdecoder-long-exe=TransDecoder.LongOrfs
86 transdecoder-predict-exe=TransDecoder.Predict
128 eggnog-sql=/databases/eggnog.db
131 eggnog-dmnd=/bin/eggnog_proteins.dmnd
137 interproscan-exe=interproscan.sh
164 diamond-exe=/libs/diamond-0.9.9/bin/diamond

これらのパスはあっていないので、それぞれあるパスを入力する
local下にインストールする場合は以下のようになる。make後はentap_outfiles下に置かれる。
※注意※
なぜかわからないが、"~"(チルダ)を認識してくれない。さらに言うと、$HOMEすら認識しない。
なので、/home/もしくは/lustre7/から書くこと！認識しないから一向に進まない。（これで1日つぶした）
また、condaで導入したものに関しても一応正式なパスを書く方がよいと思う。

configファイルの全文は別ファイル「EnTAPのconfigファイル(entap_config.ini)」にまとめた(221025)
また、テスト用にも別に作る。こちらも別ファイル「EnTAPのテスト用configファイル(entap_config_test.ini)」にまとめた。

実際は、本番用は/home/iceplant4561/local/EnTAP/　下、テスト用は/home/iceplant4561/local/EnTAP/test_data/　下に置くとよい。

ちなみに、entap_database.bin, entap_database.db, eggnog.db, eggnog_proteins.dmndはconfiguration時に導入されるので安心。

22 entap-db-bin=/lustre7/home/iceplant4561/local/EnTAP/entap_outfiles/bin/entap_database.bin
25 entap-db-sql=/lustre7/home/iceplant4561/local/EnTAP/entap_outfiles/databases/entap_database.db
28 entap-graph=/lustre7/home/iceplant4561/local/EnTAP/src/entap_graphing.py
46 rsem-calculate-expression=/lustre7/home/iceplant4561/anaconda3/envs/EnTAP/bin/rsem-calculate-expression
50 rsem-sam-validator=/lustre7/home/iceplant4561/anaconda3/envs/EnTAP/bin/rsem-sam-validator
54 rsem-prepare-reference=/lustre7/home/iceplant4561/anaconda3/envs/EnTAP/bin/rsem-prepare-reference
58 convert-sam-for-rsem=/lustre7/home/iceplant4561/anaconda3/envs/EnTAP/bin/convert-sam-for-rsem
77 genemarkst-exe=/lustre7/home/iceplant4561/local/EnTAP/gmst_linux_64/gmst.pl
83 transdecoder-long-exe=/lustre7/home/iceplant4561/anaconda3/envs/EnTAP/bin/TransDecoder.LongOrfs
86 transdecoder-predict-exe=/lustre7/home/iceplant4561/anaconda3/envs/EnTAP/bin/TransDecoder.Predict
128 eggnog-sql=/lustre7/home/iceplant4561/local/EnTAP/entap_outfiles/databases/eggnog.db
131 eggnog-dmnd=/lustre7/home/iceplant4561/local/EnTAP/entap_outfiles/bin/eggnog_proteins.dmnd
137 interproscan-exe=/lustre7/home/iceplant4561/anaconda3/envs/EnTAP/bin/interproscan.sh
164 diamond-exe=/lustre7/home/iceplant4561/anaconda3/envs/EnTAP/bin/diamond

cd ~/local/EnTAP-0.10.8-beta
cmake CMakeLists.txt -DCMAKE_INSTALL_PREFIX=${PWD}
make
make install
mv ~/local/EnTAP-0.10.8-beta ~/local/EnTAP #ディレクトリ名を変えるecho export PATH=$PATH:$HOME/local/EnTAP/bin >> ~/.bashrc #bashrcに書き加える
source ~/.bashrc

①EnTAPのConfiguration

conda activate EnTAP
nohup
~/local/EnTAP/bin/EnTAP --config
-d ~/blast/database_for_blast/sequences/uniprot_sprot.fasta
-d ~/blast/database_for_blast/sequences/uniprot_trembl.fasta
-d ~/blast/database_for_blast/sequences/plants.faa
-d ~/blast/database_for_blast/sequences/nr.fasta
-t 16
--ini ~/local/EnTAP/entap_config.ini &
②テスト用トランスクリプトームデータを用いたテスト操作
cd /home/iceplant4561/local/EnTAP/test_data

※ここでも、指定する場合はlustre7からしてあげた方がよい
(1) トランスクリプトーム配列を用いたアノテーション
簡単にまとめると、TransDecoderを用いて最長ORFを取り出し、そのアミノ酸配列を使って種々のデータベースとの相同性検索を行う。
注意！
以下のようなエラーが生じる

EnTAP --runP -i trinity.fnn -d /home/iceplant4561/local/EnTAP/test_data/bin/swiss_prot_test.dmnd -t 50 --ini entap_config_test.ini

Here are a few ways to help diagnose some general issues:
1. Check the (detailed) printed error message below
2. Review the .err files of the execution stage (they will
be in the directory for whatever stage you failed
3. Check the debug.txt file that is printed after execution
4. Ensure your paths/inputs are correct in entap_config.ini
and log_file.txt (this will show your inputs)

Error code: 105
Error in running TransDecoder.Predict
TransDecoder Error: * Running CMD: /lustre7/home/iceplant4561/anaconda3/envs/EnTAP/opt/transdecoder/util/get_top_longest_fasta_entries.pl trinity.fnn.transdecoder_dir/longest_orfs.cds 5000 5000 > trinity.fnn.transdecoder_dir/longest_orfs.cds.top_longest_5000
Running CMD: /lustre7/home/iceplant4561/anaconda3/envs/EnTAP/opt/transdecoder/util/get_top_longest_fasta_entries.pl trinity.fnn.transdecoder_dir/longest_orfs.cds.top_longest_5000.nr 500 > trinity.fnn.transdecoder_dir/longest_orfs.cds.top_500_longest
PCT_GC: 41.8
Running CMD: /lustre7/home/iceplant4561/anaconda3/envs/EnTAP/opt/transdecoder/util/seq_n_baseprobs_to_loglikelihood_vals.pl trinity.fnn.transdecoder_dir/longest_orfs.cds.top_500_longest trinity.fnn.transdecoder_dir/base_freqs.dat > trinity.fnn.transdecoder_dir/hexamer.scores
Running CMD: /lustre7/home/iceplant4561/anaconda3/envs/EnTAP/opt/transdecoder/util/score_CDS_likelihood_all_6_frames.pl trinity.fnn.transdecoder_dir/longest_orfs.cds trinity.fnn.transdecoder_dir/hexamer.scores > trinity.fnn.transdecoder_dir/longest_orfs.cds.scores
Running CMD: /lustre7/home/iceplant4561/anaconda3/envs/EnTAP/opt/transdecoder/util/select_best_ORFs_per_transcript.pl --gff3_file trinity.fnn.transdecoder_dir/longest_orfs.gff3 --cds_scores trinity.fnn.transdecoder_dir/longest_orfs.cds.scores --min_length_auto_accept 645 --single_best_orf > trinity.fnn.transdecoder_dir/longest_orfs.cds.best_candidates.gff3
Selecting best orfs
Running CMD: /lustre7/home/iceplant4561/anaconda3/envs/EnTAP/opt/transdecoder/util/train_start_PWM.pl --transcripts /lustre7/home/iceplant4561/local/EnTAP/test_data/entap_outfiles/transcriptomes//trinity.fnn --selected_orfs trinity.fnn.transdecoder_dir/longest_orfs.cds.top_500_longest --out_prefix trinity.fnn.transdecoder_dir/start_refinement
Training start codon pattern recognition* Running CMD: /lustre7/home/iceplant4561/anaconda3/envs/EnTAP/opt/transdecoder/util/PWM/build_atgPWM_+-.pl --transcripts /lustre7/home/iceplant4561/local/EnTAP/test_data/entap_outfiles/transcriptomes//trinity.fnn --selected_orfs trinity.fnn.transdecoder_dir/longest_orfs.cds.top_500_longest --out_prefix trinity.fnn.transdecoder_dir/start_refinement --pwm_left 20 --pwm_right 10
Running CMD: /lustre7/home/iceplant4561/anaconda3/envs/EnTAP/opt/transdecoder/util/PWM/feature_scoring.+-.pl --features_plus trinity.fnn.transdecoder_dir/start_refinement.+.features --features_minus trinity.fnn.transdecoder_dir/start_refinement.-.features --atg_position 20 > trinity.fnn.transdecoder_dir/start_refinement.feature.scores
-round: 1
-round: 2
-round: 3
-round: 4
-round: 5
Running CMD: /lustre7/home/iceplant4561/anaconda3/envs/EnTAP/opt/transdecoder/util/PWM/feature_scores_to_ROC.pl trinity.fnn.transdecoder_dir/start_refinement.feature.scores > trinity.fnn.transdecoder_dir/start_refinement.feature.scores.roc
-parsing scores
Running CMD: /lustre7/home/iceplant4561/anaconda3/envs/EnTAP/opt/transdecoder/util/PWM/plot_ROC.Rscript trinity.fnn.transdecoder_dir/start_refinement.feature.scores.roc || :
Running CMD: /lustre7/home/iceplant4561/anaconda3/envs/EnTAP/opt/transdecoder/util/PWM/compute_AUC.pl trinity.fnn.transdecoder_dir/start_refinement.feature.scores.roc
Running CMD: /lustre7/home/iceplant4561/anaconda3/envs/EnTAP/opt/transdecoder/util/PWM/make_seqLogo.Rscript trinity.fnn.transdecoder_dir/start_refinement.+.pwm || :
Loading required package: grid
Running CMD: /lustre7/home/iceplant4561/anaconda3/envs/EnTAP/opt/transdecoder/util/PWM/make_seqLogo.Rscript trinity.fnn.transdecoder_dir/start_refinement.-.pwm || :
Loading required package: grid
Running CMD: /lustre7/home/iceplant4561/anaconda3/envs/EnTAP/opt/transdecoder/util/PWM/deplete_feature_noise.pl --features_plus trinity.fnn.transdecoder_dir/start_refinement.+.features --pwm_minus trinity.fnn.transdecoder_dir/start_refinement.-.pwm --out_prefix trinity.fnn.transdecoder_dir/start_refinement.enhanced
num features: 2 num_incorporate: 0
Error, len(target_sequence)=33 and pwm length = 0 at /lustre7/home/iceplant4561/anaconda3/envs/EnTAP/opt/transdecoder/util/PWM/../../PerlLib/PWM.pm line 337.
PWM::score_plus_minus_pwm(PWM=HASH(0x555555992d58), "CTCTTAATTTTCCATAATTAATGGTCAGCCGTG", PWM=HASH(0x55555577d488)) called at /lustre7/home/iceplant4561/anaconda3/envs/EnTAP/opt/transdecoder/util/PWM/deplete_feature_noise.pl line 116
Error, cmd: /lustre7/home/iceplant4561/anaconda3/envs/EnTAP/opt/transdecoder/util/PWM/deplete_feature_noise.pl --features_plus trinity.fnn.transdecoder_dir/start_refinement.+.features --pwm_minus trinity.fnn.transdecoder_dir/start_refinement.-.pwm --out_prefix trinity.fnn.transdecoder_dir/start_refinement.enhanced died with ret 65280 No such file or directory at /lustre7/home/iceplant4561/anaconda3/envs/EnTAP/opt/transdecoder/util/../PerlLib/Pipeliner.pm line 185.
Pipeliner::run(Pipeliner=HASH(0x55555577d488)) called at /lustre7/home/iceplant4561/anaconda3/envs/EnTAP/opt/transdecoder/util/train_start_PWM.pl line 141
Error, cmd: /lustre7/home/iceplant4561/anaconda3/envs/EnTAP/opt/transdecoder/util/train_start_PWM.pl --transcripts /lustre7/home/iceplant4561/local/EnTAP/test_data/entap_outfiles/transcriptomes//trinity.fnn --selected_orfs trinity.fnn.transdecoder_dir/longest_orfs.cds.top_500_longest --out_prefix trinity.fnn.transdecoder_dir/start_refinement died with ret 512 No such file or directory at /lustre7/home/iceplant4561/anaconda3/envs/EnTAP/opt/transdecoder/PerlLib/Pipeliner.pm line 185.
Pipeliner::run(Pipeliner=HASH(0x555556248738)) called at /lustre7/home/iceplant4561/anaconda3/envs/EnTAP/bin/TransDecoder.Predict line 379

この場合は2つの方法がある
❶configを書き直す
transdecoder-predict-exe=/lustre7/home/iceplant4561/anaconda3/envs/EnTAP/bin/TransDecoder.Predict #Transdecoder  only. Specify the minimum protein length #type :integer
transdecoder-m=100 #Specify  this flag if you would like to pipe the TransDecoder command '--no_refine_starts' when it is executed. Default: False #This  will 'start refinement identifies potential start codons for 5' partial ORFs using a PWM, process on by default.' #type :boolean (true/false)
transdecoder-no-refine-starts=false

これを

transdecoder-predict-exe=/lustre7/home/iceplant4561/anaconda3/envs/EnTAP/bin/TransDecoder.Predict #Transdecoder  only. Specify the minimum protein length #type :integer
transdecoder-m=100 #Specify  this flag if you would like to pipe the TransDecoder command '--no_refine_starts' when it is executed. Default: False #This  will 'start refinement identifies potential start codons for 5' partial ORFs using a PWM, process on by default.' #type :boolean (true/false)
transdecoder-no-refine-starts=true

に書き換えて TransDecoder.Predict の --no_refine_starts フラグを起動する

EnTAP --runP -i trinity.fnn -d /home/iceplant4561/local/EnTAP/test_data/bin/swiss_prot_test.dmnd -t 50 --ini entap_config_test.ini

※これは、自力でTransDecoderを起動し、ちまちま書き換えたことでわかった

configファイルは元の状態(transdecoder-no-refine-starts=false)

流れ
トランスクリプトームデータのある場所に行き、
TransDecoder.LongOrfs -t trinity.fnn

その後
TransDecoder.Predict --no_refine_starts -t trinity.fnn

完成したtrinity.fnn.transdecoder.pepはヘッダーに".p1"等がついており、これが元のトランスクリプトームデータと合致せずエラーが生じる。
そこで、

cut -f 1 -d . trinity.fnn.transdecoder.pep > temp
mv temp trinity.fnn.transdecoder.pep
EnTAP --runP -i trinity.fnn.transdecoder.pep -d /home/iceplant4561/local/EnTAP/test_data/bin/swiss_prot_test.dmnd -t 50 --ini entap_config_test.ini

とすれば、エラーが起こらない

つまり、".p1"が悪さをしていたことになる。

<補足>
TransDecoderの代わりにGeneMarkS-Tを使うことも可能。

方法⇒configを書き直す

[frame_selection]

#------------------------------- #Select  this option if all of your sequences are complete proteins. #At  this point, this option will merely flag the sequences in your output file #type :boolean (true/false)
complete=false #Specify  the Frame Selection software you would like to use. Only one flag can be specified. #Specify  flags as follows:
1. GeneMarkS-T
2. Transdecoder (default) #type :integer
frame-selection=2

デフォルトではフレームセレクションの方法が2、すなわちTransDecoderになっているので、
1を選択するために書き換える

[frame_selection]

#------------------------------- #Select  this option if all of your sequences are complete proteins. #At  this point, this option will merely flag the sequences in your output file #type :boolean (true/false)
complete=false #Specify  the Frame Selection software you would like to use. Only one flag can be specified. #Specify  flags as follows:
1. GeneMarkS-T
2. Transdecoder (default) #type :integer
frame-selection=1
EnTAP --runP -i trinity.fnn -d /home/iceplant4561/local/EnTAP/test_data/bin/swiss_prot_test.dmnd -t 50 --ini entap_config_test.ini

※TransDecoderにしろ、GeneMarkにしろ、書き換えないといけない点においては等しくめんどくさい。
しかし、GeneMarkの方がサイト()に行って登録して、ダウンロードして、キーもホームディレクトリにおかないといけないので、簡易性からいうとTransDecoderの方がよい。
221025時点ではどちらがいいとか悪いとかわからなかったので、個人の好みですね。

❷ アミノ酸配列を用いたアノテーション

こちらはシンプルに

EnTAP --runP -i trinity.faa -d /home/iceplant4561/local/EnTAP/test_data/bin/swiss_prot_test.dmnd -t 50 --ini entap_config_test.ini

【結論】結局のところ、パスを正確に記入してあげれば問題ないことがわかった

②EnTAP の実行と結果の解釈
#メモリを大量に消費するので、サーバーの場合はバッチ処理を薦める。

conda activate EnTAP
gffread -g GENOME.fasta -y PROTEINS.fasta final_annotations_1.gff
conda activate ENTAP
qsub -V -cwd -l medium -l s_rt=39:00:00 -l d_rt=39:00:00 -l s_vmem=38G -l mem_req=38G -pe def_slot 13 -b y -e error_log -N EnTAP 
/home/iceplant4561/local/EnTAP/bin/EnTAP --runP -i Protein_GeMoMa_all_braker.fasta 
-d /home/iceplant4561/local/EnTAP/entap_outfiles/bin/uniprot_sprot.dmnd 
-d /home/iceplant4561/local/EnTAP/entap_outfiles/bin/uniprot_trembl.dmnd 
-d /home/iceplant4561/local/EnTAP/entap_outfiles/bin/plants.dmnd 
-d /home/iceplant4561/local/EnTAP/entap_outfiles/bin/nr.dmnd 
-t 10 --ini /home/iceplant4561/local/EnTAP/entap_config.ini

膨大な数の相同性検索が行われるが、１日くらいでおわる

 #logを確認 
tail -n 20 entap_outfiles/log_file_{日付}.txt

Final Annotation Statistics

Total Sequences: 39471
Similarity Search
Total unique sequences with an alignment: 34082
Total unique sequences without an alignment: 5389
Gene Families
Total unique sequences with family assignment: 34043
Total unique sequences without family assignment: 5428
Total unique sequences with at least one GO term: 25941
Total unique sequences with at least one pathway (KEGG) assignment: 6472
Totals
Total unique sequences annotated (similarity search alignments only): 716
Total unique sequences annotated (gene family assignment only): 677
Total unique sequences annotated (gene family and/or similarity search): 34759
Total unique sequences unannotated (gene family and/or similarity search): 4712
EnTAP has completed!
Total runtime (minutes): 1127

#もともと、brakerによって予測された遺伝子というのは、「ORFが組めるから」という理由で得られたもので、「翻訳産物である」保証はない。
#参照データを持たない非翻訳産物がnon -coding RNAとして機能することが分かってきた昨今、これらを完全に無視するのはよくない。
#私の場合は、すべてのデータを参照ゲノム情報として用い、解析の段階で選抜する。

cp entap_outfiles/final_results/final_annotations_lvl1.tsv ./EnTAP_final_annotation.tsv

#GOデータの集約
 #既にEnTAPによってGOデータは集約されているが、RのtopGO及びWeb上のエンリッチメント解析ツール(gProfilerなど)で用いる場合不具合が生じたので、手動のまとめ方を載せておく。

 #UniprotとEggNOGの2つのデータをまとめる 
tail EnTAP_final_annotation.tsv -n +2 | cut -f 1,22 > BP_Uniprot
tail EnTAP_final_annotation.tsv -n +2 | cut -f 1,23 > CC_Uniprot
tail EnTAP_final_annotation.tsv -n +2 | cut -f 1,24 > MF_Uniprot
tail EnTAP_final_annotation.tsv -n +2 | cut -f 1,34 > BP_EggNOG
tail EnTAP_final_annotation.tsv -n +2 | cut -f 1,35 > CC_EggNOG
tail EnTAP_final_annotation.tsv -n +2 | cut -f 1,36 > MF_EggNOG
join BP_Uniprot BP_EggNOG > BP_all
join CC_Uniprot CC_EggNOG > CC_all
join MF_Uniprot MF_EggNOG > MF_all
join BP_all CC_all > tmp
join tmp MF_all > GO_list
rm tmp BP_Uniprot CC_Uniprot MF_Uniprot BP_EggNOG CC_EggNOG MF_EggNOG BP_all CC_all MF_all #中間データを消す 
cat GO_list | awk '$2 != ""{print $0}' > tmp1 #空欄を消す  #このあと 、Rに導入するために、UniprotとEggNOGをくっつけた時に生じたスペース等を変換する
cat tmp1 | sed 's/\s/\t/g' | sed 's/,\t\t//g' | sed 's/,\t//g' | sed 's/\t\t/\t/g'| sed 's/,//g'| sed 's/(\w\W\w)/,/g' > orf2go.map #スぺースをタブに変換する ⇒コンマ＋ダブルタブのみ消す⇒コンマ＋タブを消す⇒残ったダブルタブを消す⇒コンマを消す⇒(L=1)などをコンマで置き換える※最後尾の単独コンマ(,GO:●●,GO:○〇",")はどうせ区切り文字で消えるので無視
rm tmp1 #中間データを消す

#どうやらGOが紐づけられた遺伝子数はlogに載っている値より少し多い

#linuxにおける各文字の正規表現については以下 (https://murashun.jp/article/programming/regular-expression.html)を参考
#これをGOエンリッチメント解析に用いる

#※ただしこの場合、コンマ区切り値内で重複が生じるため、EnTAPのGO数より少し多い。
#この重複の削除方法は221118現在では不明

それではまた次回。
いつになるやら。

この記事が気に入ったらサポートをしてみませんか？