Mz Zhou, Thanks for your response!
But how to Build many profile alignments for PF05938, and then generate HMM profile for each alignment file?
1. In the output folder named At-spada, 31_model_evaluation does not exist. But two directories, 11_motif_mining and 01_preprocessing were generated in the SPADA process.
2. In the folder named hmm.slk, only 21_all.hmm (actually PF05938seed.hmm) is located here. So it do not contain profile alignments and HMM files.
Moreover, the approach involved Clustal and build_profile does not work yet, displaying the same result.
Firstly, the PF05938 full member sequences were aligned using Clustal to generate only ONE ALN file in directory ${DIR_ALN}/11_aln, and
build_profile.pl only build ONE new HMM profile for the alignment.
I have not seen many hmm files. So you can imagine that in the ${DIR_ALN}/11_aln only contained PF05938_full_length.aln,
and ${SPADA_HMM_DIR}/15_hmm only held PF05938_full_length.hmm, a HMM is located in ${SPADA_HMM_DIR}/21_all.hmm, but has the same size as PF05938_full_length.hmm in ${SPADA_HMM_DIR}/15_hmm .
3. Should we using the default cft.txt?
program:
cd /usr/local/spada_soft/spada
perl
spada.pl --cfg cfg.txt \
# using the default cft.txt--hmm /export/home/tempo001/Data/PF05938/hmm.slk \ #contain PF05938seed.hmm, renamed 21_all.hmm (size:51 kb)
--dir /export/home/tempo001/Data/PF05938/At-spada \
--fas /export/home/tempo001/Data/arabidopsis/Athaliana_167.fa \
--gff /export/home/tempo001/Data/Athaliana_167_gene.gff3
log file:
===== setting up environment variables =====
using Athaliana matrix
will run GeneWise_SplicePredictor
will run Augustus_evidence
[17:20:37] ########## Starting pipeline ##########
[17:20:37] ##### Stage 1 [Pre-processing] #####
[17:20:37] Creating symbolic link to FASTA file
[17:20:37] translating sequences in 6 reading frames
[17:23:40] extracting ORFs from translated genomic sequence
[17:36:29] Creating symbolic link to GFF file
gff2gtb.pl -i /export/home/tempo001/Data/PF05938/At-spada/01_preprocessing/51_gene.gff -o /export/home/tempo001/Data/PF05938/At-spada/01_preprocessing/61_gene.gtb
……
35000 RNA | 27059 gene...
gtb2gff.pl -i /export/home/tempo001/Data/PF05938/At-spada/01_preprocessing/61_gene.gtb -o /export/home/tempo001/Data/PF05938/At-spada/01_preprocessing/62_gene.gff
……
Gtb -> Gff 35000 RNA | 27059 gene...
[17:37:49] extracting ORFs from predicted protein sequence
……
35000 / 35386 done
[17:40:31] ##### Stage 2 [Motif Mining] #####
[17:40:31] running hmmsearch against 12_orf_genome.fa
[17:40:34] parsing HMM output
[17:40:34] calculating alignment coordinates
[17:40:35] transforming coordinates (Amino Acid -> Nucleotide)
[17:40:35] recovering to global coordinate
[17:40:35] 149 in total
[17:40:35] refining HMM hits
[17:40:35] running hmmsearch against 71_orf_proteome.fa
malformat location string: S1
[17:40:35] parsing HMM output
[17:40:35] calculating alignment coordinates
[17:40:35] transforming coordinates (Amino Acid -> Nucleotide)
[17:40:35] recovering to global coordinate
[17:40:35] 74 in total
[17:40:35] refining HMM hits
[17:40:35] preparing for hit tiling
[17:40:35] tiling HMM hits
[17:40:35] removing noise hits
[17:40:35] 21 removed / 135 passed
[17:40:35] removing hits on wrong reading frames
[17:40:36] 0 removed / 135 passed
[17:40:36] re-formatting hit info
So the old question come again: