How to Build our own profile alignments, and then our own HMM profile for each alignment file

boyas...@gmail.com

unread,

Jul 2, 2014, 5:59:29 AM7/2/14

to sp...@googlegroups.com

SPADA paper described that (PF05938) was obtained from the Pfam database. An HMM profile was built from this alignment and then used as input to scan the Arabidopsis genome (TAIR10) by running SPADA. Does this mean one HMM profile of seed sequences is enough for SPADA?

But SPADA can not generate alignment file for each protein member in this family, so the HMM profile for each member will not be produced, and there is nothing output in the end.

So how can we build our own profile alignments, and then our own HMM profile for each alignment file?

I was wondering if we should scan the whole proteome with the HMM profile of seed sequences downloaded from Pfam, and take out the protein sequence for each member. Next, we generate for each protein member in the family by Clustal, then build HMM profile and put them in the directory named Aln.PF05938 and HMM.PF05938, respectively. At last we utilize SPADA. Is this Time-consuming process acceptable? If so, does there exist any automatic tool?

Waiting for your response! Thanks

Peng Zhou

unread,

Jul 2, 2014, 7:38:54 PM7/2/14

to sp...@googlegroups.com

The idea is to start with a seed alignment (such as PF05938 downloaded from Pfam) and then scan the entire genome to search for additional members of the family. So yes, one HMM profile of seed sequences would be adequate for a SPADA run.

You can either use own profile (such as downloaded from Pfam) or build a new one if you have a number of candidate sequences. See here for instructions building your own profile:

https://github.com/orionzhou/SPADA/wiki/Usage#preparing-search-hmm-profiles

Actually, SPADA does produce a multiple sequence alignment for each input profile alignment:

https://github.com/orionzhou/SPADA/wiki/Usage#output

For your last concern I suggest you do a SPADA run with the original HMM profile (downloaded Pfam), this will generate a new multiple sequence alignment including all predicted members in the target genome. Then you can build a new profile with this and start a new round of SPADA, which, hopefully, will pull out even more members of the family.

Let me know if I addressed your concern, thanks,

boyas...@gmail.com

unread,

Jul 3, 2014, 9:00:43 AM7/3/14

to sp...@googlegroups.com

Mz Zhou, Thanks for your response!

But how to Build many profile alignments for PF05938, and then generate HMM profile for each alignment file?

1. In the output folder named At-spada, 31_model_evaluation does not exist. But two directories, 11_motif_mining and 01_preprocessing were generated in the SPADA process.

2. In the folder named hmm.slk, only 21_all.hmm (actually PF05938seed.hmm) is located here. So it do not contain profile alignments and HMM files.

Moreover, the approach involved Clustal and build_profile does not work yet, displaying the same result.

Firstly, the PF05938 full member sequences were aligned using Clustal to generate only ONE ALN file in directory ${DIR_ALN}/11_aln, and build_profile.pl only build ONE new HMM profile for the alignment.

I have not seen many hmm files. So you can imagine that in the ${DIR_ALN}/11_aln only contained PF05938_full_length.aln,

and ${SPADA_HMM_DIR}/15_hmm only held PF05938_full_length.hmm, a HMM is located in ${SPADA_HMM_DIR}/21_all.hmm, but has the same size as PF05938_full_length.hmm in ${SPADA_HMM_DIR}/15_hmm .

3. Should we using the default cft.txt?

program:

cd /usr/local/spada_soft/spada

perl spada.pl --cfg cfg.txt \ # using the default cft.txt

--hmm /export/home/tempo001/Data/PF05938/hmm.slk \ #contain PF05938seed.hmm, renamed 21_all.hmm (size:51 kb)

--dir /export/home/tempo001/Data/PF05938/At-spada \

--fas /export/home/tempo001/Data/arabidopsis/Athaliana_167.fa \

--gff /export/home/tempo001/Data/Athaliana_167_gene.gff3

log file:

===== setting up environment variables =====

using Athaliana matrix

will run GeneWise_SplicePredictor

will run Augustus_evidence

[17:20:37] ########## Starting pipeline ##########

[17:20:37] ##### Stage 1 [Pre-processing] #####

[17:20:37] Creating symbolic link to FASTA file

[17:20:37] translating sequences in 6 reading frames

[17:23:40] extracting ORFs from translated genomic sequence

[17:36:29] Creating symbolic link to GFF file

gff2gtb.pl -i /export/home/tempo001/Data/PF05938/At-spada/01_preprocessing/51_gene.gff -o /export/home/tempo001/Data/PF05938/At-spada/01_preprocessing/61_gene.gtb

……

35000 RNA | 27059 gene...

gtb2gff.pl -i /export/home/tempo001/Data/PF05938/At-spada/01_preprocessing/61_gene.gtb -o /export/home/tempo001/Data/PF05938/At-spada/01_preprocessing/62_gene.gff

……

Gtb -> Gff 35000 RNA | 27059 gene...

[17:37:49] extracting ORFs from predicted protein sequence

……

35000 / 35386 done

[17:40:31] ##### Stage 2 [Motif Mining] #####

[17:40:31] running hmmsearch against 12_orf_genome.fa

[17:40:34] parsing HMM output

[17:40:34] calculating alignment coordinates

[17:40:35] transforming coordinates (Amino Acid -> Nucleotide)

[17:40:35] recovering to global coordinate

[17:40:35] 149 in total

[17:40:35] refining HMM hits

[17:40:35] running hmmsearch against 71_orf_proteome.fa

malformat location string: S1

[17:40:35] parsing HMM output

[17:40:35] calculating alignment coordinates

[17:40:35] transforming coordinates (Amino Acid -> Nucleotide)

[17:40:35] recovering to global coordinate

[17:40:35] 74 in total

[17:40:35] refining HMM hits

[17:40:35] preparing for hit tiling

[17:40:35] tiling HMM hits

[17:40:35] removing noise hits

[17:40:35] 21 removed / 135 passed

[17:40:35] removing hits on wrong reading frames

[17:40:36] 0 removed / 135 passed

[17:40:36] re-formatting hit info

So the old question come again:

I was wondering if we should scan the whole proteome with the HMM profile of seed sequences downloaded from Pfam, and take out the protein sequence for each member. Next, we generate for each protein member in the family by Clustal, then build HMM profile and put them in the directory named Aln.PF05938 and HMM.PF05938, respectively. At last we utilize SPADA. Is this Time-consuming process acceptable? If so, does there exist any automatic tool?

Thanks again

boyas...@gmail.com

unread,

Jul 3, 2014, 9:11:57 AM7/3/14

to sp...@googlegroups.com

SUPPLEMENTARY

1. In the output folder named At-spada, 31_model_evaluation does not exist. But two directories, 11_motif_mining and 01_preprocessing were generated in the SPADA process.

In the directories named 11_motif_mining/11_hmmsearch_x and 11_motif_mining/12_hmmsearch_p, 07_final.htb existed in both folders. Is 07_final.htb the final result SPADA produced that I should focus on?

In the folder 11_motif_mining/21_hits, 01_hit.htb, 02_tiled.htb, 04_nr.htb, 05_resolve_rf.htb, 11.tbl were generated in the end. Thus I have no idea of the final output of SPADA.

Peng Zhou

unread,

Jul 12, 2014, 6:01:18 PM7/12/14

to sp...@googlegroups.com

Hi,

Sorry for the late response - I was busy the entire last week and didn't got a chance to look into this...

If there are only two directories (11_motif_minng and 01_preprocessing) in the output folder, then it is an unfinished run. There should be other directories (including 31_model_evaluation) generated. Could you please please send the complete LOG file so I can check where the program died?

I didn't quite get your second question. A HMM profile has equivalent information contained in a multiple sequence alignment file (ALN). If you run build_profile.pl with PF05938_full_length.aln and generated PF05938_full_length.hmm, then successfully run SPADA, you are supposed to get a NEW multiple sequence alignment file (with both old and new members) in the 31_model_evaluation directory.