step 9e - hmmscan with Dfam database

71 views
Skip to first unread message

MF

unread,
Jun 13, 2022, 3:06:17 PM6/13/22
to EvidentialGene

Dear Dr. Don Gilbert,

I am using for the first time EvidentalGene and I managed to successfully complete the pipeline starting from steps 7 to 12. However, it is important for my dataset to include the transposon analysis (optional step 9e) and I couldn’t find the best way to do this.

So far, I have been generating the scripts, and subsequently executing them, with the command:
env name=Xname species=Xspecies runsteps=start7 ncpu=Xncpu maxmem=Xmaxmem datad=Xdatad ./run_evgsra2genes4v.sh

However, this does not generate a script for step 9e. 
Q1: What would be the proper way to generate and run a script for step 9e? Or should I just run hmmscan with my Dfam database, and then rerun subsequent steps (10-12)?

What would be the correct input file in this step? okayset/xxx.okay.mrna or okayser1st/xxx.okreor.mrna? All these files are now compressed since step 12 was executed.

Q2: On the other hand, in the G1 table of the file publicset/XX.genesum.txt it reads:
173190 gene loci, supported by RNA-seq
  173190 (100%) are protein coding, 0 are putative non-coding
   69023 (40%) of coding loci have large proteins, 104167 have small proteins (smORF < 120 aa)
   NA_n_teloci are protein coding, expressed, loci with transposon domains

I also have some questions about this:
a) Why do I get "0 are putative non-coding" in this report, when the file ncrnaset/XX.ncrna_pub.fa contains 505677 sequences? Would it be appropriate to use the file ncrnaset/XX.ncrna_pub.fa (along with the output of step 9e once I get it) as decoy sequences in a Salmon analysis?


b) I get a lot of small proteins (>50%). If these are longer protein fragments, would you recommend trying to reassemble these fragments into larger proteins? And if so, could you give me some recommendations on how to do it? (eg, selecting these short transcripts, specifications for running the assembler, and how to add these new sequences to the generated dataset).


It is appropriate to mention that I hope to find overexpressed transposons, and that I do not have a reference genome. My overassembly was sourced from rnaSPAdes, Trinity, Trans-ABySS, and MEGAHIT.

I really appreciate your help and any comments or suggestions you can give me.

Don Gilbert

unread,
Jun 13, 2022, 3:47:23 PM6/13/22
to EvidentialGene
However, this does not generate a script for step 9e.

Q1: What would be the proper way to generate and run a script for step 9e? Or should I just run hmmscan with my Dfam database, and then rerun subsequent steps (10-12)?

A1: You need to re-run the Sra2genes script after finishing steps prior to 9e.
Sra2Genes is tricky in that it checks your data set with each call, then writes scripts to process that set.

See at http://arthropods.eugenes.org/EvidentialGene/other/sra2genes_testdrive/sra2genes4v_testdrive/run_plant1kYYPE.txt

1. First call to run_evgsra2genes4v.sh writes scripts for steps 1..7?

env name=plYYPE sratable=plYYPE.srainfo.csv ncpu=8 maxmem=64000 datad=`pwd` $evigene/scripts/run_evgsra2genes4v.sh

Then these generated scripts are run by you on some computer cluster:
  run_s04_soap.plYYPE.sh
  run_s07_tr2aacds.plYYPE.sh

2. Next call  to run_evgsra2genes4v.sh with same options, checks your new data set, writes scripts for steps 8, 9..

env name=plYYPE sratable=plYYPE.srainfo.csv ncpu=8 maxmem=64000 datad=`pwd` $evigene/scripts/run_evgsra2genes4v.sh

THen these new scripts need to be run:
  run_s08_evgblastp.plYYPE.sh
  run_s08_buscoscan.plYYPE.sh
  run_s09a_tr2ncrna.plYYPE.sh

I'm not sure of step 9e script generation; is it this ?
 STEP9b_transposons:  transposon domains with DFAM data and software
Maybe try this, "env runsteps=start9 .. $evigene/scripts/run_evgsra2genes4v.sh"
 
When time permits, I hope to simplify this step-wise operation.  The new Gnodes_pipeline does it better, e.g. makes unix run scripts for your data set all at one time.

----

Q2: On the other hand, in the G1 table of the file publicset/XX.genesum.txt it reads:
A2a: "0 are putative non-coding" is summary software bug, I think, not looking at ncrnaset


b) I get a lot of small proteins (>50%).
A2b: All rna assemblies produce these ambiguous short proteins, some more than others, that depend both on your species and quality/quantity of RNA data.  Whether more assembler runs will help depends on what you have tried, I can't easily answer. Telling the difference between transposons and organismal genes is tricky, hard to do, an area I've not liked to tackle, because there is a lot of overlap, and indeed shifting from one class to another in evolution.

There is an option I recommend to call short proteins below a cut off of 100aa or your choice, as non-coding transcripts (or ambiguous), unless they have other species homology (e.g a few plant genes are 15aa or 20 aa).  These are options in
  tr2aacds4.pl  -smallclass=drop -SMALLMAXAA=100
where drop means leave them out of coding gene set if smaller than 100 aa.

Reply all
Reply to author
Forward
0 new messages