Dear Dr. Don Gilbert,
I am using for the first time EvidentalGene and
I managed to successfully complete the pipeline starting from steps 7 to 12.
However, it is important for my dataset to include the transposon analysis
(optional step 9e) and I couldn’t find the best way to do this.
So far, I have been
generating the scripts, and subsequently executing them, with the command:
env name=Xname species=Xspecies runsteps=start7
ncpu=Xncpu maxmem=Xmaxmem datad=Xdatad ./run_evgsra2genes4v.sh
However, this does not generate a script for
step 9e.
Q1: What would be the proper way to generate and
run a script for step 9e? Or should I just run hmmscan with my Dfam database,
and then rerun subsequent steps (10-12)?
What would be the correct
input file in this step? okayset/xxx.okay.mrna or okayser1st/xxx.okreor.mrna?
All these files are now compressed since step 12 was executed.
Q2: On the other hand, in the G1 table of the file
publicset/XX.genesum.txt it reads:
173190 gene loci, supported by RNA-seq
173190 (100%) are protein coding, 0 are
putative non-coding
69023 (40%) of coding loci have
large proteins, 104167 have small proteins (smORF < 120 aa)
NA_n_teloci are protein coding,
expressed, loci with transposon domains
I also have some questions about this:
a) Why do I get "0 are putative
non-coding" in this report, when the file ncrnaset/XX.ncrna_pub.fa
contains 505677 sequences? Would it be appropriate to use the file
ncrnaset/XX.ncrna_pub.fa (along with the output of step 9e once I get it) as
decoy sequences in a Salmon analysis?
b) I get a lot of small proteins
(>50%). If these are longer protein fragments, would you recommend
trying to reassemble these fragments into larger proteins? And if so, could you
give me some recommendations on how to do it? (eg, selecting these short
transcripts, specifications for running the assembler, and how to add these new
sequences to the generated dataset).
It is appropriate to mention that I hope to find
overexpressed transposons, and that I do not have a reference genome. My
overassembly was sourced from rnaSPAdes, Trinity, Trans-ABySS, and MEGAHIT.
I really appreciate your help and any comments
or suggestions you can give me.