Dear Charles and others,
My apologies for the long delay in this reply, paper writing takes too long,
esp. in this now complicated field. Find here a first draft paper of
EvidentialGene methods compared with other popular transcriptome builders,
http://arthropods.eugenes.org/EvidentialGene/other/genefilters_compared/I'm still editing and expect a new draft out tonight or tomorrow. If you
are eager to read this, this draft covers all I expect to say (likely more
than it should say). Please do offer comments and suggestions to improve
this, of any sort, as it will help much in getting this across to future readers.
As the paper notes, there is a substantial update to Evigene software in the wings,
not yet packaged for public use. If any of you are willing to spend time with
a still messy version for testing, let me know. The update will change
transcriptome results, for the better, and if you can compare old versus new results,
I especially want to hear of that. The major change is to retain more valid
alternate and paralog transcripts, which prior versions are discarding too many of
(but not as many as other methods). Other changes reduce spurious/redundant
transcripts using added measures (described in paper).
- Don Gilbert
For your questions, Charles ---
# Previous points
# For reducing redundancy, i.e. number of "bad" transcripts, you say
Here is my suggestion, though you already have the idea, and homology results it seems:
a. use gene sequenes of evigene's publicset/ after tr2aacds produces okayset/
evigenes/genes/
trclass2pubset.pl -onlypub -idpre Myspp2EVm -log -class project.trclass
or evigene/scripts/
evgmrna2tsa.pl -onlypub -idpre .., older does about same job
See evigene/docs/evgmrna2tsa_help.txt
As you have protein homology scores from Trinotate, you can add that into publicset data,
by first converting those scores to a 'project.names' table, in same folder as project.trclass,
with 4+ columns:
tr_ID <tab> "ref protein name" <tab> score <tab> ref_ID
where score is blast score or alignment or other value that says how valuable
that ref_ID match is. Like this, where this score col has align% to ref
SsERR789444brvelvLoc7071053ct1 oleosin-B6 20% cow18ncrf:XP_002705078.2
SsSRR6236876brvelvLoc83232ct1 erythropoietin receptor precursor 100% cow18ncrf:NP_001192530.1
b. separate out the short-protein loci with no homology support, as
unsupported, mostly random transcription noise. If you want, use
also some expression level measure to decide if other short-aa
genes are worth keeping. The publicset/project.pubids table has
info on protein sizes, gene ids, homology score.
c. keep all gene loci with >= 140 aa (or 120 or 100 depending on how many
extras you want, there isn't a firm cut off for random noise short-proteins)
These are likely true genes, either novel or well-diverged of your species,
and are unlikely to be random at large sizes, though they could be
transposon proteins, or other contaminant. THose however you can screen
for with various homology tests, i.e. PFAM has set of euk transposon domains.
So your "final realistic" gene set contains those with homology scores, and those with proteins long enough to be statistically real.
New point
# As an aside, my normal pipeline incorporates the Trinotate pipeline for annotation purposes.
Trinity's sequence header information is different format (uglier perhaps :) but
has similar information as Evigene. You figured out the needed CDS-offset in
transcript sequences. In the case of Evigene publicset/ sequences, all are +strand
as those transcripts are mRNA-oriented (5' end first).
evigene >Myspecies_combined000003t1 type=protein; aalen=5166,97%,complete; clen=15872; offs=7-15507; oid=Myspecies_velvLoc10ct1; cov=217.1_g9_i0; evgclass=main,okay,match:Myspecies_k45.R4263767,pct:100/100/.;
offs=7-15507 is CDS-offset into transcript
trinity >TRINITY_ABC123_c0_g1::TRINITY_ABC123_c0_g1_i1::g.17214::m.17214 .. TRINITY_ABC123_c0_g1_i1:3-797(+)
CDS-offset is last word on line
According to your test, this does minimal evigene to trinity reformat:
perl -pe 'BEGIN{ $OR="\(\+\)"; }
if(/^>(\S+)/){ $trid=$1; ($gid=$trid)=~s/t\d+$//; ($ofs)=m/offs=([^\;\s]+)/; ($alen)=m/aalen=(\d+)/;
s/>.*$/>$gid:$trid len:$alen $trid:$ofs$OR/; } ' \
evigene.aa > evigene2trin.aa
>Susscr4EVm094566t2 type=protein; aalen=122,16%,complete-utrbad; clen=2267; offs=432-800; oid=Susscrtrsoap1a_sBn2l2ERR789444soapk25Loc13781t13; organism=Sus_scrofa; evgclass=main,okay,match:Susscrtrsoap1a_sBn2l2ERR789444soapk43Loc394309t1,pct:100/83/.;
MMVKRAGLRNLYKIYNQRTYQAPSCARPWTVSQGLVVLSALSRIPGATQGGNLMVSVWVS
WREVSECAGSVFFGHTHSIWKFQDQGSNPSHSSDNARFLTSRPPGKFRVVFIGLLCGFNV
TO
>Susscr4EVm094566:Susscr4EVm094566t2 len:122 Susscr4EVm094566t2:432-800(+)
MMVKRAGLRNLYKIYNQRTYQAPSCARPWTVSQGLVVLSALSRIPGATQGGNLMVSVWVS
WREVSECAGSVFFGHTHSIWKFQDQGSNPSHSSDNARFLTSRPPGKFRVVFIGLLCGFNV