EviGene pipeline on a single transcriptome assembly?

757 views
Skip to first unread message

Charles Foster

unread,
Jun 20, 2019, 4:20:35 AM6/20/19
to EvidentialGene
Dear Don (and other readers),

Due to several constraints, I'm only able to use one transcriptome assembly from Trinity. As such, I can't do the preferred option of using several assemblers and merging them all into a best assembly.

In any case, I've noticed that my Trinity assembly has many "genes" that probably aren't distinct from one another. As a test, I ran the Evidential Genes pipeline on my single assembly. The reduced assembly (after running tr2aacds.pl and evgmrna2tsa2.pl) has 45746 transcripts corresponding to 41611 genes, which is much less than the 148287 transcripts from 121133 genes Trinity reported (as expected). This seems more reasonable to me.

My question is whether the Evidential Genes pipeline is still valid on a single assembly, despite being intended for merging multiple assemblies from different programs. The results certainly seem better, I just don't want to unwittingly be ruining the transcriptome by using the pipeline in an unsuitable way.

Thanks!

Charles

Don Gilbert

unread,
Jun 20, 2019, 7:36:49 AM6/20/19
to Charles Foster, EvidentialGene
Charles,

The short answer is yes, Evigene's tr2aacds uses self-referential measures to classify genes and alternates,
without regard to how many or sources of your transcript assembly.  It is an objective, absolute measure of "genes",
that is those sequences that share identical sequences must, with caveats, belong to same gene locus.

The more variant models you start with, from reasonable alternate methods of assembly, the more likely you are
to start with a full set of accurate genes, plus the extra partly accurate models.   There can be problems that inflate
gene counts: fragment assemblies especially, if you lack enough RNA coverage, and/or lack the assemblers that can
turn fragments into full genes.

There is more to say about this, in some of my documents.  Deciding on gene set quality by counting totals is one
bug-bear of mine:  one can't know unless one is counting *accurate, complete* gene sets, or a superset of those plus
others.  Most people don't know if they have accurate genes, when deciding by count it is accurate.  Please use homology
tests as your next step. You may want to measure both the large input set, and small okay set output for homology to
see if any was lost.  Busco's measure of the conserved genes is quick and can assess that, in part (please use Busco
protein input mode, as its 'transcriptome' mode is bad).


--
You received this message because you are subscribed to the Google Groups "EvidentialGene" group.
To unsubscribe from this group and stop receiving emails from it, send an email to evidentialgen...@googlegroups.com.
To post to this group, send email to evident...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/evidentialgene/391c219a-6ab6-4703-9297-7cdb65be03ee%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


--
don gilbert - www.bio.net - bioinformatics - indiana.u.

Charles Foster

unread,
Jun 20, 2019, 8:28:22 PM6/20/19
to EvidentialGene
Thanks for the prompt reply! In future projects where it is more feasible I'll be assembling with
Velvet/Oases, SOAP-trans etc. as well as Trinity. For now, though, it's good to know that the reduced Trinity assembly is (in theory) fine.

For homology testing, I've run BUSCO (with the vertebrate reference set) on both the Transdecoder-predicted proteins from the large Trinity-assembly input, and on the *.mrna_pub.fa output from the EvidentialGenes pipeline. These are the results:

Trinity input: C:78.3%[S:60.7%,D:17.6%],F:11.0%,M:10.7%,n:2586

EviGenes output: C:77.4%[S:72.2%,D:5.2%],F:11.5%,M:11.1%,n:2586

So, there was a slight decrease in completeness (-0.9%) and a slight increase in fragmentation (+0.5%), but a fair decrease in duplication (-12.4%). This seems pretty good given the substantial decrease in redundant transcripts after running EviGenes (148287 transcripts down to 45746 transcripts).

What do you think about taking the BUSCOs that are present in my search using Transdecoder peptides but missing from my EviGenes search, and then adding them back into the EviGenes assembly?

Thanks again for your thoughts.

Charles


On Thursday, June 20, 2019 at 9:36:49 PM UTC+10, Don Gilbert wrote:
Charles,

The short answer is yes, Evigene's tr2aacds uses self-referential measures to classify genes and alternates,
without regard to how many or sources of your transcript assembly.  It is an objective, absolute measure of "genes",
that is those sequences that share identical sequences must, with caveats, belong to same gene locus.

The more variant models you start with, from reasonable alternate methods of assembly, the more likely you are
to start with a full set of accurate genes, plus the extra partly accurate models.   There can be problems that inflate
gene counts: fragment assemblies especially, if you lack enough RNA coverage, and/or lack the assemblers that can
turn fragments into full genes.

There is more to say about this, in some of my documents.  Deciding on gene set quality by counting totals is one
bug-bear of mine:  one can't know unless one is counting *accurate, complete* gene sets, or a superset of those plus
others.  Most people don't know if they have accurate genes, when deciding by count it is accurate.  Please use homology
tests as your next step. You may want to measure both the large input set, and small okay set output for homology to
see if any was lost.  Busco's measure of the conserved genes is quick and can assess that, in part (please use Busco
protein input mode, as its 'transcriptome' mode is bad).


On Thu, Jun 20, 2019 at 4:20 AM Charles Foster <charles...@gmail.com> wrote:
Dear Don (and other readers),

Due to several constraints, I'm only able to use one transcriptome assembly from Trinity. As such, I can't do the preferred option of using several assemblers and merging them all into a best assembly.

In any case, I've noticed that my Trinity assembly has many "genes" that probably aren't distinct from one another. As a test, I ran the Evidential Genes pipeline on my single assembly. The reduced assembly (after running tr2aacds.pl and evgmrna2tsa2.pl) has 45746 transcripts corresponding to 41611 genes, which is much less than the 148287 transcripts from 121133 genes Trinity reported (as expected). This seems more reasonable to me.

My question is whether the Evidential Genes pipeline is still valid on a single assembly, despite being intended for merging multiple assemblies from different programs. The results certainly seem better, I just don't want to unwittingly be ruining the transcriptome by using the pipeline in an unsuitable way.

Thanks!

Charles

--
You received this message because you are subscribed to the Google Groups "EvidentialGene" group.
To unsubscribe from this group and stop receiving emails from it, send an email to evident...@googlegroups.com.

Don Gilbert

unread,
Jun 22, 2019, 2:14:22 PM6/22/19
to EvidentialGene
Charles,

The most useful way to preserve transcripts with good homology, is to add a homology score table, as input to a new run of tr2aacds2 using same full transcript input set, and table with IDs of those transcripts scored for homology.

Regarding transdecoder, I recommend from measures that you instead use evigene's protein set, this document explains why:
 http://eugenes.org/EvidentialGene/evigene/docs/protein_calcs_compared.txt
 (and in your evigene/docs/ folder if current)
If your large Trinity assembly is the Evigene input, the calculated proteins are all in evigene outputs,  "inputset" and "dropset" for the not-okay ones.

1. Create homology score table from BUSCO output "full_table_XXX.tsv"
You should create a large table with all the transcript IDs you tested
with BUSCO (i.e. evigene set + extras).  Likewise BLAST and other scores,
in same format can be used. Highest scored transcript per reference ID will
be preserved.
Table format is rows of "transcriptID <tab> referenceID <tab> score"

perl -ne \
'if(/^\w/){($bid,$btype,$tid,$score,$len)=split; print join("\t",$tid,$bid,$score)."\n"; }' \
 run_bo_arath16ap_aa/full_table_bo_arath16ap.tsv \
 > arath16ap_busco.tab
# cdnaID        referenceID       score (largest/ref is best)
AT3G02260.1    EOG09360002    7075.1
AT3G02260.2    EOG09360002    7092.4
AT3G02260.3    EOG09360002    7092.4
AT3G02260.4    EOG09360002    7092.9  << top homol score, preserve this transcript
AT5G23110.1    EOG0936000A    7346.3
AT5G40820.1    EOG0936001N    3947.9  << top homol score tied, preserve .1, .3 or .4
AT5G40820.2    EOG0936001N    3947.0
AT5G40820.3    EOG0936001N    3947.9
AT5G40820.4    EOG0936001N    3947.9
...

2. Run transcript assembly reduction with homology scores, to preserve best homol transcripts
$evigene/scripts/prot/tr2aacds2.pl -ablast arath16ap_busco.tab -cdna arath16ap.cdna
Another way, less useful, is to mark some transcripts as *must keep*, with id-keep/drop table,
then use later step in evigene, genes/trclass2pubset.pl, to add back these.

arath16ap.keepdrop file
AT3G02260.4    keep
AT3G02260.1    drop # remove this one example
AT5G23110.1    keep
AT5G40820.1    keep
..

Need also to create new okayset mRNA that adds in extra sequences you want to
keep, if they are not in the okayset from tr2aacds2.pl

  cat okayset/xxx.ok*.cdna added.cdna > newokayset.cdna

Then generate public sequence set with added sequences to keep
$evigene/scripts/genes/trclass2pubset.pl  -keepdrop arath16ap.keep $opts -log -class arath16ap.trclass -cdna newokayset.cdna


On Thursday, June 20, 2019 at 8:28:22 PM UTC-4, Charles Foster wrote:
What do you think about taking the BUSCOs that are present in my search using Transdecoder peptides but missing from my EviGenes search, and then adding them back into the EviGenes assembly?

Thanks again for your thoughts.

Charles


Charles Foster

unread,
Oct 10, 2019, 2:47:55 AM10/10/19
to EvidentialGene
Hi again Don,

Thanks for your reply, and apologies for MY late reply now! Can I please confirm the steps you recommend?

My original input: 148287 transcripts from a Trinity assembly.

1. Run tr2aacds.pl on the assembly

$evigene/prot/tr2aacds.pl -NCPU $ncpu -MAXMEM $maxmem -log -cdna transcripts.tr

2. Get the amino acid sequences for all input transcripts from the EviGene output

I believe these are in inputset/transcripts.aa? The only confusion I have is that my input transcriptome assembly has 148287 transcripts, but transcripts.aa has 150613 sequences. Is this because of multiple open reading frames per sequence?

3. Run Busco on the amino acid sequences

4. Reformat the Busco results tsv file using the perl code you provided above

perl -ne \
'if(/^\w/){($bid,$btype,$tid,$
score,$len)=split; print join("\t",$tid,$bid,$score)."\n"; }' \
 full_table_transcripts_aa.tsv \
 > transcripts_aa_busco.tab

5. Run tr2aacds.pl again, but with the added option to use Busco homology scores

$evigene/prot/tr2aacds.pl -NCPU $ncpu -MAXMEM $maxmem -log -ablast transcripts_aa_busco.tab -cdna transcripts.tr

6. Generate new pubsets etc.

$evigene/scripts/evgmrna2tsa2.pl -onlypubset -idprefix $myspeciesEVm -class $myspecies.trclass

Done!

Does that look correct to you? Any other steps/modifications you would recommend?

Cheers,
Charles

Don Gilbert

unread,
Oct 10, 2019, 8:05:29 AM10/10/19
to Charles Foster, EvidentialGene
Charles,

Yes, I think those steps looks like it will work.  If not, let me know.

PS. I'm working on a draft paper comparing Evigene methods to other popular ones.
It points out the mistakes in common popular methods of gene reconstruction, including
some in Evigene, which I've corrected and will have updated software to release, hopefully
in a few months.

Trinity is still low on the list of assemblers ability to recover accurate genes, but more to the point
is that by  using several assemblers, one overcomes the weaknesses of each one to obtain
the more accurate gene set.   Also the now popular Transdecoder for protein computes from
transcripts is missing many valid proteins, which Evigene methods do not.

- Don Gilbert


--
You received this message because you are subscribed to the Google Groups "EvidentialGene" group.
To unsubscribe from this group and stop receiving emails from it, send an email to evidentialgen...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/evidentialgene/b7244bbc-574b-493f-a6e6-c3d994cdad9b%40googlegroups.com.

Charles Foster

unread,
Oct 11, 2019, 12:01:48 AM10/11/19
to EvidentialGene
Great, thanks. I'll give it a go and see if any issues pop up. I'm also in the process of using additional assemblers. Once they have (hopefully) completed with no issues I'll try combining all assemblies using the same approach.

Looking forward to the paper too.

Cheers,
Charles
To unsubscribe from this group and stop receiving emails from it, send an email to evident...@googlegroups.com.

Charles Foster

unread,
Oct 13, 2019, 11:06:46 PM10/13/19
to EvidentialGene
Hi again Don,

My additional assemblies finished faster than expected. I've subsequently followed the steps I outlined above, only I've worked with several assemblers:

* Trinity (assembles with 1 kmer size: K25)
* rnaspades (assembles with 2 kmer sizes: K29 and K35)
* transabyss (assembled with 4 kmer sizes: K25, K35, K45, and K55)

I have some follow-up questions:

Work with merged assemblies or single-kmers?
Both rnaspades and trans-abyss merge their single kmer assemblies into a single final assembly. I'm not sure how rnaspades does it, but transabyss uses a function called transabyss-merge. Are you familiar with these programs? If so, would you recommend using the "final" assemblies as input to EviGene, or using the single kmer assemblies as input to EviGene?

Number of transcripts
As discussed in this thread, I originally used EviGene just with a Trinity assembly. The Trinity assembly has 148287 transcripts, but after the EviGene pipeline this number was reduced to just 45746 transcripts corresponding to 41611 genes. The reduced number of genes seemed far more realistic, but still resulted in many unannotated transcripts.

However, after running the EviGene pipeline (tr2aacds --> busco --> tr2aacds + ablast) with multiple assemblies, the final count of transcripts is 131132, corresponding to 89213 genes. There are clearly more genes reconstructed than with only Trinity.

Perhaps naively, I thought/hoped that the EviGene pipeline with multiple assemblers would result in less (but higher quality) gene reconstructions. It's difficult for you to say without being familiar with the data set, but are my results as you would expect? I should note that the BUSCO score increases from ~78%  complete with the Trinity-only assembly (based on both transdecoder and EviGene proteins) to 89% with the multi-assembly EviGene result.

By the way, the -ablast option with BUSCO results didn't change the numbers of reconstructed genes with the multi-assembly EviGene test.

Cheers,
Charles

On Thursday, October 10, 2019 at 11:05:29 PM UTC+11, Don Gilbert wrote:
To unsubscribe from this group and stop receiving emails from it, send an email to evident...@googlegroups.com.

Don Gilbert

unread,
Oct 14, 2019, 4:08:13 PM10/14/19
to Charles Foster, EvidentialGene
Charles,

Thanks for these questions, the draft paper I am working on has some of the answers in detail, but still is missing chunks of text.
To your questions:

A1: I've used rnaspades, collaborators have used transabyss (I found it too hard:), but I haven't a definititive answer.  What other methods have shown is that
these mergers are using a 'longest-rna' approach which collects mistakes more than accuracies.  You can measure this fairly simply: tr2aacds on each (unmerged, merged) set, look at (1) trclass.sum.txt which lists simple, longest protein stats (AA1k statistic), (2) quick BUSCO homology scoring.  I will guess the unmerged sets have somewhat more accurate coding genes, along with the extras.

A2: number of transcripts .. you already have part of this answered:
"BUSCO score increases from ~78%  complete with the Trinity-only assembly to 89% with the multi-assembly.."

The other answer parts are (1) more alternate transcripts, and especially (2) many more short, unclassified proteins (under 120aa-140aa).  Most of those are random noise, but require added evidence (e.g. homology) to detect the real ones.

These over-assemblies are adding many more accurate genes, and indeed for your more complex gene families, many fragments are reduced to a few longer complete genes. But the flip side is this adds also many more short gene models that are not classifiable with the efficient self-referential approach of tr2aacds.  It produces a draft set for you that has more of the good and not-good, but a simple step of protein homology will re-classify the short ones into  1,000 good ones and maybe 50,000 useless ones.  Or the really simple way is to say you don't care about <100aa genes (which some folks do), and drop out the bulk of your transcripts that way.

PS. Transdecoder makes many mistakes of omission (drops ~20-30% of valid genes), you are better off using Evigene's tr2aacds proteins.

The underlying question/answer here is that popular/high-reputation tools in genomics are to be questioned with objective tests, and those tests show flaws in this area of gene reconstruction, flaws that Evigene overcomes.  Part of this answer is that measures used by others are not the biologically relevant ones.

A3: -ablast option with BUSCO results didn't change the numbers.
This is what should happen (ie tr2aacds keeps all the longest protein == most homology).
But you can double-check by transcript ID.  If a shorter protein at gene locus has higher homology/BUSCO score (which happens on occasion), it will be kept using -ablast busco.table. If you check IDs in okayset and they don't contain all IDs in your busco.table, that is a mistake/bug somewhere.  You can triple check by adding false high score in busco.table to some tiny transcript and see what happens.

-- Don Gilbert
Here is some relevant text from my paper.  It isn't as readable or coherent as I'd wish .. can any of you suggest improvements?

D: Explain discrepancies, differences with related work

-- "too many genes/transcripts compared to known/reference gene sets "

Many practitioners of gene informatics use the count of transcripts from an assembly or modeling set to assess quality. The problem with this is that one is often counting errors of omission, and mistaken models, in small transcript sets.  Any count from un-validated gene sets is likely to be spurious.  One needs first to assess quality of each gene/transcript, as with homology and other cross-species measures, with chromosomal and related same-species measures.  This will help identify accurate versus inaccurate models, where the count of accurate models increases with additional assembly sets (eg. Fig 1).  When using many assemblers and options to produce over-assemblies of gene sets, one increases both the number of accurate models, and the number of ambiguous models.   Ambiguous models need evidence beyond self-referential measures to re-classify.  The efficiency of self-referential reduction, where it retains these many additional accurate models, allows one to apply more costly measures of data reduction with external evidence to produce gene sets with superior accuracy.

Two sub-classes of ambiguous transcripts increase with added assembly:
  1. alternate transcripts that may or may not be biologically valid, the distinction needs as much added evidence as feasible.  Known cases of complex genes will include 100s of biological alternates with a few, such as hyper-DSCAM in arthropods having 10,000s of measurable biological alternates.  Added evidence from chromosome-mapping and splice site pattern analyses will re-classify between useful and useless alternate transcripts.
  2. short model proteins, below 120aa-140aa, are very common and very difficult to classify between biological or random sequence results, with a majority from random sequence.  These should be re-classified with homology and any species-group or experimental evidence of short proteins, eliminating a large pile of uninformative expressed sequences with some coding potential.   Expressed transposons, where found, tend also to produce large numbers of slightly different, short proteins, and can be reclassified with transposon domain analyses.  Experience of this author finds expressed transposons are generally rare, but level is species/experiment dependent.

In this report, the Evigene draft transcripts sets for human and plant are 4x-5x above reference set counts, most of the excess are the unclassified short proteins.  This evigene draft set for human has 341,026 putative gene loci, but only 27,240 of these have proteins of 140aa or longer.  The 313,786 short putative loci contain most of the 1,770 found in the human reference set, and most of the spurious loci that can be discarded with further evidence classification.




To unsubscribe from this group and stop receiving emails from it, send an email to evidentialgen...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/evidentialgene/b1f93d2e-0285-4e58-8c34-73384ce10e49%40googlegroups.com.

Charles Foster

unread,
Oct 14, 2019, 11:57:06 PM10/14/19
to EvidentialGene
Don,

Thanks for the informative reply. It's valuable learning to further understand and improve transcriptome assemblies.

Previous points
I'm re-running the assemblies for a few reasons, but once they're done I'll try merging single k-mer assemblies using EviGene instead of the bundled merging functions of rnaSpades and trans-abyss.

For reducing redundancy, i.e. number of "bad" transcripts, you say "a simple step of protein homology will re-classify the short ones into 1,000 good ones and maybe 50,000 useless ones". Can I confirm, do you mean something as simple as search against a reference database and/or genome (using BLAST+ or similar) and keeping only those transcripts with a good hit?

The reason I ask is that I'm trying to have as realistic a transcriptome as possible from a non-model organism, but good reference genomes are relatively distantly related. When I run downstream differential expression analyses, having "too many" transcripts/genes necessarily impacts on the analyses by increasing the number of comparisons and potential false positives. Consequently, I've considered just keeping annotated genes for these analyses, and/or only using annotated DE genes for eventual GO term enrichment analyses. My motivation for readdressing the analyses using multiple assemblies was to reduce the number of fragments of the same gene that are recovered as differentially expressed - I'd rather just have the complete, well reconstructed gene recovered as differentially expressed.

New point
As an aside, my normal pipeline incorporates the Trinotate pipeline for annotation purposes. I know this is a separately maintained pipeline to EviGene, but I thought you might be able to help with some data wrangling to make it work with Evigene results. The Trinotate pipeline uses Transdecoder ORF predictions, blasts them against SwissProt, then loads the hits into an SQLite database. I've tried just loading in the Evigene peptide predictions, but it throws up an error about not being able to extract orf coordinates from transcripts. Basically, Trinotate expects the header of the sequences to be long and convoluted like so:

>TRINITY_ABC123_c0_g1::TRINITY_ABC123_c0_g1_i1::g.17214::m.17214 TRINITY_ABC123_c0_g1::TRINITY_ABC123_c0_g1_i1::g.17214  ORF type:5prime_partial len:265 (+),score=61.37 TRINITY_ABC123_c0_g1_i1:3-797(+)

The line in the script throwing the error is:

$header =~ /\s+(\S+):(\d+)-(\d+)\(([\+\-])\)/ or die "Error, cannot extract orf coordinates from transcript $header"

So, the script is wishing to find the co-ordinates in the form of ":3-797(+)" at the end of fasta sequence headers.

The Evigene header I have is like so:

>Myspecies_combined000003t1 type=protein; aalen=5166,97%,complete; clen=15872; offs=7-15507; oid=Myspecies_velvLoc10ct1; cov=217.1_g9_i0; evgclass=main,okay,match:Myspecies_k45.R4263767,pct:100/100/.;

As a test, I manually edited the header to be:

>Myspecies_combined000003::Myspecies_combined000003t1 Myspecies_combined000003t1:7-15507(+)

And Trinotate worked as desired. Do you know the best way to wrangle the final *.aa_pub.fa file from Evigene to be in this format, including having the co-ordinates and direction (+ or -)? If not, or if it would be too much work, I can modify my pipeline to skip using Trinotate, but it HAS been pretty handy to use thus far. I mainly use it to generate a nice annotation file with the genes, corresponding isoforms, the annotations, and associated GO terms/KEGG orthologues (automatically mined from Uniprot).

The text from your paper
It reads fine to me, and is informative. We actually do see expressed transposons in our transcriptomes funnily enough. I'd be happy to read through a draft of the paper at some stage if you want a set of non-expert eyes to judge how interpretable it is. I'd be interested in reading it regardless.

Overall, thanks for all the assistance thus far. I appreciate the time you've taken to reply. I ask these questions openly in the hope someone else stumbles across this thread and finds it useful.

Cheers,
Charles

Don Gilbert

unread,
Oct 31, 2019, 2:31:36 PM10/31/19
to Charles Foster, EvidentialGene
Dear Charles and others,

My apologies for the long delay in this reply, paper writing takes too long,
esp. in this now complicated field.  Find here a first draft paper of
EvidentialGene methods compared with other popular transcriptome builders,  
  http://arthropods.eugenes.org/EvidentialGene/other/genefilters_compared/

I'm still editing and expect a new draft out tonight or tomorrow.  If you
are eager to read this, this draft covers all I expect to say (likely more
than it should say).  Please do offer comments and suggestions to improve
this, of any sort, as it will help much in getting this across to future readers.

As the paper notes, there is a substantial update to Evigene software in the wings,
not yet packaged for public use.  If any of you are willing to spend time with
a still messy version for testing, let me know.  The update will change
transcriptome results, for the better, and if you can compare old versus new results,
I especially want to hear of that.  The major change is to retain more valid
alternate and paralog transcripts, which prior versions are discarding too many of
(but not as many as other methods).  Other changes reduce spurious/redundant
transcripts using added measures (described in paper).

- Don Gilbert

For your questions, Charles ---
 
# Previous points
# For reducing redundancy, i.e. number of "bad" transcripts, you say

Here is my suggestion, though you already have the idea, and homology results it seems:

a. use gene sequenes of evigene's publicset/ after tr2aacds produces okayset/
   evigenes/genes/trclass2pubset.pl -onlypub -idpre Myspp2EVm -log -class project.trclass
or  evigene/scripts/evgmrna2tsa.pl  -onlypub -idpre ..,  older does about same job
See evigene/docs/evgmrna2tsa_help.txt

As you have protein homology scores from Trinotate, you can add that into publicset data,
by first converting those scores to a 'project.names' table, in same folder as project.trclass,
with 4+ columns:  
    tr_ID <tab> "ref protein name" <tab> score <tab> ref_ID
where score is blast score or alignment or other value that says how valuable
that ref_ID match is.  Like this, where this score col has align% to ref
  SsERR789444brvelvLoc7071053ct1 oleosin-B6   20% cow18ncrf:XP_002705078.2
  SsSRR6236876brvelvLoc83232ct1  erythropoietin receptor precursor 100% cow18ncrf:NP_001192530.1

b. separate out the short-protein loci with no homology support, as
   unsupported, mostly random transcription noise.  If you want, use
   also some expression level measure to decide if other short-aa
   genes are worth keeping.  The publicset/project.pubids table has
   info on protein sizes, gene ids, homology score.

c. keep all gene loci with >= 140 aa (or 120 or 100 depending on how many
   extras you want, there isn't a firm cut off for random noise short-proteins)  
   These are likely true genes, either novel or well-diverged of your species,
   and are unlikely to be random at large sizes, though they could be
   transposon proteins, or other contaminant.  THose however you can screen
   for with various homology tests, i.e. PFAM has set of euk transposon domains.
     
So your "final realistic" gene set contains those with homology scores, and those with proteins long enough to be statistically real.

New point
# As an aside, my normal pipeline incorporates the Trinotate pipeline for annotation purposes.

Trinity's sequence header information is different format (uglier perhaps :) but
has similar information as Evigene.  You figured out the needed CDS-offset in
transcript sequences.  In the case of Evigene publicset/ sequences, all are +strand
as those transcripts are mRNA-oriented (5' end first).

evigene >Myspecies_combined000003t1 type=protein; aalen=5166,97%,complete; clen=15872; offs=7-15507; oid=Myspecies_velvLoc10ct1; cov=217.1_g9_i0; evgclass=main,okay,match:Myspecies_k45.R4263767,pct:100/100/.;
  offs=7-15507 is CDS-offset into transcript
trinity >TRINITY_ABC123_c0_g1::TRINITY_ABC123_c0_g1_i1::g.17214::m.17214 ..  TRINITY_ABC123_c0_g1_i1:3-797(+)
  CDS-offset is last word on line

According to your test, this does minimal evigene to trinity reformat:
  perl -pe 'BEGIN{ $OR="\(\+\)"; }
  if(/^>(\S+)/){ $trid=$1; ($gid=$trid)=~s/t\d+$//; ($ofs)=m/offs=([^\;\s]+)/; ($alen)=m/aalen=(\d+)/;
  s/>.*$/>$gid:$trid len:$alen $trid:$ofs$OR/; } ' \
     evigene.aa > evigene2trin.aa

  >Susscr4EVm094566t2 type=protein; aalen=122,16%,complete-utrbad; clen=2267; offs=432-800; oid=Susscrtrsoap1a_sBn2l2ERR789444soapk25Loc13781t13; organism=Sus_scrofa; evgclass=main,okay,match:Susscrtrsoap1a_sBn2l2ERR789444soapk43Loc394309t1,pct:100/83/.;
  MMVKRAGLRNLYKIYNQRTYQAPSCARPWTVSQGLVVLSALSRIPGATQGGNLMVSVWVS
  WREVSECAGSVFFGHTHSIWKFQDQGSNPSHSSDNARFLTSRPPGKFRVVFIGLLCGFNV
   TO
  >Susscr4EVm094566:Susscr4EVm094566t2 len:122 Susscr4EVm094566t2:432-800(+)
  MMVKRAGLRNLYKIYNQRTYQAPSCARPWTVSQGLVVLSALSRIPGATQGGNLMVSVWVS
  WREVSECAGSVFFGHTHSIWKFQDQGSNPSHSSDNARFLTSRPPGKFRVVFIGLLCGFNV
  


To unsubscribe from this group and stop receiving emails from it, send an email to evidentialgen...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/evidentialgene/97ade6a3-7f82-476b-9d51-7206f99e6a9d%40googlegroups.com.

Charles Foster

unread,
Nov 7, 2019, 7:25:06 PM11/7/19
to EvidentialGene
Dear Don,

Thanks again for the extra information and tips - I'll take them on board. I've had to be pragmatic and "finalise" my assembly and annotation for my current project. However, for future projects I'll be happy to test out the newer implementation of Evigene.

FYI: some notes for the current assembly

I stuck with using Trinity, trans-abyss, rna-spades, and spades (in single-cell mode). These four assemblers have been found to do pretty well for assembly: see this recent paper https://academic.oup.com/gigascience/article/8/5/giz039/5488105. Of course, I then reduced the over-assembly using EviGene. Previously I asked whether it would be preferable to use (1) the single k-mer assemblies with EviGene or (1) to use the versions where single k-mer assemblies were automatically merged by the respective assembly programs. I ended up assessing both approaches using BUSCO scores...

(1) Single k-mers w/ Evigene: C:90.1%[S:2.2%,D:87.9%],F:4.1%,M:5.8%,n:2586
(2) Auto-merged assemblies w/ Evigene:  C:89.6%[S:58.1%,D:31.5%],F:4.0%,M:6.4%,n:2586

While the single-kmer reduction approach had a slightly higher BUSCO score, it also had a much higher duplication level and much lower single-copy level. It also had a much larger number of transcripts in the assembly, despite the very similar BUSCO score (although I acknowledge that on its own this isn't the most useful criterion). Ultimately, I chose to progress with the assembly where single k-mer assemblies were automatically merged by the respective assembly programs followed by Evigene reduction.

Evigene bioRxiv paper
I've also had a quick read of the paper you put onto bioRxiv. It's very useful - thanks. In time I will try to provide mroe useful feedback, but for now I had a query about the Transdecoder commands you used. You used:

Trans-Decoder.LongOrfs -m 30 -t human18nc.mrna; followed by TransDecoder.Predict -t human18nc.mrna

From my previous experiences (as a non-expert!), I found better results with transdecoder by including homology searches as ORF retention criteria - see https://github.com/TransDecoder/TransDecoder/wiki#including-homology-searches-as-orf-retention-criteria. The homology searches I used were blastp results (after blasting transdecoder long ORFs against SwissProt) and hmmer results (using hmmer to search transdecoder long ORFs against the PFam database):

TransDecoder.Predict -t Trinity.fasta --retain_pfam_hits pfam_results.txt --retain_blastp_hits blastp.outfmt6

Have you tried doing this and comparing the results to Evigene? I'm not sure how much the results will change if the initial long ORFs prediction is missing valid proteins, but perhaps the accuracy of the method might increase? However, based on the results you present, I'm still confident that Evigene will have superior results.

Cheers,
Charles

Don Gilbert

unread,
Nov 7, 2019, 9:20:40 PM11/7/19
to Charles Foster, EvidentialGene
Charles,

Thanks for your update.  For your two BUSCO scores,
(1) Single k-mers w/ Evigene: C:90.1%[S:2.2%,D:87.9%],F:4.1%,M:5.8%,n:2586
(2) Auto-merged assemblies w/ Evigene:  C:89.6%[S:58.1%,D:31.5%],F:4.0%,M:6.4%,n:2586

the first appears to recover about 15 more conserved genes, that only matters depending on what folks
want of those few.  The S: and D: counts are only meaningful if you are measuring gene loci.  Transcriptomes
include many alternate transcripts per locus, and BUSCO software doesn't recognize that (unless you do
extra steps in classifying .. I've a variant of BUSCO.py that does such given the gene ID table from evigene).

It may well be you end up with more putative gene loci from (1) full set of singles above, after Evigene reduction.
There is a trade-off of more true loci plus more ambiguous loci, versus fewer of both (a topic in draft  paper).

For this paper of mine, the point about TransDecoder is that it doesn't work well at recovering known proteins
unless you already have used those known proteins to inform it.  This isn't what a transcript ORF/protein tool
should be required to do.  I likely should add that as a note somewhere, but in the end, using homology scoring
on very large transcript assemblies is a waste of compute effort as well as getting you only the known proteins,
leaving new, unknowns with poor results.

Don Gilbert


To unsubscribe from this group and stop receiving emails from it, send an email to evidentialgen...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/evidentialgene/808c3e89-0def-4427-b767-fbf5a6a6ed05%40googlegroups.com.
Reply all
Reply to author
Forward
0 new messages