ExNy Statistics results

552 views
Skip to first unread message

Vanessa Guerra

unread,
Jan 13, 2016, 2:22:22 AM1/13/16
to trinityrnaseq-users
Hi Everyone, 

My results for the small values of Ex with ExNy Statistics look a little high. Does this look normal? Or did I do something wrong? (general questions, but any tips would be helpful). 

command
perl $TRINITY_HOME/util/TrinityStats.pl /global/scratch/vguerra/analysis_hungabee_scratch/Trinotate_hungabee_Acanthaster_Trinity206/Abundance_estimation_Trinity-2-0-6-16Dec2015/Trinity_2-0-6_Allsamples_4Dec2015_normalized.fasta

output

output

################################

## Counts of transcripts, etc.

################################

Total trinity 'genes': 74877

Total trinity transcripts: 93375

Percent GC: 43.49


########################################

Stats based on ALL transcript contigs:

########################################


Contig N10: 4229

Contig N20: 3073

Contig N30: 2371

Contig N40: 1813

Contig N50: 1364


Median contig length: 409

Average contig: 789.33

Total assembled bases: 73703575


#####################################################

## Stats based on ONLY LONGEST ISOFORM per 'GENE':

#####################################################


Contig N10: 3,755

Contig N20: 2,613

Contig N30: 1,930

Contig N40: 1,409

Contig N50: 1,002


Median contig length: 378

Average contig: 675.94

Total assembled bases: 50,612,458


command
perl $TRINITY_HOME/util/misc/contig_ExN50_statistic.pl /global/scratch/vguerra/analysis_hungabee_scratch/Trinotate_hungabee_Acanthaster_Trinity206/Abundance_estimation_Trinity-2-0-6-16Dec2015/Build_Transcript_and_Gene_Expression_Matrices_29Dec2015/trans_27Dec2015.TMM.EXPR.matrix /global/scratch/vguerra/analysis_hungabee_scratch/Trinotate_hungabee_Acanthaster_Trinity206/Abundance_estimation_Trinity-2-0-6-16Dec2015/Trinity_2-0-6_Allsamples_4Dec2015_normalized.fasta | tee ExN50.stats

results

#E min_expr E-N50 num_transcripts

E2 80876.635 2698 1

E4 43117.687 7180 2

E5 25517.007 7180 3

E6 25517.007 7180 4

...

E98 2.425 1434 74618

E99 1.902 1407 79912

E100 0.008 1380 90197

E100 0 1364 93375





Brian Haas

unread,
Jan 13, 2016, 7:33:37 AM1/13/16
to Vanessa Guerra, trinityrnaseq-users
Hi

If the ExN50 stat doesn't show a nice peak at around E90, then it usually means that you need to sequence more deeply to get a better assembly.  Note this is what we've observed with non-metatranscriptomes.  If you're doing meta, this may or may not apply.

I'd suggest doing assemblies at a range of read depths and examining full length reconstruction stats as a function of read depth.  This will tell you whether you're getting close to saturation or not. 

Our guide to assessment is available here

And I'll add more to it shortly.

Best

-Brian
(by iPhone)

--
You received this message because you are subscribed to the Google Groups "trinityrnaseq-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to trinityrnaseq-u...@googlegroups.com.
To post to this group, send email to trinityrn...@googlegroups.com.
Visit this group at https://groups.google.com/group/trinityrnaseq-users.
For more options, visit https://groups.google.com/d/optout.

Brian Haas

unread,
Jan 14, 2016, 8:04:57 AM1/14/16
to Vanessa Guerra, trinityrnaseq-users
Hi all,

The ExN50 documentation is now updated, and I included an additional plot to show how it might look as you sequence more deeply.  We've observed this pattern on several data sets, and I know others have as well - and of course, I'm curious to see any profiles that might differ from this in case it's not as useful as a guide as I'm anticipating.


best wishes,

~brian

Brian Haas

unread,
Jan 14, 2016, 8:19:32 AM1/14/16
to Vanessa Guerra, trinityrnaseq-users
The full-length swissprot protein match analysis is another important indicator of 'did I sequence deeply enough to generate a high quality assembly', as we describe here:
and in our earlier Nature Protocol paper.  This, combined with the ExN50 analysis should hopefully provide some important insights.

best,

~brian

--
--
Brian J. Haas
The Broad Institute
http://broadinstitute.org/~bhaas

 

Mauricio Losilla

unread,
Apr 20, 2016, 3:09:31 PM4/20/16
to trinityrnaseq-users, vguerr...@gmail.com
Hi,

I would also like your opinion on my ExNy stats. I used Trinity v2.2.0, SS_lib RF, min_contig 152, normalization for the assembly. Then kallisto for transcript quantification, via the provided script in Trinity.


These are the basic stats for my assembly:

################################
## Counts of transcripts, etc.
################################
Total trinity 'genes': 2671011
Total trinity transcripts: 3073139
Percent GC: 45.13

########################################
Stats based on ALL transcript contigs:
########################################

Contig N10: 3585
Contig N20: 2237
Contig N30: 1502
Contig N40: 1062
Contig N50: 769

Median contig length: 282
Average contig: 511.57
Total assembled bases: 1572127561


#####################################################
## Stats based on ONLY LONGEST ISOFORM per 'GENE':
#####################################################

Contig N10: 2380
Contig N20: 1368
Contig N30: 957
Contig N40: 713
Contig N50: 543

Median contig length: 262
Average contig: 419.23
Total assembled bases: 1119755961





The ExN50 table (the values for E1-E10, E12-E19, and others were missing from the result Trinity generated):

#E min_expr     E-N50    num_transcripts
E11 307090.87 459 1
E20 99631.298 544 2
E23 46140.779 459 3
E26 29265.267 459 4
E27 29265.267 473 5
E29 21057.728 473 6
E30 13600.389 473 7
.
.
.
E85 0.202 1042 673017
E86 0.193 1018 726271
E87 0.176 998         782933
E88 0.167 976       843060
E89 0.157 958         906802
E90 0.146 946       974934
E91 0.139 934 1048147
E92 0.13        919 1126551
E93 0.119 906 1210238
E94 0.11        898 1300954
E95 0.098 890 1400670
E96 0.09        882 1511959
E97 0.077 872 1638831
E98 0.063 861 1788859
E99 0.045 847 1982674
E100  0.001 867 2384365


And the ExN50 vs Ex plot is attached. This peaks way before E90.

I think I have a very large number of small, lowly expressed genes/transcripts, in part because I chose min_length 152 (reads were 150 bp, I was trying to save potential ncRNAs).


Any thoughts or suggestions?

Thanks
Mau
ExN50.tiff

Brian Haas

unread,
Apr 20, 2016, 6:15:06 PM4/20/16
to Mauricio Losilla, trinityrnaseq-users, Vanessa Guerra
Hi Mau,

We tend to see this trend when insufficient sequencing was done.  I think you would be best served by sequencing much deeper (probably 10x deeper), but you might first explore why this data set was so under-sampled on the whole.  It seems you have only 7 transcripts that are sucking up 30% of your expression data.  If you don't have a way of reducing these somehow experimentally (pre library prep), then just deeper sequencing is the way to go in the brute force strategy (and don't forget in silico normalization before trying to assemble the deep reads).

~b

--
You received this message because you are subscribed to the Google Groups "trinityrnaseq-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to trinityrnaseq-u...@googlegroups.com.
To post to this group, send email to trinityrn...@googlegroups.com.
Visit this group at https://groups.google.com/group/trinityrnaseq-users.
For more options, visit https://groups.google.com/d/optout.

Mauricio Losilla

unread,
Apr 21, 2016, 12:32:18 PM4/21/16
to trinityrnaseq-users, mlosi...@gmail.com, vguerr...@gmail.com

Hello Brian,

Thank you for getting back to me. If you don't mind, I would like to ask you more questions and pick your brain.

1) I also got the impression of insufficient sequencing when looking at your plot of profiles of Ex vs N50 by sequencing depth. However, I wasn't expecting this in my case: I started +176 million pairs of trimmed reads, of which ~52 million survived normalization and were used in the assembly. (I did the transcript abundance estimation using all reads, since I had to analyze them per tissue to get the transcripts.TMM.EXPR.matrix. Is this right?)

2) If 7 transcripts suck up 30% of expression data, that sounds to me like rRNA or maybe DNA contamination. I blasted the top 7 expressed transcripts (from the file transcripts.TMM.EXPR.matrix.E-inputs). These are the top matches for them:

--Homo sapiens RNA, 45S pre-ribosomal 5 (RNA45S5), ribosomal RNA
--TPA: Danio rerio SRP RNA for signal recognition particle RNA
--Takifugu rubripes scaffold_54 immunoglobulin light chain genomic sequence (other hits suggest it is a ncRNA)
--PREDICTED: Astyanax mexicanus PDZ domain-containing protein 4-like (LOC103035084), mRNA
--Apteronotus albifrons partial 28S rRNA gene, clone GY21
--Apteronotus albifrons partial 28S rRNA gene, clone GY21
--PREDICTED: Takifugu rubripes vacuolar protein sorting 37 homolog A (S. cerevisiae) (vps37a), transcript variant X2, mRNA

Transcripts 5 and 6 returned the same hit, but they are from different Trinity genes.

Many of these hits seem to be RNAs from ribonucleoproteins. I wasn't involved in the library prep, but shouldn't these have been taken care of then?


3) the num_transcripts value for E100 in the ExNy file is ~600K short of the total # of transcripts. Not sure what to make of this


4) the BUSCO results (default parameters with vertebrata database):

Summarized benchmarks in BUSCO notation:
C:72%[D:54%],F:12%,M:15%,n:3023

Representing:
529 Complete Single-copy BUSCOs
1650 Complete Duplicated BUSCOs
379 Fragmented BUSCOs
465 Missing BUSCOs
3023 Total BUSCO groups searched


6) What are your overall impressions and recommendations?


Thanks

Mau


To unsubscribe from this group and stop receiving emails from it, send an email to trinityrnaseq-users+unsub...@googlegroups.com.

To post to this group, send email to trinityrn...@googlegroups.com.
Visit this group at https://groups.google.com/group/trinityrnaseq-users.
For more options, visit https://groups.google.com/d/optout.

Brian Haas

unread,
Apr 22, 2016, 7:35:27 PM4/22/16
to Mauricio Losilla, trinityrnaseq-users, Vanessa Guerra
Hi Mauricio,

Responses inline below

On Thu, Apr 21, 2016 at 12:32 PM, Mauricio Losilla <mlosi...@gmail.com> wrote:

Hello Brian,

Thank you for getting back to me. If you don't mind, I would like to ask you more questions and pick your brain.

1) I also got the impression of insufficient sequencing when looking at your plot of profiles of Ex vs N50 by sequencing depth. However, I wasn't expecting this in my case: I started +176 million pairs of trimmed reads, of which ~52 million survived normalization and were used in the assembly. (I did the transcript abundance estimation using all reads, since I had to analyze them per tissue to get the transcripts.TMM.EXPR.matrix. Is this right?)

Yes, you did this correctly. The normalized reads are just for the assembly only. All downstream analyses should be done with the original reads (optionally quality trimmed).

 

2) If 7 transcripts suck up 30% of expression data, that sounds to me like rRNA or maybe DNA contamination. I blasted the top 7 expressed transcripts (from the file transcripts.TMM.EXPR.matrix.E-inputs). These are the top matches for them:

--Homo sapiens RNA, 45S pre-ribosomal 5 (RNA45S5), ribosomal RNA
--TPA: Danio rerio SRP RNA for signal recognition particle RNA
--Takifugu rubripes scaffold_54 immunoglobulin light chain genomic sequence (other hits suggest it is a ncRNA)
--PREDICTED: Astyanax mexicanus PDZ domain-containing protein 4-like (LOC103035084), mRNA
--Apteronotus albifrons partial 28S rRNA gene, clone GY21
--Apteronotus albifrons partial 28S rRNA gene, clone GY21
--PREDICTED: Takifugu rubripes vacuolar protein sorting 37 homolog A (S. cerevisiae) (vps37a), transcript variant X2, mRNA

Transcripts 5 and 6 returned the same hit, but they are from different Trinity genes.

Many of these hits seem to be RNAs from ribonucleoproteins. I wasn't involved in the library prep, but shouldn't these have been taken care of then?



If this was a polyA-captured or other mRNA-targeted prep, then I wouldn't expect tons of rRNA to show up.  If you're doing total RNA, then you might try to use some rRNA depletion method. 

 
3) the num_transcripts value for E100 in the ExNy file is ~600K short of the total # of transcripts. Not sure what to make of this

I think it throws out those that have 'zero' expression according to the read mappings.  Note, they weren't assembled out of thin air, though, so there should really be some read support underlying them. Usually it has something to do with mapping issues, but you can usually safely ignore it.
 


4) the BUSCO results (default parameters with vertebrata database):

Summarized benchmarks in BUSCO notation:
C:72%[D:54%],F:12%,M:15%,n:3023

Representing:
529 Complete Single-copy BUSCOs
1650 Complete Duplicated BUSCOs
379 Fragmented BUSCOs
465 Missing BUSCOs
3023 Total BUSCO groups searched


That's actually not bad at all.
 

6) What are your overall impressions and recommendations?


Since BUSCO actually looks pretty good, perhaps there's something else that's accounting for the peculiar ExN50 stats.  If you have some contamination or microbiome included, this could also account for the low ExN50 stats and profile.  Others have had success using MG-RAST to examine microbiome/contamination related issues. Maybe give that a whirl. 

I'd also suggest looking into the library prep methods that were used, and whether it was polyA-capture or whether you're after total RNA.  The ExN50 profiles are based on polyA-capture data.

best,

~b

 


Thanks

Mau



Mauricio Losilla

unread,
Apr 25, 2016, 1:23:47 PM4/25/16
to trinityrnaseq-users, mlosi...@gmail.com, vguerr...@gmail.com
Hi Brian,

Thank you so much for your kind assistance. It turns out that the library was prepared with the RiboZero kit to remove rRNAs. This kit is designed for Human, mouse and rat, so I guess it is not that surprising that it failed to remove some rRNA in my fish.

Based on what you said about the BUSCO stats and specific gene blasts I have done, it looks that the assembly is acceptable, although more fragmented than desired. 

I want to compare gene abundance of a handful of genes, among the tissues I used in the assembly. I estimated abundance with both RSEM and Kallisto, and the results are pretty similar. Which matrix do you think is the most appropriate for the comparison (TPM_not_cross_normalized or TMM – although both yield qualitatively the same result).  Also, do you think the non-depleted rRNA may influence this?

Thanks

Mau

Brian Haas

unread,
Apr 25, 2016, 1:35:53 PM4/25/16
to Mauricio Losilla, trinityrnaseq-users, Vanessa Guerra
Hi Mau,

The TMM-normalized expression matrix is the one you want for general comparisons of expression values (heatmaps).  Note, for DE analysis, be sure to use the raw counts matrix.

best,

~b

--
You received this message because you are subscribed to the Google Groups "trinityrnaseq-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to trinityrnaseq-u...@googlegroups.com.

To post to this group, send email to trinityrn...@googlegroups.com.
Visit this group at https://groups.google.com/group/trinityrnaseq-users.
For more options, visit https://groups.google.com/d/optout.

Caroline Judy

unread,
Jun 6, 2017, 2:23:07 PM6/6/17
to trinityrnaseq-users
Hi everyone,

In response I also have some strange patterns in my ExN50 stats, which I've included below. One question - I used reads that were quality trimmed and filtered (Trim_Galore), rather than the original reads, which, I think is correct, but, would it be better to use the original reads (unfiltered, untrimmed) since those are more representative of sequencing effort/library?

################################
## Counts of transcripts, etc.
################################
Total trinity 'genes': 154348
Total trinity transcripts: 193916
Percent GC: 48.25

########################################
Stats based on ALL transcript contigs:
########################################

Contig N10: 4768
Contig N20: 3336
Contig N30: 2480
Contig N40: 1848
Contig N50: 1360

Median contig length: 541
Average contig: 933.96
Total assembled bases: 181109054


#####################################################
## Stats based on ONLY LONGEST ISOFORM per 'GENE':
#####################################################

Contig N10: 3699
Contig N20: 2312
Contig N30: 1550
Contig N40: 1090
Contig N50: 814

Median contig length: 482
Average contig: 728.22
Total assembled bases: 112400013

###########
## ExN50 ##
###########

E min_expr E_N50 num_transcripts
E24   355540.002  1587   1
E42   270212.565  1587   2
E52   191234.276  1196   3
E61   187119.110  1587   4
E64   40823.241   1531   5
E67   32339.559   1531   6
E69   29518.941   1587   7
E70   17143.762   1587   8
E71   12417.904   1587   9
E72   10910.900   1644   10
E73   7243.594    1786   12
E74   4922.705    1786   14
E75   3629.061    1644   18
E76   2543.671    1637   23
E77   2042.906    1531   28
E78   1654.466    1437   35
E79   1595.688    1531   42
E80   1254.822    1532   51
E81   975.907     1403   61
E82   788.113     1396   76
E83   582.321     1299   94
E84   465.372     1196   118
E85   350.920     1161   150
E86   273.265     1273   189
E87   201.557     1322   242
E88   140.611     1287   313
E89   101.946     1396   413
E90   64.756      1448   562
E91   39.625      1669   807
E92   22.755      1757   1199
E93   14.179      1845   1859
E94   8.206       1961   2996
E95   4.500       2137   5054
E96   2.348       2263   8923
E97   1.059       2351   17191
E98   0.496       2206   36161
E99   0.256       1844   73255
E100  0.007       1455   162824

############################################
#   analyze_blastPlus_topHit_coverage.pl   #
############################################


#hit_pct_cov_bin  count_in_bin  >bin_below
100               5233          5233
90                1264          6497
80                929           7426
70                945           8371
60                1117          9488
50                1200          10688
40                1342          12030
30                1558          13588
20                1580          15168
10                716           15884
exn50.pdf

Brian Haas

unread,
Jun 6, 2017, 2:32:32 PM6/6/17
to Caroline Judy, trinityrnaseq-users
This doesn't look bad to me...    and using the trimmed reads should be fine.  Matt MacManes published a paper a while back about how very mild trimming is better than rigorous trimming, but overall it shouldn't make a major difference. 

Also, I'd recommend experimenting with other assemblers as well. Usually, it's more productive than experimenting with a bunch of different Trinity parameters.

best,

~b



--
You received this message because you are subscribed to the Google Groups "trinityrnaseq-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to trinityrnaseq-users+unsub...@googlegroups.com.
To post to this group, send email to trinityrnaseq-users@googlegroups.com.

Caroline Judy

unread,
Jun 6, 2017, 2:42:45 PM6/6/17
to Brian Haas, trinityrnaseq-users
Thank you, Brian! Am I correct in my interpretation that 3 transcripts suck up 51% of the expression?

To unsubscribe from this group and stop receiving emails from it, send an email to trinityrnaseq-users+unsubscribe...@googlegroups.com.

To post to this group, send email to trinityrnaseq-users@googlegroups.com.
Visit this group at https://groups.google.com/group/trinityrnaseq-users.
For more options, visit https://groups.google.com/d/optout.

Brian Haas

unread,
Jun 6, 2017, 3:05:17 PM6/6/17
to Caroline Judy, trinityrnaseq-users
ouch.... yes, that's what it looks like.  

Attached are new versions of the script that will go into the upcoming release.  The ExN50 script will generate a file that will list each transcript and its membership/ordering along the Ex value.  From this, you can easily tell what the top transcripts are.  

You may have mentioned before what these are, but can you refresh my memory?  Are these known to take up so much transcription in the cell?   I know w/ plants, RUBSICO is a problem, and in general rRNA is an issue if you don't polyA-select or otherwise deplete it.

best,

~b
contig_ExN50_statistic.pl
plot_ExN50_statistic.Rscript

Caroline Judy

unread,
Jun 6, 2017, 4:22:10 PM6/6/17
to Brian Haas, trinityrnaseq-users
Hi Brian, thanks for those scripts! Very helpful. I just ran the new perl script, and was able to identify the four most highly expressed transcripts. I blasted them and two mapped to rRNA and two mapped to mtDNA sequences (from avian hosts). 

I used mRNA derived from avian muscle tissue extracted using a mRNA direct kit (polyA tail). In theory there shouldn't be rRNA contamination but clearly there was. I performed the assembly using the Agalma pipeline, which supposedly removed ~30% of read pairs that mapped to known RNA sequences. I'm surprised that there were any reads left that map to rRNA, but I guess there are. 

I'll have to experiment with various ways to bioinformatically remove the rRNA and mtDNA reads and then will re-run the quantification and normalization steps. I'm attaching my output diagnostics from the Agalma assembly (which calls Trinity).

Any thoughts on how such excessive contamination might bias the normalization and downstream DE analysis?
Caroline

index.html

Brian Haas

unread,
Jun 6, 2017, 4:58:02 PM6/6/17
to Caroline Judy, trinityrnaseq-users
The rRNA content definitely seems to be higher than expected.  I wonder if the polyA capture didn't work as well as it was supposed to. It should generally be only a couple percent, ideally.

I wouldn't worry about it biasing the DE results since TMM normalization (or other cross-sample normalization methods) should compensate for these extreme values.

best,

~b

Caroline Judy

unread,
Jun 6, 2017, 10:31:56 PM6/6/17
to Brian Haas, trinityrnaseq-users
Thanks for your comments, Brian. Big help.
Reply all
Reply to author
Forward
0 new messages