Cumulative sum scaling in normalize_table.py (Qiime) vs. metagenomeSeq (R)

1,169 views
Skip to first unread message

Ryoko Oono

unread,
Apr 25, 2016, 5:36:10 PM4/25/16
to Qiime 1 Forum
I am having trouble replicating normalization results in Qiime and R. The metagenomeSeq results make sense to me: the OTU reads are divided by the scaling factor for each sample (which I can see with exportStats). When I divide my original OTU read abundance by the normalized OTU read abundance, they all become the scaling factor specific to that sample. However, the css in Qiime gives me different results even though I use the exact same starting OTU table. When I export the stats using '-s' in Qiime, I get the exact same statistics as I do with metagenomeSeq, which tells me I am indeed working on the same data set and the scaling factors are the same for both datasets. But the normalized OTU abundance reads are completely different from what I get in metagenomeSeq and I can't seem to figure out how to back-calculate to get my original OTU read numbers. Just wanted to check to see if the code in Qiime is old? Or if I'm doing something wrong in R. 

Thank you!

These are my codes:

In Qiime: 

normalize_table.py -i OTUTable.biom -o NormalizedOTUTable_CSS_qiime.biom -a CSS -s


In metagenomeSeq in R:
obj = cumNorm(ObjData, p = cumNormStatFast(ObjData)) exportStats(obj) NormalizedMatrix <- MRcounts(obj, norm = TRUE) exportMat(NormalizedMatrix, file = "~/Desktop/NormalizedOTUTable_CSS_R.tsv")

Here's a sample of my original OTU table:
OTUId Sample 1 Sample 2
OTU_1 63460 91939
OTU_2 21 13563
OTU_3 3663 790
OTU_4 7468 8212
OTU_6 375 657
OTU_5 5364 1304
OTU_8 1082 558
OTU_11 520 22982

A sample of normalized OTU table from metagenomeSeq:
Taxa and Samples Sample 1 Sample 2
OTU_1 68754.06284 442014.4231
OTU_2 22.75189599 65206.73077
OTU_3 3968.580715 3798.076923
OTU_4 8091.007584 39480.76923
OTU_6 406.283857 3158.653846
OTU_5 5811.48429 6269.230769
OTU_8 1172.264355 2682.692308
OTU_11 563.3802817 110490.3846

A sample of normalized OTU table from Qiime's normalize_table.py:
#OTU ID Sample 1 Sample 2
OTU_1 16.069 18.754
OTU_2 4.57 15.993
OTU_3 11.955 11.891
OTU_4 12.982 15.269
OTU_6 8.6699 11.626
OTU_5 12.505 12.614
OTU_8 10.196 11.39
OTU_11 9.1405 16.754

Here are the stats (same for metagenomeSeq and Qiime):
Subject Scaling factor Quantile value Number of identified features Library size
Sample 1 923 97 108 141101
Sample 2 208 10 137 176123

Colin Brislawn

unread,
Apr 25, 2016, 6:44:11 PM4/25/16
to Qiime 1 Forum
Hello Ryoko,

As you probably already know, qiime is fully open source, so you can take a look at the code qiime is using to run metagenomeseq. Based on this line, it looks like qiime is writing log transformed counts: 

This matches the example data you provided, too. 
OTU_1 68754.06284
log2(68754.06284) == 16.0691573479
OTU_1 16.069


Does that answer your question?
Colin

Ryoko Oono

unread,
Apr 25, 2016, 8:23:09 PM4/25/16
to Qiime 1 Forum
Thank you! Yes, that makes a lot of sense now. 

Jay T

unread,
Jul 25, 2016, 11:52:22 AM7/25/16
to Qiime 1 Forum
Will the log transformed data affect any downstream processes? I read Paul's paper and I plan normalizing my raw OTU table (otu_table_wc2_no_pynast_failures.biom) and using these for alpha-diversity analyses. What do you guys think?

Jay T

unread,
Jul 25, 2016, 7:13:14 PM7/25/16
to qiime...@googlegroups.com
Colin - I should add this bit of information...I originally had 171 samples. I filtered my samples so that I only worked with ones with 15000 sequences, however the variance in library size is large. Anyhow, I ran this the normalize_table.py script on my raw OTU table. When I imported it into phyloseq it gave this error:

> ## Merge the three objects together in phyloseq
> testdata=merge_phyloseq(biomfile,tree,map)
> print(testdata)
phyloseq-class experiment-level object
otu_table()   OTU Table:         [ 90862 taxa and 148 samples ]
sample_data() Sample Data:       [ 148 samples by 8 sample variables ]
tax_table()   Taxonomy Table:    [ 90862 taxa by 7 taxonomic ranks ]
phy_tree()    Phylogenetic Tree: [ 90862 tips and 90848 internal nodes ]
> # Alpha_diversity #Best Plots
> plot_richness(testdata, x="Description", color="Site_Tissue", measures=c("Chao1", "Shannon"))
Error in estimateR.default(newX[, i], ...) : 
  function accepts only integers (counts)
In addition: Warning message:
In estimate_richness(physeq, split = TRUE, measures = measures) :
  The data you have provided does not have
any singletons. This is highly suspicious. Results of richness
estimates (for example) are probably unreliable, or wrong, if you have already
trimmed low-abundance taxa from the data.

We recommend you find the un-trimmed data and retry.

So the script worked but it seems to have removed my singletons. I should also mention that I ran the filter_otus_through_otu_table.py script in a old workflow using the setting n=2 for singleton removal and I noticed my files did not decrease in size. Is it possible I never had many to begin with?

Colin Brislawn

unread,
Jul 25, 2016, 7:29:53 PM7/25/16
to Qiime 1 Forum
Hello Jay,

Will the log transformed data affect any downstream processes? 
I'm not sure how this transformation will affect the different alpha and beta diversity metrics you plan to calculate. Perhaps a qiime dev can speak more about implication of a log transform on the stats.

So the script worked but it seems to have removed my singletons.
I'm not sure at which step your singletons are removed...  
I should also mention that I ran the filter_otus_through_otu_table.py script in a old workflow using the setting n=2 for singleton removal and I noticed my files did not decrease in size. Is it possible I never had many to begin with?
Good question...

I need to mention here that singleton refers to 'reads that appear once', but people use it different ways. For example, when phyloseq says
"The data you have provided does not have any singletons"
here singletons means 'reads that appear once in a single sample.' This is pretty suspicious, but not biologically impossible. 

When you run the qiime script filter_otus_through_otu_table.py with setting n=2, these singletons are 'reads that appear once in the whole study.' 

It's pretty common to remove 'reads that appear once in the whole study.' In fact, some of the OTU picking scripts in qiime already do this automatically, which is why you may not have removed any reads. It is, however, much more contentious to remove 'reads that appear once in a single sample,' which is why phyloseq returns a warning.


Unfortunately, I don't know enough about metagenomeseq to speculate as to where 'reads that appear once in a single sample' are going. Have you considered importing your raw OTU table to see if it returns the same error?

Colin

Jay T

unread,
Jul 25, 2016, 8:24:06 PM7/25/16
to qiime...@googlegroups.com
I ended up using the raw OTU table and it worked just fine. The results seem pretty similar compared to the alpha diversity metrics that I produced in the core_diversity_analyses, which if I'm not mistaken, rarefies the OTU table based on an even sampling depth (I used 15,000 sequences/per sample). I'm pretty satisfied and plan on reporting both results. Thanks Colin.

Richard Rodrigues

unread,
Feb 7, 2017, 8:17:12 PM2/7/17
to Qiime 1 Forum
Hi,

A quick question. Does it do log2(x) or log2(1+x). I ask because when I unlog the CSS normalized data I do not see any 0, I see 1. I am wondering if this is because of log transformation or the CSS normalization setting some minimum.

Thanks.

-Richard

Colin Brislawn

unread,
Feb 7, 2017, 11:47:16 PM2/7/17
to Qiime 1 Forum
Hello Richard,

I'm looking through the code, and I'm not finding a line where qiime adds a pseudocount of 1. However, the metagenomeseq paper which introduces CSS normalization does talk about this added pseudocount. 

Adding pseudocounts is also contentious, and criticised because the choice of pseudocount makes a difference.

So yes, I think it's log2(1+x) as you suggested.
Good catch!
Colin

Richard Rodrigues

unread,
Feb 8, 2017, 12:27:29 PM2/8/17
to Qiime 1 Forum
Colin,

Thanks for confirming about the 1+x and also for the link to an interesting article!

-Rich
Reply all
Reply to author
Forward
0 new messages