Cumulative sum scaling in normalize_table.py (Qiime) vs. metagenomeSeq (R)

Ryoko Oono

unread,

Apr 25, 2016, 5:36:10 PM4/25/16

to Qiime 1 Forum

I am having trouble replicating normalization results in Qiime and R. The metagenomeSeq results make sense to me: the OTU reads are divided by the scaling factor for each sample (which I can see with exportStats). When I divide my original OTU read abundance by the normalized OTU read abundance, they all become the scaling factor specific to that sample. However, the css in Qiime gives me different results even though I use the exact same starting OTU table. When I export the stats using '-s' in Qiime, I get the exact same statistics as I do with metagenomeSeq, which tells me I am indeed working on the same data set and the scaling factors are the same for both datasets. But the normalized OTU abundance reads are completely different from what I get in metagenomeSeq and I can't seem to figure out how to back-calculate to get my original OTU read numbers. Just wanted to check to see if the code in Qiime is old? Or if I'm doing something wrong in R.

Thank you!

These are my codes:

In Qiime:

normalize_table.py -i OTUTable.biom -o NormalizedOTUTable_CSS_qiime.biom -a CSS -s

In metagenomeSeq in R:

obj = cumNorm(ObjData, p = cumNormStatFast(ObjData)) exportStats(obj) NormalizedMatrix <- MRcounts(obj, norm = TRUE) exportMat(NormalizedMatrix, file = "~/Desktop/NormalizedOTUTable_CSS_R.tsv")

Here's a sample of my original OTU table:

OTUId	Sample 1	Sample 2
OTU_1	63460	91939
OTU_2	21	13563
OTU_3	3663	790
OTU_4	7468	8212
OTU_6	375	657
OTU_5	5364	1304
OTU_8	1082	558
OTU_11	520	22982

A sample of normalized OTU table from metagenomeSeq:

Taxa and Samples	Sample 1	Sample 2
OTU_1	68754.06284	442014.4231
OTU_2	22.75189599	65206.73077
OTU_3	3968.580715	3798.076923
OTU_4	8091.007584	39480.76923
OTU_6	406.283857	3158.653846
OTU_5	5811.48429	6269.230769
OTU_8	1172.264355	2682.692308
OTU_11	563.3802817	110490.3846

A sample of normalized OTU table from Qiime's normalize_table.py:

#OTU ID	Sample 1	Sample 2
OTU_1	16.069	18.754
OTU_2	4.57	15.993
OTU_3	11.955	11.891
OTU_4	12.982	15.269
OTU_6	8.6699	11.626
OTU_5	12.505	12.614
OTU_8	10.196	11.39
OTU_11	9.1405	16.754

Here are the stats (same for metagenomeSeq and Qiime):

Subject Scaling factor Quantile value Number of identified features Library size

Sample 1 923 97 108 141101

Sample 2 208 10 137 176123

Colin Brislawn

unread,

Apr 25, 2016, 6:44:11 PM4/25/16

to Qiime 1 Forum

Hello Ryoko,

As you probably already know, qiime is fully open source, so you can take a look at the code qiime is using to run metagenomeseq. Based on this line, it looks like qiime is writing log transformed counts:

https://github.com/biocore/qiime/blob/master/qiime/support_files/R/CSS.r#L36

This matches the example data you provided, too.

OTU_1 68754.06284

log2(68754.06284) == 16.0691573479

OTU_1 16.069

Does that answer your question?

Colin

Ryoko Oono

unread,

Apr 25, 2016, 8:23:09 PM4/25/16

to Qiime 1 Forum

Thank you! Yes, that makes a lot of sense now.

Jay T

unread,

Jul 25, 2016, 11:52:22 AM7/25/16

to Qiime 1 Forum

Will the log transformed data affect any downstream processes? I read Paul's paper and I plan normalizing my raw OTU table (otu_table_wc2_no_pynast_failures.biom) and using these for alpha-diversity analyses. What do you guys think?

Jay T

unread,

Jul 25, 2016, 7:13:14 PM7/25/16

to qiime...@googlegroups.com

Colin - I should add this bit of information...I originally had 171 samples. I filtered my samples so that I only worked with ones with 15000 sequences, however the variance in library size is large. Anyhow, I ran this the normalize_table.py script on my raw OTU table. When I imported it into phyloseq it gave this error:

> ## Merge the three objects together in phyloseq

> testdata=merge_phyloseq(biomfile,tree,map)

> print(testdata)

phyloseq-class experiment-level object

otu_table() OTU Table: [ 90862 taxa and 148 samples ]

sample_data() Sample Data: [ 148 samples by 8 sample variables ]

tax_table() Taxonomy Table: [ 90862 taxa by 7 taxonomic ranks ]

phy_tree() Phylogenetic Tree: [ 90862 tips and 90848 internal nodes ]

> # Alpha_diversity #Best Plots

> plot_richness(testdata, x="Description", color="Site_Tissue", measures=c("Chao1", "Shannon"))

Error in estimateR.default(newX[, i], ...) :

function accepts only integers (counts)

In addition: Warning message:

In estimate_richness(physeq, split = TRUE, measures = measures) :

The data you have provided does not have

any singletons. This is highly suspicious. Results of richness

estimates (for example) are probably unreliable, or wrong, if you have already

trimmed low-abundance taxa from the data.

We recommend you find the un-trimmed data and retry.

So the script worked but it seems to have removed my singletons. I should also mention that I ran the filter_otus_through_otu_table.py script in a old workflow using the setting n=2 for singleton removal and I noticed my files did not decrease in size. Is it possible I never had many to begin with?

Colin Brislawn

unread,

Jul 25, 2016, 7:29:53 PM7/25/16

to Qiime 1 Forum

Hello Jay,

Will the log transformed data affect any downstream processes?

I'm not sure how this transformation will affect the different alpha and beta diversity metrics you plan to calculate. Perhaps a qiime dev can speak more about implication of a log transform on the stats.

So the script worked but it seems to have removed my singletons.

I'm not sure at which step your singletons are removed...

I should also mention that I ran the filter_otus_through_otu_table.py script in a old workflow using the setting n=2 for singleton removal and I noticed my files did not decrease in size. Is it possible I never had many to begin with?

Good question...

I need to mention here that singleton refers to 'reads that appear once', but people use it different ways. For example, when phyloseq says

"The data you have provided does not have any singletons"

here singletons means 'reads that appear once in a single sample.' This is pretty suspicious, but not biologically impossible.

When you run the qiime script filter_otus_through_otu_table.py with setting n=2, these singletons are 'reads that appear once in the whole study.'

It's pretty common to remove 'reads that appear once in the whole study.' In fact, some of the OTU picking scripts in qiime already do this automatically, which is why you may not have removed any reads. It is, however, much more contentious to remove 'reads that appear once in a single sample,' which is why phyloseq returns a warning.

Unfortunately, I don't know enough about metagenomeseq to speculate as to where 'reads that appear once in a single sample' are going. Have you considered importing your raw OTU table to see if it returns the same error?

Colin

Jay T

unread,

Jul 25, 2016, 8:24:06 PM7/25/16

to qiime...@googlegroups.com

I ended up using the raw OTU table and it worked just fine. The results seem pretty similar compared to the alpha diversity metrics that I produced in the core_diversity_analyses, which if I'm not mistaken, rarefies the OTU table based on an even sampling depth (I used 15,000 sequences/per sample). I'm pretty satisfied and plan on reporting both results. Thanks Colin.

Richard Rodrigues

unread,

Feb 7, 2017, 8:17:12 PM2/7/17

to Qiime 1 Forum

Hi,

A quick question. Does it do log2(x) or log2(1+x). I ask because when I unlog the CSS normalized data I do not see any 0, I see 1. I am wondering if this is because of log transformation or the CSS normalization setting some minimum.

Thanks.

-Richard

Colin Brislawn

unread,

Feb 7, 2017, 11:47:16 PM2/7/17

to Qiime 1 Forum

Hello Richard,

I'm looking through the code, and I'm not finding a line where qiime adds a pseudocount of 1. However, the metagenomeseq paper which introduces CSS normalization does talk about this added pseudocount.

https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4010126/

Adding pseudocounts is also contentious, and criticised because the choice of pseudocount makes a difference.

https://www.researchgate.net/publication/261220701_A_fair_comparison

So yes, I think it's log2(1+x) as you suggested.

Good catch!

Colin

Richard Rodrigues

unread,

Feb 8, 2017, 12:27:29 PM2/8/17

to Qiime 1 Forum

Colin,

Thanks for confirming about the 1+x and also for the link to an interesting article!

-Rich

Reply all

Reply to author

Forward