CSS algorithm as implemented in Qiime 1.9

Paul Czechowski

unread,

Sep 1, 2016, 12:19:01 AM9/1/16

to Qiime 1 Forum

Dear Qiime developers and community,

I have been using Qiime in the last four years for several publications and generally appreciate this rather well documented script environment. I have a question regarding the CSS algorithm for abundance correction as implemented in Qiime. I use CSS call by Qiime to correct abundances of Illumina sequence data, with the aim to connect multiple samples with different sequence coverage with one another, whilst avoiding resampling / rarefaction methods. I am importing the Qiime-derived (CSS modified) .biom tables into R via the Physloseq package and mainly (for this project) for analyses on abundance matrices in Vegan (samples are rows, OTUs are columns). However the counts aren't integers anymore - which in itself is appears to be a problem of some distance-based analysis methods implemented in Vegan and other packages (e.g. mvabund). It appears the CCS's abundance values are some how transformed, and I'd like to know how - i.e. what is the mathematic function applied to these counts that makes them non-integers? (is this just the result of the scaling procedure, or is there a log transformation involved? - The CSS paper mentions a log transformation in one occasion.) Perhaps I should use resampling / refraction methods to maintain raw count values in abundance corrected OTU observations? Any experience with this, comments? This would be of great help. Of course I have read the CSS paper, but being a paper in a high-ranking journal, it is quite short, dense and thus hard to understand for me.

I would appreciate a short comment on this.

Kind regards,

Paul Czechowski

Jose Antonio Navas Molina

unread,

Sep 1, 2016, 12:34:13 AM9/1/16

to Qiime 1 Forum

Hi Paul,

I've redirected your question to more experience developers on this topic.

They'll gat back to you soon.

Thanks,

Sophie

unread,

Sep 3, 2016, 4:30:38 PM9/3/16

to Qiime 1 Forum

Hi Paul,

In logUQ scaling, each sample is scaled by the 75th percentile of its count distribution, then the data is log transformed. CSS is similar, except it enables a flexible scaling factor for each sample, that depends on the distribution of counts in each sample. Only the segment of each sample's count distribution that is relatively invariant across samples is scaled. This mitigates the influence of higher abundance OTUs on lower abundance OTUs when the scaling is done.

This scaling does indeed result in non-integers, that are then log transformed. If you use CSS, it would be advised that you double check that after normalization, the samples are not clustering by original (before normalization) library size, by e.g. PCoA or PERMANOVA. This doesn't happen much in weighted unifrac, but for metrics like unweighted unifrac, it may be best to just rarefy, depending on how different your library sizes are.