Issues with Deseq2 normalization on combined dataset

Skip to first unread message

Ben Sikes

Oct 26, 2018, 10:46:13 AM10/26/18
to Qiime 1 Forum
Hello all.

I've used DESeq2 multiple times now as a normalization step following McMurdie and Holmes (2014) for ITS2 fungal community data (generated on MiSeq with 300bp paired end reads). I'm doing so within the Qiime (1.9.1) pipeline on a Linux cluster environment. I realize this is still the old Qiime version, but we haven't all migrated yet and in this case I want to keep analyses consistent among older and newer data. Anyhow, I hope someone might be able to shed some light on an issue we're having with one particular dataset and what DESeq2 is doing to it. To be clear, this has worked well for us on multiple datasets, but I'm getting some weird outputs when I combine several datasets to create a global dataset. I'm not certain if it something about combining the data, something to do with the size and sparseness of the dataset, or perhaps something else. Although a much larger matrix/biom file, the data look very similar to the 5 individual datasets on which DESeq2 normalization runs fine.

Here are the leadup scripts and the call for the DESeq2 normalization in Qiime (1.9.1).

    ## 4.3 Normalizing OTU table using Deseq2
        # The data have inherent sampling biases: variation in the number of sequences for a given sample.
        # Historical solution was to rarefy the data; randomly selecting sequences from each sample to an equal depth (a lower depth to include most samples)
        # This method throws out a lot of data, which we have already done multiple times above through quality filtering and occurence thresholds.
        # New papers (McMurdie and Holmes 2014) suggests algorithms can be used to transform OTU tables and normalize data for sequencing depth variation among samples.
        # This process requires some formatting steps as DeSeq2 is a R package
        # It may also require some tinkering with admins to insure R is properly accessible through your cluster
        ### 4.3a Filter OTU table to remove samples with <X reads
            # i.e. negative control
            # You can look at the summary file to figure out an appropriate cutoff
            # threshold set at 5000 reads here
           -i $WORK/working/FireFun_comb/FF14_16_5OTU_sosu/otu_table_mc5.biom -o $WORK/working/FireFun_comb/FF14_16_5OTU_sosu/FF14_16_5k.biom -n 5000
        ### 4.3b Convert biom file from HDF5 to JSON format
        biom convert -i $WORK/working/FireFun_comb/FF14_16_5OTU_sosu/FF14_16_5k.biom -o $WORK/working/FireFun_comb/FF14_16_5OTU_sosu/FF14_16_5k.txt --to-tsv --table-type "OTU table"
        biom convert -i $WORK/working/FireFun_comb/FF14_16_5OTU_sosu/FF14_16_5k.txt -o $WORK/working/FireFun_comb/FF14_16_5OTU_sosu/FF14_16_5kJ.biom --to-json --table-type="OTU table"
    # Because uneven sampling depth across samples use the DESeq2 normalization. #For more info look here:
        #Also paper by McMurdie and Holmes 2014 PLoSOne
       -i $WORK/working/FireFun_comb/FF14_16_5OTU_sosu/FF14_16_5kJ.biom -a DESeq2 --DESeq_negatives_to_zero -o $WORK/working/FireFun_comb/FF14_16_5OTU_sosu/FF14_16_5k_Dseq.biom

The script completes correctly with no errors, but the main strangeness is that zeros in the matrix before all get inflated to the same fraction across the board.

Pre-normalization data looks like this (Samples labeled as letters in top row, OTU's in 1st column):

# Constructed from biom file


While post-normalization looks like this:

# Constructed from biom file


Again, I recognize this is the not the core application for the DESeq2 vignette and I'm running it through Qiime (1.9.1), but I thought someone might have encountered this problem before.

Here are the specs from my Qiime config file:

System information
         Platform:      linux2
   Python version:      2.7.13 (default, Jul 13 2017, 11:00:39)  [GCC 4.6.4]
Python executable:      /panfs/pfs.local/work/kbs/software/qiime/1.9.1/bin/pytho

QIIME default reference information
For details on what files are used as QIIME's default references, see here:

Dependency versions
          QIIME library version:        1.9.1
           QIIME script version:        1.9.1
qiime-default-reference version:        0.1.3
                  NumPy version:        1.10.0
                  SciPy version:        0.19.1
                 pandas version:        0.20.3
             matplotlib version:        1.5.3
            biom-format version:        2.1.4
                   h5py version:        2.6.0 (HDF5 version: 1.8.17)
                   qcli version:        0.1.1
                   pyqi version:        0.3.2
             scikit-bio version:        0.2.3
                 PyNAST version:        1.2.2
                Emperor version:        0.9.60
                burrito version:        0.9.1
       burrito-fillings version:        0.1.1
              sortmerna version:        SortMeRNA version 2.0, 29/11/2014
              sumaclust version:        SUMACLUST Version 1.0.00
                  swarm version:        Swarm 1.2.19 [Jul 13 2017 14:06:49]
                          gdata:        Installed.

QIIME config values
For definitions of these settings and to learn how to configure QIIME, see here:

                     blastmat_dir:      /panfs/pfs.local/work/kbs/software/blast
      pick_otus_reference_seqs_fp:      /panfs/pfs.local/work/kbs/software/qiime
                  torque_walltime:      5:00:00
                         sc_queue:      all.q
      topiaryexplorer_project_dir:      None
     pynast_template_alignment_fp:      /panfs/pfs.local/work/kbs/software/qiime
pynast_template_alignment_blastdb:      None
assign_taxonomy_reference_seqs_fp:      /panfs/pfs.local/work/kbs/software/qiime
                     torque_queue:      sixhour
                    jobs_to_start:      1
                       slurm_time:      None
                    torque_memory:      100gb
            denoiser_min_per_core:      50
assign_taxonomy_id_to_taxonomy_fp:      /panfs/pfs.local/work/kbs/software/qiime
                         temp_dir:      /panfs/pfs.local/scratch/kbs/b082s740/
                     slurm_memory:      None
                      slurm_queue:      None
                      blastall_fp:      blastall
                 seconds_to_sleep:      1

Qiime is calling to R version 3.4.1 and here are the version specs on that:

> print(version)
platform       x86_64-pc-linux-gnu
arch           x86_64
os             linux-gnu
system         x86_64, linux-gnu
major          3
minor          4.1
year           2017
month          06
day            30
svn rev        72865
language       R
version.string R version 3.4.1 (2017-06-30)
nickname       Single Candle

and the SessionInfo()
R version 3.4.1 (2017-06-30)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Red Hat Enterprise Linux Server release 6.4 (Santiago)

Matrix products: default
BLAS: /usr/lib64/
LAPACK: /usr/lib64/atlas/

 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C
 [9] LC_ADDRESS=C               LC_TELEPHONE=C

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base

loaded via a namespace (and not attached):
[1] compiler_3.4.1

And DESeq2 is version 1.16.0

Thanks and please let me know if any more information could help in solving.

Colin Brislawn

Oct 29, 2018, 5:53:04 PM10/29/18
to Qiime 1 Forum
Hello Ben,

As you can see, these forums are not really active anymore, but I wanted to see if I could help out. 

So the normalization step transforms your values as follows:
0 -> 0.58947
1 -> 1.3465
2 -> 1.8364

Is this a problem?
What do you expect these values to look like?

Here is the R code that Qiime is running under the hood. Note that in line 34, a pseudocount of 1 is added to all cells in the table, this removing all zeros. Pseudocounts are contentious, so I just wanted to mention this here.


Ben Sikes

Oct 31, 2018, 12:57:50 AM10/31/18
to Qiime 1 Forum
Hi Colin,

Yes I realize these forums aren't monitored much any more, but hoped there were folks like you that might have a clue about these issues. Thanks very much for your reply!

In my experience, when we've done normalization we don't normally have all our zeros turn into real numbers. As I said above, this dataset is made up of several different datasets and we've run them independently. So here a fraction of one of those before DESeq2:

#OTU ID 4326.Soil 5074.Soil 4609.Soil 4756.Soil 14948.Soil 3416.Soil
SH480534.07FU_LC158598_reps_singleton 14 2 5 3 1 2
SH631310.07FU_KX499282_reps_singleton 0 0 0 0 0 0
0 0 1 0 0 0
SH191714.07FU_FJ515208_reps 0 6 0 0 0 0
SH209461.07FU_FJ827731_reps 0 0 0 0 0 0
0 0 0 0 0 1
SH212606.07FU_KF800128_reps 0 0 0 5 0 0
0 0 0 0 0 0
0 0 0 0 0 0
0 0 0 0 0 0

And after:

#OTU ID 4326.Soil 5074.Soil 4609.Soil 4756.Soil 14948.Soil 3416.Soil
SH480534.07FU_LC158598_reps_singleton 4.0284 1.0996 2.4803 1.7009 0.20166 1.107
SH631310.07FU_KX499282_reps_singleton 0 0 0 0 0 0
SH197054.07FU_JN164989_refs 0 0 0.19988 0 0 0
SH191714.07FU_FJ515208_reps 0 2.762 0 0 0 0
SH209461.07FU_FJ827731_reps 0 0 0 0 0 0
0 0 0 0 0
SH212606.07FU_KF800128_reps 0 0 0 2.4787 0 0
0 0 0 0 0 0
0 0 0 0 0 0
0 0 0 0 0 0

Just to re-emphasize, some of this data is actually in that larger dataset where we are getting the weirdness. As you can see, normalization of that smaller data doesn't turn all the zeros into actual values. Sounds like what you are saying is because of the pseudocount addition of 1, its possible that the conversion of all these 0's to actual numbers is still solid. I guess I just don't understand why it wouldn't have done this in so many previous cases.

Again, I very much appreciate your feedback and wasn't aware of that element of the DESeq2. I appreciate you pointing me to the underlying code and will keep trying to figure out what is going on. If you have any thoughts, I'd love to hear them.


Colin Brislawn

Oct 31, 2018, 9:42:39 AM10/31/18
to Qiime 1 Forum
Hello Ben,

Thanks for the detail! I think we are finding lots of good clues. 

As you can see, normalization of that smaller data doesn't turn all the zeros into actual values. 
Wait, wait, how did you perform normalization of these two runs? With DESeq2 directly in R, or using the qiime script

I should also point out this line and the DESeq_negatives_to_zero option.

My reading of these lines (not a expert) is that deseq through qiime will always at +1 to all values, then optionally 'zero out' values that are <0 after normalization. This could explain how a pseudocount is added to your data AND THEN you still get zeros in your output.

I know we are kind of going down the rabbit hole here, but I think it's totally worthwhile. Backwards-engineering these statistical methods is illuminating. 

I hope this helps,

P.S. My understanding of deseq2 normalization is that it's designed for differential abundance testing, not necessarily for general normalization. What is your overall goal here? There might be a more elegant way to do this!

Ben Sikes

Oct 31, 2018, 3:19:06 PM10/31/18
to Qiime 1 Forum
Hi Colin,

I performed this step through Qiime's script. Perhaps I should try to kick it out and run through R directly. I had seen that option and have always had it in there. It certainly could explain how all the others were zero, but it still would mean that with the combined data set, the DESeq2 normalized values for those are now no longer negative, but positive. I read that options in the same way that you did though. It's certainly possible, but again the consistency of the converted zero values seems to indicate that something funky is going on to me. I guess my thinking is that there might be high variances among these datasets and therefore the means might be really small relative to the variances. When the DESeq2 algorithm goes to work, I think it might be inflating them in this way, but I'm really not sure. I think I will try to do it in R itself and then potentially post something to the DESeq2 forum if I'm seeing the same behavior. They might be better suited to know what's going on under the hood. I very much appreciate your advice though and will post here if I'm able to suss out the issue, whether it means the transformed data is solid or not.

Thanks very much,

PS This type or normalization was something outlined in this paper as an alternative to rarefying:

This is why it (and edgeR) have been incorporated into the newer microbiome pipelines.

Colin Brislawn

Oct 31, 2018, 9:54:20 PM10/31/18
to Qiime 1 Forum
Hello Ben,

Good news! We have reached a consensus! I agree that "something funky" is going on here, but we don't have enough understanding of the algorithms or statistical distributions to understand it. 

I like your suggestions about what to try next. I think validating this funky behaviour in R then reaching out to the DESeq2 devs is the proper way forward. You have probably found this already, but this is the github repo for the newest version of deseq2. I think I've interacted with Mike Love before and he was cool. 

On the topic of rarefaction, have you read the response to that paper from the Knight Lab? It's lead by Sophie Weiss, a statistician in the Knight Lab, and is a level-headed response to the iconic 'waste not' paper. 

Note how Figure 2 of this second paper replicates Figure 4 the initial paper. These two papers are the most understandable applied stats papers I've read, and beautifully frame the contentious nature of statistical normalization.

I hope you find a convincing solution to this problem,

P.S. For Halloween, I'm going as the ghost of scientific contention! So spooky!

Reply all
Reply to author
0 new messages