Hello all.
I've used DESeq2 multiple times now as a normalization step following McMurdie and Holmes (2014) for
ITS2 fungal community data (generated on MiSeq with 300bp paired end reads). I'm doing so
within the Qiime (1.9.1) pipeline on a Linux cluster environment. I realize this is still the old Qiime version, but we haven't all migrated yet and in this case I want to keep analyses consistent among older and newer data. Anyhow, I hope someone might be able to shed some light on an issue we're having with
one particular dataset and what DESeq2 is doing to it. To be clear, this
has worked well for us on multiple datasets, but I'm getting some weird
outputs when I combine several datasets to create a global dataset. I'm
not certain if it something about combining the data, something to do
with the size and sparseness of the dataset, or perhaps something else.
Although a much larger matrix/biom file, the data look very similar to
the 5 individual datasets on which DESeq2 normalization runs fine.
Here are the leadup scripts and the call for the DESeq2 normalization in Qiime (1.9.1).
## 4.3 Normalizing OTU table using Deseq2
# The data have inherent sampling biases: variation in the number of sequences for a given sample.
# Historical solution was to rarefy the data; randomly selecting
sequences from each sample to an equal depth (a lower depth to include
most samples)
# This method throws out a lot of data, which
we have already done multiple times above through quality filtering and
occurence thresholds.
# New papers (McMurdie and Holmes 2014)
suggests algorithms can be used to transform OTU tables and normalize
data for sequencing depth variation among samples.
# This process requires some formatting steps as DeSeq2 is a R package
# It may also require some tinkering with admins to insure R is properly accessible through your cluster
### 4.3a Filter OTU table to remove samples with <X reads
# i.e. negative control
# You can look at the summary file to figure out an appropriate cutoff
# threshold set at 5000 reads here
filter_samples_from_otu_table.py -i
$WORK/working/FireFun_comb/FF14_16_5OTU_sosu/otu_table_mc5.biom -o
$WORK/working/FireFun_comb/FF14_16_5OTU_sosu/FF14_16_5k.biom -n 5000
### 4.3b Convert biom file from HDF5 to JSON format
biom convert -i
$WORK/working/FireFun_comb/FF14_16_5OTU_sosu/FF14_16_5k.biom -o
$WORK/working/FireFun_comb/FF14_16_5OTU_sosu/FF14_16_5k.txt --to-tsv
--table-type "OTU table"
biom convert -i
$WORK/working/FireFun_comb/FF14_16_5OTU_sosu/FF14_16_5k.txt -o
$WORK/working/FireFun_comb/FF14_16_5OTU_sosu/FF14_16_5kJ.biom --to-json
--table-type="OTU table"
# Because uneven sampling depth across samples use the DESeq2 normalization. #For more info look here: http://qiime.org/scripts/normalize_table.html
#Also paper by McMurdie and Holmes 2014 PLoSOne
normalize_table.py -i
$WORK/working/FireFun_comb/FF14_16_5OTU_sosu/FF14_16_5kJ.biom -a DESeq2
--DESeq_negatives_to_zero -o
$WORK/working/FireFun_comb/FF14_16_5OTU_sosu/FF14_16_5k_Dseq.biom
The script completes correctly with no errors, but the main strangeness is that zeros in the matrix before all get inflated to the same fraction across the board.
Pre-normalization data looks like this (Samples labeled as letters in top row, OTU's in 1st column):
# Constructed from biom file
|
|
|
|
|
|
|
|
|
|
|
|
|
|
#OTU ID | A | B | C | D | E | F | G | H | I | J | K | L | M |
SH480534.07FU_LC158598_reps_singleton
| 2 | 1 | 1 | 1 | 1 | 29 | 2 | 4 | 2 | 1 | 1 | 2 | 1 |
SH631310.07FU_KX499282_reps_singleton | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 15 | 0 | 0 |
SH010074.07FU_HE820663_reps_singleton | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
SH086742.07FU_AF033454_refs_singleton | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
SH197054.07FU_JN164989_refs | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 2 | 0 | 0 | 3 | 0 |
SH191714.07FU_FJ515208_reps
| 0 | 0 | 0 | 0 | 0 | 1 | 3 | 3 | 0 | 0 | 2 | 0 | 0 |
SH209461.07FU_FJ827731_reps | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 |
SH211410.07FU_JX317229_reps | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
SH013048.07FU_AY909004_reps_singleton | 1 | 0 | 0 | 0 | 2 | 0 | 0 | 0 | 0 | 2 | 0 | 2 | 0 |
SH194982.07FU_AY015437_refs | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0
|
While post-normalization looks like this:
# Constructed from biom file |
|
|
|
|
|
|
|
|
|
|
|
#OTU ID | A | B | C | D | E | F | G | H | I | J | K | L | M |
SH480534.07FU_LC158598_reps_singleton
| 1.8364 | 1.3465 | 1.3452 | 1.3446 | 1.3441 | 4.9581 | 1.8343 | 2.4869 | 1.8354 | 1.3449 | 1.3439 | 1.835 | 1.3449 |
SH631310.07FU_KX499282_reps_singleton | 0.58947 | 0.58913 | 0.58804 | 0.58756 | 0.58711 | 0.58829 | 0.58777 | 0.58711 | 0.58863 | 0.58776 | 4.0705 | 0.58829 | 0.58776 |
SH010074.07FU_HE820663_reps_singleton | 0.58947 | 0.58913 | 0.58804 | 0.58756 | 0.58711 | 0.58829 | 0.58777 | 0.58711 | 0.58863 | 0.58776 | 0.58692 | 0.58829 | 0.58776 |
SH086742.07FU_AF033454_refs_singleton | 0.58947 | 0.58913 | 0.58804 | 0.58756 | 0.58711 | 0.58829 | 0.58777 | 0.58711 | 0.58863 | 0.58776 | 0.58692 | 0.58829 | 0.58776 |
SH197054.07FU_JN164989_refs | 0.58947 | 0.58913 | 0.58804 | 0.58756 | 0.58711 | 0.58829 | 0.58777 | 1.3441 | 1.8354 | 0.58776 | 0.58692 | 2.1987 | 0.58776 |
SH191714.07FU_FJ515208_reps | 0.58947 | 0.58913 | 0.58804 | 0.58756 | 0.58711 | 1.3455 | 2.198 | 2.1972 | 0.58863 | 0.58776 | 1.8333 | 0.58829 | 0.58776 |
SH209461.07FU_FJ827731_reps | 0.58947 | 0.58913 | 0.58804 | 0.58756 | 0.58711 | 0.58829 | 0.58777 | 0.58711 | 1.3459 | 0.58776 | 0.58692 | 0.58829 | 0.58776 |
SH211410.07FU_JX317229_reps | 0.58947 | 0.58913 | 0.58804 | 0.58756 | 0.58711 | 0.58829 | 0.58777 | 0.58711 | 0.58863 | 0.58776 | 0.58692 | 0.58829 | 0.58776 |
SH013048.07FU_AY909004_reps_singleton | 1.3469 | 0.58913 | 0.58804 | 0.58756 | 1.8335 | 0.58829 | 0.58777 | 0.58711 | 0.58863 | 1.8343 | 0.58692 | 1.835 | 0.58776 |
SH194982.07FU_AY015437_refs | 0.58947 | 0.58913 | 1.3452 | 0.58756 | 0.58711 | 0.58829 | 1.3449 | 0.58711 | 0.58863 | 0.58776 | 0.58692 | 0.58829 | 0.58776
|
Again,
I recognize this is the not the core application for the DESeq2 vignette and
I'm running it through Qiime (1.9.1), but I thought someone might have encountered this problem before.
Here are the specs from my Qiime config file:
System information
==================
Platform: linux2
Python version: 2.7.13 (default, Jul 13 2017, 11:00:39) [GCC 4.6.4]
Python executable: /panfs/pfs.local/work/kbs/software/qiime/1.9.1/bin/pytho
n
QIIME default reference information
===================================
For details on what files are used as QIIME's default references, see here:
https://github.com/biocore/qiime-default-reference/releases/tag/0.1.3Dependency versions
===================
QIIME library version: 1.9.1
QIIME script version: 1.9.1
qiime-default-reference version: 0.1.3
NumPy version: 1.10.0
SciPy version: 0.19.1
pandas version: 0.20.3
matplotlib version: 1.5.3
biom-format version: 2.1.4
h5py version: 2.6.0 (HDF5 version: 1.8.17)
qcli version: 0.1.1
pyqi version: 0.3.2
scikit-bio version: 0.2.3
PyNAST version: 1.2.2
Emperor version: 0.9.60
burrito version: 0.9.1
burrito-fillings version: 0.1.1
sortmerna version: SortMeRNA version 2.0, 29/11/2014
sumaclust version: SUMACLUST Version 1.0.00
swarm version: Swarm 1.2.19 [Jul 13 2017 14:06:49]
gdata: Installed.
QIIME config values
===================
For definitions of these settings and to learn how to configure QIIME, see here:
http://qiime.org/install/qiime_config.html http://qiime.org/tutorials/parallel_qiime.html blastmat_dir: /panfs/pfs.local/work/kbs/software/blast
/2.2.22/data
pick_otus_reference_seqs_fp: /panfs/pfs.local/work/kbs/software/qiime
/1.9.1/lib/python2.7/site-packages/qiime_default_reference/gg_13_8_otus/rep_set/
97_otus.fasta
torque_walltime: 5:00:00
sc_queue: all.q
topiaryexplorer_project_dir: None
pynast_template_alignment_fp: /panfs/pfs.local/work/kbs/software/qiime
/1.9.1/lib/python2.7/site-packages/qiime_default_reference/gg_13_8_otus/rep_set_
aligned/85_otus.pynast.fasta
cluster_jobs_fp: start_parallel_jobs_torque.py
pynast_template_alignment_blastdb: None
assign_taxonomy_reference_seqs_fp: /panfs/pfs.local/work/kbs/software/qiime
/1.9.1/lib/python2.7/site-packages/qiime_default_reference/gg_13_8_otus/rep_set/
97_otus.fasta
torque_queue: sixhour
jobs_to_start: 1
slurm_time: None
torque_memory: 100gb
denoiser_min_per_core: 50
assign_taxonomy_id_to_taxonomy_fp: /panfs/pfs.local/work/kbs/software/qiime
/1.9.1/lib/python2.7/site-packages/qiime_default_reference/gg_13_8_otus/taxonomy
/97_otu_taxonomy.txt
temp_dir: /panfs/pfs.local/scratch/kbs/b082s740/
slurm_memory: None
slurm_queue: None
blastall_fp: blastall
seconds_to_sleep: 1
Qiime is calling to R version 3.4.1 and here are the version specs on that:
> print(version)
_
platform x86_64-pc-linux-gnu
arch x86_64
os linux-gnu
system x86_64, linux-gnu
status
major 3
minor 4.1
year 2017
month 06
day 30
svn rev 72865
language R
version.string R version 3.4.1 (2017-06-30)
nickname Single Candle
and the SessionInfo()
R version 3.4.1 (2017-06-30)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Red Hat Enterprise Linux Server release 6.4 (Santiago)
Matrix products: default
BLAS: /usr/lib64/libblas.so.3.2.1
LAPACK: /usr/lib64/atlas/liblapack.so.3.0
locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
[3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
[5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
[7] LC_PAPER=en_US.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
attached base packages:
[1] stats graphics grDevices utils datasets methods base
loaded via a namespace (and not attached):
[1] compiler_3.4.1
And DESeq2 is version 1.16.0
Thanks and please let me know if any more information could help in solving.
Ben