plotHeatmap not sorting regions with missing data correctly

660 views
Skip to first unread message

thirdan...@gmail.com

unread,
Feb 7, 2017, 2:48:30 PM2/7/17
to deepTools
Hi all-
I am aligning MNase-seq data around TSS using computeMatrix and plotHeatmap. In regions where I have reads, the scale of values goes from 0-1. Of my ~28000 Refseq genes, there are a number (~13000) where I have varying degrees of missing data (spots with no reads).
I am using the default setting of sorting by mean value across the region (+/- 3 kb of TSS), and in regions where I have complete coverage (no missing reads), the sorting is performed correctly. However, in the top half of my heatmap (where the ~13000 regions with missing data reside), the sorting seems to be almost random. There are a number of regions with very little missing data, and a few (~800) which are completely missing.
I would like to be able to sort by mean value, so that the regions were displayed in descending order regardless of whether there were missing data or not.
I can send along the heatmap to illustrate what I'm talking about, though it doesn't seem to allow me to upload it here. Does anyone have suggestions?

Devon Ryan

unread,
Feb 7, 2017, 3:09:13 PM2/7/17
to thirdan...@gmail.com, deepTools
This is related to NAs being sorted above other numbers by default in
numpy. At the moment, the only way to prevent this sort of behavior is
by specifying --missingDataAsZero in computeMatrix.

Having said that, I've always found this sort of behavior a bit weird.
There are drop-in replacement functions that handle NAs in the way
you're expecting and those of us on the deepTools team should just sit
down and think whether it really makes sense to continue with the
current behavior or to switch to the aforementioned functions.

BTW, if anyone on this list is strongly in favor of keeping the current
behavior then please speak up! We'd prefer not to overlook use cases.

Devon
--
Devon Ryan, PhD
Bioinformatician / Data manager
Bioinformatics Core Facility
Max Planck Institute for Immunobiology and Epigenetics
Email: dpry...@gmail.com

thirdan...@gmail.com

unread,
Feb 7, 2017, 3:14:36 PM2/7/17
to deepTools, thirdan...@gmail.com
Ah, I see. Would you be willing to share those functions and let me know where I could put them in the code, so that I could change the code running on my local system to handle this the way I described above? Thanks!

Devon Ryan

unread,
Feb 7, 2017, 4:13:48 PM2/7/17
to thirdan...@gmail.com, deepTools
You'll need to change the "sort_groups" function in heatmapper.py.
Either the array isn't properly masked there or
"np.ma.__getattribute__(sort_using)" isn't doing what was originally
expected. If nothing else, slap a "matrix_avgs =
np.nanmean(matrix,axis=1)" in there, presuming you want to sort by mean.

Devon


On 02/07/2017 09:14 PM, mmmmc...@gmail.com wrote:
> Ah, I see. Would you be willing to share those functions and let me know where I could put them in the code, so that I could change the code running on my local system to handle this the way I described above? Thanks!
>

Devon Ryan

unread,
Feb 8, 2017, 4:16:57 AM2/8/17
to thirdan...@gmail.com, deepTools
A little update on this, we've come to an internal consensus that this
needs to be changed. The next release will have the behaviour you're
looking for. If you'd like, you can follow my progress in implementing
this here: https://github.com/fidelram/deepTools/issues/478

Devon
--
Devon Ryan, Ph.D.
Email: dpr...@dpryan.com
Data Manager/Bioinformatician
Max Planck Institute of Immunobiology and Epigenetics
Stübeweg 51
79108 Freiburg
Germany

thirdan...@gmail.com

unread,
Feb 8, 2017, 1:30:17 PM2/8/17
to deepTools, thirdan...@gmail.com, dpr...@dpryan.com
Great, thanks for the help!

thirdan...@gmail.com

unread,
Mar 8, 2017, 12:42:31 PM3/8/17
to deepTools, dpr...@dpryan.com, thirdan...@gmail.com
Hey all-
A bit of follow-up on this. I am now using the dev version of deepTools where plotHeatmap essentially ignores missing data for the purposes of sorting by mean, and it's working great.
However, I have some entire regions (6 kb) which have no data, and these still show up at the bottom of the heatmap as bright yellow (missingDataColor) streaks. It would be nice if I could eliminate this without manually going through the regions files and finding the regions where I have no data.
When I have these regions present during plotHeatmap, I receive the following:
/mnt/home/user/anaconda2/lib/python2.7/site-packages/numpy/lib/nanfunctions.py:675: RuntimeWarning: Mean of empty slice
I assume that this is related to the empty regions. Is there a way to modify plotHeatmap such that these regions are simply left out? I know that I could set missing data to zero, then choose to skipZeros. However, setting missing data to zero may drastically alter the mean signal at regions where I have some (but not much) missing data. Is there a workaround for this?
Thanks!

Devon Ryan

unread,
Mar 8, 2017, 2:42:58 PM3/8/17
to thirdan...@gmail.com, deepTools, dpr...@dpryan.com
I'll have to think about whether there's a good way to implement this.
You can follow the status of this here:
https://github.com/fidelram/deepTools/issues/490
Reply all
Reply to author
Forward
0 new messages