Read Mapping effect on Peak Calling

664 views
Skip to first unread message

ara...@bu.edu

unread,
Nov 4, 2014, 5:13:23 PM11/4/14
to macs-ann...@googlegroups.com
MACS users,

I was hoping to get some of your insight into an issue regarding read mapping and peak calling.  For short-read mapping to the reference genome, do you only use the uniquely mapped reads or do you use uniquely mapped reads plus multi-mapped reads?

For typical DHS samples (40bp reads), we get about 70% of reads map to exactly 1 location (unique mapping) while about 25% align to more than 1 location (multi-mapping).  For Bowtie2, for multi-mapped reads, it randomly chooses the alignment.  This may introduce a bias for repetitive/ambiguous genomic regions that may translate into peak calling.

What is your rationale and general practice regarding read mapping in relation to peak calling?  Is it better to only use the uniquely mapped reads or
use uniquely mapped reads plus multi-mapped reads as input for peak calling?

Thanks,
Andy

Ian Donaldson

unread,
Nov 6, 2014, 4:21:43 AM11/6/14
to macs-announcement
Until recently I have used single-end mapping of 50bp reads using bowtie (v1) using default setting and -m1 (uniquely mapped reads), despite everything I think this method gives good results. 

However, our sequencing facility is now churning out long paired-end reads (HiSeq2500) by default so I am using bowtie2. Bowtie2 does not allow you to explicitly output uniquely mapped reads, so after lots of trials and input from an interested colleague I am now doing the following: using default parameters for paired end mapping and then filtering the resulting SAM file using samtools.  I use 'samtools view -f2 -q30'; -f2 retains concordant reads, whereas -q30 only retains reads with a minimum quality score.  In theory reads with good mapping quality should out weight any issues with multi-mapping reads.  I have seen some nice examples on UCSC sessions where "bogus" peaks sat on repetitive sequence disappear when Q is increased.  The following blog post by Heng Li was pivotal: 'http://lh3lh3.users.sourceforge.net/mapuniq.shtml'.

Importantly mapping is done using reads where adapters have been removed and have been trimmed for quality, in my case using trimmomatic.

I hope this helps.
Ian

--
You received this message because you are subscribed to the Google Groups "MACS announcement" group.
To unsubscribe from this group and stop receiving emails from it, send an email to macs-announcem...@googlegroups.com.
To post to this group, send email to macs-ann...@googlegroups.com.
Visit this group at http://groups.google.com/group/macs-announcement.
For more options, visit https://groups.google.com/d/optout.

ara...@bu.edu

unread,
Dec 5, 2014, 12:36:46 PM12/5/14
to macs-ann...@googlegroups.com
Thanks for the response Ian.

I ended up re-analyzing my samples with only uniquely mapped reads and compared MACS2 peak results.  As I mentioned, I now have about 70% of my reads mapping uniquely. 

Comparing the counts of MACS2 peaks; I'm finding that the uniquely mapped read data sets do find fewer peaks (on the order of 100 to a few thousand, representing about 1% to 10% of the total peak count).

Comparing the percentage of reads in MACS2 peaks; I'm finding that the uniquely mapped read data sets also find slightly lower percentages (between 1% and 5%).

I did a bunch of downstream analysis with these datasets (differential peak identification, PCA of samples with differential sites, etc...) and I found no difference in my results (compared to my data using uniquely mapped reads plus multi-mapped reads).

My conclusion is that there's not much of a difference between only using the uniquely mapped reads or using uniquely mapped reads plus multi-mapped reads as input for peak calling.  But if you want greater specificity of your peak profile and avoid bias from repetitive/ambiguous genomic regions, it's better to use uniquely mapped reads only.

Andy
Reply all
Reply to author
Forward
0 new messages