predicted fragment size

1,547 views
Skip to first unread message

Dietmar Rieder

unread,
Feb 12, 2013, 8:11:48 AM2/12/13
to macs-ann...@googlegroups.com
Hi,

I'm analyzing ChIP-seq data (unfortunately no control/input ) and I wonder how reliable the fragment size estimation with this data is:
The Bioanalyzer data shows that most of the fragments are arround 120bps (ranging from 55-200) so I set bw to 120 (is this reasonable?).

MACS determines the following values:
tag size is determined as 40 bps
tags after filtering in treatment: 6376008
number of paired peaks: 524
predicted fragment length is 42 bps

So, the # of paired peaks is rather low, and I wonder how I should interpret the 42bps fragment size? When I look at the aligned/mapped reads (before shifting) at some of the peak regions I can see that many reads on the + strand indeed partly overlap with reads from the - strand and cover regions that extend about 80 - <200 bps.


Thanks
  Dietmar

TRypdal

unread,
Feb 12, 2013, 1:12:05 PM2/12/13
to macs-ann...@googlegroups.com
Hi, are you using MACS14? If so you are likely to be experiencing a known problem MACS14 has wrt fragment estimation. I recommend you to look into some of the earlier posts here. Also double check your results with a cross-correlation estimation method, for example the one proposed in the tool PhantomPeakSeq. Also repeat the analysis using MACS2.

Cheers

Dietmar Rieder

unread,
Feb 12, 2013, 1:29:27 PM2/12/13
to macs-ann...@googlegroups.com
Hi,

Thank you very much for your suggestions.
I'm using MACS14 but I got the same predicted fragment lenght with MACS2. I'll try to verify this by the suggested cross-correlation method.

Dietmar

Anshul Kundaje

unread,
Feb 12, 2013, 2:58:01 PM2/12/13
to macs-ann...@googlegroups.com
For your reference

Here is my post on the issue with the fragment length estimation
https://groups.google.com/d/msg/macs-announcement/XawMJuBLYrc/ErL5oWVUWdYJ

And here is the package http://code.google.com/p/phantompeakqualtools/ that
you can use to re-evaluate the fragment length distribution ..

-Anshul.


--
You received this message because you are subscribed to the Google Groups "MACS announcement" group.
To unsubscribe from this group and stop receiving emails from it, send an email to macs-announcem...@googlegroups.com.
To post to this group, send email to macs-ann...@googlegroups.com.
Visit this group at http://groups.google.com/group/macs-announcement?hl=en.
For more options, visit https://groups.google.com/groups/opt_out.
 
 

Tao Liu

unread,
Feb 12, 2013, 3:09:26 PM2/12/13
to macs-ann...@googlegroups.com
If you got this result from MACS2, try to generate _model.pdf file from _model.r, and check the page 2. If you got this result from MACS1 (I bet so), you can confirm it through MACS2… There is a subcommand named 'predictd'. The current version MACS2 on github site switches to a method similar to SPP's cross-correlation, but instead of looking at all tags along a chromosome, it will call a set of putative enriched regions then investigate the tag distribution there so it 'hopefully' won't be fooled by background. Still there are some tricks, for example, whether you should remove duplicates before predicting fragment size.

Best,
Tao Liu

--
Assistant Professor
Deptartment of Biochemistry
SUNY at Buffalo
NY State Center of Excellence in Bioinformatics & Life Sciences

B2-163 COEBLS
(O) 716-829-2749
tl...@buffalo.edu

Mailing address:
SUNY at Buffalo-COEBLS
701 Ellicott St, B2-163
Buffalo, NY 14203-1221

Anshul Kundaje

unread,
Feb 12, 2013, 3:17:18 PM2/12/13
to macs-ann...@googlegroups.com
Hi Tao,

The MACS2 fragment length estimator is significantly more robust than MACS1.4 but still often makes the same mistake i.e. predicts read length as the fragment length. I would suggest a change in the code: Use the read-length information to explicitly filter the cross-correlation phantom peak at read length (if it is the strongest) and use the second mode if this is the case. This will avoid these problems altogether.

Thanks,
-A

Dietmar Rieder

unread,
Feb 12, 2013, 4:41:58 PM2/12/13
to macs-ann...@googlegroups.com
Hi Tao, Hi Anshul,

I now run macs2 predictd from github and got the following result:
INFO  @ Tue, 12 Feb 2013 22:19:12: # predicted fragment length is 119 bps 
INFO  @ Tue, 12 Feb 2013 22:19:12: # alternative fragment length(s) may be 119 bps

I also run the spp cross-correlation and got the following fragment length estimates:
-15,125,140

The results from both tools are matching the Bioanalyzer data.

So it seem that I should use ~120 bp as fragment size in peak calling and not the estimate from macs14.

Thanks to both of you for your help
  Didi

Dietmar Rieder

unread,
Feb 13, 2013, 7:39:14 AM2/13/13
to macs-ann...@googlegroups.com
Hi Tao,

one more question:

How can it be that I get different fragment length estimates with macs2 predictd when using either sam or bam formated alignments.
In my case, the sam file was generated out of the bwa sam file by the following command:

            samtools view -S -h -F0x4 -q 1 aln.sam -o aln_unique.sam

and the bam file was generated out of this sam file.

           samtools view -bS aln_unique.sam > aln_unique.bam

now the estimates using either aln_unique.sam or aln_unique.bam are quite different (119 vs 46). What did I miss?

Dietmar

Tao Liu

unread,
Feb 13, 2013, 12:02:06 PM2/13/13
to macs-ann...@googlegroups.com
Hi Dietmar,

Hmm. I don't think the difference would be that much. Since you use pipe '> alb_unique.bam', my first suggestion is to check whether your aln_unique.bam file is complete. I usually do this instead:

samtools view -bS aln_unique.sam -o aln_unique.bam

Best,

Tao Liu

--
Assistant Professor
Deptartment of Biochemistry
SUNY at Buffalo
NY State Center of Excellence in Bioinformatics & Life Sciences

B2-163 COEBLS
(O) 716-829-2749
tl...@buffalo.edu

Mailing address:
SUNY at Buffalo-COEBLS
701 Ellicott St, B2-163
Buffalo, NY 14203-1221

Dietmar Rieder

unread,
Feb 13, 2013, 12:43:22 PM2/13/13
to macs-ann...@googlegroups.com
Hi,

from what I see the bam file should be complete, since the same # of mapped reads are present.

samtools view -S -c aln_unique.sam
6427798

samtools view  -c aln_unique.bam
6427798

samtools view -bS aln_unique.sam -o aln_unique.bam
did not change the outcome

It is also interesting that the number of paired peaks is different when using either bam or sam:

macs2 predictd -i  aln_unique.bam -g 1.357e9 
INFO  @ Wed, 13 Feb 2013 18:35:36: tag size is determined as 40 bps 
INFO  @ Wed, 13 Feb 2013 18:35:36: # tag size = 40 
INFO  @ Wed, 13 Feb 2013 18:35:36: # total tags in alignment file: 6427798 
INFO  @ Wed, 13 Feb 2013 18:35:36: # Build Peak Model... 
INFO  @ Wed, 13 Feb 2013 18:35:40: #2 number of paired peaks: 821 
[...]
INFO  @ Wed, 13 Feb 2013 18:36:16: # predicted fragment length is 46 bps 
INFO  @ Wed, 13 Feb 2013 18:36:16: # alternative fragment length(s) may be -400,46 bps 

macs2 predictd -i  aln_unique.sam -g 1.357e9 
INFO  @ Wed, 13 Feb 2013 18:37:41: tag size is determined as 40 bps 
INFO  @ Wed, 13 Feb 2013 18:37:41: # tag size = 40 
INFO  @ Wed, 13 Feb 2013 18:37:41: # total tags in alignment file: 6427798 
INFO  @ Wed, 13 Feb 2013 18:37:41: # Build Peak Model... 
INFO  @ Wed, 13 Feb 2013 18:37:44: #2 number of paired peaks: 568
[...]
INFO  @ Wed, 13 Feb 2013 18:38:19: # predicted fragment length is 119 bps 
INFO  @ Wed, 13 Feb 2013 18:38:19: # alternative fragment length(s) may be 119 bps 

Dietmar

Tao Liu

unread,
Feb 13, 2013, 5:07:19 PM2/13/13
to macs-ann...@googlegroups.com
Hi Dietmar,

Need to apologize first, that I found there is a bug in SAM file parser.  Due to this bug, minus strand tags will be incorrectly move towards left side ( chromosome coordination). So d from SAM file will be larger than actual value. I would trust the result from BAM file. This bug affects SAM and BOWTIE standard output only, and doesn't affect BAM, ELAND, BED formats. I stick to BAM and don't use SAM and BOWTIE nowadays, so I haven't noticed this problem, sorry again. 

There is a difference between fragment size from your sequencing library and the fragment size associated with the binding sites. Theoretically,  if you don't have enough ChIP enrichment in the library, by doing a chromosome-wide cross-correlation like SPP will recover the size you selected during gel cut (= what bioanalyzer gave you). And meanwhile, MACS2 would find fewer candidate regions to do a region-specific cross-correlation, and what it can recover is in most cases something only related to amplification artifact. I can't say the number from bioanalyzer wouldn't  help peak detection since it still provides a way to smooth the data. But it will affect the accuracy of peak summits. I prefer the fragment size associated with the binding sites, or actually I should call it the size of DNA protected by target protein. 

Since you only get hundreds of enriched regions to calculate fragment size, I would suggest you tweak MFOLD -- try maybe "-m 2 50". However, in terms of quality control, when you see fewer pairs from MACS2 to build model *AND* a small fragment size, it's a flag that you need to check your data quality.  For your information, if I run MACS2 on ChIP-EXO, it will get over ten thousands pairs although the fragment size in the end might be only ~10bps -- I would still trust the prediction. 

Best,
Tao Liu

--
Assistant Professor
Deptartment of Biochemistry
SUNY at Buffalo
NY State Center of Excellence in Bioinformatics & Life Sciences

B2-163 COEBLS
(O) 716-829-2749
tl...@buffalo.edu

Mailing address:
SUNY at Buffalo-COEBLS
701 Ellicott St, B2-163
Buffalo, NY 14203-1221

Dietmar Rieder

unread,
Feb 16, 2013, 5:25:48 AM2/16/13
to macs-ann...@googlegroups.com
Hi Tao,

thank you for finding the bug. I switched to bam files as default now too, just to be on the save site :-)
Thanks also for the explanation regarding the fragment size. After all, it seems like that using the shorter fragment size for building the shift model returns better results. When I compare the shifted reads with the mapped reads, the short shiftsize reflects the peak better than the long estimate which broadens the peaks by extending to far on the 3' end.

Dietmar
Reply all
Reply to author
Forward
0 new messages