SAM/BAM to BED

234 views
Skip to first unread message

Assaf Gordon

unread,
Mar 30, 2010, 6:08:47 PM3/30/10
to bedtools...@googlegroups.com
Hello Aaron,

Does the "bamToBed" program knows to properly handle reads with insertions or deletions in the CIGAR field ?

Meaning:
SAM files contains only the start position, the sequence itself and the CIGAR string.

The naive way to calculate the END position is to add the length of the sequence to the start coordinate.
But if the read was mapped with insertions or deletions (as indicated by the CIGAR field), the end coordinate will be incorrect.

Does the program takes the CIGAR into account ?

thanks,
-gordon

Aaron Quinlan

unread,
Mar 30, 2010, 6:34:02 PM3/30/10
to bedtools...@googlegroups.com
It _intends_ to, yes.  For most of the cases I have seen it works great.




--
To unsubscribe, reply using "remove me" as the subject.

Aaron Quinlan

unread,
Mar 30, 2010, 7:47:00 PM3/30/10
to bedtools...@googlegroups.com
Hi Gordon,
I should have taken a second before I responded. The old version did add pads to the sequence and shift the end coordinate accordingly. The new version does not. The rationale was that I wanted the reference coordinates to be "unpadded", instead of "padded". That said, a more proper way to do this would be to optionally allow padded or unpadded.

I will do this for the next release. The current release only obeys the M and N CIGAR operations.

A quick way to fix this if you need to would be to modify the ParseCigarBed() function to increment currPosition accordingly when I or D operations are encountered. One would change the main loop in that function to something like:

for (; cigItr != cigEnd; ++cigItr) {
switch (cigItr->type) {
case('M'): currPosition += cigItr->Length;
case('I'): currPosition += cigItr->Length;
case('D'): currPosition += cigItr->Length;
}
}

Sorry if I confused things...long day.

Aaron

freeseek

unread,
Sep 18, 2013, 11:25:10 AM9/18/13
to bedtools...@googlegroups.com
It would be also nice to have an option to split each read in multiple intervals when there are deletions (especially long deletions). For now I am using this simple script (but I don't know what the "N" and the "P" in the CIGAR mean) instead of bedtools bamtobed:

samtools view in.bam |
  awk '{split ($6,a,"[MIDNSHP]"); bp=$4-1; n=0;
    for (i=1; i<=length(a); i++) {
      n+=1+length(a[i]);
      if (substr($6,n,1)=="M") print $3"\t"bp"\t"(bp+=a[i]);
      if (substr($6,n,1)=="D") bp+=a[i];
    }
  }' > out.bed

It would be nice to have this as an option already coded in bedtools bamtobed. Notice that the code only takes into account columns 3 (chromosome), 4 (base pair start), and 6 (CIGAR) of the BAM file, and it does not use at all column 10 (sequence).

Aaron Quinlan

unread,
Sep 19, 2013, 5:19:24 PM9/19/13
to bedtools...@googlegroups.com
Currently, bamtobed splits on N (spliced alignments, e.g., RNA-seq) CIGAR ops only.  Are you asking for it to split on both N and D ops?
--
You received this message because you are subscribed to the Google Groups "bedtools-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to bedtools-discu...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

giulio....@gmail.com

unread,
Sep 19, 2013, 8:22:44 PM9/19/13
to bedtools...@googlegroups.com
On Thursday, September 19, 2013 5:19:24 PM UTC-4, Aaron Quinlan wrote:
Currently, bamtobed splits on N (spliced alignments, e.g., RNA-seq) CIGAR ops only.  Are you asking for it to split on both N and D ops?

Exactly. Though now that I looked into it, what is the real difference between N and D? Anyhow, some option (like "-d") to split on D would be useful.

giulio....@gmail.com

unread,
Sep 27, 2013, 12:06:09 PM9/27/13
to bedtools...@googlegroups.com, giulio....@gmail.com
On Thursday, September 19, 2013 8:22:44 PM UTC-4, giulio....@gmail.com wrote:

Exactly. Though now that I looked into it, what is the real difference between N and D? Anyhow, some option (like "-d") to split on D would be useful.

I actually only now noticed (as of bedtools v2.16.1) that the "-split" option in "bedtools bamtobed" only splits on the "N" while the "-split" option in "bedtools coverage" splits both on "N" and on "D". This seems a bit inconsistent.

Aaron Quinlan

unread,
Oct 1, 2013, 10:30:54 AM10/1/13
to bedtools...@googlegroups.com, giulio....@gmail.com
Hi,

Thanks for your suggestions. Last night I added a new -splitD option to bamtobed that will split both on N and D operators.  You can find this change in the github repository: https://github.com/arq5x/bedtools

Reply all
Reply to author
Forward
0 new messages