BED to SAM/BAM

1,600 views
Skip to first unread message

Assaf Gordon

unread,
Apr 9, 2010, 10:47:14 AM4/9/10
to bedtools...@googlegroups.com
Hello Aaron,

Thanks for your detailed reply regarding the SAM-to-BED.

I now in need for the exact opposite:
I have a 12-column BED file (with blocks/exons > 1), and would like to convert it to SAM/BAM.

Do you (or other mailing-list members) know of a tool to convert such a BED file to SAM/BAM ?
(while preserving the exons, probably with the correct CIGAR string encoding).


Thanks,
-gordon

Aaron Quinlan

unread,
Apr 9, 2010, 1:09:33 PM4/9/10
to bedtools...@googlegroups.com
Hi Gordon,
I wrote a script to convert BED6 to SAM, but have not yet written one for BED12 to SAM. The main challenge for this sort of script is whether or not one wants to include the original sequence for the SAM record. If so, there's more work involved. If not, it's basically a matter of converting the BED blocks to "M" CIGAR operations and the gaps between the blocks to "N" CIGAR operations. I could try to knock this out next week, but I can' promise anything. Alternatively, I'd certainly be happy to have someone contribute such a script... :)

Aaron

> --
> To unsubscribe, reply using "remove me" as the subject.

Aaron Quinlan

unread,
Apr 9, 2010, 2:31:16 PM4/9/10
to bedtools...@googlegroups.com
Another note:

Strictly speaking, BED conversion to BAM would not require the sequence and should be straight-forward. My comment about including the original sequence/quality is more applicable to PSL format (the BLAT output format), which is very similar to BED. The samtools distribution has a Perl script for converting PSL to SAM. I imagine it might be easy to modify it for blocked BED.

Aaron

James M. Ward

unread,
Apr 16, 2010, 11:58:47 AM4/16/10
to bedtools-discuss
Aaron,

First, thanks again for making BEDtools openly available -- I find it
extremely efficient and useful, and has turned into almost a game-
changer in terms of how I organize my workflows now!

I noticed the new bedToBam tool and it works very well for what it
does! Thanks for putting it together. I also noted that it didn't
yet link multiple exons together, as you mentioned, but which I really
need. I was going to write a script to convert Bam-to-SAM, sort by
name, chromosome, start coordinate, then collapse records by name,
then convert back to BAM. I say this in case (1) it could be useful
for you or others, or (2) you are writing code to do this sort of
thing directly within the bedToBam tool (in which case I won't write a
script!) :-)

At this point, my simple script would likely be Perl, so wouldn't be
the most efficient thing in the world. But probably not prohibitory.

Best regards,

James M. Ward

Aaron Quinlan

unread,
Apr 16, 2010, 1:29:22 PM4/16/10
to bedtools...@googlegroups.com
Hi James,

I am still tinkering with bedToBam. By "linking multiple exons together", do you mean that you'd like them to appear like genes on UCSC or IGV, yet in BAM format? If so, have you tried the -bed12 option? I think this should link multiple exons in the way you want. The default behavior assumes "unblocked"/BED6 BED entries.

Let me know if this is not what you mean...it's still very mucha work in progress, but my hope is that it will ultimately become a foundation for converting other similar formats (e.g., PSL, BLAST8, AXT, etc.) to BAM

Best,
Aaron

James M. Ward

unread,
Apr 23, 2010, 12:40:31 PM4/23/10
to bedtools-discuss
Aaron,

Sorry it took a while to reply, it took a while for me to work through
this topic and get time to respond. Many thanks for your promptness,
in fact I did try the -bed12 option the day of your response. The
trouble was that bed4 (or any bed <12) doesn't have aggregated
features, so I basically had to write a script to aggregate multiple
rows of bed4 together into singular rows in bed12 ... ultimately
describing a pseudo-feature which was the total length of the aligned
segments and the gaps between them. It worked, but feels like a
kludge. Obviously it's better to start with something already having
the aggregated feature information, in my case I tend to use PSL since
I have a smooth funnel from PSL to sequence-SAM, to sequence-BAM. And
I like saying BAM!

But this workflow is still good, and needed, in cases where I've used
BEDtools to merge two files together, then aggregate a new set of
feature segments. I don't know how one could automate that process,
since the aggregation step seems situation-dependent. I could
*envision* a way, but that's another topic. :-) Anyway, I wrote a
script which does bed4ToBed12, it just needs to know how to join
things.

The other issue I encountered was logistical wrt my downstream use in
GBrowse 2. It may be similar for other genome browsers, which is why
I mention it. If my newly-aggregated bed12 feature is too huge,
performance is effectively lost when using BAM. I suspect it
supercedes the point of using BAM files to have features 500 kb long.
The only suggestion I can think of is similar to the maxIntron flags
used in other toolsets, since most of these cases have just one
ginormous gap. Then again, it could just be a GBrowse 2 issue.

If I may ask, what do you do with multi-segment features? I love
BEDtools, so I find myself encoding information into column 4 "name"
field to keep track of exons and multi-segment features, like
"exon_1", "exon_2", etc.

I appreciate your proactiveness!

Best,

James

Aaron Quinlan

unread,
Apr 26, 2010, 3:10:16 PM4/26/10
to bedtools...@googlegroups.com
Hi James,

> Sorry it took a while to reply, it took a while for me to work through
> this topic and get time to respond. Many thanks for your promptness,
> in fact I did try the -bed12 option the day of your response. The
> trouble was that bed4 (or any bed <12) doesn't have aggregated
> features, so I basically had to write a script to aggregate multiple
> rows of bed4 together into singular rows in bed12 ... ultimately
> describing a pseudo-feature which was the total length of the aligned
> segments and the gaps between them. It worked, but feels like a
> kludge. Obviously it's better to start with something already having
> the aggregated feature information, in my case I tend to use PSL since
>
I now understand you dilemma; I previously misunderstood what you were getting at. Your script to "stitch" BED4 to BED12 sounds very useful. I assume it requires that the BED4 entries are sorted by name, then by start?

> I have a smooth funnel from PSL to sequence-SAM, to sequence-BAM. And
> I like saying BAM!

Really? With proper SAM CIGAR strings, sequence and quals from the PSL alignments? If so, I would love to see that, as I have been putting off writing something like this for a week or so.

> Anyway, I wrote a
> script which does bed4ToBed12, it just needs to know how to join
> things.
Would you be willing to share this? I could ultimately lift it to C++ and perhaps speed it up. I think this could be very useful for folks.

> The other issue I encountered was logistical wrt my downstream use in
> GBrowse 2. It may be similar for other genome browsers, which is why
> I mention it. If my newly-aggregated bed12 feature is too huge,
> performance is effectively lost when using BAM. I suspect it
> supercedes the point of using BAM files to have features 500 kb long.
> The only suggestion I can think of is similar to the maxIntron flags
> used in other toolsets, since most of these cases have just one
> ginormous gap. Then again, it could just be a GBrowse 2 issue.
I haven't tested this directly in IGV, but I've had good luck with it while using my BAM representation of genes in BAM format. Perhaps you should try it out...

> If I may ask, what do you do with multi-segment features? I love
> BEDtools, so I find myself encoding information into column 4 "name"
> field to keep track of exons and multi-segment features, like
> "exon_1", "exon_2", etc.

I mainly use it as a compact, indexed method for displaying gene annotations and structural rearrangements.

> I appreciate your proactiveness!
My reply took 4 days. Don't speak too soon!

All the best,
Aaron

Aaron Quinlan

unread,
Apr 27, 2010, 8:46:42 AM4/27/10
to bedtools...@googlegroups.com
Hi James,
With respect to your last question and for the sake of others, I should also mention that another convenient use of bedToBam is for loading _large_ BED files into viewers such as Gambit, Tablet or IGV. For example, I recently converted dbSNP130 to BAM and loaded it into IGV in order to interpret some results. Once sorted and indexed with samtools, such large BAM files play much more nicely with viewers.

Aaron

James M. Ward

unread,
May 3, 2010, 6:44:10 PM5/3/10
to bedtools-discuss
Aaron,

I promise to send code as I mentioned, honest! :-)

In the meantime I wante to add that I did need to alter the cpp code
due to a problem I observed. It started creating files tens of
gigabytes in size, mostly or entirely with N's. I added a "-noseq"
option so it would insert a * for sequence and for quality. Otherwise
I'm not sure how bedToBam would know the sequence to use, because even
if it did subseq the genome (which may have had an issue in my case)
the genome sequence wouldn't be the one I'm interested in. Again
though, I didn't need a sequence in the file, and I can't speak for
what other viewers may need.

Code coming in the next day or so. Any good way to send it to you,
Email or post attachment?

Aaron Quinlan

unread,
May 3, 2010, 7:45:04 PM5/3/10
to bedtools...@googlegroups.com
Hi James,

> I promise to send code as I mentioned, honest! :-)
No rush or pressure at all. Up to you...your contribution will be properly credited.

> In the meantime I wante to add that I did need to alter the cpp code
> due to a problem I observed. It started creating files tens of
> gigabytes in size, mostly or entirely with N's. I added a "-noseq"
> option so it would insert a * for sequence and for quality. Otherwise
> I'm not sure how bedToBam would know the sequence to use, because even
> if it did subseq the genome (which may have had an issue in my case)
> the genome sequence wouldn't be the one I'm interested in. Again
> though, I didn't need a sequence in the file, and I can't speak for
> what other viewers may need.
This is a good point. There was no intent for bedToBam to extract relevant sequence for a given range from a genome. The "auto-filling" of Ns and default qualities was based on an (perhaps naive) understanding that viewing programs needed this to display the features correctly. In retrospect, this may be incorrect as it is largely a result of very early attempts at viewing BED->BAM conversions in IGV. I will test it tonight and if indeed the seq and qual are not required, I will remove them from the resulting BAM output. If they are required, I will add your -noseq option. Thanks for bringing this up.

> Code coming in the next day or so. Any good way to send it to you,
> Email or post attachment?
You can just send me an offline email. Again, no rush.

Best,
Aaron

Aaron Quinlan

unread,
May 3, 2010, 8:42:25 PM5/3/10
to bedtools...@googlegroups.com
Hi James,
Looks like there is no need to include "dummy" seq and qual entries as the alpha version of bedToBam does. I will remove this from the version that will be in the next release of BEDTools.

Thanks again for bringing this up.
Aaron
Reply all
Reply to author
Forward
0 new messages