BAM Mapping-counts and Colors

Assaf Gordon

unread,

Oct 1, 2010, 4:48:26 PM10/1/10

to bedtools...@googlegroups.com

Hi Aaron,

I have two programs that deal with BAM files, I'm wondering if you'll be willing to add them to BEDTools.

The first program, bamMapCount, reads an entire BAM file and counts how many times each read mapped, then adds a tag to the output bam file with this count. This program can help with down-stream analysis of repeated elements.
Tags look like "XM:i:N" where N is the map-count (e.g 1 is a unique mapper that appears only once in the BAM file).

The second program, bamColorTag, adds a color tag to the BAM file.
The UCSC Genome Browser has a nice feature where one can specify custom BAM tags for the browser to color the reads.
Tags look like "YC:Z:255,0,128" where the value is an RGB triplet.

Combining these two programs, one can visualize BAM files with different colors for unique-mappers or multi-mappers (or many other coloring possibilities).

Example:

I took the UCSC Example Bam file from:
http://genome.ucsc.edu/goldenPath/help/examples/bamExample.bam

And added three tags:
XM:i:N shows how many times the read mapped (in practice, 1 if one end mapped, 2 if both end mapped).
YC:Z:r,g,b purple for negative strand, green for positive strand
YD:Z:r,g,b purple if a single end mapped, gray of both end mapped.

After loading a custom track with this BAM file, one can easily change the coloring by changing the custom track options.

Using the "YC" color tag:
http://genome.ucsc.edu/cgi-bin/hgTracks?hgS_doOtherUser=submit&hgS_otherUserName=Cdnatata&hgS_otherUserSessionName=bamColorTag_YC

Using the "YD" color tag:
http://genome.ucsc.edu/cgi-bin/hgTracks?hgS_doOtherUser=submit&hgS_otherUserName=Cdnatata&hgS_otherUserSessionName=bamColorTag_YD

Comments are welcomed,
-gordon

Ivan Gregoretti

unread,

Oct 1, 2010, 5:08:49 PM10/1/10

to bedtools...@googlegroups.com

Hello Assaf,

The programs sound awesome. I have three questions

1) Would the outputs be compatible with the BAM standard?

2) Would it be possible to create a functionality that will erase the
added tags?

3) Do the programs accept standard input and output?

I have used colour in BED files and it is extremely helpful.

Thanks,

Ivan

Ivan Gregoretti, PhD
National Institute of Diabetes and Digestive and Kidney Diseases
National Institutes of Health
5 Memorial Dr, Building 5, Room 205.
Bethesda, MD 20892. USA.
Phone: 1-301-496-1016 and 1-301-496-1592
Fax: 1-301-496-9878

Assaf Gordon

unread,

Oct 1, 2010, 5:27:45 PM10/1/10

to bedtools...@googlegroups.com

Hello Ivan,

Ivan Gregoretti wrote, On 10/01/2010 05:08 PM:
> 1) Would the outputs be compatible with the BAM standard?

Yes, It's fully compatible, the BAM specs include custom tags.

> 2) Would it be possible to create a functionality that will erase the
> added tags?

Yes, quite easily.

> 3) Do the programs accept standard input and output?

bamMapCount supports STDOUT, but not STDIN - it needs to scan the BAM file twice:
once to count the mappings, once to add the tags.

bamColorTag works with STDIN and OUT.

If anyone is interested, the source (based on BEDTools 0.10) is available here:
http://cancan.cshl.edu/labmembers/gordon/files/BEDTools_MapCount_ColorTag.tar.gz

but the programs are far from final.

-gordon

Aaron Quinlan

unread,

Oct 2, 2010, 8:41:39 AM10/2/10

to bedtools...@googlegroups.com

Hey Gordon,
This looks really good and I'd be happy to include it. I have, however, been working on (ca. 80% complete) a new tool called "tagBam" that is geared for adding, updating, and removing tags. Would you be open to me add your proposed functionality to it? The colorTag functionality would be directly supported as it is now; one merely runs it as follows:

$ tagBam -i <BAM> -t YC -v 255,0,128

The counting functionality is not currently supported, but could easily be added. I would propose to require that the input file be sorted by name, so as to remove the need to do two passes and keep track of the counts for each read. This could use an intractable amount of memory for very large (e.g. TCGA or 1000 Genomes) BAM files. This would also allow stdin and stdout support. Would you be okay with that?

Gratefully,
Aaron

Assaf Gordon

unread,

Oct 4, 2010, 10:58:06 AM10/4/10

to bedtools...@googlegroups.com

Hi Aaron,

Aaron Quinlan wrote, On 10/02/2010 08:41 AM:
> Hey Gordon, This looks really good and I'd be happy to include it.
> I have, however, been working on (ca. 80% complete) a new tool
> called "tagBam" that is geared for adding, updating, and removing
> tags. Would you be open to me add your proposed functionality to it?
>

Of course.

> The colorTag functionality would be directly supported as it is now;
> one merely runs it as follows:
>
> $ tagBam -i <BAM> -t YC -v 255,0,128

Not exactly - because the trick is that the RGB value is not fixed - it is based on other properties of the tag.
My program currently supports three options:
1. Fixed color - just like your tagBam
2. Color based on strand
3. Color based on other numeric tag (and allows the user to specify mapping between integers and RGBs)

In the future more coloring schemes can be added.
This functionally is pretty easy to add, it's encapsulated in its own classes.

>
> The counting functionality is not currently supported, but could
> easily be added. I would propose to require that the input file be
> sorted by name, so as to remove the need to do two passes and keep
> track of the counts for each read. This could use an intractable
> amount of memory for very large (e.g. TCGA or 1000 Genomes) BAM
> files. This would also allow stdin and stdout support. Would you be
> okay with that?
>

Technically it's OK, but that would require sorting the BAM files twice - once be name (for the map-count),
then by chrom/start for the visualization and indexing (or other operations).

I'm worried that two sort operations (if the files are indeed huge) will take a lot of time.

At least in my case, I have enough memory on big machines two allow map-count to hash everything - making it much quicker.
I would guesstimate that most labs (if they deal with huge amount of data) have access to a machine with lots of memory.
An example: running bamMapCount on a compressed BAM file (5.7GB) with 109M unique reads took 7.9GB of RAM. For many common cases (perhaps not 1000 Genoms or TCGA), 100M reads is a good amount to handle.
And while 8GB is a lot, it's not intractable.
I'm quite certain many people have access to these kind of machines.

I would at least ask to keep in-memory counting as an option in the program (maybe not the default operation mode), so that people who can afford running it would benefit from its speed.
I think the common case (even an entire flowcell from a HiSeq Illumina machine, I think is roughly 80M reads that are mappable) is very doable in-memory with today's common hardware.

Thanks again,
-gordon

Aaron Quinlan

unread,

Oct 5, 2010, 10:48:15 AM10/5/10

to bedtools...@googlegroups.com

Hi Gordon,

Thanks for pointing out the subtleties of colorTag; I didn't initially appreciate that. The concept is great; unfortunately, it will require a bit of work to generalize conditional tags based on rules, etc. I have been working on a rule system for adding tags for specific read groups or libraries. I will look closely at your code and see if I can come up with clever ways to generalize the concept. As for optionally counting without requiring a namesorted BAM, I am completely open to that, but I think the RAM usage may be a bot higher as I'd like to stick with vanilla STL constructs (instead of specialized hash_maps, etc.) for licensing concerns and simplicity. That said, you are right that some people have mega-RAM boxes and can afford to skip sorting.

Thanks as always for the great suggestions,
Aaron

Reply all

Reply to author

Forward