Questions regarding CCDS -- 900 million bases?

2 views
Skip to first unread message

Luo, Yiming (NIH/NIAMS) [E]

unread,
May 28, 2021, 11:55:06 AM5/28/21
to genome...@soe.ucsc.edu

Dear UCSC support staff,

 

I am a research trainee at NIH and wonder if I can ask you a question regarding the CCDS BED files?

 

I have a BAM file (hg19) and need to calculate the depth of coverage at each base in the exonic regions. I downloaded the CCDS BED files from the UCSC Table Browser (https://genome.ucsc.edu/cgi-bin/hgTables?hgsid=1109848749_gHXtAmBqKaR1UOA191fS812ba8pn&clade=mammal&org=Human&db=hg19&hgta_group=genes&hgta_track=ccdsGene&hgta_table=0&hgta_regionType=genome&position=chrX%3A15%2C578%2C261-15%2C621%2C068&hgta_outputType=bed&hgta_outFileName=ccds.bed) and used Samtools depth (http://www.htslib.org/doc/samtools-depth.html)

 

However, I found there are over 900 million bases in the CCDS file, which much exceeds the expected bases in exonic area (roughly 1% or 30 million). I manually checked the bed file with R and got the similar output of 900 million bases (995030607) for non-overlapping regions

 

I wonder what the problem might be?

 

Thank you very much!

 

Yiming

 

 

 

----

Yiming Luo, M.D.

Clinical Fellow, Rheumatology

National Institute of Arthritis and Musculoskeletal and Skin Diseases

National Institutes of Health

9000 Rockville Pike, Building 10 Room 10N-311

Bethesda, MD 20850

Tel: 301-480-1819

 

Matthew Speir

unread,
May 28, 2021, 4:51:02 PM5/28/21
to Luo, Yiming (NIH/NIAMS) [E], genome...@soe.ucsc.edu
Hello, Yiming.

Thank you for your question about downloading BED files from the UCSC Genome Browser. 

It sounds like the samtools depth utility is not taking into account the exon/intron structure encode in the BED format. The 10th, 11th, and 12th columns of the BED format include information about the number of exons/blocks, the size of those exons/blocks, and the start positions of those exons/blocks relative to the start of the BED item. 

When outputting a BED file, you may want to select the option to output one BED record per coding region. Although since there are overlapping transcripts in the CCDS table, it does mean you may have multiple entries that cover the same coordinates. 

After outputting a BED entry per coding exon in the CCDS table, one of our engineers found that this was fairly close to the 30 million you expected (though I don't think their command accounted for overlapping entries):

awk '{total += $3 - $2;}  END{print total;}' ccds.bed
49525029

Which is ~49.5 million bases. Accounting for overlapping entries would likely bring that closer to your expected 30 million bases.  

I hope this is helpful. If you have any further questions, please reply to genome...@soe.ucsc.edu. All messages sent to that address are archived on a publicly-accessible Google Groups forum. If your question includes sensitive data, you may send it instead to genom...@soe.ucsc.edu.

Training videos & resources: http://genome.ucsc.edu/training/index.html

Want to share the Browser with colleagues? Host a workshop: http://bit.ly/ucscTraining

---

Matthew Speir

UCSC Cell Browser, Quality Assurance and Data Wrangler

Human Cell Atlas, User Experience Researcher

UCSC Genome Browser, User Support

UC Santa Cruz Genomics Institute

Revealing life’s code.



--

---
You received this message because you are subscribed to the Google Groups "UCSC Genome Browser Mirror-Specific Support" group.
To unsubscribe from this group and stop receiving emails from it, send an email to genome-mirro...@soe.ucsc.edu.
To view this discussion on the web visit https://groups.google.com/a/soe.ucsc.edu/d/msgid/genome-mirror/F32ABAA1-2C21-4E8E-BB7A-70FF97D722AA%40nih.gov.

Luo, Yiming (NIH/NIAMS) [E]

unread,
Jun 1, 2021, 11:43:28 AM6/1/21
to Matthew Speir, genome...@soe.ucsc.edu

Yes now it works now. Thank you so much for your support!

 

Yiming

Reply all
Reply to author
Forward
0 new messages