retrieving entire CDS for human reference genome

285 views
Skip to first unread message

Max Shpak

unread,
Aug 5, 2013, 7:06:10 PM8/5/13
to gen...@soe.ucsc.edu
In order to estimate dN/dS for various genes, I need the entire coding sequence. I have been working with the list of cds exon sequences provided from the Tables browser for the human reference genome, and one of the problems that I'm facing is that if I attempt to concatenate them into a single sequence for PAML, HYPHY, etc, I have to deal with the fact that each exon is on a potentially different reading frame.

Therefore, I need to know if there is some efficient means of extracting the entire cds sequence with the exons already concatenated and adjusted into a single 0 to modulus 3 reading frame. I don't see a cds option as such listed (although the tables provide coordinates for cds Start/End)

--
=======================
Max Shpak, Ph.D.
NeuroTexas Institute
St. David's Medical Center
1015 East 32nd Street, Suite 404
Austin, TX 78705
(512) 544-8077

Jonathan Casper

unread,
Aug 6, 2013, 1:38:44 PM8/6/13
to Max Shpak, gen...@soe.ucsc.edu

Hello Max,

Thank you for your question about retrieving coding sequences. If you are just selecting coding sequence for your table browser output, the reading frame should not matter. Reading frame information is about how many bases of intron are between adjacent exons and whether that number is a multiple of 3 or not. If you omit the intron regions, the right edge of the final codon for one exon should align perfectly with the left edge of the codon from the next exon.

You can obtain your desired output from the Table Browser by selecting your genes track and gene identifiers of preference and setting output type to "sequence". On the next page you can select genome, protein, or mRNA output. If you desire the actual DNA sequence, select genomic. On the final page deselect everything but CDS exons and choose "One FASTA record per gene". The resulting sequence for each gene will be just the coding regions, one codon after another.

I hope this is helpful. If you have any further questions, please reply to gen...@soe.ucsc.edu.

--
Jonathan Casper
UCSC Genome Bioinformatics Group



--
 
 
 

Reply all
Reply to author
Forward
0 new messages