Extracting just the cds regions of a human genome (hg19)

948 views
Skip to first unread message

Manasa Lanka

unread,
Mar 3, 2015, 2:27:02 PM3/3/15
to gen...@soe.ucsc.edu
Hi,

My name is Manasa and I am working on a a project at LIAI, San Diego, CA,  that requires me to extract just the protein coding regions from the human genome. I need to extract the cds start and cds end positions, along with the sequences extracted from the reference genome. 

I used UCSC's knownGene table to extract the cds start and end positions (and exon starts and exon ends), and I was planning to extract the sequences by writing a code myself,  but I figured it must have already been done. I am looking to extract the sequences in the coding region, along with the other start and end positions data (knownGene table). Is there any particular command-line program that I could use, or can I directly use the table browser? If yes, how do I use it to achieve these results? I apologize if its a naive doubt but I am stuck at this problem and would really appreciate any help!

Thank you,
Manasa Lanka

Jonathan Casper

unread,
Mar 3, 2015, 2:57:17 PM3/3/15
to Manasa Lanka, gen...@soe.ucsc.edu

Hello Manasa,

Thank you for your question about downloading the CDS sequence of the GRCh37/hg19 human genome assembly. There are command-line programs that will extra DNA sequences for you - we provide one called twoBitToFa, which is available from our download server at http://hgdownload.soe.ucsc.edu/admin/exe/. You can run twoBitToFa without any arguments to see a usage message. In addition to a 2bit file containing the sequence of the source genome, twoBitToFa also accepts an optional "-bed=filename" option, which allows you to provide a BED file that describes the regions to extract (e.g., the CDS exons that you have identified). twoBitToFa does respect the full BED12 specification (http://genome.ucsc.edu/FAQ/FAQformat.html#format1), and will both omit introns and reverse complement the sequence of regions on the '-' strand in the output file. Please note, however, that twoBitToFa will also output sequence for UTR regions (regions listed in the exon coordinates, but outside of the cdsstart-cdsend range). If this is unacceptable, you may prefer to use the UCSC Table Browser as described below.

A 2bit file containing the sequence of the hg19 genome assembly is provided at http://hgdownload.soe.ucsc.edu/goldenPath/hg19/bigZips/ (or by clicking "Human" on http://hgdownload.soe.ucsc.edu and following the "Full data set" link for hg19).

Alternatively, you can obtain CDS sequence using the UCSC Table Browser as described in the answer to this question: https://groups.google.com/a/soe.ucsc.edu/d/topic/genome/m4jwD6zITsU/discussion.

I hope this is helpful. If you have any further questions, please reply to gen...@soe.ucsc.edu or genome...@soe.ucsc.edu. Questions sent to those addresses will be archived in publicly-accessible forums for the benefit of other users. If your question contains sensitive data, you may send it instead to genom...@soe.ucsc.edu.

--
Jonathan Casper
UCSC Genome Bioinformatics Group


--


Reply all
Reply to author
Forward
0 new messages