Download protein-coding gene coordinates

380 views

Skip to first unread message

Kostas Tsirigos

unread,

Sep 20, 2017, 10:52:50 AM9/20/17

to gen...@soe.ucsc.edu

Hello,

I am new to the UCSC genome browser and I would like to ask the following:

If I want to download all LINE elements annotated in the Genome Browser, then I go to the Table browser, select "Repeats" under the "group" and then I filter for LINES.

Is there a way to download the same information (BED file) for protein-coding genes?

I tried Table browser -> Genes and Gene predictions and then selected "Uniprot" under the track and it seems that I am getting almost what I need, which is ~23,000 unique genes, but with the following problem : I have multiple lines for each protein, e.g.:

chr6	46139583	46139628	Signal peptide	1000	46139583	46139628	0,0,0	1	45	0	signal peptide	amino acids 1-15 on protein Q9Y6X5		Q9Y6X5
chr6	46139628	46143499	Extracellular	1000	46139628	46143499	100,0,0	3	781171224	0,1423,3647	topological domain	amino acids 16-407 on protein Q9Y6X5	Extracellular	Q9Y6X5
chr6	46139628	46143637	Bis(5'-adenosy...	1000	46139628	46143637	0,0,0	3	781171362	0,1423,3647	chain	amino acids 16-453 on protein Q9Y6X5	Bis(5'-adenosyl)-triphosphatase ENPP4	Q9Y6X5
chr6	46139682	46139685	ion-binding	1000	46139682	46139685	0,0,0	1	3	0	metal ion-binding site	amino acid 34 on protein Q9Y6X5	Zinc 1; catalytic	Q9Y6X5
chr6	46139790	46139793	enzyme act site	1000	46139790	46139793	0,0,0	1	3	0	active site	amino acid 70 on protein Q9Y6X5	AMP-threonine intermediate	Q9Y6X5
chr6	46139790	46139793	ion-binding	1000	46139790	46139793	0,0,0	1	3	0	metal ion-binding site	amino acid 70 on protein Q9Y6X5	Zinc 1; catalytic	Q9Y6X5
chr6	46139853	46139856	bind	1000	46139853	46139856	0,0,0	1	3	0	binding site	amino acid 91 on protein Q9Y6X5	Substrate	Q9Y6X5
chr6	46140042	46140045	bind	1000	46140042	46140045	0,0,0	1	3	0	binding site	amino acid 154 on protein Q9Y6X5	Substrate	Q9Y6X5
chr6	46140045	46140048	glyco	1000	46140045	46140048	100100	1	3	0	glycosylation site	amino acid 155 on protein Q9Y6X5	N-linked (GlcNAc...	Q9Y6X5
chr6	46140078	46140081	glyco	1000	46140078	46140081	100100	1	3	0	glycosylation site	amino acid 166 on protein Q9Y6X5	N-linked (GlcNAc...	Q9Y6X5
chr6	46140147	46140150	bind	1000	46140147	46140150	0,0,0	1	3	0	binding site	amino acid 189 on protein Q9Y6X5	Substrate	Q9Y6X5
chr6	46140147	46140150	ion-binding	1000	46140147	46140150	0,0,0	1	3	0	metal ion-binding site	amino acid 189 on protein Q9Y6X5	Zinc 2; catalytic	Q9Y6X5
chr6	46140159	46140162	ion-binding	1000	46140159	46140162	0,0,0	1	3	0	metal ion-binding site	amino acid 193 on protein Q9Y6X5	Zinc 2; catalytic	Q9Y6X5
chr6	46140291	46140294	ion-binding	1000	46140291	46140294	0,0,0	1	3	0	metal ion-binding site	amino acid 237 on protein Q9Y6X5	Zinc 1; catalytic	Q9Y6X5
chr6	46140294	46140297	ion-binding	1000	46140294	46140297	0,0,0	1	3	0	metal ion-binding site	amino acid 238 on protein Q9Y6X5	Zinc 1; catalytic	Q9Y6X5
chr6	46140342	46140345	disulf bond	1000	46140342	46140345	100100100	1	3	0	disulfide bond	amino acid 254 on protein Q9Y6X5	disulfide bond to position 287	Q9Y6X5
chr6	46140408	46141053	glyco	1000	46140408	46141053	100100	2	1,2	643	glycosylation site	amino acid 276 on protein Q9Y6X5	N-linked (GlcNAc...	Q9Y6X5
chr6	46141083	46141086	disulf bond	1000	46141083	46141086	100100100	1	3	0	disulfide bond	amino acid 287 on protein Q9Y6X5	disulfide bond to position 254	Q9Y6X5
chr6	46143283	46143286	ion-binding	1000	46143283	46143286	0,0,0	1	3	0	metal ion-binding site	amino acid 336 on protein Q9Y6X5	Zinc 2; catalytic	Q9Y6X5
chr6	46143433	46143436	glyco	1000	46143433	46143436	100100	1	3	0	glycosylation site	amino acid 386 on protein Q9Y6X5	N-linked (GlcNAc...	Q9Y6X5
chr6	46143457	46143460	disulf bond	1000	46143457	46143460	100100100	1	3	0	disulfide bond	amino acid 394 on protein Q9Y6X5	disulfide bond to position 401	Q9Y6X5
chr6	46143478	46143481	disulf bond	1000	46143478	46143481	100100100	1	3	0	disulfide bond	amino acid 401 on protein Q9Y6X5	disulfide bond to position 394	Q9Y6X5
chr6	46143499	46143562	Transmembrane	1000	46143499	46143562	0,0,100	1	63	0	transmembrane region	amino acids 408-428 on protein Q9Y6X5	Helical	Q9Y6X5
chr6	46143562	46143637	Cytoplasmic	1000	46143562	46143637	100,0,0	1	75	0	topological domain	amino acids 429-453 on protein Q9Y6X5	Cytoplasmic	Q9Y6X5

Is there a way to have 1 line per protein-coding gene?

Thank you,

Kostas

Christopher Lee

unread,

Sep 27, 2017, 4:42:14 PM9/27/17

to Kostas Tsirigos, UCSC Genome Browser Discussion List

Hi Kostas,

Thank you for your question about downloading protein coding genes. The current recommended method for downloading protein coding genes is to select your gene track of interest and then filter for "cdsStart!=cdsEnd" in the free form query section. This limits the output to all coding transcripts from your gene track of interest. However, because there can be multiple alternatively spliced transcripts for a single gene, the result will still contain multiple entries for one locus. Luckily we do create a table called "knownCanonical" which contains a "canonical" transcript for a given locus. Thus you can filter for those transcripts that have cdsStart!=cdsEnd and are in the knownCanonical table to get roughly one transcript per locus:

1. Navigate to the Table Browser: https://genome.ucsc.edu/cgi-bin/hgTables.
2. Select your organism and assembly of interest, in the example below I will be using Human Dec. 2013 GRCh38/hg38 (hg38).
3. Make the following selections:
- group: Genes and Gene Predictions
- track: GENCODE V24
- table: knownGene
4. Click the "create" button next to "filter".
5. In the free-form query box in the "Filter on Fields from hg38.knownGene" section, enter "cdsStart!=cdsEnd" without the quotes.
6. Scroll down to the "Linked Tables" section and check the box next to knownCanonical. Scroll down all the way to the bottom of the page and click "allow filtering using fields in checked tables".
7. In the "hg38.knownCanonical based filters" free-form query box, enter "1" without quotes.
8. Click "submit".
9. Select your output format of interest (BED, custom track, etc) and whether you would like the results in a file.
10. Click "get output".

Here is a session containing a custom track created via the above steps, where you can see that even though there are multiple coding transcripts in the GENCODE V24 track, the Table Browser query limits the output to only the protein coding genes (MTOR and ANGPTL7):
http://genome.ucsc.edu/cgi-bin/hgTracks?hgS_doOtherUser=submit&hgS_otherUserName=chmalee&hgS_otherUserSessionName=hg38_proteinCodingOnly

Since you are new to the UCSC Genome Browser, you may also find the following pages helpful:
- UCSC Genome Browser - Training
http://genome.ucsc.edu/training/index.html

- UCSC Genome Browser Tutorials by OpenHelix:
http://www.openhelix.com/ucsc

- UCSC Genome Browser Videos:
https://www.youtube.com/channel/UCQnUJepyNOw0p8s2otX4RYQ/videos

Lastly, all emails sent to gen...@soe.ucsc.edu are publicly archived at the following google group, which you can search for topics of interest like the following:
https://groups.google.com/a/soe.ucsc.edu/forum/#!searchin/genome/find$20TSS

Please let us know if you have any further questions!

Thank you again for your inquiry and using the UCSC Genome Browser. If
you have any further questions, please reply to gen...@soe.ucsc.edu.
All messages sent to that address are archived on a
publicly-accessible forum. If your question includes sensitive data,
you may send it instead to genom...@soe.ucsc.edu.

Christopher Lee
UCSC Genomics Institute

--

---
You received this message because you are subscribed to the Google Groups "UCSC Genome Browser Public Support" group.
To unsubscribe from this group and stop receiving emails from it, send an email to genome+un...@soe.ucsc.edu.
To post to this group, send email to gen...@soe.ucsc.edu.
Visit this group at https://groups.google.com/a/soe.ucsc.edu/group/genome/.
To view this discussion on the web visit https://groups.google.com/a/soe.ucsc.edu/d/msgid/genome/CAFNbu4Yf-P_6-5g%2BVmFU9vrAD-iw2-4i8pOpJL1gU3%2BNWtMeYQ%40mail.gmail.com.
For more options, visit https://groups.google.com/a/soe.ucsc.edu/d/optout.

Reply all

Reply to author

Forward

0 new messages