Download protein-coding gene coordinates

380 views
Skip to first unread message

Kostas Tsirigos

unread,
Sep 20, 2017, 10:52:50 AM9/20/17
to gen...@soe.ucsc.edu
Hello,

I am new to the UCSC genome browser and I would like to ask the following:

If I want to download all LINE elements annotated in the Genome Browser, then I go to the Table browser, select "Repeats" under the "group" and then I filter for LINES.

Is there a way to download the same information (BED file) for protein-coding genes?
I tried Table browser -> Genes and Gene predictions and then selected "Uniprot" under the track and it seems that I am getting almost what I need, which is ~23,000 unique genes, but with the following problem : I have multiple lines for each protein, e.g.:

chr6 46139583 46139628 Signal peptide 1000 0 46139583 46139628 0,0,0 1 45 0 signal peptide amino acids 1-15 on protein Q9Y6X5
Q9Y6X5
chr6 46139628 46143499 Extracellular 1000 0 46139628 46143499 100,0,0 3 781171224 0,1423,3647 topological domain amino acids 16-407 on protein Q9Y6X5 Extracellular Q9Y6X5
chr6 46139628 46143637 Bis(5'-adenosy... 1000 0 46139628 46143637 0,0,0 3 781171362 0,1423,3647 chain amino acids 16-453 on protein Q9Y6X5 Bis(5'-adenosyl)-triphosphatase ENPP4 Q9Y6X5
chr6 46139682 46139685 ion-binding 1000 0 46139682 46139685 0,0,0 1 3 0 metal ion-binding site amino acid 34 on protein Q9Y6X5 Zinc 1; catalytic Q9Y6X5
chr6 46139790 46139793 enzyme act site 1000 0 46139790 46139793 0,0,0 1 3 0 active site amino acid 70 on protein Q9Y6X5 AMP-threonine intermediate Q9Y6X5
chr6 46139790 46139793 ion-binding 1000 0 46139790 46139793 0,0,0 1 3 0 metal ion-binding site amino acid 70 on protein Q9Y6X5 Zinc 1; catalytic Q9Y6X5
chr6 46139853 46139856 bind 1000 0 46139853 46139856 0,0,0 1 3 0 binding site amino acid 91 on protein Q9Y6X5 Substrate Q9Y6X5
chr6 46140042 46140045 bind 1000 0 46140042 46140045 0,0,0 1 3 0 binding site amino acid 154 on protein Q9Y6X5 Substrate Q9Y6X5
chr6 46140045 46140048 glyco 1000 0 46140045 46140048 100100 1 3 0 glycosylation site amino acid 155 on protein Q9Y6X5 N-linked (GlcNAc... Q9Y6X5
chr6 46140078 46140081 glyco 1000 0 46140078 46140081 100100 1 3 0 glycosylation site amino acid 166 on protein Q9Y6X5 N-linked (GlcNAc... Q9Y6X5
chr6 46140147 46140150 bind 1000 0 46140147 46140150 0,0,0 1 3 0 binding site amino acid 189 on protein Q9Y6X5 Substrate Q9Y6X5
chr6 46140147 46140150 ion-binding 1000 0 46140147 46140150 0,0,0 1 3 0 metal ion-binding site amino acid 189 on protein Q9Y6X5 Zinc 2; catalytic Q9Y6X5
chr6 46140159 46140162 ion-binding 1000 0 46140159 46140162 0,0,0 1 3 0 metal ion-binding site amino acid 193 on protein Q9Y6X5 Zinc 2; catalytic Q9Y6X5
chr6 46140291 46140294 ion-binding 1000 0 46140291 46140294 0,0,0 1 3 0 metal ion-binding site amino acid 237 on protein Q9Y6X5 Zinc 1; catalytic Q9Y6X5
chr6 46140294 46140297 ion-binding 1000 0 46140294 46140297 0,0,0 1 3 0 metal ion-binding site amino acid 238 on protein Q9Y6X5 Zinc 1; catalytic Q9Y6X5
chr6 46140342 46140345 disulf bond 1000 0 46140342 46140345 100100100 1 3 0 disulfide bond amino acid 254 on protein Q9Y6X5 disulfide bond to position 287 Q9Y6X5
chr6 46140408 46141053 glyco 1000 0 46140408 46141053 100100 2 1,2 643 glycosylation site amino acid 276 on protein Q9Y6X5 N-linked (GlcNAc... Q9Y6X5
chr6 46141083 46141086 disulf bond 1000 0 46141083 46141086 100100100 1 3 0 disulfide bond amino acid 287 on protein Q9Y6X5 disulfide bond to position 254 Q9Y6X5
chr6 46143283 46143286 ion-binding 1000 0 46143283 46143286 0,0,0 1 3 0 metal ion-binding site amino acid 336 on protein Q9Y6X5 Zinc 2; catalytic Q9Y6X5
chr6 46143433 46143436 glyco 1000 0 46143433 46143436 100100 1 3 0 glycosylation site amino acid 386 on protein Q9Y6X5 N-linked (GlcNAc... Q9Y6X5
chr6 46143457 46143460 disulf bond 1000 0 46143457 46143460 100100100 1 3 0 disulfide bond amino acid 394 on protein Q9Y6X5 disulfide bond to position 401 Q9Y6X5
chr6 46143478 46143481 disulf bond 1000 0 46143478 46143481 100100100 1 3 0 disulfide bond amino acid 401 on protein Q9Y6X5 disulfide bond to position 394 Q9Y6X5
chr6 46143499 46143562 Transmembrane 1000 0 46143499 46143562 0,0,100 1 63 0 transmembrane region amino acids 408-428 on protein Q9Y6X5 Helical Q9Y6X5
chr6 46143562 46143637 Cytoplasmic 1000 0 46143562 46143637 100,0,0 1 75 0 topological domain amino acids 429-453 on protein Q9Y6X5 Cytoplasmic Q9Y6X5

Is there a way to have 1 line per protein-coding gene?

Thank you,
Kostas

Christopher Lee

unread,
Sep 27, 2017, 4:42:14 PM9/27/17
to Kostas Tsirigos, UCSC Genome Browser Discussion List

Hi Kostas,

Thank you for your question about downloading protein coding genes. The current recommended method for downloading protein coding genes is to select your gene track of interest and then filter for "cdsStart!=cdsEnd" in the free form query section. This limits the output to all coding transcripts from your gene track of interest. However, because there can be multiple alternatively spliced transcripts for a single gene, the result will still contain multiple entries for one locus. Luckily we do create a table called "knownCanonical" which contains a "canonical" transcript for a given locus. Thus you can filter for those transcripts that have cdsStart!=cdsEnd and are in the knownCanonical table to get roughly one transcript per locus:

1. Navigate to the Table Browser: https://genome.ucsc.edu/cgi-bin/hgTables.
2. Select your organism and assembly of interest, in the example below I will be using Human Dec. 2013 GRCh38/hg38 (hg38).
3. Make the following selections:
- group: Genes and Gene Predictions
- track: GENCODE V24
- table: knownGene
4. Click the "create" button next to "filter".
5. In the free-form query box in the "Filter on Fields from hg38.knownGene" section, enter "cdsStart!=cdsEnd" without the quotes.
6. Scroll down to the "Linked Tables" section and check the box next to knownCanonical. Scroll down all the way to the bottom of the page and click "allow filtering using fields in checked tables".
7. In the "hg38.knownCanonical based filters" free-form query box, enter "1" without quotes.
8. Click "submit".
9. Select your output format of interest (BED, custom track, etc) and whether you would like the results in a file.
10. Click "get output".

Here is a session containing a custom track created via the above steps, where you can see that even though there are multiple coding transcripts in the GENCODE V24 track, the Table Browser query limits the output to only the protein coding genes (MTOR and ANGPTL7):
http://genome.ucsc.edu/cgi-bin/hgTracks?hgS_doOtherUser=submit&hgS_otherUserName=chmalee&hgS_otherUserSessionName=hg38_proteinCodingOnly

Since you are new to the UCSC Genome Browser, you may also find the following pages helpful:
- UCSC Genome Browser - Training
http://genome.ucsc.edu/training/index.html

- UCSC Genome Browser Tutorials by OpenHelix:
http://www.openhelix.com/ucsc

- UCSC Genome Browser Videos:
https://www.youtube.com/channel/UCQnUJepyNOw0p8s2otX4RYQ/videos

Lastly, all emails sent to gen...@soe.ucsc.edu are publicly archived at the following google group, which you can search for topics of interest like the following:
https://groups.google.com/a/soe.ucsc.edu/forum/#!searchin/genome/find$20TSS

Please let us know if you have any further questions!

Thank you again for your inquiry and using the UCSC Genome Browser. If
you have any further questions, please reply to gen...@soe.ucsc.edu.
All messages sent to that address are archived on a
publicly-accessible forum. If your question includes sensitive data,
you may send it instead to genom...@soe.ucsc.edu.

Christopher Lee
UCSC Genomics Institute


--

---
You received this message because you are subscribed to the Google Groups "UCSC Genome Browser Public Support" group.
To unsubscribe from this group and stop receiving emails from it, send an email to genome+un...@soe.ucsc.edu.
To post to this group, send email to gen...@soe.ucsc.edu.
Visit this group at https://groups.google.com/a/soe.ucsc.edu/group/genome/.
To view this discussion on the web visit https://groups.google.com/a/soe.ucsc.edu/d/msgid/genome/CAFNbu4Yf-P_6-5g%2BVmFU9vrAD-iw2-4i8pOpJL1gU3%2BNWtMeYQ%40mail.gmail.com.
For more options, visit https://groups.google.com/a/soe.ucsc.edu/d/optout.

Reply all
Reply to author
Forward
0 new messages