Downloading Ensembl canonical transcripts with Gene IDs

948 views
Skip to first unread message

Hossein Asgharian

unread,
Dec 21, 2017, 10:43:51 AM12/21/17
to gen...@soe.ucsc.edu
Dear UCSC genome browser team,

I have downloaded the Human GRCh38 canonical transcripts fasta file from the table browser with the Gencode v22 track (snapshot attached). My problem is that the first line (header line) only contains chromosomal coordinates of the transcript and not the transcript ID or the parent gene ID. Is there an option to include this ID information in the header line? It would be particularly helpful if the annotation were consistent with the Ensembl .gtf file. I could try and extract the IDs from the .gtf using the coordinates from the fasta file but I am not sure how consistent the two files are.   

Thank you very much for your help.

Best,
Hossein
download.png

Cath Tyner

unread,
Dec 22, 2017, 12:12:30 PM12/22/17
to Hossein Asgharian, UCSC Genome Browser Public Help Forum

Hello Hossein,

At this time, we don't have a way to add the transcript or gene ID to the Table Browser sequence output.
However, it may be possible for you to achieve this through scripting.

The knownCanonical table does include an ENSG id, where ENSG id is the "protein" column.

Using the Table Browser, you could output all of the hg38.knownCanonical ENSG IDs along with the associated regions.
You could then get other IDs from related fields, based on the ENSG ID. Finally, you could do some scripting to match
the regions in your sequence output such that you add in the IDs that you would like to appear in the header.

Generally, scripting advice is outside the scope of this mailing list, but I do have some scripting steps that should work for you.
You might want to try the steps below with a smaller region such as chr1 and then do the same steps for the genome.

With the same Table Browser settings you have in your screenshot,
(changing from 'genome' to 'chr1' if you want a smaller example test):

1. Change the output format to "selected fields from primary and related tables".
2. Click "get output" to go to the next step.
3. Select the following checkboxes:

hg38.knownCanonical

chrom
chromStart
chromEnd
transcript (UCSC ID)
protein (ENSG ID)

hg38.kgXref fields

geneSymbol

Linked Tables

knownToEnsembl

4. Click 'allow selection from checked tables' (at the bottom of the form) to go to the next step.

hg38.knownToEnsembl

value (ENST)

5. Click 'get output'

This will give you output like this:

#hg38.knownCanonical.chrom    hg38.knownCanonical.chromStart    hg38.knownCanonical.chromEnd    hg38.knownCanonical.transcript    hg38.knownCanonical.protein    hg38.kgXref.geneSymbol    hg38.knownToEnsembl.value
chr1    169853073    169893959    uc001ggs.5    ENSG00000000457.13    SCYL3    ENST00000367772.8
chr1    169795048    169854080    uc001ggp.4    ENSG00000000460.16    C1orf112    ENST00000359326.8
chr1    27612063    27635277    uc001bom.4    ENSG00000000938.12    FGR    ENST00000374005.7
chr1    196651877    196747504    uc001gtj.4    ENSG00000000971.15    CFH    ENST00000367429.8
chr1    24357004    24413725    uc001bja.4    ENSG00000001460.17    STPG1    ENST00000003583.12

We now have 2 files, one is the sequence file that you already have (I'll call this file "knownCanonical.sequence") where you would like to add the transcript/gene names.
The other file is the one we just created from the Table Browser, which includes all of the related IDs that you would like (I'll call this file "knownCanonical.IDs"). The common
values in both of these 2 files are the regions, so we can use these as a mapping key value after some massaging.

The coords in each file are in different formats (0-start BED vs 1-start positional), which are described here:
http://genome.ucsc.edu/blog/the-ucsc-genome-browser-coordinate-counting-systems/

The knownCanonical.sequence file has regions in the positional format (note the start coord):

$ head -1 knownCanonical.sequence | cut -d " " -f2
range=chr1:17369-17436

This same region in the "knownCanonical.IDs" file is in BED format (note the off-by-one start region): 

$ cat knownCanonical.IDs | grep 17436
chr1    17368    17436    uc031tla.1    ENSG00000278267.1    MIR6859-1    ENST00000619216.1

Let's convert all the regions in the "knownCanonical.IDs" file from BED to positional format by adding 1 to the start.
This will give us the matching key value to make a join with the two files.

First let's look at our grep example from above and make sure the awk command (below) does what we want.
Yes, looks good, 1 was added to the start and the regions are now in positional format, thus matching the regions in the sequence file. 

$ awk '{print $1 ":" ($2 + 1) "-" $3 " " $4 "," $5 "," $6 "," $7}' knownCanonical.IDs | grep 17436
chr1:17369-17436 uc031tla.1,ENSG00000278267.1,MIR6859-1,ENST00000619216.1
Let's go ahead and make the converted file:
$ awk '{print $1 ":" ($2 + 1) "-" $3 " " $4 "," $5 "," $6 "," $7}' knownCanonical.IDs > knownCanonical.IDs.positional

At this point you now have 2 files with a matching key value for positional-formatted region. You will need one final scripting step which
will replace any matching region in knownCanonical.sequence with the matched region in knownCanonical.IDs.positional, replacing it with
the entire line from the matched row in knownCanonical.IDs.positional. To do this, let's change the header in the sequence file so that it
only contains the region. 

cat knownCanonical.sequence| cut -d " " -f2 | sed 's/range=//g' > knownCanonical.sequence.regionOnlyHeader

Example output, note that everything except the region was stripped from the header:
$ head knownCanonical.sequence.regionOnlyHeader
chr1:17369-17436
CTACAGAGGCGACATGGGGGTCAGGCAAGCTGACACCCGCTGTCCTGAGC
CCATGTTCCTCTCCCACA
chr1:29554-31097
GTGCACACGGCTCCCATGCGTTGTCTTCCGAGCGTCAGGCCGCCCCTACC
CGTGCTTTCTGCTCTGCAGACCCTCTTCCTAGACCTCCGTCCTTTGTCCC
ATCGCTGCCTTCCCCTCAAGCTCAGGGCCAAGCTGTCCGCCAACCTCGGC
TCCTCCGGGCAGCCCTCGCCCGGGGTGCGCCCCGGGGCAGGACCCCCAGC
CCACGCCCAGGGCCCGCCCCTGCCCTCCAGCCCTACGCCTTGACCCGCTT
TCCTGCGTCTCTCAGCCTACCTGACCTTGTCTTTACCTCTGTGGGCAGCT

Now that we have a tidy sequence file for which to match up the key value, let's do the final step of replacing the region-only header
in the sequence file with the matching line in our knownCanonical.IDs.positional file:

awk 'FNR==NR{a[$1]=OFS $2;next} {print $0 a[$1]}' knownCanonical.IDs.positional knownCanonical.sequence.regionOnlyHeader > knownCanonical.sequence.headerWithIDs

Example output, you should now have a sequence file with all of the IDs that you would like:
$ head knownCanonical.sequence.headerWithIDs 
chr1:17369-17436 uc031tla.1,ENSG00000278267.1,MIR6859-1,ENST00000619216.1
CTACAGAGGCGACATGGGGGTCAGGCAAGCTGACACCCGCTGTCCTGAGC
CCATGTTCCTCTCCCACA
chr1:29554-31097 uc057aty.1,ENSG00000243485.3,RP11-34P13.3,ENST00000473358.1
GTGCACACGGCTCCCATGCGTTGTCTTCCGAGCGTCAGGCCGCCCCTACC
CGTGCTTTCTGCTCTGCAGACCCTCTTCCTAGACCTCCGTCCTTTGTCCC
ATCGCTGCCTTCCCCTCAAGCTCAGGGCCAAGCTGTCCGCCAACCTCGGC
TCCTCCGGGCAGCCCTCGCCCGGGGTGCGCCCCGGGGCAGGACCCCCAGC
CCACGCCCAGGGCCCGCCCCTGCCCTCCAGCCCTACGCCTTGACCCGCTT
TCCTGCGTCTCTCAGCCTACCTGACCTTGTCTTTACCTCTGTGGGCAGCT

At this point you'll want to so some thorough checking.
A quick check on my example file looks good, as I compare my original sequence file with my knownCanonical.sequence.headerWithIDs file:

Comparing the original sequence file:

$ cat knownCanonical.sequence | grep ":" -A2 | head
>hg38_knownCanonical_0 range=chr1:17369-17436 5'pad=0 3'pad=0 strand=+ repeatMasking=none
CTACAGAGGCGACATGGGGGTCAGGCAAGCTGACACCCGCTGTCCTGAGC
CCATGTTCCTCTCCCACA
--
>hg38_knownCanonical_1 range=chr1:29554-31097 5'pad=0 3'pad=0 strand=+ repeatMasking=none
GTGCACACGGCTCCCATGCGTTGTCTTCCGAGCGTCAGGCCGCCCCTACC
CGTGCTTTCTGCTCTGCAGACCCTCTTCCTAGACCTCCGTCCTTTGTCCC
--
>hg38_knownCanonical_2 range=chr1:30366-30503 5'pad=0 3'pad=0 strand=+ repeatMasking=none
GGATGCCCAGCTAGTTTGAATTTTAGATAAACAACGAATAATTTCGTAGC

as compared to the final outcome: 
$ cat knownCanonical.sequence.headerWithIDs | grep ":" -A2 | head
chr1:17369-17436 uc031tla.1,ENSG00000278267.1,MIR6859-1,ENST00000619216.1
CTACAGAGGCGACATGGGGGTCAGGCAAGCTGACACCCGCTGTCCTGAGC
CCATGTTCCTCTCCCACA
--
chr1:29554-31097 uc057aty.1,ENSG00000243485.3,RP11-34P13.3,ENST00000473358.1
GTGCACACGGCTCCCATGCGTTGTCTTCCGAGCGTCAGGCCGCCCCTACC
CGTGCTTTCTGCTCTGCAGACCCTCTTCCTAGACCTCCGTCCTTTGTCCC
--
chr1:30366-30503 uc031tlb.1,ENSG00000274890.1,MIR1302-2,ENST00000607096.1
GGATGCCCAGCTAGTTTGAATTTTAGATAAACAACGAATAATTTCGTAGC
There may be better/other scripting solutions out there that better meet your needs, and of course you can always change the scripting steps to rearrange your output.

Also of possible interest:
http://genomewiki.ucsc.edu/index.php/Genes_in_gtf_or_gff_format

Thank you for contacting the UCSC Genome Browser support team. 
Please send new and follow-up questions to one of our mailing lists below:

  * Post to the Public Help Forum: E
mail 
gen...@soe.ucsc.edu
​ or search the Public Archives
​  * Post to the Mirror Help Forum: Email
 
genome...@soe.ucsc.edu 
or search the Mirror Archives​
​  * Confidential/private help: Email
 
genom...@soe.ucsc.edu

Join us on Social Media! FacebookTwitter, Wordpress BlogYouTube
UCSC Genome Browser Announcements List (for new data & software)
Request on-site training & workshops at your institution

​Enjoy,​
Cath
. . .
Cath Tyner
UCSC Genome Browser, Software QA & User Support
UC Santa Cruz Genomics Institute


--

---
You received this message because you are subscribed to the Google Groups "UCSC Genome Browser Public Support" group.
To unsubscribe from this group and stop receiving emails from it, send an email to genome+un...@soe.ucsc.edu.
To post to this group, send email to gen...@soe.ucsc.edu.
Visit this group at https://groups.google.com/a/soe.ucsc.edu/group/genome/.
To view this discussion on the web visit https://groups.google.com/a/soe.ucsc.edu/d/msgid/genome/CAFZPC-%2BD_yrLowDo_BYJt3TcTc-1g3wf_s99Uq30HsJgxS56gQ%40mail.gmail.com.
For more options, visit https://groups.google.com/a/soe.ucsc.edu/d/optout.

Reply all
Reply to author
Forward
0 new messages