extracting promoter, 5'UTR, exons, 3'UTR positions for all genes

2,696 views
Skip to first unread message

Tuna, Salih

unread,
Mar 26, 2014, 8:22:45 AM3/26/14
to gen...@soe.ucsc.edu
Hi,
I would like to extract the following position information for each gene or a certain list of genes.
This is for mouse mm10

promoter
5’UTR
1st Exon
2nd penultimati exon
last exon (All exons basically)
3’UTR
Intoron

I was looking at the tables but could not figure out how to do it.

Can you please advise what is the best way to get this information?

Best,
Salih

Steve Heitner

unread,
Mar 26, 2014, 2:16:49 PM3/26/14
to Tuna, Salih, gen...@soe.ucsc.edu
Hello, Salih.

You can do this using the Table Browser. If you are unfamiliar with the
Table Browser, please see the User's Guide at
http://genome.ucsc.edu/goldenPath/help/hgTablesHelp.html.

The best thing to do would be to select BED output which allows you to query
coordinates for 5' UTR, 3' UTR, exons, introns, etc, but it doesn't allow
you to select multiple items per query, so you will have to perform multiple
queries. Perform the following steps:

1. Navigate to http://genome.ucsc.edu/cgi-bin/hgTables

2. Select the following options:
Clade: Mammal
Genome: Mouse
Assembly: Dec. 2011 (GRCm38/mm10)
Group: Genes and Gene Predictions
Track: UCSC Genes or RefSeq Genes
Output format: BED - browser extensible data

3. On the "region" line, select "genome" for the entire genome, select
"position" and enter a position for a single locus or click the "define
regions" button to specify multiple loci. If you want to search a list of
genes, select "genome" on the "region" line and then click the "paste list"
or "upload list" button on the "identifiers" line to enter your gene IDs.
Note that when you click one of these buttons, you are presented with
examples of the format your identifiers are expected to be in. If your
identifiers don't match these formats, your query will not be successful.

4. Click the "get output" button

5. This screen allows you to specify whether you want 5' UTR regions, 3'
UTR regions, etc. Choose your desired options and click the "get BED"
button.

Please contact us again at gen...@soe.ucsc.edu if you have any further
questions. All messages sent to that address are archived on a
publicly-accessible Google Groups forum. If your question includes
sensitive data, you may send it instead to genom...@soe.ucsc.edu.

---
Steve Heitner
UCSC Genome Bioinformatics Group
--


Tuna, Salih

unread,
Apr 9, 2014, 7:47:19 AM4/9/14
to st...@soe.ucsc.edu, Tuna, Salih, gen...@soe.ucsc.edu
Hi Steve,
Thanks for the detailed answer.

I do not understand what is the difference between the following two entries which are on the exon list i extracted

chr17 14399370 14399645 uc008amy.2_exon_10_0_chr17_14399371_f 0 +
chr17 14399370 14399645 uc008ana.2_exon_1_0_chr17_14399371_f 0 +

Can you please explain?

Best,
Salih

Steve Heitner

unread,
Apr 9, 2014, 12:46:42 PM4/9/14
to Tuna, Salih, gen...@soe.ucsc.edu

Hello, Salih.

You are viewing the results of two separate transcripts of the Smoc2 gene in the UCSC Genes track.  If you inspect them visually in the Browser by searching uc008amy.2, you will see that uc008amy.2 spans chr17:14,279,506-14,404,790 while uc008ana.2 spans only chr17:14,399,060-14,404,790.  Exon 10 of uc008amy.2 shares the same coordinates with exon 0 of uc008ana.2.



Please contact us again at gen...@soe.ucsc.edu if you have any further questions. 

All messages sent to that address are archived on a publicly-accessible Google Groups forum.  If your question includes sensitive data, you may send it instead to genom...@soe.ucsc.edu.



---
Steve Heitner
UCSC Genome Bioinformatics Group

Tuna, Salih

unread,
Apr 10, 2014, 7:34:08 AM4/10/14
to st...@soe.ucsc.edu, Tuna, Salih, gen...@soe.ucsc.edu
Hi Steve,
Does it mean that exon 0 is the first exon exon 1 is the second exon etc?

Unrelated to the original question but is there an easy way to get the last exon and 2nd penultimate exon of each gene?

best,
Salih

Tuna, Salih

unread,
Apr 10, 2014, 12:06:25 PM4/10/14
to st...@soe.ucsc.edu, Tuna, Salih, gen...@soe.ucsc.edu
I would also like to download the promoter regions. I could not see it on the list. can it be with a different name? please advise.

Best,
Salih

Matthew Speir

unread,
Apr 10, 2014, 7:10:40 PM4/10/14
to Tuna, Salih, st...@soe.ucsc.edu, gen...@soe.ucsc.edu
Hi Salih,

Unfortunately, we don't have an easy way for you to get only the last two exons from a gene. You maybe able to write a script to pull this information from your previous output file though.

You are mostly correct about the exon numbers in the Table Browser output. The exon coordinates in the knownGene table are more accurately described as gapless blocks. Most of the time these gapless blocks coincide exactly with the exon boundaries. Aligning the transcripts to the reference genome sometimes introduces indels that create little breaks between these gapless blocks. This means that a single true exon may get broken up into separate pieces in the table - falsely inflating the exon count. There may be cases in your output where these exon numbers do not match up with the true exon number.

You can get promoter regions from the Table Browser using steps nearly identical to those described by my colleague Steve. After you've followed the steps one through four he provided, you can get the coordinates for the upstream regions for the genes by selecting the 'Upstream by' option, and entering the size of the regions you're interested in. Then click the "get BED" button.

I hope this is helpful. If you have any further questions, please reply to gen...@soe.ucsc.edu. All messages sent to that address are archived on a publicly-accessible Google Groups forum. If your question includes sensitive data, you may send it instead to genom...@soe.ucsc.edu.

Matthew Speir
UCSC Genome Bioinformatics Group
--


Tuna, Salih

unread,
Apr 30, 2014, 11:30:08 AM4/30/14
to gen...@soe.ucsc.edu, Tuna, Salih, st...@soe.ucsc.edu, Matthew Speir
Hi,
Regarding the promoter regions, how do i get the promoter from TSS of a gene (3KB from TSS for a gene for example)? 

Best,
Salih

Steve Heitner

unread,
Apr 30, 2014, 12:53:58 PM4/30/14
to Tuna, Salih, gen...@soe.ucsc.edu, Matthew Speir

Hello, Salih.

Please refer to my colleague’s previous response:

“You can get promoter regions from the Table Browser using steps nearly identical to those described by my colleague Steve. After you've followed the steps one through four he provided, you can get the coordinates for the upstream regions for the genes by selecting the 'Upstream by' option, and entering the size of the regions you're interested in. Then click the "get BED" button.”

There are also downloadable upstream files located at http://hgdownload.cse.ucsc.edu/goldenPath/mm10/bigZips/.  Note that these correspond to the RefSeq Genes track and not the UCSC Genes track, though this should not really matter.

All messages sent to that address are archived on a publicly-accessible Google Groups forum.  If your question includes sensitive data, you may send it instead to genom...@soe.ucsc.edu.

Tuna, Salih

unread,
Apr 30, 2014, 2:21:39 PM4/30/14
to <steve@soe.ucsc.edu>, gen...@soe.ucsc.edu, Matthew Speir
My input is a list of genes. If I simply choose n kb upstream, will it be from TSS? That is what I am confused?

Or my input needs to be the TSS for each gene of interest?

Best,
Salih

Steve Heitner

unread,
Apr 30, 2014, 7:26:27 PM4/30/14
to Tuna, Salih, gen...@soe.ucsc.edu, Matthew Speir

Hello, Salih.

When you choose to display upstream sequence, it automatically begins at the TSS regardless of whether you choose your genes by specifying coordinates or by specifying gene names.  When you specify coordinates, the output includes any genes that intersect those coordinates at any point, so your coordinates do not need to actually include the TSS.  To prove this to yourself, try a query where you specify a single locus (e.g., chr12:56,699,994-56,700,239 which occurs in the middle of Pax9 in mm10) and then try another by specifying a single identifier (e.g., Pax9).  The upstream sequence in your output should be the same in both instances.

Note that if the coordinates you specify intersect multiple genes, your results will include all genes intersected by your coordinates.

Please contact us again at gen...@soe.ucsc.edu if you have any further questions. 
Questions sent to that address will be archived in a publicly-accessible forum for the benefit of other users.  If your question contains sensitive data, you may send it instead to genom...@soe.ucsc.edu.

Reply all
Reply to author
Forward
0 new messages