Transcriptional start site

ruvalcabatrejo

unread,

Oct 3, 2014, 1:17:49 PM10/3/14

to gen...@soe.ucsc.edu

Hi,

I figured out how to find the transcriptional start site of genes and the CDS, but I am not sure if the output is correct. I followed the instructions from the link to find the transcriptional start site:

https://groups.google.com/a/soe.ucsc.edu/forum/#!msg/genome/HZqXPD6AJlo/cUYny6TLWRMJ

When I search for the TSS and CDS of gene IL1RN I get the following output:

The first pair shows that the TSS and CDS are the same position. The second set is about 9,000 bp away from each other. Are these numbers correct? What I am trying to do is find the transcription start sites for a list of genes, but I am not sure if I am using the UCSC browser correctly. Thank you.

Jonathan Casper

unread,

Oct 7, 2014, 8:53:53 PM10/7/14

to ruvalcabatrejo, gen...@soe.ucsc.edu

Hello ruvalcabatrejo,

Thank you for your question about finding the transcriptional start site and coding sequence for genes. Unfortunately your sample output is missing from the question, so I can't advise you about those results.

If you would like to obtain the TSS and CDS start positions for a list of genes, I suggest you use the UCSC Table Browser as follows. Here I will assume that you want to obtain gene coordinates from the UCSC Genes track of the human hg19 genome assembly.

1. Open the UCSC Table Browser at http://genome.ucsc.edu/cgi-bin/hgTables
2. Select the following options

Clade: Mammals
Genome: Human
Assembly: Feb. 2009 (GRCh37/hg19)
Group: Genes and Gene Predictions
Track: UCSC Genes
Table: knownGene
Region: genome
Output format: selected fields from primary and related tables

Note: After you select the knownGene table, you can click the "describe table schema" button to get a description of each of the different fields. In particular, txStart contains the TSS coordinate, and cdsStart is the start position of the CDS. These coordinates are 0-based (see http://genomewiki.ucsc.edu/index.php/Coordinate_Transforms for more information).

3. For "identifiers (names/accessions)", click "paste list" and paste your list of genes into the text box that appears. Click "submit" when you are done.
4. Click "get output".
5. On the next page, select the following options.

From hg19.knownGene: name, txStart, cdsStart
From the linked hg19.kgXref table: geneSymbol

6. Click "get output"
You will be presented with data for each transcript of your matching genes. Here is the output I got for the IL1RN gene:

#hg19.knownGene.name    hg19.knownGene.txStart    hg19.knownGene.cdsStart    hg19.kgXref.geneSymbol
uc002tix.1    113856936    113856936    IL1RN
uc002tiy.3    113875469    113885303    IL1RN
uc002tiz.3    113875469    113875595    IL1RN
uc002tja.3    113875469    113875595    IL1RN
uc002tjb.3    113885137    113885201    IL1RN

If you are interested in working further with the UCSC Table Browser and related tools, I suggest you begin with the resources on our training page at http://genome.ucsc.edu/training.html. The OpenHelix video tutorials in particular offer a guided, example-driven introduction to our website.

I hope this is helpful. If you have any further questions, please reply to gen...@soe.ucsc.edu or genome...@soe.ucsc.edu. Questions sent to those addresses will be archived in publicly-accessible forums for the benefit of other users. If your question contains sensitive data, you may send it instead to genom...@soe.ucsc.edu.

--
Jonathan Casper
UCSC Genome Bioinformatics Group

--

ruvalcabatrejo

unread,

Oct 8, 2014, 12:30:09 PM10/8/14

to Jonathan Casper, gen...@soe.ucsc.edu

Hi Jonathan,

Thank you so much for answering my question. I am curious why you used the assembly Feb. 2009 (GRCh37/hg19) and not Dec.2013 (GRCh38/hg38)? When I use assembly Dec.2013 (GRCh38/hg38) the output is closer to TSS that I already have confirmed. Is assembly Feb. 2009 (GRCh37/hg19) more accurate and trustworthy for TSSs? This is new to me so I am not very familiar with the differences. I thought they were the same but the most recent assembly had more accurate and recent information than previous assemblies.

This is the output I get with assembly Dec.2013 (GRCh38/hg38) for IL1RN:

#hg38.knownGene.name	hg38.knownGene.txStart	hg38.knownGene.cdsStart	hg38.kgXref.geneSymbol
uc002tix.1	113099359	113099359	IL1RN
uc002tiy.3	113117892	113127726	IL1RN
uc002tiz.3	113117892	113118018	IL1RN
uc002tja.3	113117892	113118018	IL1RN
uc002tjb.3	113127560	113127624	IL1RN

Thank you for your time and help.

-Laura

Jonathan Casper

unread,

Oct 9, 2014, 8:17:56 PM10/9/14

to ruvalcabatrejo, gen...@soe.ucsc.edu

Hello Laura,

I used the hg19 assembly in my example because that is still the default human assembly for the UCSC Genome Browser. While GRCh38/hg38 is a more complete assembly than GRCh37/hg19, it is still quite new. Much of the annotation on the hg19 assembly (and there is a lot) has not yet been constructed for the hg38 assembly. That does not mean that hg19 is more accurate or trustworthy.

If you are interested in learning more about genome assemblies, NCBI provides a short primer at http://www.ncbi.nlm.nih.gov/assembly/basics/. You may also be interested in the NCBI Insights blog, which gives further information about some of their projects. You can find posts about the hg38 genome assembly at http://ncbiinsights.ncbi.nlm.nih.gov/tag/grch38/.

I hope this is helpful. If you have any further questions, please reply to gen...@soe.ucsc.edu or genome...@soe.ucsc.edu. Questions sent to those addresses will be archived in publicly-accessible forums for the benefit of other users. If your question contains sensitive data, you may send it instead to genom...@soe.ucsc.edu.

--
Jonathan Casper
UCSC Genome Bioinformatics Group

Reply all

Reply to author

Forward