Where can I find the gene length information?

Yung-Chih Lai

unread,

Mar 20, 2017, 10:17:59 AM3/20/17

to UCSC Genome Browser Discussion List

Hi,

I use a command below to get a GTF annotation file for RNA-Seq quantification. However, for FPKM gene expression levels, I still need the corresponding gene length information. Could you show me how can I get the information from the UCSC Genome Browser? Many thanks.

genePredToGtf -source=UCSChg38refGene20170102 -utr -addComments hg38 refGene hg38refGeneUtrComment.gtf

Best,

Gary

Matthew Speir

unread,

Mar 21, 2017, 2:20:21 PM3/21/17

to Yung-Chih Lai, UCSC Genome Browser Discussion List

Hi Gary,

Thank you for your question about obtaining gene length information from the UCSC Genome Browser.

While we don't store gene length information directly, you can easily calculate it using our public MySQL server.

Obtaining this information from our public MySQL server can be done with a single command:

mysql --user=genome --host=genome-mysql.cse.ucsc.edu -A -Ne "select name, txEnd-txStart from refGene" hg38

This will give you a two column file where the first column is the transcript name and the second column is the gene length. More information about accessing our public MySQL server can be found in our help documentation here: https://genome.ucsc.edu/goldenPath/help/mysql.html.

Alternatively, you could download the files for these tracks from our download server, http://hgdownload.soe.ucsc.edu/downloads.html, and then subtract txStart from txEnd. It should be fairly straightforward to come up with a short awk command that would parse the file and then perform this calculation.

We have also released a new NCBI RefSeq track, which includes coordinates and other information provided directly by the RefSeq group. You can read more about it in the announcement here: https://genome.ucsc.edu/goldenPath/newsarch.html#030317.

I hope this is helpful. If you have any further questions, please reply to gen...@soe.ucsc.edu. All messages sent to that address are archived on a publicly-accessible Google Groups forum. If your question includes sensitive data, you may send it instead to genom...@soe.ucsc.edu.

Matthew Speir
UCSC Genome Bioinformatics Group

--

Yung-Chih Lai

unread,

Mar 22, 2017, 10:57:53 AM3/22/17

to Matthew Speir, UCSC Genome Browser Discussion List

Hi Matthew,

Many thanks for your information. May I know the lengths of txEnd - txStart include both exons and introns? I need the lengths only include exons, but not introns. Furthermore, if I would like to do the gene level RNA-Seq quantification, is it possible to get the length for each gene, but not each transcript?

Best,

Gary

Matthew Speir

unread,

Mar 30, 2017, 10:54:09 AM3/30/17

to Yung-Chih Lai, UCSC Genome Browser Discussion List

Hi Gary,

The steps I provided in my previous response will output the size of the gene that includes both exons and introns.

There are a few different methods to get the information you are interested in. Each method produces slightly different results, and you will need to decide which method best fits your needs.

Method 1:
Get the transcript accession, gene name and RNA size from the refSeqAli table and then write a script to get the longest transcript for a particular gene symbol. This method involves extracting the total RNA size, which may include the unaligned portions of this RNA in the genome (polyA tails, etc.) Here is a MySQL command to get this information:

mysql --user=genome --host=genome-mysql.cse.ucsc.edu -A -Ne "select gene.name2, ali.qName, ali.qSize from refGene gene, refSeqAli ali where gene.name=ali.qName" hg38

Method 2:
This method is similar to "Method 1" described above as we're going to be extracting the gene symbol, transcript name, and RNA size from the refSeqAli table. This difference is that we're going to be calculating the size based on the regions of the RNA that align to the genome (meaning that it excludes unaligned sequence such as the polyA tail). As with before, you will ned to write your own script to extract the largest transcript for each gene symbol. Here is a MySQL command to get this information:

mysql --user=genome --host=genome-mysql.cse.ucsc.edu -A -Ne "select gene.name2, ali.qName, ali.qEnd-ali.qStart from refGene gene, refSeqAli ali where gene.name=ali.qName" hg38

Method 3:
Use APPRIS tags to try to obtain a single gene size for each gene symbol. Note that this method isn't perfect, and hat some gene symbols may have more that one associated size, so you will have to look at the results to find those cases. This method uses a link between the ncbiRefSeq table (which is the "RefSeq All" track in the new "NCBI RefSeq" track) and the wgEncodeGencodeRefSeqV24 table to filter items in the ncbiRefSeq table based on the tags in the wgEncodeGencodeTagV24 table:

mysql hg38 -h genome-mysql.soe.ucsc.edu -u genome -NBe ' \
select l.name, r.name, gt.tag, psl.qEnd - psl.qStart as length \
from   ncbiRefSeq r, ncbiRefSeqLink l, wgEncodeGencodeRefSeqV24 gr, wgEncodeGencodeTagV24 gt, \
         ncbiRefSeqPsl psl \
where r.name = l.id and r.name = gr.rnaAcc and gr.transcriptId = gt.transcriptId \
         and r.name = psl.qName \
         and tag like "appris%"' \
| sort > geneTxTag.tab

Method 4:
This method uses a link between the refGene table and the knownCanonical table to come up with a set of gene lengths for almost every gene symbol. The knownCanonical table in hg38 has already been filtered using APPRIS tags and, if these tags aren't available, by transcript size. Again, it should be noted that this method isn't perfect and there may be some gene symbols with more than one size. The MySQL command to get this information:

mysql hg38 -h genome-mysql.soe.ucsc.edu -u genome -NBe ' \
select l.name, kr.value, psl.qEnd - psl.qStart as length \
from refGene r, hgFixed.refLink l, knownToRefSeq kr, knownCanonical kc, refSeqAli psl \
where r.name = l.mrnaAcc and r.name = kr.value and kr.name = kc.transcript \
and r.name = psl.qName \
group by kr.value' \
| sort > geneTxCanonical.tab

As said before, each of these sets of steps produces slightly different results. I would highly recommend looking closely at the output of each of them to find the one that best fits your needs.

I hope this is helpful. If you have any further questions, please reply to gen...@soe.ucsc.edu. All messages sent to that address are archived on a publicly-accessible Google Groups forum. If your question includes sensitive data, you may send it instead to genom...@soe.ucsc.edu.

Matthew Speir
UCSC Genome Bioinformatics Group

Reply all

Reply to author

Forward