Hi Gary,
The steps I provided in my previous response will output the size of
the gene that includes both exons and introns.
There are a few different methods to get the information you are
interested in. Each method produces slightly different results, and
you will need to decide which method best fits your needs.
Method 1:
Get the transcript accession, gene name and RNA size from the
refSeqAli table and then write a script to get the longest
transcript for a particular gene symbol. This method involves
extracting the total RNA size, which may include the unaligned
portions of this RNA in the genome (polyA tails, etc.) Here is a
MySQL command to get this information:
mysql --user=genome --host=genome-mysql.cse.ucsc.edu -A
-Ne "select gene.name2, ali.qName, ali.qSize from refGene gene,
refSeqAli ali where gene.name=ali.qName" hg38
Method 2:
This method is similar to "Method 1" described above as we're going
to be extracting the gene symbol, transcript name, and RNA size from
the refSeqAli table. This difference is that we're going to be
calculating the size based on the regions of the RNA that align to
the genome (meaning that it excludes unaligned sequence such as the
polyA tail). As with before, you will ned to write your own script
to extract the largest transcript for each gene symbol. Here is a
MySQL command to get this information:
mysql --user=genome --host=genome-mysql.cse.ucsc.edu -A
-Ne "select gene.name2, ali.qName, ali.qEnd-ali.qStart from
refGene gene, refSeqAli ali where gene.name=ali.qName" hg38
Method 3:
Use APPRIS tags to try to obtain a single gene size for each gene
symbol. Note that this method isn't perfect, and hat some gene
symbols may have more that one associated size, so you will have to
look at the results to find those cases. This method uses a link
between the ncbiRefSeq table (which is the "RefSeq All" track in the
new "NCBI RefSeq" track) and the wgEncodeGencodeRefSeqV24 table to
filter items in the ncbiRefSeq table based on the tags in the
wgEncodeGencodeTagV24 table:
mysql hg38 -h genome-mysql.soe.ucsc.edu -u genome -NBe '
\
select l.name, r.name, gt.tag, psl.qEnd - psl.qStart as length \
from ncbiRefSeq r, ncbiRefSeqLink l, wgEncodeGencodeRefSeqV24
gr, wgEncodeGencodeTagV24 gt, \
ncbiRefSeqPsl psl \
where r.name = l.id and r.name = gr.rnaAcc and gr.transcriptId
= gt.transcriptId \
and r.name = psl.qName \
and tag like "appris%"' \
| sort > geneTxTag.tab
Method 4:
This method uses a link between the refGene table and the
knownCanonical table to come up with a set of gene lengths for
almost every gene symbol. The knownCanonical table in hg38 has
already been filtered using APPRIS tags and, if these tags aren't
available, by transcript size. Again, it should be noted that this
method isn't perfect and there may be some gene symbols with more
than one size. The MySQL command to get this information:
mysql hg38 -h genome-mysql.soe.ucsc.edu -u genome -NBe '
\
select l.name, kr.value, psl.qEnd - psl.qStart as length \
from refGene r, hgFixed.refLink l, knownToRefSeq kr,
knownCanonical kc, refSeqAli psl \
where r.name = l.mrnaAcc and r.name = kr.value and kr.name =
kc.transcript \
and r.name = psl.qName \
group by kr.value' \
| sort > geneTxCanonical.tab
As said before, each of these sets of steps produces slightly
different results. I would highly recommend looking closely at the
output of each of them to find the one that best fits your needs.