Dear Klaus,
Thank you for using the UCSC Genome Browser and your question about identifying all non-coding genes using UCSC Genes in hg18 and hg19.
There are many approaches to take to get the non-coding genes from UCSC genes, such as your suggestion to search for "non-coding" in the associated hg19.kgXref.description field for each entry. The most straightforward approach would be to take advantage of the knownGene's table columns named cdsStart and cdsEnd for coding start and coding end. These columns indicate the coordinates of where the RNA starts coding for a protein. Since not every RNA is predicted to become a coding protein, some will have a matching value in these two fields, in essence indicating coding never starts. By filtering the entire table where these values are set to each other, you can therefore pull out all the non-coding genes from knownGene.
To demonstrate this, take a look at the first entry of the knownGene's table, which happens to be a non-coding gene:
mysql> select * from knownGene limit 1\G
name: uc001aaa.3
chrom: chr1
strand: +
txStart: 11873
txEnd: 14409
cdsStart: 11873
cdsEnd: 11873
exonCount: 3
exonStarts: 11873,12612,13220,
exonEnds: 12227,12721,14409,
proteinID:
alignID: uc001aaa.3
Above cdsStart=cdsEnd, and in fact if you query the MySQL database with that condition, mysql> select * from knownGene where cdsStart=cdsEnd limit 1\G, you will get the same entry.
You can use the Table Browser to select out all these non-coding entries using the filter option.
4. In the “hg19.knownGene” section, on the “Free-form query” line, enter the following: “cdsStart=cdsEnd”
6. Set "Output Format: selected fields from primary and related tables" and click the “get output” button
7. Select the fields you would like to be included in your output such as "name", "chrom", "txStart", and "txEnd" (including perhaps the hg19.kgXref.description and geneSymbol fields) and click the “get output” button.
You will get output like:
#
hg19.knownGene.name hg19.knownGene.chrom
hg19.knownGene.txStart
hg19.knownGene.txEnd
hg19.kgXref.geneSymbol
hg19.kgXref.description
uc001aaa.3 chr1 11873 14409 DDX11L1 Homo sapiens DEAD/H (Asp-Glu-Ala-Asp/His) box helicase 11 like 1 (DDX11L1), non-coding RNA.
uc010nxr.1 chr1 11873 14409 DDX11L1 Homo sapiens DEAD/H (Asp-Glu-Ala-Asp/His) box helicase 11 like 1 (DDX11L1), non-coding RNA.
While your approach of doing a grep of "non-coding" from the associated description field will yield results it will miss many other UCSC genes that are non-coding.
Thank you again for your inquiry and using the UCSC Genome Browser. If you have any further questions, please reply to
gen...@soe.ucsc.edu. All messages sent to that address are archived on a publicly-accessible forum. If your question includes sensitive data, you may send it instead to
genom...@soe.ucsc.edu.
All the best,
Brian Lee
UCSC Genome Bioinformatics Group