identify all non-coding genes

242 views
Skip to first unread message

Klaus Schmitz-Abe

unread,
Nov 22, 2013, 1:31:42 AM11/22/13
to gen...@soe.ucsc.edu, kl...@broadinstitute.org
Dear Steve:

I am working with HG18 and hg19. I need to identify all non-coding genes using ucsc genes
If I use hg19, the problem is easy, i just grep "non-coding" in the description.

Can i do the same using  hg18?
If i do that, I obtain very few compare with hg19

Is there is another way to identify non-coding UCSC genes?

thanks, i look forward to hearing from you

best
klaus

-----------------------------------------------------------------
Klaus Schmitz Abe, PhD
Research Instructor at:
Program in Genomics Children's Hospital Boston
Department of Pediatrics Harvard Medical School
Manton Center for Orphan Disease Research
3 Blackfan Circle CLS15072, Boston MA 02115
-----------------------------------------------------------------



Hello, Aritra.

The problem with using “NR_*” is that not every item in UCSC Genes containing “NR_” is a non-coding RNA and not every non-coding RNA contains “NR_” in the various ID fields.  The best thing to do would be to create a filter based on the hg19.kgXref.description field to list items with “non-coding” in the description field.  To do so, perform the following steps:

1. Navigate to http://genome.ucsc.edu/cgi-bin/hgTables

2. Select the following options:
Clade: Mammal
Genome: Human
Assembly: Feb. 2009 (GRCh37/hg19)
Group: Genes and Gene Prediction Tracks
Track: UCSC Genes
Table: knownGene
Region: Click “genome” for the entire genome, select “position” and enter the coordinates in the text box to specify a certain position, or click the “define regions” button to specify multiple regions
Output Format: selected fields from primary and related tables

3. On the “filter” line, click the “create” button

4. In the “hg19.kgXref based filters” section, on the “Free-form query” line, enter the following: description like “%non-coding%”

5. Click the “submit” button

6. Click the “get output” button

7. Select the fields you would like to be included in your output (including hg19.kgXref.description) and click the “get output” button


Please contact us again at gen...@soe.ucsc.edu if you have any further questions.

---
Steve Heitner
UCSC Genome Bioinformatics Group




Brian Lee

unread,
Nov 22, 2013, 3:27:00 PM11/22/13
to Klaus Schmitz-Abe, gen...@soe.ucsc.edu, kl...@broadinstitute.org
Dear Klaus,

Thank you for using the UCSC Genome Browser and your question about identifying all non-coding genes using UCSC Genes in hg18 and hg19.

There are many approaches to take to get the non-coding genes from UCSC genes, such as your suggestion to search for "non-coding" in the associated hg19.kgXref.description field for each entry. The most straightforward approach would be to take advantage of the knownGene's table columns named cdsStart and cdsEnd for coding start and coding end. These columns indicate the coordinates of where the RNA starts coding for a protein. Since not every RNA is predicted to become a coding protein, some will have a matching value in these two fields, in essence indicating coding never starts.  By filtering the entire table where these values are set to each other, you can therefore pull out all the non-coding genes from knownGene.

To demonstrate this, take a look at the first entry of the knownGene's table, which happens to be a non-coding gene:

mysql> select * from knownGene limit 1\G
      name: uc001aaa.3
     chrom: chr1
    strand: +
   txStart: 11873
     txEnd: 14409
  cdsStart: 11873
    cdsEnd: 11873
 exonCount: 3
exonStarts: 11873,12612,13220,
  exonEnds: 12227,12721,14409,
 proteinID: 
   alignID: uc001aaa.3

Above cdsStart=cdsEnd, and in fact if you query the MySQL database with that condition, mysql> select * from knownGene where cdsStart=cdsEnd limit 1\G, you will get the same entry.

You can use the Table Browser to select out all these non-coding entries using the filter option.


2. Select the following options:
Clade: Mammal
Genome: Human
Assembly: Feb. 2009 (GRCh37/hg19)   [Or hg18 if you are using that database]
Group: Genes and Gene Prediction Tracks
Track: UCSC Genes
Table: knownGene
Region: Click “genome” for the entire genome, select “position” and enter the coordinates in the text box to specify a certain position, or click the “define regions” button to specify multiple regions

3. On the “filter” line, click the “create” button

4. In the “hg19.knownGene” section, on the “Free-form query” line, enter the following: “cdsStart=cdsEnd”

5. Click the “submit” button

6. Set "Output Format: selected fields from primary and related tables" and click the “get output” button

7. Select the fields you would like to be included in your output such as "name", "chrom", "txStart", and "txEnd" (including perhaps the hg19.kgXref.description and geneSymbol fields) and click the “get output” button.

You will get output like: 
#hg19.knownGene.name hg19.knownGene.chrom hg19.knownGene.txStart hg19.knownGene.txEnd hg19.kgXref.geneSymbol hg19.kgXref.description
uc001aaa.3 chr1 11873 14409 DDX11L1 Homo sapiens DEAD/H (Asp-Glu-Ala-Asp/His) box helicase 11 like 1 (DDX11L1), non-coding RNA.
uc010nxr.1 chr1 11873 14409 DDX11L1 Homo sapiens DEAD/H (Asp-Glu-Ala-Asp/His) box helicase 11 like 1 (DDX11L1), non-coding RNA.

While your approach of doing a grep of "non-coding" from the associated description field will yield results it will miss many other UCSC genes that are non-coding. 

Please see the last previous reply about possibly using the RefSeq Gene's track and filtering on “NR_*” and other considerations to make when picking different gene prediction tracks: https://groups.google.com/a/soe.ucsc.edu/forum/?hl=en&fromgroups#!search/How$20to$20get$20cDNAs$20of$20ncRNAs$20from$20UCSC$20Genes/genome/c6nddNoSFBo/WAfzR9nREykJ

Thank you again for your inquiry and using the UCSC Genome Browser. If you have any further questions, please reply to gen...@soe.ucsc.edu. All messages sent to that address are archived on a publicly-accessible forum. If your question includes sensitive data, you may send it instead to genom...@soe.ucsc.edu.

All the best,

Brian Lee
UCSC Genome Bioinformatics Group


--
 

Klaus Schmitz-Abe

unread,
Nov 29, 2013, 10:19:10 AM11/29/13
to Brian Lee, gen...@soe.ucsc.edu
Dear Brian:

Thank you very much for your answer, very helpful. However, there is a problem with your method.
I need to divide in 2 list the UCSC genes, in coding and noncoding genes.
If i use only your method, there is some genes that you have definition of UTR and they are non-coding.  example LOC100129726

What do you recommend to do better my job?

thanks, i appreciate your time

happy thanksgiving

klaus

Brian Lee

unread,
Dec 2, 2013, 6:32:30 PM12/2/13
to Klaus Schmitz-Abe, gen...@soe.ucsc.edu
Dear Klaus,

Thank you for your message. Using the cdsStart=cdsEnd query will effectively divide knownGene into two lists for what UCSC gene's predicts as coding and noncoding genes.

The description for LOC100129726, http://genome.ucsc.edu/cgi-bin/hgGene?db=hg19&hgg_gene=LOC100129726, has the wording "non-coding RNA" which could be causing the confusion. However, if you scroll down to the bottom of the gene details page, that represents an amalgamation of information from multiple sources including the gene description, you will find the "Gene Model Information" section that helps clarify that UCSC Gene's annotates this region of the genome as "category: coding".

You can therefore use the cdsStart=cdsEnd approach to separate knownGene into two lists, however, it will not always match the results from other gene prediction tracks that use different methods to predict what parts of the genome represent coding genes.

Thank you again for your inquiry and using the UCSC Genome Browser. If you have any further questions, please reply to gen...@soe.ucsc.edu. All messages sent to that address are archived on a publicly-accessible forum. If your question includes sensitive data, you may send it instead to genom...@soe.ucsc.edu.

All the best,

Brian Lee
UCSC Genome Bioinformatics Group
Reply all
Reply to author
Forward
0 new messages