Good afternoon,
Can someone explain the difference between the outputs from the ccdsGene and RefGene tables? The only difference that I can see is that the annotation information is listed based on the NM or CDS number. If that is the case, I will use the ccdsGene table and tie that to the NM number listed in the ccdsInfo table to make my NM reference.
Thanks,
Chris Heilala
Software Engineer | PreventionGenetics
chris....@preventiongenetics.com | 715.387.0484 x165

Hello Chris,
Thank you for your question about the difference between the refGene and ccdsGene tables. The refGene table is built by aligning RefSeq RNAs to the human genome assembly. The table contains accession numbers for sequences and information about where in the genome assembly the sequences seemed best aligned. More information about how this is done is available on the track page at http://genome.ucsc.edu/cgi-bin/hgTrackUi?db=hg19&g=refGene. The ccdsGene table is built from data provided by the Consensus Coding Sequence (CCDS) project (http://www.ncbi.nlm.nih.gov/CCDS/), which uses RefSeq data in tandem with other sources to try to identify a high-quality set of protein-coding regions. One of our engineers provides the following description of CCDS annotation:
CCDS are the consensus CDS annotations (UTR is excluded) between RefSeq and HAVANA+Ensembl. For Human, HAVANA+Ensembl is now GENCODE. So this is the CDS regions where RefSeq and GENCODE agree. It's not a passive process, the annotation that disagrees are discussed by the annotators and an attempt is made to bring them into agreement. UCSC has the tie vote.
More information about how the ccdsGene data are obtained is available at http://genome.ucsc.edu/cgi-bin/hgTrackUi?db=hg19&g=ccdsGene.
While the underlying structure of these two tables is identical, the data within the tables are not. As a result of the additional constraints placed on the CCDS data, the ccdsGene table contains significantly fewer records than the refGene table (~29k vs. ~48k). Furthermore, while RefSeq data are part of the CCDS project, the genomic alignments used for that data are provided by RefSeq. These alignments can differ from the ones in the refGene table, which are provided by UCSC.
Which of these tables is most relevant to your research isn't a question we can really answer for you. If you are specifically looking for NM numbers, however, please note that the ccdsInfo table includes both RefSeq and non-RefSeq accession numbers.
I hope this is helpful. If you have any further questions, please reply to gen...@soe.ucsc.edu. Questions sent to that address will be archived in a publicly-accessible forum for the benefit of other users. If your question contains sensitive data, you may send it instead to genom...@soe.ucsc.edu.
--
Jonathan Casper
UCSC Genome Bioinformatics Group
--