Discrepancies in exon counts between UCSC RefGene table and NCBI RefSeq entries

62 views
Skip to first unread message

Emilie Ait Yahya Graison

unread,
May 4, 2016, 11:32:16 AM5/4/16
to gen...@soe.ucsc.edu
Hello UCSC Genome Bioinformatics Group,

I have just downloaded through the table browser all fields from the RefGene table (hg19 assembly) using a list of RefSeq ids as identifiers.

But I have noticed discrepancies for some of my refSeq transcripts of interest in the exonCount column of the output refSeq file that cannot be explained by differences in sub version of NMs (since the request on ucsc table browser and NCBI Nucleotide have been performed at the same date).

Here are a few examples :

- name : NM_014249, exonCount : 9
but
entry in NBCI Nucleotide counts 8 exons for that transcript :

-name : NM_018474, exonCount: 14
but
entry in NBCI Nucleotide counts 13 exons for that transcript :

-name : NM_022124, exonCount : 68 
but
entry in NBCI Nucleotide counts 70 exons for that transcript :

Do you have any idea of what could explain such inconsistencies, please?

Many thanks in advance.

Regards,

Emilie
NGS Bioinformatics team
Molecular Biology Facility
CHRU - Pole de Biologie Pathologie Génétique
Lille, France



Matthew Speir

unread,
May 4, 2016, 1:10:43 PM5/4/16
to Emilie Ait Yahya Graison, gen...@soe.ucsc.edu
Hello Emilie,

Thank you for your question about differences in exon counts between RefSeq and the UCSC Genome Browser.

These differences seems to be stemming from two different things. The first is based on how we create the "RefSeq Genes" track in the UCSC Genome Browser and the second is due to assembly differences.

Often these differences in exon counts are due to how the RefSeq Genes track in the UCSC Genome Browser is created. In short, the track is created by mapping the RefSeq mRNAs to the genome using BLAT. You can read more about how the track is produced on the track description page: http://genome.ucsc.edu/cgi-bin/hgTrackUi?db=hg19&g=refGene. This may produce differences between the annotations in RefSeq and what we display in the UCSC Genome Browser.

For two of the annotations you've highlighted, NM_014249 and NM_018474 , these differences are likely due to looking at these annotations on hg19. If you look at these annotations on hg38, you will see that these two match the exon count from RefSeq:

This reason for the annotations being different is related to the first reason I highlighted. To create the RefSeq track, we are taking the mRNAs and remapping them to the most recent version as well as all previous versions of the human genome, while RefSeq is creating these mRNAs based on the most recent assembly, hg38. Assemblies change overtime, so those that are created for the most recent assembly, may not map as well to previous versions. In the links to the two transcripts above, I've also included the "Hg19 Diff" track that highlights those regions of the hg38 assembly that are different than the previous hg19 assembly.

We are currently working with NCBI to create a track that includes annotations directly from their database rather than relying on our current method of re-mapping their mRNAs. Unfortunately, I can't give you a time estimate of when this track may be available.

I hope this is helpful. If you have any further questions, please reply to gen...@soe.ucsc.edu. All messages sent to that address are archived on a publicly-accessible Google Groups forum. If your question includes sensitive data, you may send it instead to genom...@soe.ucsc.edu.

Matthew Speir
UCSC Genome Bioinformatics Group
--


Reply all
Reply to author
Forward
0 new messages