Inconsistency between hg38.kgXref and hg38.knownToRefSeq

17 views
Skip to first unread message

Mark Brown

unread,
Apr 19, 2016, 11:18:07 AM4/19/16
to gen...@soe.ucsc.edu
I try to find all transcripts for PDGFRA (NM_006206).

My questions is why uc003haa.4 is associated with NM_006206 according to knownToRefSeq table?  Thanks!

The following SQL pulls out lots of entries, where uc003haa.4 is the first row.
select name as UCSC, value as RefSeq from hg38.knownToRefSeq r where value='NM_006206'

          UCSC     RefSeq
0   uc003haa.4  NM_006206
1   uc003han.5  NM_006206
2   uc003hal.4  NM_006206
3   uc062wqt.1  NM_006206
4   uc062wqu.1  NM_006206
5   uc062wqv.1  NM_006206
6   uc062wqw.1  NM_006206
7   uc062wqx.1  NM_006206
8   uc062wqy.1  NM_006206
9   uc062wra.1  NM_006206
10  uc062wrb.1  NM_006206

However, uc003haa.4 appears to be linked to ENST00000507166, which belongs to RP11-231C18.3-001.  See:

select * from hg38.knownToEnsembl where name='uc003haa.4'
         name              value
0  uc003haa.4  ENST00000507166.4

If I search kgXref, uc003haa.4 does not map to any RefSeq entry.
select * from hg38.kgXref where kgID='uc003haa.4'

      kgID      mRNA    spID spDisplayID geneSymbol refseq protAcc
uc003haa.4  AY229892  Q6UN15  FIP1_HUMAN     FIP1L1

It appears the association between uc003haa.4 and NM_006206 is a mistake.

Matthew Speir

unread,
Apr 20, 2016, 1:40:13 PM4/20/16
to Mark Brown, gen...@soe.ucsc.edu
Hi Mark,

Thank you for your question about the knownToRefSeq table in the UCSC Genome Browser.

The knownToRefSeq only associates IDs of items in the GENCODE v22 track with those items they overlap in the RefSeq Genes track. In this case, uc003haa.4 is a long transcript that overlaps a large region that includes multiple other transcripts, meaning that all of those transcripts will appear alongside it in the knownToRefSeq table. You can see this if you look at uc003haa.4 in the Genome Browser with both the GENCODE v22 and RefSeq Genes tracks:

http://genome.ucsc.edu/cgi-bin/hgTracks?hgS_doOtherUser=submit&hgS_otherUserName=mspeir&hgS_otherUserSessionName=hg38_uc003haa.4

I hope this is helpful. If you have any further questions, please reply to gen...@soe.ucsc.edu. All messages sent to that address are archived on a publicly-accessible Google Groups forum. If your question includes sensitive data, you may send it instead to genom...@soe.ucsc.edu.

Matthew Speir
UCSC Genome Bioinformatics Group
--


Mark Brown

unread,
Apr 20, 2016, 5:20:22 PM4/20/16
to Matthew Speir, gen...@soe.ucsc.edu
Hi, Matthew

Thanks! I can see why uc003haa.4 is linked to NM_006206 now.

However, I have follow-up questions on how to obtain the best gene structure for a RefSeq entry.

(1) It seem UCSC genome browser gateway search looksup RefSeq number using to hg38.refGene table.  E.g. for NM_006206

select name,chrom,strand,txStart,txEnd from hg38.refGene where name='NM_006206'
        name chrom strand   txStart     txEnd
0  NM_006206  chr4      +  54229096  54298245

NM_006206 is linked to uc003han.5 according to kgXref
select * from hg38.kgXref where mRNA='NM_006206'
         kgID       mRNA    spID  spDisplayID geneSymbol     refseq  \
0  uc003han.5  NM_006206  P16234  PGFRA_HUMAN     PDGFRA  NM_006206

However,
select name,chrom,strand,txStart,txEnd from hg38.knownGene where name='uc003han.5'
         name chrom strand   txStart     txEnd
0  uc003han.5  chr4      +  54229096  54298247

Now notice txEnd coordinates are different, comparing what obtained from refGene and knownGene?

(2) Also it seems RefSeq entries do not always have corresponding UCSC entries
E.g., I search NM_002929
I found two entries using
select name,chrom,strand,txStart,txEnd from hg38.refGene where name='NM_002929'
        name                 chrom strand    txStart      txEnd
0  NM_002929  chr13_KI270842v1_alt      +       4330       8775
1  NM_002929                 chr13      +  113667281  113735664

However, no entry in hg38.kgXref matches NM_002929.  Two entries in hg38.knownToRefSeq matches:
select name as UCSC, value as RefSeq from hg38.knownToRefSeq r where value='NM_002929'
         UCSC     RefSeq
0  uc010tkf.3  NM_002929
1  uc058yoe.1  NM_002929

But neither of these two share the same gene structure with the two rows returned by refGene:
select name,chrom,strand,txStart,txEnd from hg38.knownGene where name in ('uc058yoe.1','uc003han.5')
         name  chrom strand    txStart      txEnd
0  uc010tkf.3  chr13      +  113667154  113737735
1  uc058yoe.1  chr13      +  113726485  113735236

So my current understanding is: UCSC entries come from Ensembl.  For a given RefSeq number, if I trust NCBI's RefSeq gene structure, I should use refGene table to retrieve the data.  But if I trust transcripts from Ensembl, I should map RefSeq into UCSC entries using kgXref first and obtain gene structures from matched UCSC entries.  If nothing found in kgXref, I should use knownToRefSeq entry to find UCSC entries and obtain gene structure that way.  Is my understanding correct?

Thanks!

Matthew Speir

unread,
Apr 28, 2016, 12:40:22 PM4/28/16
to Mark Brown, gen...@soe.ucsc.edu
Hi Mark,

If you are starting with RefSeq Genes identifiers, such as NM_006206, then I would recommend getting the gene structures from the RefSeq Genes track or from RefSeq themselves. I would not recommend mixing and matching gene structures between different sources in the UCSC Genome  Browser.


I hope this is helpful. If you have any further questions, please reply to gen...@soe.ucsc.edu. All messages sent to that address are archived on a publicly-accessible Google Groups forum. If your question includes sensitive data, you may send it instead to genom...@soe.ucsc.edu.

Matthew Speir
UCSC Genome Bioinformatics Group


Reply all
Reply to author
Forward
0 new messages