Hi, Matthew
Thanks! I can see why uc003haa.4 is linked to NM_006206 now.
However, I have follow-up questions on how to obtain the best gene structure for a RefSeq entry.
(1) It seem UCSC genome browser gateway search looksup RefSeq number using to hg38.refGene table. E.g. for NM_006206
select name,chrom,strand,txStart,txEnd from hg38.refGene where name='NM_006206'
name chrom strand txStart txEnd
0 NM_006206 chr4 + 54229096 54298245
NM_006206 is linked to uc003han.5 according to kgXref
select * from hg38.kgXref where mRNA='NM_006206'
kgID mRNA spID spDisplayID geneSymbol refseq \
0 uc003han.5 NM_006206 P16234 PGFRA_HUMAN PDGFRA NM_006206
However,
select name,chrom,strand,txStart,txEnd from hg38.knownGene where name='uc003han.5'
name chrom strand txStart txEnd
0 uc003han.5 chr4 + 54229096 54298247
Now notice txEnd coordinates are different, comparing what obtained from refGene and knownGene?
(2) Also it seems RefSeq entries do not always have corresponding UCSC entries
E.g., I search NM_002929
I found two entries using
select name,chrom,strand,txStart,txEnd from hg38.refGene where name='NM_002929'
name chrom strand txStart txEnd
0 NM_002929 chr13_KI270842v1_alt + 4330 8775
1 NM_002929 chr13 + 113667281 113735664
However, no entry in hg38.kgXref matches NM_002929. Two entries in hg38.knownToRefSeq matches:
select name as UCSC, value as RefSeq from hg38.knownToRefSeq r where value='NM_002929'
UCSC RefSeq
0 uc010tkf.3 NM_002929
1 uc058yoe.1 NM_002929
But neither of these two share the same gene structure with the two rows returned by refGene:
select name,chrom,strand,txStart,txEnd from hg38.knownGene where name in ('uc058yoe.1','uc003han.5')
name chrom strand txStart txEnd
0 uc010tkf.3 chr13 + 113667154 113737735
1 uc058yoe.1 chr13 + 113726485 113735236
So my current understanding is: UCSC entries come from Ensembl. For a given RefSeq number, if I trust NCBI's RefSeq gene structure, I should use refGene table to retrieve the data. But if I trust transcripts from Ensembl, I should map RefSeq into UCSC entries using kgXref first and obtain gene structures from matched UCSC entries. If nothing found in kgXref, I should use knownToRefSeq entry to find UCSC entries and obtain gene structure that way. Is my understanding correct?
Thanks!