Hi,
I'm trying to understand why some transcripts are present in the UCSC alignments of RefSeq RNAs, but are not present in the data imported directly from refseq. I've read through the announcement blog post and the FAQ entry that I could find, but neither answered my question.
An example is probably easiest. Take the gene PDGFRA. If I run the following two queries in the public MySQL:
select * from hg38.refFlat where geneName = 'PDGFRA' and name like 'NM_%';
select * from hg38.ncbiRefSeq where name2 = 'PDGFRA' and name like 'NM_%';
the first one returns five transcripts (NM_001347827, NM_001347828, NM_001347829, NM_001347830, NM_006206) while the second one returns only a single transcript: NM_006206.4.
I thought perhaps that RefSeq only provided alignments for the single transcript. I couldn't find a reference at UCSC for exactly how the alignments from RefSeq are accessed, but on the RefSeq site I found the following GFF which I'm assuming contains data from the same source:
That GFF contains entries for all five transcripts, with alignments.
More broadly it would appear that somewhere between ~12% of NM transcripts that are present in hg38.refFlat are not present in hg38.ncbiRefSeq. The following query returns a count of 7,725, which the total number of NM transcripts in hg38.refFlat is 65,664:
from hg38.refFlat rf
left join hg38.ncbiRefSeq nm on rf.geneName = nm.name2 and rf.name = substr(nm.name, 1, instr(nm.name, '.')-1) ;
Any light that could be shed on these differences would be greatly appreciated! Thanks,
-t