I’m working on a graph database that includes genomic information (hg38) downloaded from UCSC with Table Browser. I’m linking it to gene information from the HUGO database. I’ve noticed some discrepancies in chromosome location between the two. For instance, Table Browser locates the following on chr22, the HUGO location is in the last column.
0 | chr22 | NR_132385 | long intergenic non-protein coding RNA 1297 | LINC01297 | 14q11.2 |
0 | chr22 | NR_073459 | BMS1, ribosome biogenesis factor pseudogene 18 | BMS1P18 | 14q11.2 |
0 | chr22 | NR_039973 | microRNA 5096 | MIR5096 | 4 |
0 | chr22 | NR_073460 | BMS1, ribosome biogenesis factor pseudogene 17 | BMS1P17 | 14q11.2 |
There are more discrepancies on other chromosomes. Forgive me if I’m asking a naïve question. I’m relatively new to bioinformatics.
Thanks!
Hi James,
Thank you for your question about RefSeq and HUGO discrepancies. The
genes you listed in your example are all RNA's or pseudogenes that map
to multiple locations in the genome. Note the BLAT results for the first
gene in your list:
BLAT Search Results Go back to chr22:15746674-15778289 on the Genome Browser. ACTIONS QUERY SCORE START END QSIZE IDENTITY CHRO STRAND START END SPAN --------------------------------------------------------------------------------------------------- browser details NR_132385 570 1 572 572 100.0% 22 + 15746674 15778289 31616 browser details NR_132385 561 1 572 572 99.4% 14 - 19344327 19375967 31641 browser details NR_132385 561 1 572 572 99.4% 14 + 19024061 19055696 31636 browser details NR_132385 520 1 572 572 96.0% 2 + 131644195 131676060 31866 browser details NR_132385 514 1 572 572 95.7% 18 - 14444408 14493506 49099 browser details NR_132385 499 1 572 572 94.7% 15_KI270852v1_alt - 96785 148106 51322 browser details NR_132385 499 1 572 572 94.7% 15_KI270727v1_random - 66034 117355 51322 browser details NR_132385 499 1 572 572 94.7% 15 - 21338762 21390139 51378 browser details NR_132385 499 1 572 572 94.7% 15 - 20769675 20821024 51350 browser details NR_132385 443 1 572 572 94.4% 21 + 13659844 13710474 50631 browser details NR_132385 441 1 572 572 94.1% 21 + 6794799 6845451 50653
You can see that there are multiple high quality matches to the NR_132385 sequence, including the 2nd and 3rd result from the top, which correspond to 14q11.2.
The reason there is a discrepancy between the HUGO and RefSeq Table
Browser results has to do with how we build the RefSeq track: we take
the sequence (and not coordinates) from NCBI, which we then align to the
genome with BLAT, and keep the best alignment, which in this case is
the one on chr22. You can find more information about this procedure on
the RefSeq track description page:
http://genome.ucsc.edu/cgi-bin/hgTrackUi?db=hg38&g=refGene
There is an additional wrinkle in that sometimes the assemblies used
to map transcripts change, in this case from hg19 to hg38. The
following previously answered mailing list questions elaborate further
on both of these topics:
https://groups.google.com/a/soe.ucsc.edu/d/msg/genome/kJh3YJCiCDs/PEGTRqZdMwAJ
https://groups.google.com/a/soe.ucsc.edu/d/msg/genome/JPlIGxbsN6o/iVG5lSjSAAAJ
What likely happened is that HUGO used a different pipeline (or different assembly) to map the transcripts, and thus came up with the 14q11.2 location, while we came up with the chr22 location.
Thank you again for your inquiry and using the UCSC Genome Browser. If you have any further questions, please reply to gen...@soe.ucsc.edu. All messages sent to that address are archived on a publicly-accessible forum. If your question includes sensitive data, you may send it instead to genom...@soe.ucsc.edu.
Christopher Lee
UCSC Genomics Institute
--
---
You received this message because you are subscribed to the Google Groups "UCSC Genome Browser discussion list" group.
To unsubscribe from this group and stop receiving emails from it, send an email to genome+un...@soe.ucsc.edu.