Default locus of a RefSeq protein accession ID

16 views
Skip to first unread message

Mohamed Reda Keddar

unread,
May 19, 2020, 12:04:42 PM5/19/20
to gen...@soe.ucsc.edu
Dear Genome Browser support,

I am trying to understand on which basis Genome Browser assigns the default locus (i.e. best hit) of a RefSeq protein accession ID.

I first understood it was based on the score of the BLAT alignment of that protein, i.e. that Genome Browser would show the locus that corresponds to the highest BLAT score of that protein. But then I came across an example where it did not seem to be the case: when I input ‘NP_000652’ into the Genome Browser search bar, the locus I get is chr4:39,454,124-39,458,948. However, this does not correspond to the highest-score hit. Indeed, there are 3 hits that have a highest score of 576 that fall within chrX, chr15 and chr15_KI270850v1_alt respectively, whereas the BLAT hit that seems to correspond to the Genome Browser locus has a score of 571. Instead, this locus (chr4: 39,454,546-39,458,439) has the highest span among all hits. I have attached the sequence of NP_000652 for reference/reproducibility of the example.

Could you please explain how the default locus is decided for RefSeq proteins when these are input to Genome Browser search and whether it is based on any of the BLAT metrics such as the score, span and/or identity? Also, what is done in the case of ties in any of these metrics?

Many thanks for your help,

Reda

The Francis Crick Institute Limited is a registered charity in England and Wales no. 1140062 and a company registered in England and Wales no. 06885462, with its registered office at 1 Midland Road London NW1 1AT

NP_000652.faa

Luis Nassar

unread,
May 22, 2020, 2:24:52 PM5/22/20
to Mohamed Reda Keddar, gen...@soe.ucsc.edu

Hello Reda,

Thank you for your interest in the Genome Browser.

After you search for an accession, such as NP_000652, you will get results on two tracks:

  • RefSeq Genes
  • NCBI RefSeq genes, curated subset (NM_*, NR_*, NP_* or YP_*)

RefSeq Genes is our in-house built track where we map RefSeq mRNA sequences using BLAT (called UCSC RefSeq). The second track, NCBI RefSeq genes..., are NCBI's alignments imported from their site. The scope of this answer is limited to the in-house track, even though in this case the coordinates are nearly identical between the two. For more information on this, see our genes FAQ (http://genome.ucsc.edu/FAQ/FAQgenes.html#ncbiRefseq).

You are correct that BLATing the protein sequence for NP_000652 yields better hits in chrX and chr15. However, what we align are the mRNA sequences, not the protein sequences. If you BLAT the sequence for NM_000661.5 you will see the top hit is the chr4 protein annotation.

As you have said, the protein sequence top hit nearly always matches the actual annotation. In this rare case, the chrX and chr15 matches are actually annotated by GENCODE as pseudogenes: http://genome.ucsc.edu/s/Lou/ML25597. There are some additional filtering steps that occur in choosing the alignment match for the track (and ties are extremely unlikely), however, the top score match when BLATing the mRNA sequence should match the annotation in most all cases.

I hope this is helpful. Please include gen...@soe.ucsc.edu in any replies to ensure visibility by the team. All messages sent to that address are archived on our public forum. If your question includes sensitive information, you may send it instead to genom...@soe.ucsc.edu.

Lou Nassar
UCSC Genomics Institute


--

---
You received this message because you are subscribed to the Google Groups "UCSC Genome Browser Public Support" group.
To unsubscribe from this group and stop receiving emails from it, send an email to genome+un...@soe.ucsc.edu.
To view this discussion on the web visit https://groups.google.com/a/soe.ucsc.edu/d/msgid/genome/56C1B329-A8B1-4F51-B330-1C01DBBDEB13%40crick.ac.uk.

Mohamed Reda Keddar

unread,
May 26, 2020, 12:34:17 PM5/26/20
to Luis Nassar, gen...@soe.ucsc.edu
Hi Luis,

Many thanks for your answer! It is indeed very helpful.

Is there a way to programmatically get the UCSC locus (annotation) for a given RefSeq protein accession ID? That would be much helpful in my case as I am trying to resolve some cases where I have more than one top hit (ties).

Thanks for your help,

Reda

Jairo Navarro Gonzalez

unread,
May 29, 2020, 6:50:52 PM5/29/20
to Mohamed Reda Keddar, Luis Nassar, gen...@soe.ucsc.edu

Hello Reda,

Unfortunately, we do not map proteins to the genome because the hits are not unique. However, you could use the public MySQL server to get the locus for a protein id:

select chrom,txStart,txEnd from refGene r, hgFixed.refLink l where l.mrnaAcc=r.name and l.protAcc='NP_000652';

I hope this is helpful. If you have any further questions, please reply to gen...@soe.ucsc.edu.
All messages sent to that address are archived on a publicly-accessible Google Groups forum.
If your question includes sensitive data, you may send it instead to genom...@soe.ucsc.edu.

Jairo Navarro
UCSC Genome Browser

Want to share the Browser with colleagues?
Host a workshop: https://bit.ly/ucscTraining


Reply all
Reply to author
Forward
0 new messages