Discrepancies between UCSC Liftover and NCBI API remap.pl at updating chromosomal positions

142 views
Skip to first unread message

Mendez Giraldez, Raul

unread,
May 28, 2015, 6:19:27 PM5/28/15
to gen...@soe.ucsc.edu
Dear UCSC team,

I needed to update the chromosomal positions on 2,896,684 SNPs distributed all over the human genome, from NCBI36(hg18) to GRCh37(hg19). Fo that purpose I used UCSC Liftover, as the web interface:

http://genome.ucsc.edu/cgi-bin/hgLiftOver

And double checked the results using NCBI API remap tool (perl script):

ftp://ftp.ncbi.nlm.nih.gov/pub/remap/remap_api.pl

with the following command line:

perl remap_api.pl -mode asm-asm --from GCF_000001405.12 --dest GCF_000001405.13 --annotation BED_input_file --annot_out BED_output_file --in_format bed --allowdupes off

Although as it was expected, the results from the two resources coincided in almost all cases, we have found 98 mismatches between Liftover and your API, and 4 SNP that were *NOT MAPPED* by the API. Among these mismatches, 80 are within chromosome 15. Please attached find the complete list in excel format. The first column lists chromosome Id, the second lists the NCBI36(hg18) base pair position, the third Liftover's mapping, the fourth, NCBI mapping, and columns 4-6, refer to NCBI data, mapped chromosome, recip and asm_unit respectively. Do you know which can the cause be for those mismatches/non-mappings ? Even being at low number is good to know that the algorithms works fine for all the cases.

Thanks for your attention.

Best,
Raul

Raul Mendez Giraldez, PhD
Dept. Epidemiology
UNC School of Public Health
137 East Franklin Street,
306G CVS Plaza Building
Chapel Hill, NC 27599 - 8050
USA
FindMismatches_LiftOver_vs_NCBI_API.xlsx

Steve Heitner

unread,
May 29, 2015, 5:05:58 PM5/29/15
to Mendez Giraldez, Raul, gen...@soe.ucsc.edu

Hello, Raul.

Thank you for sharing this comparison with us.  It’s good to see concrete examples that show overwhelming agreement between our tool and NCBI’s tool.

It appears that the disagreements on chromosome 15 are the result of a couple of high-identity segmental
duplications (chr15:20,935,076-21,034,034 <--> chr15:21,941,706-22,040,711 and chr15:21,033,446-21,199,563 <--> chr15:22,044,595-22,210,800).  In these cases, NCBI’s mappings look better than ours in terms of similarity between hg18 and hg19.

One good way to double-check the results of both tools is to view at least a couple hundred base pairs on either side of the SNP in hg18, get the DNA for that region, and then blat that sequence on hg19 to view the results.  In most of the disagreements, the UCSC coordinate is the top blat hit, but the NCBI coordinate is a very close second.

Most of the chromosome 4 and all of the chromosome 17 disagreements were the result of NCBI listing haplotype chromosomes.  For all of the haplotype instances that I checked, the haplotype chromosome was actually the top blat hit, but liftOver preferentially selects the main chromosomes over the haplotype chromosomes (in general, there is a far greater number of annotations on the main chromosomes versus the alternate chromosomes; in the case of SNPs, this may not always produce the intended results) which was why there was a disagreement between UCSC’s liftOver and NCBI’s remap tool.


Please contact us again at gen...@soe.ucsc.edu if you have any further questions. 
All messages sent to that address are archived on a publicly-accessible Google Groups forum.  If your question includes sensitive data, you may send it instead to genom...@soe.ucsc.edu.

---
Steve Heitner
UCSC Genome Bioinformatics Group

--

Reply all
Reply to author
Forward
0 new messages