Dog genome: The stand LiftOver utility and online HTML form have different results.

111 views
Skip to first unread message

Fahad Syed

unread,
Aug 21, 2014, 11:34:30 AM8/21/14
to gen...@soe.ucsc.edu, dog-...@broadinstitute.org
Hi,

I am developing a tool to map the SNP data coordinates in CanFam2.0 to CanFam3.1 using LiftOver utility provided by UCSC at 'http://hgdownload.soe.ucsc.edu/admin/exe' and chain file 'http://hgdownload.soe.ucsc.edu/goldenPath/canFam2/liftOver/canFam2ToCanFam3.over.chain.gz'. I am getting different outputs while using the online HTML form at 'http://genome.ucsc.edu/cgi-bin/hgLiftOver' for LiftOver and stand alone LiftOver utility (downloaded from 'http://hgdownload.soe.ucsc.edu/admin/exe').

My work flow is listed below:

1. Merge all of the SNP data at http://www.broadinstitute.org/mammals/dog/snp2/ into one single SNP data file and convert it to BED format.

2. Transform the coordinates in BED file using standalone LiftOver utility

3. Verify the result using online HTML form at  http://genome.ucsc.edu/cgi-bin/hgLiftOver. While verifying, for some of the lift-over records, the result is different. E.g., consider the following  coordinates in CanFam2.0:

    chr11:12559119-12559120
    chr11:12559366-12559367
    chr11:12559633-12559634
    chr11:12559687-12559688
    chr11:12559993-12559994
    chr3:3027265-3027266
    chr3:3027545-3027546
    chr3:3027683-3027684
    chr3:3027844-3027845
    chr9:10348783-10348784
    chr9:10357267-10357268
    chr9:10357648-10357649
    chrX:125198-125199
    chrX:126573-126574
    chrX:127209-127210

The stand alone LiftOver utility produced the following output:

    chr11:11080593-11080594
    chr11:9553465-9553466
    chr11:11080077-11080078
    chr11:11080023-11080024
    chr11:9552838-9552839
    chr3:73720-73721
    chr3:73440-73441
    chr3:73302-73303
    chr3:73141-73142
    chr9:7452976-7452977
    chr9:7444491-7444492
    chr9:7444110-7444111
    chrX:346785-346786
    chrX:345410-345411
    chrX:344774-344775


Whereas, the online HTML form produced the following output:

    chr11:11080594-11080595
    chr11:9553466-9553467
    chr11:11080078-11080079
    chr11:11080024-11080025
    chr11:9552839-9552840
    chr3:73721-73722
    chr3:73441-73442
    chr3:73303-73304
    chr3:73142-73143
    chr9:7452977-7452978
    chr9:7444492-7444493
    chr9:7444111-7444112
    chrX:346786-346787
    chrX:345411-345412
    chrX:344775-344776


If you notice, there is a shift of one base between the stand alone utility output and HTML form output. Could you please provide an explanation for the difference in results.

Note, I have also tried to LiftOver my merged SNP data file using online file submission form at http://genome.ucsc.edu/cgi-bin/hgLiftOver and found no difference with stand alone utility program.


Best regards,
Fahad Syed.
BC Platforms Ltd
web: www.bcplatforms.com

Matthew Speir

unread,
Aug 27, 2014, 4:13:34 PM8/27/14
to Fahad Syed, gen...@soe.ucsc.edu, dog-...@broadinstitute.org
Hi Fahad,

Thank you for your question about the LiftOver utility. Could you share a few lines of the BED file you've been using as input for standalone LiftOver? If you do not wish to share this information with the public list, you can send them to me directly. This difference in output coordinates is likely due to different input formats. The standalone LiftOver utility takes bed format as input. The web LiftOver utility can take both position and bed formats as input. Putting these two different formats into the web LiftOver utility will give you different results. This is because the position format is one-based and the bed format is zero-based, and each one is interpreted by the Genome Browser differently. You can read more about the differences between the two and how they are interpreted by the Genome Browser here: http://genome.ucsc.edu/FAQ/FAQtracks#tracks1.

I hope this is helpful. If you have any further questions, please reply to gen...@soe.ucsc.edu. All messages sent to that address are archived on a publicly-accessible Google Groups forum. If your question includes sensitive data, you may send it instead to genom...@soe.ucsc.edu.

Matthew Speir
UCSC Genome Bioinformatics Group
--


Jeremy Johnson

unread,
Aug 28, 2014, 12:34:49 PM8/28/14
to fahad...@bcplatforms.com, gen...@soe.ucsc.edu
Hi Fahad,

A few things:

1)  We're not responsible for maintaining in UCSC's tools.  I suggest you bring this up with them  (gen...@soe.ucsc.edu), but 1-off problems are quite common when dealing with genomic data (computer science people like to start numbering at 0, biologists at 1.)

2)  We've already done this liftover, and have the data hosted on the CanFam3.1 genome on UCSC via a Track Hub.  If you go to UCSC, load the CanFam3.1 assembly, the hit the "Track Hubs" button and load the "Broad Improved Canine Annotation v1" Track hub, there is a track there listed as "Survey SNPs."  These contain the SNPs from CanFam2, as well as some addition SNPs we generated when we designed the dog SNP array.

3)  You can check the results yourself between the discrepent tools, as you have the truth set of genotypes from CanFam2.  Whichever result matches those genotypes (i.e. returns the same bases from CanFam3.1) is the correct tool.

Hope this helps, and good luck!

~Jeremy


Jeremy Johnson
Project Coordinator
Vertebrate Biology Group
Broad Institute
Reply all
Reply to author
Forward
0 new messages