liftover problem

1,891 views
Skip to first unread message

Shiqing Mao

unread,
Jun 20, 2017, 11:50:43 AM6/20/17
to gen...@soe.ucsc.edu
Hi 

I used liftover to convert one bedmethyl file from hg38 to hg19.
In the hg19 file I got a lot of duplicates (77k/48m).

By searching around, I found probably the same problem from a picard user:

I am a bit confused how this could happen.
Thanks for your help in advance.

Regards
Shiqing

Brian Lee

unread,
Jun 20, 2017, 4:22:21 PM6/20/17
to Shiqing Mao, gen...@soe.ucsc.edu
Dear Shiqing,

Thank you for using the UCSC Genome Browser and your question about duplicate locations when using liftover for a file from hg38 to hg19.

Can you share more information, such as the specific hg38 coordinates that are mapping to multiple hg19 locations?

Here is a session of the hg38 assembly where a track is turned on that shows differences between the hg19 and hg38 assemblies:

hg38 coordinates:

In some regions, you can see that there are red blocks indicating a new contig added to hg38 to update sequence or fill gaps present in hg19. This will mean some coordinates can not directly lift back from hg38 to hg19. It is possible that is what is happening to your regions.

Thank you again for your inquiry and using the UCSC Genome Browser. If you have any further questions, please reply to gen...@soe.ucsc.edu. All messages sent to that address are archived on a publicly-accessible forum. If your question includes sensitive data, you may send it instead to genom...@soe.ucsc.edu.

All the best,

Brian Lee
UCSC Genomics Institute

--

---
You received this message because you are subscribed to the Google Groups "UCSC Genome Browser Public Support" group.
To unsubscribe from this group and stop receiving emails from it, send an email to genome+un...@soe.ucsc.edu.
To post to this group, send email to gen...@soe.ucsc.edu.
Visit this group at https://groups.google.com/a/soe.ucsc.edu/group/genome/.
To view this discussion on the web visit https://groups.google.com/a/soe.ucsc.edu/d/msgid/genome/7D92A643-F6ED-4C93-A62B-CEC2E6A23A77%40cruk.cam.ac.uk.
For more options, visit https://groups.google.com/a/soe.ucsc.edu/d/optout.

Shiqing Mao

unread,
Jun 20, 2017, 5:22:48 PM6/20/17
to Brian Lee, gen...@soe.ucsc.edu
Deal Brian

Thanks for your quick reply.

Here is one example:
Before liftover, there is only one entry in hg38. 
chr1 10524 10525

After liftover, there are two coordinates having the same position.
chr1 10524 10525
chr1 10524 10525

And there are 77 thousands out of 58 million CpG sites.

Best wishes
Shiqing

Brian Lee

unread,
Jun 20, 2017, 5:57:22 PM6/20/17
to Shiqing Mao, gen...@soe.ucsc.edu

Dear Shiqing

Thank you for using the UCSC Genome Browser and your added information about the coordinates you are using.

What tool are you using to perform your liftOver? I did a test of our liftOver tool and am only getting one coordinate. You can get precompiled versions of the liftOver tool to use on your computer. For example, for my Apple laptop I acquired this binary, and the associated hg38 to hg19 chain file:

wget http://hgdownload.soe.ucsc.edu/admin/exe/macOSX.x86_64/liftOver
wget http://hgdownload.soe.ucsc.edu/goldenPath/hg38/liftOver/hg38ToHg19.over.chain.gz

With an input bed file like this one called hg38In.bed:

chr1 10524 10525

I ran the command to allow output to a newHg19.bed and an "unMapped" file for any coordinates that did not map:

./liftOver hg38In.bed hg38ToHg19.over.chain.gz newHg19.bed unMapped

The result was that the newHg19.bed had only one coordinate:

chr1 10524 10525

You can see this on our website too, where you can paste these coordinates on the hg38 web version of the liftOver tool (http://genome.ucsc.edu/cgi-bin/hgLiftOver?db=hg38) and get only one set of coordinates back for hg19. Similarly you can see this with the visual conversion tool, that will take browsing regions and transfer them to hg19, this region only results in one spot in hg19: http://genome.ucsc.edu/cgi-bin/hgConvert?db=hg38&position=chr1%3A10525-10525

If you are using the picard liftOver tool, it is harder for us to be of any assistance since we did not develop that tool. You will want to contact that resource to ask for help addressing the issue.

I hope this was helpful, thank you again for your inquiry and using the UCSC Genome Browser. If you have any further questions, please reply to gen...@soe.ucsc.edu. All messages sent to that address are archived on a publicly-accessible forum. If your question includes sensitive data, you may send it instead to genom...@soe.ucsc.edu.

All the best,

Brian Lee
UCSC Genomics Institute

Shiqing Mao

unread,
Jun 20, 2017, 6:16:23 PM6/20/17
to Brian Lee, gen...@soe.ucsc.edu
Deal Brian

I don’t think the other coordinate is just duplicated. It could be from other part.

Here is my procedure.

I download the bedmethyl file from ENCODE

Since liftover doesn’t work on this big file, I cut it down using the following code:
awk '{ print $1,$2,$3,$11 }’ 

I got liftover and the chainfile from UCSC website.
I also tried using liftover in galaxy server.
They gave the same output with duplicates after liftover.

I also tried the same procedure using a different encode bedmethyl file.
I always got duplicated entry for chr1 10524 10525.

best wishes
Shiqing

Brian Lee

unread,
Jun 22, 2017, 8:16:52 PM6/22/17
to Shiqing Mao, gen...@soe.ucsc.edu

Dear Shiqing,

Thank you for your message. You may be interested to learn you can use an option on the liftOver tool, -bedPlus=9, which will allow you to convert longer bed files with additional columns instead of having to use the awk step.

liftOver -bedPlus=9 ENCFF847OWL.bed hg38ToHg19.over.chain.gz ENCFF847OWL.hg19.bed unMapped

By repeating steps with this file it was discovered that there are indeed many duplications taking place as you describe, but in some ways this isn't surprising. The suspicion is that most of the new sequence in hg38 is going to liftOver to some place in hg19 that may appear in the liftOver chains more than once.

In chain and net terminology, the target is the reference genome sequence and the query is some other genome sequence. For example, if you are viewing Human-Mouse alignments in the Human genome browser, human is the target and mouse is the query. LiftOver chains are single coverage on the target, but can be overlapping on the query. About 2.5% of hg19 is mapped to more than once in the hg38->hg19 lifttOver chain

The best solution may be to filter out these duplicates, if you choose to do so.

By following your steps but using a different awk command to print out the originating coordinates in hg38, along with the score, awk '{print $1, $2, $3, $1"_"$2"_"$3"_score"$4}', I found that this position duplicate is caused by a different location:

chr1 10524 10525 chr1_10524_10525_score100
chr1 10524 10525 chr1_181054_181055_score100

And if you try to lift these coordinates, chr1:181054-181055, in hg38 to hg19, you do indeed get that position in hg19 (http://genome.ucsc.edu/cgi-bin/hgConvert?db=hg38&position=chr1%3A181054-181055) and if you use that earlier session with the hg38 assembly and the track about the assembly differences navigating to this position, http://genome.ucsc.edu/cgi-bin/hgTracks?hgS_doOtherUser=submit&hgS_otherUserName=brianlee&hgS_otherUserSessionName=hg38.diffHg19&position=chr1%3A181054-181055 , you will see that this is from a contig new to GRCh38, so it makes some sense that it does not find an exact spot on hg19.

All the best,
Brian Lee

Shiqing Mao

unread,
Jun 23, 2017, 11:16:29 AM6/23/17
to Brian Lee, gen...@soe.ucsc.edu
Deal Brian

Many thanks for the explicit answer.
It really helps to clear up my confusion.

One more question about the awk line you used.
Could you please explain how you find the originating coordinates?
What is the score column? Is it from another option of liftover?

Best wishes
Shiqing

Brian Lee

unread,
Jun 23, 2017, 5:46:45 PM6/23/17
to Shiqing Mao, gen...@soe.ucsc.edu

Dear Shinqing,

Thank you for your message. My apologies, I incorrectly labeled the 11th column score, when it should be percent methylation based on bedMethyl.as definitions: http://genome-source.cse.ucsc.edu/gitweb/?p=kent.git;a=blob_plain;f=src/hg/lib/encode/bedMethyl.as

Where you used awk '{ print $1,$2,$3,$11 }’ and put the percent methylation in the 4th column, I changed the 4th column to repeat the position as well as the percentMeth column all joined by "_", but it ought to have had $11: awk '{print $1, $2, $3, $1"_"$2"_"$3"_percentMeth"$11}' as I was transforming an existing four column output from an earlier step replicating your process (this is the only reason there was $4 instead of $11). Using this awk step would result in an input bed like the following in the hg38 coordinates:

chr1 10524 10525 chr1_10524_10525_percentMeth100

And when it was lifted over the result would be columns 1,2,3 having new hg19 coordinates, but the "name" column, chr1_10524_10525_percentMeth100, being retained and thus annotating hg38 input coordinates. By using this approach, it was found that the hg38 chr1 181054 181055 coordinates (new to hg38) were mapping to chr1 10524 10525 in hg19.

Please note that there is also the tool CrossMap that can read almost any bed file and all fields, and tons of other files, and is compatible with liftOver. It might also be a tool you would like to investigate: http://crossmap.sourceforge.net/

Here are some examples of using both where inBed is hg38 coordinates, and how they result in the duplicate item in hg19 (because chr1 181054 181055 is new in hg38 and doesn't exist in hg19):

inBed
chr1 10524 10525 chr1_10524_10525_percentMeth100
chr1 181054 181055 chr1_181054_181055_percentMeth100

CrossMap.py bed hg38ToHg19.over.chain.gz inBed outHg19

outHg19 
chr1 10524 10525 chr1_10524_10525_percentMeth100
chr1 10524 10525 chr1_181054_181055_percentMeth100

liftOver inBed hg38ToHg19.over.chain.gz outHg19.liftOver unmapped

outHg19.liftOver
chr1 10524 10525 chr1_10524_10525_percentMeth100
chr1 10524 10525 chr1_181054_181055_percentMeth100

Reply all
Reply to author
Forward
0 new messages