--
---
You received this message because you are subscribed to the Google Groups "UCSC Genome Browser Public Support" group.
To unsubscribe from this group and stop receiving emails from it, send an email to genome+un...@soe.ucsc.edu.
To post to this group, send email to gen...@soe.ucsc.edu.
Visit this group at https://groups.google.com/a/soe.ucsc.edu/group/genome/.
To view this discussion on the web visit https://groups.google.com/a/soe.ucsc.edu/d/msgid/genome/7D92A643-F6ED-4C93-A62B-CEC2E6A23A77%40cruk.cam.ac.uk.
For more options, visit https://groups.google.com/a/soe.ucsc.edu/d/optout.
Dear Shiqing
Thank you for using the UCSC Genome Browser and your added information about the coordinates you are using.
What tool are you using to perform your liftOver? I did a test of our liftOver tool and am only getting one coordinate. You can get precompiled versions of the liftOver tool to use on your computer. For example, for my Apple laptop I acquired this binary, and the associated hg38 to hg19 chain file:
wget http://hgdownload.soe.ucsc.edu/admin/exe/macOSX.x86_64/liftOver
wget http://hgdownload.soe.ucsc.edu/goldenPath/hg38/liftOver/hg38ToHg19.over.chain.gz
With an input bed file like this one called hg38In.bed:
chr1 10524 10525
I ran the command to allow output to a newHg19.bed and an "unMapped" file for any coordinates that did not map:
./liftOver hg38In.bed hg38ToHg19.over.chain.gz newHg19.bed unMapped
The result was that the newHg19.bed had only one coordinate:
chr1 10524 10525
You can see this on our website too, where you can paste these coordinates on the hg38 web version of the liftOver tool (http://genome.ucsc.edu/cgi-bin/hgLiftOver?db=hg38) and get only one set of coordinates back for hg19. Similarly you can see this with the visual conversion tool, that will take browsing regions and transfer them to hg19, this region only results in one spot in hg19: http://genome.ucsc.edu/cgi-bin/hgConvert?db=hg38&position=chr1%3A10525-10525
If you are using the picard liftOver tool, it is harder for us to be of any assistance since we did not develop that tool. You will want to contact that resource to ask for help addressing the issue.
I hope this was helpful, thank you again for your inquiry and using the UCSC Genome Browser. If you have any further questions, please reply to gen...@soe.ucsc.edu. All messages sent to that address are archived on a publicly-accessible forum. If your question includes sensitive data, you may send it instead to genom...@soe.ucsc.edu.
All the best,
Brian Lee
UCSC Genomics Institute
Dear Shiqing,
Thank you for your message. You may be interested to learn you can use an option on the liftOver tool, -bedPlus=9, which will allow you to convert longer bed files with additional columns instead of having to use the awk step.
liftOver -bedPlus=9 ENCFF847OWL.bed hg38ToHg19.over.chain.gz ENCFF847OWL.hg19.bed unMapped
By repeating steps with this file it was discovered that there are indeed many duplications taking place as you describe, but in some ways this isn't surprising. The suspicion is that most of the new sequence in hg38 is going to liftOver to some place in hg19 that may appear in the liftOver chains more than once.
In chain and net terminology, the target is the reference genome sequence and the query is some other genome sequence. For example, if you are viewing Human-Mouse alignments in the Human genome browser, human is the target and mouse is the query. LiftOver chains are single coverage on the target, but can be overlapping on the query. About 2.5% of hg19 is mapped to more than once in the hg38->hg19 lifttOver chain
The best solution may be to filter out these duplicates, if you choose to do so.
By following your steps but using a different awk command to print out the originating coordinates in hg38, along with the score, awk '{print $1, $2, $3, $1"_"$2"_"$3"_score"$4}', I found that this position duplicate is caused by a different location:
chr1 10524 10525 chr1_10524_10525_score100
chr1 10524 10525 chr1_181054_181055_score100
And if you try to lift these coordinates, chr1:181054-181055, in hg38 to hg19, you do indeed get that position in hg19 (http://genome.ucsc.edu/cgi-bin/hgConvert?db=hg38&position=chr1%3A181054-181055) and if you use that earlier session with the hg38 assembly and the track about the assembly differences navigating to this position, http://genome.ucsc.edu/cgi-bin/hgTracks?hgS_doOtherUser=submit&hgS_otherUserName=brianlee&hgS_otherUserSessionName=hg38.diffHg19&position=chr1%3A181054-181055 , you will see that this is from a contig new to GRCh38, so it makes some sense that it does not find an exact spot on hg19.
All the best,
Brian Lee
Dear Shinqing,
Thank you for your message. My apologies, I incorrectly labeled the 11th column score, when it should be percent methylation based on bedMethyl.as definitions: http://genome-source.cse.ucsc.edu/gitweb/?p=kent.git;a=blob_plain;f=src/hg/lib/encode/bedMethyl.as
Where you used awk '{ print $1,$2,$3,$11 }’ and put the percent methylation in the 4th column, I changed the 4th column to repeat the position as well as the percentMeth column all joined by "_", but it ought to have had $11: awk '{print $1, $2, $3, $1"_"$2"_"$3"_percentMeth"$11}' as I was transforming an existing four column output from an earlier step replicating your process (this is the only reason there was $4 instead of $11). Using this awk step would result in an input bed like the following in the hg38 coordinates:
And when it was lifted over the result would be columns 1,2,3 having new hg19 coordinates, but the "name" column, chr1_10524_10525_percentMeth100, being retained and thus annotating hg38 input coordinates. By using this approach, it was found that the hg38 chr1 181054 181055 coordinates (new to hg38) were mapping to chr1 10524 10525 in hg19.
Please note that there is also the tool CrossMap that can read almost any bed file and all fields, and tons of other files, and is compatible with liftOver. It might also be a tool you would like to investigate: http://crossmap.sourceforge.net/
Here are some examples of using both where inBed is hg38 coordinates, and how they result in the duplicate item in hg19 (because chr1 181054 181055 is new in hg38 and doesn't exist in hg19):
CrossMap.py bed hg38ToHg19.over.chain.gz inBed outHg19inBed
chr1 10524 10525 chr1_10524_10525_percentMeth100
chr1 181054 181055 chr1_181054_181055_percentMeth100
liftOver inBed hg38ToHg19.over.chain.gz outHg19.liftOver unmappedoutHg19
chr1 10524 10525 chr1_10524_10525_percentMeth100
chr1 10524 10525 chr1_181054_181055_percentMeth100
outHg19.liftOver
chr1 10524 10525 chr1_10524_10525_percentMeth100
chr1 10524 10525 chr1_181054_181055_percentMeth100