Fwd: hg38 to hg19

514 views
Skip to first unread message

Inna Zukher

unread,
Oct 6, 2016, 11:35:06 AM10/6/16
to gen...@soe.ucsc.edu
Dear UCSC Team,

I am trying to liftover some RNAseq data from hg38 to hg19.

However in the output I get contains a number of overlapping regions, which is first confusing me and second does not allow me to convert .bedGraph to .bigwig to upload the data to UCSC browser and compare with other data.

The typical example is below, from sorted liftOver output .bedGraph file:

Line 2 here tells that the whole fragment from 10531 to 10564 is 0, but then line 3 that 10533 to 34 is 3, 10534 to 10535 is 4 etc

chr1 10485 10531 1
chr1 10531 10564 0
chr1 10533 10534 3
chr1 10534 10535 4
chr1 10535 10539 5
chr1 10539 10541 6
chr1 10541 10582 7
chr1 10564 10614 1
chr1 10614 13488 0
chr1 13085 13090 3
chr1 13090 13092 2


In unsorted output file sorting is a bit strange, but maybe it is supposed to be like that. 
There is almost the whole chr1, ordered, then random piece from chr1 beginning, then random part of chrY, then chr2, chr1 again... 

chr1 10485 10531 1
chr1 10531 10564 0
chr1 10564 10614 1
chr1 10614 13488 0
------[all the chr1, ordered]------
chr1 174426 174444 1  [last chr1 fragment]
chr1 10533 10534 3  [chr1 beginning, overlapping regions here]
chr1 10534 10535 4  [overlapping regions here]
chr1 10535 10539 5  [overlapping regions here]
chr1 10539 10541 6  [overlapping regions here]
chr1 10541 10582 7  [overlapping regions here]
chrY 59361617 59361649 1
chr2 114358172 114358221 0
chr2 114358113 114358121 0
chr2 114358112 114358113 1
chr2 114358064 114358112 2
chr2 114358062 114358064 1
chr2 114357993 114358062 0
chr2 114357974 114357993 1
chr2 114357946 114357974 2
chr2 114357943 114357946 3
chr1 13085 13090 3
chr1 13090 13092 2
chr1 13092 13116 3
chr1 13116 13118 4
chr15 102518034 102518051 3
chr15 102518027 102518034 2
chr2 114357759 114357849 0
chr2 114357733 114357759 1


Scripts I used were: 
liftOver for macOSx86_64 from here http://hgdownload.soe.ucsc.edu/admin/exe/macOSX.x86_64/ and 
hg38ToHg19.over.chain.gz file from here http://hgdownload.cse.ucsc.edu/goldenPath/hg38/liftOver/

May be you have some advice how to avoid these overlapping elements  arising in the output or how to delete/merge them?


Thank you,
Inna

Cath Tyner

unread,
Oct 6, 2016, 7:54:37 PM10/6/16
to Inna Zukher, UCSC Genome Browser Public Help Forum
Hello Inna,

Thank you for using the UCSC Genome Browser and for inquiring about overlapping regions output from the LiftOver utility.

Please start by understanding our 0-based start, 1-based end coordinate system. Please see this FAQ for more information
​.​



Here is a detailed response explaining the 0-based start coordinate system as it pertains to LiftOver (web-based and command-line LiftOver versions)
​.​


As an example, I uploaded a bed file for 5 of your regions (listed below) into hg19 as a custom track, where you can see in the browser that these regions actually don't overlap (you may have to zoom in closer to see that there is no overlap between regions).

Regions in custom track
​ (follow the link above)​
:

chr1 10533 10534
chr1 10534 10535
chr1 10535 10539
chr1 10539 10541
chr1 10541 10582

Below are additional resources which may also be helpful to you:

Here
​is a​
 similar previously answered question from the mailing list archives
​.​


The most sensible method for dealing with overlapping regions will depend on what your data is and what you intend to do with it. Our engineers suggest the best plan is probably to zero out any overlapping values or set them to something easily recognizable like -1. The two original values may have little to do with each other and attempting to combine them could just result in nonsensical data.

and from this archive
​:​

Andy Pohl's bwtool lift program available here: https://github.com/andypohl/bwtool/wiki/lift The description shares how the program removes the overlapping data. Also please note that the provided example is lifting a bigWig in the reverse direction from hg19 to hg18.

I also found this page, with scripting solutions
​.​


If the  "-multiple" LiftOver option was used, you may want to remove that option. Another option to try is to experiment with the minMatch option, which could possibly be increased to reduce noise.

Please respond to this list if you require further assistance. It may be helpful to include an example input file and the actual command, with any LiftOver options, so that we can re-create the issue.

Thank you again for your inquiry and for using the UCSC Genome Browser. 
​Please send new and follow-up questions to one of our UCSC Genome Browser mailing lists below:

  * Post to the Public Help Forum: E
mail 
gen...@soe.ucsc.edu
​ or search the Public Archives
​  * Post to the Mirror Help Forum: Email
 
genome...@soe.ucsc.edu 
or search the Mirror Archives​
​  * Confidential/private help: Email
 
genom...@soe.ucsc.edu

UCSC Genome Browser Announcements List (email alerts for new data & software):
  * Subscribe: Email genome-announce+subscribe@soe.ucsc.edu 
  * Unsubscribe: Email genome-announce+unsubscribe@soe.ucsc.edu

Join us on Social Media! FacebookTwitter, Wordpress BlogYouTube

​Enjoy,​
Cath
. . .
Cath Tyner
UCSC Genome Browser, Software QA & User Support
UC Santa Cruz Genomics Institute


--

---
You received this message because you are subscribed to the Google Groups "UCSC Genome Browser discussion list" group.
To unsubscribe from this group and stop receiving emails from it, send an email to genome+un...@soe.ucsc.edu.

Inna Zukher

unread,
Mar 2, 2017, 11:07:29 AM3/2/17
to Cath Tyner, UCSC Genome Browser Public Help Forum
Dear Cath,

thank you fro your response!

I need to make more tests, but I believe it was exactly the combination of 0-start 1-start based systems that was the problem! With -positions option I now get a non-overlapping file.

However I actually got 5 files when I apply the script 
>liftOver -positions track1_chr1_hg38.bedGraph hg38ToHg19.over.chain track1_hg19_chr1.bedGraph unlifted1_chr1.bed

1) I get a strange track1_hg19_chr1.bedGraph file, looking like this:
chr1:10486-10531
chr1:10532-10564
chr1:10565-10614
chr1:10615-13488
chr1:13489-13538
chr1:13539-13637
chr1:13638-13687
chr1:13688-14360
chr1:14361-14362

2) unlifted1_chr1.bed with unlifted positions
3) liftOver_innas_mbp_17c31_6e80d0.bedmapped file, with my data lifted to hg19
4) liftOver_innas_mbp_17c31_6e80d0.bedunmapped - probably another unlifted file, but size differs from unlifted_chr1.bed
5) liftOver_innas_mbp_17c31_6e80a0.bed of the similar size as liftOver_innas_mbp_17c31_6e80d0.bedmapped, and which seems to have same content

I am a bit confused: which file should I use as my lifted file? It looks like files #3 or #5, but I am not sure what is the difference between them. Also, should I discard other files? Is there any way to assign more specific names for files #3 and #5 (not with my Mac data, but with actual track names?


Thank you in advance,
Inna


Join us on Social Media! FacebookTwitter, Wordpress BlogYouTube

​Enjoy,​
Cath
. . .
Cath Tyner
UCSC Genome Browser, Software QA & User Support
UC Santa Cruz Genomics Institute


On Thu, Oct 6, 2016 at 8:04 AM, Inna Zukher <inna....@path.ox.ac.uk> wrote:

Cath Tyner

unread,
Mar 7, 2017, 6:44:21 PM3/7/17
to Inna Zukher, UCSC Genome Browser Public Help Forum
Hello again Inna,

Thank for posting your follow up questions. To address your first question, where your output generated 5 files: because you are using bedGraph input, you will need to specify your file type in your liftOver command:

From the liftOver usage statement:
-bedPlus=N - File is bed N+ format

liftOver -positions -bedPlus=4 track1_chr1_hg38.bedGraph hg38ToHg19.over.chain.gz lifted.bed unlifted.bed

Using the above command will result in just 2 files, lifted.bed unlifted.bed.

To address your second question in regard to the "overlapping regions" when running bedGraphToBigWig, this was addressed earlier in this email thread. Please see my earlier response
​ in this thread in regard​
to your question about overlapping regions
​.​


You can always search
​our​
 public support forum for similar questions
​.​ 
For example, you can search the support forum for

"overlapping regions" liftOver

https://groups.google.com/a/soe.ucsc.edu/forum/#!searchin/genome/%22overlapping$20regions%22$20liftOver

You might also find various methods in other forums such as BioStars. For example:
https://www.biostars.org/p/171443/

Galaxy may also have a tool for this, you can contact their help desk to inquire about options. I think they have a "fetch closest non-overlapping feature for every interval" tool. For example, see pg 21 here:

Note that this isn't necessarily a recommended solution, just an option to research. This tool may or may not not make sense from a bioinformatics perspective.

And of course please feel free to respond to this forum if you have further questions!

Thank you again for your inquiry and for using the UCSC Genome Browser. 
​Please send new and follow-up questions to one of our UCSC Genome Browser mailing lists below:

  * Post to the Public Help Forum: E
mail 
gen...@soe.ucsc.edu
​ or search the Public Archives
​  * Post to the Mirror Help Forum: Email
 
genome...@soe.ucsc.edu 
or search the Mirror Archives​
​  * Confidential/private help: Email
 
genom...@soe.ucsc.edu

UCSC Genome Browser Announcements List (email alerts for new data & software):
Join us on Social Media! FacebookTwitter, Wordpress BlogYouTube

​Enjoy,​
Cath
. . .
Cath Tyner
UCSC Genome Browser, Software QA & User Support
UC Santa Cruz Genomics Institute


Reply all
Reply to author
Forward
0 new messages