looking for duplications and liftOver

28 views
Skip to first unread message

Anaïs Gouin

unread,
Jun 26, 2015, 10:42:05 AM6/26/15
to gen...@soe.ucsc.edu
Good morning,

I am comparing two genomes using the UCSC pipeline : Latsz + axtChain + chainNet. I would like to detect the specific duplications in both genomes. Do you have a pipeline or an available script to do that?

I am trying to do it by myself but I meet with several problems. My strategy is (for one way = 1 genome as reference and the other as query):
-scanning the net, I retrieve all regions in query that are used twice (or at least twice if I want to detect duplications of more than 2 copies)
-creation of a bed with these regions
-liftOver to get the corresponding regions in the second genome using the -multiple flag to have all possible correspondances

However, I do not understrand why, but liftOver do not give me all possible regions in the second genome.
Here is an example :

1/ working well
bed :
SFRU_RICE_008595        5379    5468    reg_3 #scaffold/start/end/id

chains (i do not write the information about gap for clarity) containing the scaffold SFRU_RICE_008595:
chain 84620 SFRU_RICE_008595 14293 + 317 1485 scaffold_28689 2593 + 3 1182 1265
chain 327568 SFRU_RICE_008595 14293 + 9740 14236 scaffold_18603 4868 - 0 4326 507
chain 960741 SFRU_RICE_008595 14293 + 0 14236 scaffold_425 146919 - 105838 121010 141
chain 40715 SFRU_RICE_008595 14293 + 2978 3621 scaffold_425 146919 - 108358 108971 1631
chain 37350 SFRU_RICE_008595 14293 + 6207 6750 scaffold_425 146919 - 111664 112184 1674
chain 32066 SFRU_RICE_008595 14293 + 6780 7165 scaffold_425 146919 - 112410 112791 1796
chain 9207 SFRU_RICE_008595 14293 + 5192 5478 scaffold_7072 14142 + 3268 3659 30806

In bold are chains including the region written in bed file.

lift file :
scaffold_425    36065   36154   reg_3   1
scaffold_7072   3557    3649    reg_3   2

Two regions in second genome are given.

2/ not giving expected regions
bed :
SFRU_RICE_008595        317     1484    reg_1

chains : same as above
The first and third chain (italic) include the region written in bed.

lift file :
scaffold_28689  3       1181    reg_1   1

Only the corresponding region found in the first chain is given. Can you explain why there is no region from scaffold_425 given for this bed entry?

Thank you very much in advance for your help.

Anaïs

Matthew Speir

unread,
Jul 1, 2015, 2:03:00 PM7/1/15
to Anaïs Gouin, gen...@soe.ucsc.edu
Hi Anaïs,

Thank you for your questions about identifying duplications in your genomes. Unfortunately, we don't have any programs or pipelines that can assist you with your current method. One of our engineers noted that lastz doesn't attempt to report every duplication. After it has a number of these matches, it will stop searching for new ones. This limit is based on the M parameter, which cannot be larger than 255.

We have a couple of tracks here at UCSC, such as the "Segmental Dups" track, that attempt to identify duplications in a genome. You can look at the track description page for more information: http://genome.ucsc.edu/cgi-bin/hgTrackUi?db=hg38&g=genomicSuperDups, and contact the Eichler lab at the University of Washington to inquire about their methods used to generate the data. I would also recommend searching the literature for what techniques others have used to find genomic duplications.

I hope this is helpful. If you have any further questions, please reply to gen...@soe.ucsc.edu. All messages sent to that address are archived on a publicly-accessible Google Groups forum. If your question includes sensitive data, you may send it instead to genom...@soe.ucsc.edu.

Matthew Speir
UCSC Genome Bioinformatics Group
--


Anaïs Gouin

unread,
Jul 2, 2015, 12:11:02 PM7/2/15
to gen...@soe.ucsc.edu
Thank you very much for your answer. Given all the information in the nets, I was thinking that it was possible to do it from it. But you are right, i will search further in the literature to find an other way to detect the specific duplications between my two genomes.
However, I would still like to insist about the liftover point. It is clearly possible that I misunderstood what liftover does exactly. I thought that it will return all regions in the second genome that correspond to the region in first genome, given as input of liftover (thanks to the multiple flag).
But, according to my two examples (previous email), it looks like that it does not act like this. Do you have an explanation to this behaviour?

Thank you very much in advance,

Anaïs


Steve Heitner

unread,
Jul 6, 2015, 12:32:44 PM7/6/15
to Anaïs Gouin, gen...@soe.ucsc.edu

Hello, Anaïs.

Taken from the first line of our liftOver how-to wiki at http://genomewiki.ucsc.edu/index.php/LiftOver_Howto:

>
A liftOver file is a chain file, where for each region in the genome the alignments of the best/longest syntenic regions are used to translate features from one version of a genome to another.”

When you say you thought it would return all regions in the second genome the correspond to the region in the first genome, you are describing the behavior of blat (http://genome.ucsc.edu/cgi-bin/hgBlat).  There is also a command line version of blat, free for non-commercial use, available at http://hgdownload.cse.ucsc.edu/downloads.html#source_downloads.

Please contact us again at gen...@soe.ucsc.edu if you have any further questions. 
All messages sent to that address are archived on a publicly-accessible Google Groups forum.  If your question includes sensitive data, you may send it instead to genom...@soe.ucsc.edu.

---
Steve Heitner
UCSC Genome Bioinformatics Group

--

Anaïs Gouin

unread,
Jul 7, 2015, 11:51:51 AM7/7/15
to gen...@soe.ucsc.edu
Thank you Steve for your answer. Then, what I did not understand is the option -multiple in liftOver program.
Could you give me more details?

Here is what is written for this option in the help page:
-multiple               Allow multiple output regions

I thought this option will give me all possible regions retrieved from the chains (a query could match several regions in the reference as I am not working on reciprocal best chains).

Thanks in advance,

Anaïs


Anaïs Gouin

unread,
Jul 7, 2015, 11:52:00 AM7/7/15
to gen...@soe.ucsc.edu
Actually, I realize that we are not talking about the same thing. I have already done the liftOver process following the instruction of http://genomewiki.ucsc.edu/index.php/LiftOver_Howto:
This allows me to get the final net file.

What i am using now is the liftOver tool that I found the UCSC environment.
liftOver - Move annotations from one assembly to another
usage:
   liftOver oldFile map.chain newFile unMapped
oldFile and newFile are in bed format by default, but can be in GFF and
maybe eventually others with the appropriate flags below.
The map.chain file has the old genome as the target and the new genome
as the query.


This is the tool I am using now. It looks like it is done first to move annotations, but I was thinking it should work for what I want to do.

Anaïs


Reply all
Reply to author
Forward
0 new messages