Optimization of liftOver for region with large duplications (exact and not exact)?

Charles Warden

unread,

Nov 19, 2020, 4:38:54 PM11/19/20

to genome...@soe.ucsc.edu

Hi,

I am trying to transfer annotations from an earlier version of an assembly to an updated assembly (with a mix of large and small changes).

The general discussion can be seen here:

https://www.biostars.org/p/472543/

The specifics of how I am creating a .chain file for liftOver (and then CrossMap) can be seen here:

https://www.biostars.org/p/391080/#465890

Here is the code from that posting:

cd $ID1

faToTwoBit $ID1.fa $ID1.2bit

twoBitInfo $ID1.2bit chrom.sizes

cd ..

cd $ID2

faToTwoBit $ID2.fa $ID2.2bit

twoBitInfo $ID2.2bit chrom.sizes

cd ..

# create .chain file

blat $ID1/$ID1.2bit $ID2/$ID2.fa $ID1\to$ID2.psl -tileSize=12 -minScore=100 -minIdentity=98

axtChain -linearGap=medium -psl $ID1\to$ID2.psl $ID1/$ID1.2bit $ID2/$ID2.2bit $ID1\to$ID2.chain

Unlike Exonerate (and the other methods described in the general discussion), I don’t have testing of liftOver to the positive control (the exact starting sequence). However, at an earlier point, I did test using liftOver on a different set of small changes.

When I used liftOver with only a few small differences (less than 10 bp each), that is where I was describing losing ~20% of the exon blocks (~50 total blocks in the unmapped file).

While I was currently leaning towards using Exonerate, do you think there is anything that might help map more of the exons for liftOver (now that I have revised sequences that I am ready to annotate)?

Thank You,

Charles

Charles Warden

Bioinformatics Specialist

Integrative Genomics Core, City of Hope National Medical Center

Shamrock Monrovia Building (655 Huntington Dr, Monrovia, CA, 91016), Room 1086

E-mail: cwa...@coh.org

Internal Ext: 80375 | Direct: 626-218-0375

Work-From-Home Cell: 404-316-0012

------------------------------------------------------------
-SECURITY/CONFIDENTIALITY WARNING-

This message and any attachments are intended solely for the individual or entity to which they are addressed. This communication may contain information that is privileged, confidential, or exempt from disclosure under applicable law (e.g., personal health information, research data, financial information). Because this e-mail has been sent without encryption, individuals other than the intended recipient may be able to view the information, forward it to others or tamper with the information without the knowledge or consent of the sender. If you are not the intended recipient, or the employee or person responsible for delivering the message to the intended recipient, any dissemination, distribution or copying of the communication is strictly prohibited. If you received the communication in error, please notify the sender immediately by replying to this message and deleting the message and any accompanying files from your system. If, due to the security risks, you do not wish to receive further communications via e-mail, please reply to this message and inform the sender that you do not wish to receive further e-mail from the sender. (LCP301)
------------------------------------------------------------

Daniel Schmelter

unread,

Nov 24, 2020, 7:17:13 PM11/24/20

to Charles Warden, genome...@soe.ucsc.edu

Hello Charles,

Thank you for using the Genome Browser and for your question about LiftOver optimization.

In general, that method is not a great way to get quality chain file results. Our chain file creation process is slightly different and these differences may be important for cases like yours.

Our process generates alignment files with BLAT by first splitting each fasta file into 5kb regions, running BLAT, and doing a clean-up step. For complete genomes, we have partially automated this pipeline and recommend following the "doSameSpeciesLiftOver.pl" wiki guide to perform all the steps involved to make the chain file.

http://genomewiki.ucsc.edu/index.php/DoSameSpeciesLiftOver.pl

Depending on your sequence, mismatches may still produce results below the score thresholds. I see you set a few quality thresholds to be higher than their defaults. If you want more results, you could try changing these thresholds. Specifically, reducing tileSize to 11, minScore to 30, minIdentity to 90, and maxGap to 3. This may be the easiest solution for starters. You can access descriptions of these options by running the "blat" program without any options to get the usage message.

I hope this was helpful. If you have any more questions, please reply-all to gen...@soe.ucsc.edu or genome...@soe.ucsc.edu. All messages sent to that address are publicly archived. If your question includes sensitive data, please reply-all to genom...@soe.ucsc.edu.

All the best,

Daniel Schmelter
UCSC Genome Browser

--

---
You received this message because you are subscribed to the Google Groups "UCSC Genome Browser Mirror-Specific Support" group.
To unsubscribe from this group and stop receiving emails from it, send an email to genome-mirro...@soe.ucsc.edu.
To view this discussion on the web visit https://groups.google.com/a/soe.ucsc.edu/d/msgid/genome-mirror/8D39BD52A38EC54B908CCE3AD963153F0132AAC31F%40ppwexch2kx03.coh.org.

Charles Warden

unread,

Nov 25, 2020, 11:56:21 AM11/25/20

to Daniel Schmelter, genome...@soe.ucsc.edu

Hi Daniel,

Great – thank you very much!

I will take a look at that.

Sincerely,

Charles

Charles Warden

unread,

Nov 25, 2020, 3:47:46 PM11/25/20

to Daniel Schmelter, genome...@soe.ucsc.edu

Hi Daniel,

As an update, I tried changing those parameters but I got worse results (everything is now in the unmapped file from CrossMap).

However, for this project, I think using a .pileup file to keep track of SNPs and indels between the versions of the sequence (from a BWA-MEM alignment) is working OK, and I am using Exonerate to compare the gene annotations to those predictions.

My guess was something about having large blocks of identical or closely related sequence might be causing a problem, such as causing problems with having 1:1 mappings? However, I will add a link to this suggestion in the Biostars discussion, in the event that this can help others in slightly different situations.

Thank You,

Charles

From: Daniel Schmelter <dsch...@ucsc.edu>
Sent: Tuesday, November 24, 2020 4:17 PM
To: Charles Warden <cwa...@coh.org>
Cc: genome...@soe.ucsc.edu
Subject: Re: [genome-mirror] Optimization of liftOver for region with large duplications (exact and not exact)?

Hello Charles,

Reply all

Reply to author

Forward