7 Yeast species liftover

78 views
Skip to first unread message

John Adams

unread,
Aug 23, 2016, 2:05:06 PM8/23/16
to gen...@soe.ucsc.edu
Hi,

I am trying to generate all pairwise liftover chain files between 7 yeast genomes available on the yeast genome database (http://www.yeastgenome.org/download-data/sequence).

Since, only S. cerevisiae is on the UCSC genome browser i am not sure which settings to use for the lastz run. The settings described here (http://genomewiki.ucsc.edu/index.php/Whole_genome_alignment_howto) are for Ciona species.

The commands that i use are given below. Out of 2303 windows of 100 bp each, i can only liftover 144 windows.

Could somebody please provide a pointer as to which parameters need to be changed to increase the fraction of the genome that can be liftOver?

Thanks,
John

#######################################################################################################################################
faSplit sequence genome.fa 16 chr

i="chr00"
blastz S288C_reference_sequence_R58-1-1_20080305.fsa par/"$i".fa C=0 O=200 E=20 K=1000 L=1000 H=2000 M=20 > yeast_cer_par_"$i".lav
lavToPsl yeast_cer_par_"$i".lav out_"$i".psl
faToTwoBit par/"$i".fa "$i".2bit
axtChain -linearGap=loose -psl out_"$i".psl S288C_reference_sequence_R58-1-1_20080305.2bit "$i".2bit out."$i".chain
chainNet out."$i".chain S288C_reference_sequence_R58-1-1_20080305.sizes genome.sizes chr."$i".net /dev/null
netChainSubset chr."$i".net out."$i".chain lift."$i".chain

bedtools makewindows -g S288C_reference_sequence_R58-1-1_20080305.sizes -w 100|awk '$1=="I"{print $0,$0}'|sed 's/ /\t/g' > chrI.cer.bed
wc -l chrI.cer.bed
2303 chrI.cer.bed

liftOver -minMatch=0.1 -bedPlus=3 chrI.cer.bed lift."$i".chain conversions.bed unmapped
wc -l conversions.bed
144 conversions.bed
#######################################################################################################################################


Hiram Clawson

unread,
Aug 23, 2016, 2:33:08 PM8/23/16
to John Adams, gen...@soe.ucsc.edu
Good Morning John:

I ran up some lastz/chain/net runs on 33 yeast species some time ago. (2011)
I simply used 'lastz' with no parameters at all, so defaults all the way through.
(FYI: blastz is long ago out of date)
Depending upon the phylogenetic distance between the strain of yeast, your
liftOver result could vary dramatically. The measurement I make is how
much of one genome is matched to the other. All alignments I made were
to target: sacCer3/Apr. 2011 (SacCer_Apr2011)
Saccharomyces cerevisiae S288c assembly from Saccharomyces Genome Database (GCA_000146055.2)

The similarity from each strain to sacCer3 was:

# AWRI1631: 11651200 bases of 12157105 (95.839%) in intersection
# AWRI796: 11999010 bases of 12157105 (98.700%) in intersection
# CBS7960: 11687063 bases of 12157105 (96.134%) in intersection
# CLIB215: 11432987 bases of 12157105 (94.044%) in intersection
# CLIB324: 10840222 bases of 12157105 (89.168%) in intersection
# CLIB382: 9370716 bases of 12157105 (77.080%) in intersection
# EC1118: 11981729 bases of 12157105 (98.557%) in intersection
# FL100: 11086015 bases of 12157105 (91.190%) in intersection
# FostersB: 11874757 bases of 12157105 (97.678%) in intersection
# FostersO: 11924459 bases of 12157105 (98.086%) in intersection
# JAY291: 12031398 bases of 12157105 (98.966%) in intersection
# LalvinQA23: 12000205 bases of 12157105 (98.709%) in intersection
# M22: 10511165 bases of 12157105 (86.461%) in intersection
# PW5: 11935295 bases of 12157105 (98.175%) in intersection
# RM111a: 11962069 bases of 12157105 (98.396%) in intersection
# Sigma1278b: 12005373 bases of 12157105 (98.752%) in intersection
# T7: 12081072 bases of 12157105 (99.375%) in intersection
# T73: 11606894 bases of 12157105 (95.474%) in intersection
# UC5: 11949157 bases of 12157105 (98.289%) in intersection
# VL3: 12015619 bases of 12157105 (98.836%) in intersection
# Vin13: 12005359 bases of 12157105 (98.752%) in intersection
# W303: 12042586 bases of 12157105 (99.058%) in intersection
# Y10: 10156801 bases of 12157105 (83.546%) in intersection
# YJM269: 11719109 bases of 12157105 (96.397%) in intersection
# YJM789: 11957276 bases of 12157105 (98.356%) in intersection
# YPS163: 10812681 bases of 12157105 (88.941%) in intersection
# sacBay: 10762610 bases of 12157105 (88.529%) in intersection
# sacCas: 7202451 bases of 12157105 (59.245%) in intersection
# sacKlu: 6404405 bases of 12157105 (52.680%) in intersection
# sacKud: 10830333 bases of 12157105 (89.086%) in intersection
# sacMik: 11254494 bases of 12157105 (92.575%) in intersection
# sacPar: 11712000 bases of 12157105 (96.339%) in intersection

You can see the procedure I performed in the document:

http://genome-source.cse.ucsc.edu/gitweb/?p=kent.git;a=blob_plain;f=src/hg/makeDb/doc/sacCer3.txt

Look for the section:
Multiple alignment (WORKING - 2011-09-01 - Hiram)

Clearly, I'm not WORKING on it any longer ...

--Hiram

On 8/23/16 11:00 AM, John Adams wrote:
> Hi,
>
> I am trying to generate all pairwise liftover chain files between 7 yeast
> genomes available on the yeast genome database (
> http://www.yeastgenome.org/download-data/sequence).
>
> Since, only *S. cerevisiae*
> <http://hgdownload.soe.ucsc.edu/downloads.html#yeast> is on the UCSC genome

John Adams

unread,
Aug 24, 2016, 10:47:30 AM8/24/16
to Hiram Clawson, gen...@soe.ucsc.edu
Good Evening Hiram,

Thank you. That is super helpful. 

I did all your steps upto chaining (stopping at chainMergeSort).

However, i can only liftOver 77% of the genome. Should i do the netting and maffing and then somehow convert it back to a chain file to get the 96% that you see?

Thanks,
John

Matthew Speir

unread,
Sep 1, 2016, 12:44:10 PM9/1/16
to John Adams, Hiram Clawson, gen...@soe.ucsc.edu
Hi John,

The statistics listed in that make doc for the Yeast 33-way alignment that Hiram mentioned are based on the coverage of the "ChainLink" tables, which are different than LiftOver.

Because of this, one of our engineers notes:

liftOver is a different type of result than chainLink coverage. liftOver can fail if parts of the query can not be mapped. Just because chainLink coverage is %96 does not mean that liftOver will map %96 of all items. chainLink coverage is a single base to single base measurement.

I hope this is helpful. If you have any further questions, please reply to gen...@soe.ucsc.edu. All messages sent to that address are archived on a publicly-accessible Google Groups forum. If your question includes sensitive data, you may send it instead to genom...@soe.ucsc.edu.

Matthew Speir
UCSC Genome Bioinformatics Group
--


John Adams

unread,
Sep 1, 2016, 4:30:08 PM9/1/16
to Matthew Speir, Hiram Clawson, gen...@soe.ucsc.edu
Hi Matthew,

Thank you for that clarification. I figured as much.

In the meantime i have been trying out Progressive Cactus and HALliftover. It seems to do better for my purposes. However, i am hesitant to use it as its not used by UCSC for between species liftover. Moreover, i dont have the computational power to run Progressive Cactus for anything larger than yeast.

Do you plan to provide .hal files similar to .chain files in the near future?

Thanks,

John

John Adams

unread,
Sep 2, 2016, 10:41:46 AM9/2/16
to gen...@soe.ucsc.edu
Halliftover in the end managed to transfer 94%.

Cheers,
John

Reply all
Reply to author
Forward
0 new messages