Lastz alignment of scaffold assemblies

Colin Clarke

unread,

Dec 10, 2014, 12:38:51 PM12/10/14

to gen...@soe.ucsc.edu

Hi All,

I’m attempting to carry out pairwise alignment of a scaffold assembly (~55,000 sequences) against multiple species genomes (some of these are also scaffold assemblies) using lastZ. I have attempted to recreate the UCSC pipeline from the information on the wiki: (http://genomewiki.ucsc.edu/index.php/Whole_genome_alignment_howto) and. I have adapted the RunLastzChain_sh code to run on our cluster, however the number of individual runs required to align this genome to mm10 (split into 20Mb non-overlapping sequences), for example, is prohibitively large (~ 10, million).

The wiki page above contains the following information on how UCSC deal with scaffold genome assemblies: “However, if both genomes are spread across many more scaffolds you have to play around with the sequences; otherwise it will be millions of blastz-runs. At UCSC they first join the smaller scaffolds into one big chromsome and then run blastz on the "ChrUn"-virtual-chromosome. (this will create some false nets going from one scaffold to another”.

I wondering would any of the UCSC folks be able to elaborate on how they carry out whole genome alignments when a genome is spread over a large number of scaffolds? For example, dasNov3 v mm10?

Any advice at all would be greatly appreciated!

Thanks a million,

Colin

Email Disclaimer
"This e-mail and any files transmitted with it are confidential and are intended solely for use by the addressee. Any unauthorised dissemination, distribution or copying of this message and any attachments is strictly prohibited. If you have received this e-mail in error, please notify the sender and delete the message. Any views or opinions presented in this e-mail may solely be the views of the author and cannot be relied upon as being those of Dublin City University. E-mail communications such as this cannot be guaranteed to be virus-free, timely, secure or error-free and Dublin City University does not accept liability for any such matters or their consequences. Please consider the environment before printing this e-mail."
Séanadh Ríomhphoist
"Tá an ríomhphost seo agus aon chomhad a sheoltar leis faoi rún agus is lena úsáid ag an seolaí agus sin amháin é. Tá cosc iomlán ar scaipeadh, dháileadh nó chóipeáil neamhúdaraithe ar an teachtaireacht seo agus ar aon cheangaltán atá ag dul leis. Má tá an ríomhphost seo faighte agat trí dhearmad cuir sin in iúl le do thoil don seoltóir agus scrios an teachtaireacht. D’fhéadfadh sé gurb iad tuairimí an údair agus sin amháin atá in aon tuairimí no dearcthaí atá curtha i láthair sa ríomhphost seo agus níor chóir glacadh leo mar thuairimí nó dhearcthaí Ollscoil Chathair Bhaile Átha Cliath. Ní ghlactar leis go bhfuil cumarsáid ríomhphoist den sórt seo saor ó víreas, in am, slán, nó saor ó earráid agus ní ghlacann Ollscoil Chathair Bhaile Átha Cliath le dliteanas in aon chás den sórt sin ná as aon iarmhairt a d’eascródh astu. Cuimhnigh ar an timpeallacht le do thoil sula gcuireann tú an ríomhphost seo i gcló."

Hiram Clawson

unread,

Dec 10, 2014, 12:56:06 PM12/10/14

to Colin Clarke, gen...@soe.ucsc.edu

Good Morning Colin:

The trick to this is a convenient set of 'partitions' or groupings of
the contigs. For example, the hg19 dasNov3 alignment used the 'DEF'
file of:
# human vs armadillo

# TARGET: Human Hg19
SEQ1_DIR=/scratch/data/hg19/hg19.2bit
SEQ1_LEN=/scratch/data/hg19/chrom.sizes
SEQ1_CHUNK=10000000
SEQ1_LAP=10000

# QUERY: armadillo DasNov3
SEQ2_DIR=/hive/data/genomes/dasNov3/dasNov3.2bit
SEQ2_LEN=/hive/data/genomes/dasNov3/chrom.sizes
SEQ2_CHUNK=20000000
SEQ2_LAP=0
SEQ2_LIMIT=300

BASE=/hive/data/genomes/hg19/bed/lastzDasNov3.2013-07-07
TMPDIR=/scratch/tmp

Note the 'CHUNK' sizes and LIMIT of 300 means no more than 300 sequences
in a single 'CHUNK'

This leads to partitioning statements of:

/cluster/bin/scripts/partitionSequence.pl 10000000 10000 /scratch/data/hg19/hg19.2bit
/scratch/data/hg19/chrom.sizes -xdir xdir.sh -rawDir ../psl 30 -lstDir tParts > hg19.lst
set L1 = `wc -l < hg19.lst`
/cluster/bin/scripts/partitionSequence.pl 20000000 0 /hive/data/genomes/dasNov3/dasNov3.2bit
/hive/data/genomes/dasNov3/chrom.sizes 300 -lstDir qParts > dasNov3.lst

The resulting two .lst files have line counts of:
wc -l *.lst
323 dasNov3.lst
328 hg19.lst

Which results in 323 * 328 cluster jobs == 105,944 jobs.

I typically tune batches here to have about 100,000 jobs since that is most efficient
for our cluster. Newer versions of our pipeline here do this type of partitioning
and then package the chunks into individual .2bit files, for example, hg38 vs chlSab2,
the partitioning script is:

partitionSequence.pl 20000000 10000 /hive/data/genomes/hg38/hg38.contigs.2bit
/hive/data/genomes/hg38/hg38.contigs.chrom.sizes -xdir xdir.sh -rawDir ../psl 30 -lstDir tParts
> hg38.lst
export L1=`wc -l < hg38.lst`
partitionSequence.pl 20000000 0 /hive/data/genomes/chlSab2/chlSab2.2bit
/hive/data/genomes/chlSab2/chrom.sizes 100 -lstDir qParts > chlSab2.lst
export L2=`wc -l < chlSab2.lst`
export L=`echo $L1 $L2 | awk '{print $1*$2}'`
echo "cluster batch jobList size: $L = $L1 * $L2"
if [ -d tParts ]; then
echo 'constructing tParts/*.2bit files'
ls tParts/*.lst | sed -e 's#tParts/##; s#.lst##;' | while read tPart
do
sed -e 's#.*.2bit:##;' tParts/$tPart.lst \
| twoBitToFa -seqList=stdin /hive/data/genomes/hg38/hg38.contigs.2bit stdout \
| faToTwoBit stdin tParts/$tPart.2bit
done
fi
if [ -d qParts ]; then
echo 'constructing qParts/*.2bit files'
ls qParts/*.lst | sed -e 's#qParts/##; s#.lst##;' | while read qPart
do
sed -e 's#.*.2bit:##;' qParts/$qPart.lst \
| twoBitToFa -seqList=stdin /hive/data/genomes/chlSab2/chlSab2.2bit stdout \
| faToTwoBit stdin qParts/$qPart.2bit
done
fi

And the blastz-run-ucsc script has been updated to correctly pass the resulting
2bit files to lastz:

http://genome-source.cse.ucsc.edu/gitweb/?p=kent.git;a=blob;f=src/hg/utils/automation/blastz-run-ucsc

This also requires newer versions of lastz which have been fixed to handle
this procedure correctly, versions lastz-distrib-1.03.52 or lastz-distrib-1.03.54
can correctly work with query and target sequences all in 2bit files.

--Hiram

Colin Clarke

unread,

Dec 10, 2014, 2:53:43 PM12/10/14

to Hiram Clawson, gen...@soe.ucsc.edu

Hi Hiram,

Thanks so much for such a comprehensive and helpful answer!

Best wishes,

Colin

Colin Clarke

unread,

Feb 20, 2015, 1:54:01 PM2/20/15

to Hiram Clawson, gen...@soe.ucsc.edu

Hi Hiram,

I hope all is well.

I have been able to partition the data as described in your earlier email. I can now split genome and run using lastz and modified the runLastz.sh script to run on our cluster.

https://github.com/ge11232002/CSC/blob/master/WholeGenomeAlignment/scripts/ucsc/RunLastzChain_sh.txt

My target genome is in ~55,000 scaffolds so I lump those sequences into 20,000,000 segments for lastz. I then run the program with the [multiple] modifier and output an axt alignment.

so - I've altered the liftUp line of code to convert the axt ot psl and then lift (I've also tried to pass the axt file directly to liftup )

axtToPsl alignent.lav target.chrom.sizes query.chrom.sizes stdout | liftUp -type=.axt stdout target.lift error stdin \ | liftUp -nohead -axtQ -type=.axt stdout query.lift error stdin \ | gzip -c > psl/alignment.psl.gz

and I get the following error from the liftUp for the query sequence

chr1 isn't in liftSpec file line 6 of stdin

I've checked and chr1 is definitely in the query.lift

I've repeated the process for a single scaffold and outputted to lav and can do the liftUp without error.

I'm pretty stumped and would appreciate any advice!

Thanks,

Colin

Jonathan Casper

unread,

Feb 25, 2015, 7:28:11 PM2/25/15

to Colin Clarke, Hiram Clawson, gen...@soe.ucsc.edu

Hello Colin,

I'm sorry to hear that you're still having script issues. We are confused about this problem as well; are you able to provide us with some example files that cause the problem? You can send them to me privately to avoid sharing them with the mailing list if you prefer.

If you have any further questions, please reply to gen...@soe.ucsc.edu or genome...@soe.ucsc.edu. Questions sent to those addresses will be archived in publicly-accessible forums for the benefit of other users. If your question contains sensitive data, you may send it instead to genom...@soe.ucsc.edu.