chainSort vs. chainMergeSort

108 views

Skip to first unread message

David Garfield

unread,

Aug 27, 2014, 11:50:55 AM8/27/14

to gen...@soe.ucsc.edu

Good morning!

Living on the genomewiki are three excellent pages regarding whole genome alignments:

http://genomewiki.ucsc.edu/index.php/Whole_genome_alignment_howto

http://genomewiki.ucsc.edu/index.php/Minimal_Steps_For_LiftOver

http://genomewiki.ucsc.edu/index.php/LiftOver_Howto

Together, they form a nice blueprint for how to create your own set of MAFs and liftOver chain files.

However, there is a subtle difference between the first link and the second two link. I am wondering if it is significant or not.

On the first page (as on http://genomewiki.ucsc.edu/images/9/93/RunLastzChain_sh.txt) the chains are processed (and presumably sorted?) together by first converting each .psl file into a .chain file and then passing all of these chain files to chainMergeSort. The resulting all.chain file can then be further processed towards MAFs or liftOver chains.

In the second two pages (the third most relevantly), the results of chainMergeSort are passed to chainSplit. The resulting split chains are then merged by 'cat' and then sorted (despite the warning that chainSort is not suitable for large sets).

Looking at my output, the first approach seems to work just fine. Nonetheless, I'd like to check before making final use of my newly constructed alignment files.....perhaps the difference is because these second two links are aimed at genome versions rather than alignments between species?

Many thanks,

-- David

-------------------------------------------------------------------------------------

David Garfield, PhD

Furlong Group

European Molecular Biology Laboratory (EMBL)

Telephone +49 6221 387 8426

Fax +49 6221 387 166

E-mail david.g...@embl.de

Snail Meyerhofstraße 1

D-69012 Heidelberg

Germany

Luvina Guruvadoo

unread,

Sep 3, 2014, 12:53:58 PM9/3/14

to David Garfield, gen...@soe.ucsc.edu

Hello David,

Thank you for your question. One of our engineers commented on the difference between these two alignment procedures:

The 'whole genome alignment' is a lastz alignment procedure. The 'lift over' same-species alignment is a blat alignment procedure. They both make chain files. The difference in chaining sorting seen in the documents is due to the different sizes of the genomes involved. Small genomes can use one type of splitting, large genomes need a different program. The result is the same. The chainSplit is merely used to renumber the chain IDs to get a consistent set of IDs for all the chains when together. The individual chain results each had their own set of chain IDs which would conflict if all were simply put together in one file with chainMergesort.

If you have any further questions, please reply to gen...@soe.ucsc.edu. All messages sent to that address are archived on a publicly-accessible forum. If your question includes sensitive data, you may send it instead to genom...@soe.ucsc.edu.

- - -
Luvina Guruvadoo
UCSC Genome Bioinformatics Group