chainSort vs. chainMergeSort

108 views
Skip to first unread message

David Garfield

unread,
Aug 27, 2014, 11:50:55 AM8/27/14
to gen...@soe.ucsc.edu
Good morning!

Living on the genomewiki are three excellent pages regarding whole genome alignments:


Together, they form a nice blueprint for how to create your own set of MAFs and liftOver chain files. 
However, there is a subtle difference between the first link and the second two link. I am wondering if it is significant or not. 

On the first page (as on http://genomewiki.ucsc.edu/images/9/93/RunLastzChain_sh.txt) the chains are processed (and presumably sorted?) together by first converting each .psl file into a .chain file and then passing all of these chain files to chainMergeSort. The resulting all.chain file can then be further processed towards MAFs or liftOver chains. 

In the second two pages (the third most relevantly), the results of chainMergeSort are passed to chainSplit. The resulting split chains are then merged by 'cat' and then sorted (despite the warning that chainSort is not suitable for large sets).

Looking at my output, the first approach seems to work just fine. Nonetheless, I'd like to check before making final use of my newly constructed alignment files.....perhaps the difference is because these second two links are aimed at genome versions rather than alignments between species?

Many thanks,

-- David



-------------------------------------------------------------------------------------
David Garfield, PhD
Furlong Group
European Molecular Biology Laboratory (EMBL)

Telephone    +49 6221 387 8426
Fax                 +49 6221 387 166
Snail Meyerhofstraße 1
D-69012 Heidelberg
Germany





Luvina Guruvadoo

unread,
Sep 3, 2014, 12:53:58 PM9/3/14
to David Garfield, gen...@soe.ucsc.edu
Hello David,

Thank you for your question. One of our engineers commented on the difference between these two alignment procedures:

The 'whole genome alignment' is a lastz alignment procedure. The 'lift over' same-species alignment is a blat alignment procedure. They both make chain files. The difference in chaining sorting seen in the documents is due to the different sizes of the genomes involved. Small genomes can use one type of splitting, large genomes need a different program. The result is the same. The chainSplit is merely used to renumber the chain IDs to get a consistent set of IDs for all the chains when together. The individual chain results each had their own set of chain IDs which would conflict if all were simply put together in one file with chainMergesort.

If you have any further questions, please reply to gen...@soe.ucsc.edu. All messages sent to that address are archived on a publicly-accessible forum. If your question includes sensitive data, you may send it instead to genom...@soe.ucsc.edu.

- - -
Luvina Guruvadoo
UCSC Genome Bioinformatics Group



--


Reply all
Reply to author
Forward
0 new messages