Whole genome alignment howto Question.

403 views
Skip to first unread message

XiaoJu Zhang

unread,
Jul 13, 2015, 6:53:14 PM7/13/15
to gen...@soe.ucsc.edu

Dear colleagues,

I am following the “Whole_genome_alignment_howto” wiki page (http://genomewiki.ucsc.edu/index.php/Whole_genome_alignment_howto) to learn how to generate whole genome alignment. I hope you can offer some explanation helping me figure out some steps that I got confused and stuck.

At step 2, in the last command 
hgLoadChain ci2 chainCioSav2 all.pre.chain

Is this step necessary for creating alignment, or is it just a command line for visualizing the track?

In this step and a following step, ci2cioSav2 are needed as input. Where can I get these database? I found ci2’s database link, but I am not sure which file is the database this command refer to? http://hgdownload.soe.ucsc.edu/goldenPath/ci2/database

Plus, where can I find the track file chainCioSav2?

Thanks a lot.

Steve Heitner

unread,
Jul 15, 2015, 4:09:41 PM7/15/15
to XiaoJu Zhang, gen...@soe.ucsc.edu

Hello, XiaoJu.

The hgLoadChain command is used to visualize the alignment and it is necessary insofar as wanting to actually view the alignment once you have created it.

Regarding the ci2 and cioSav2 databases, tables and files, some of these exist only on our development server and some of them don’t even exist there.  The wiki page was constructed merely as an example of the procedures to follow and assumes that you have your own mirror installed that includes these databases.  I assume your interest is not specifically in the ci2 or cioSav2 assemblies, but in following the procedure.  Also note the “Example, step1: Alignments with Blastz” section which links to a script, http://genomewiki.ucsc.edu/images/9/93/RunLastzChain_sh.txt, which will run independent of database requirements.

Please contact us again at gen...@soe.ucsc.edu if you have any further questions. 
All messages sent to that address are archived on a publicly-accessible Google Groups forum.  If your question includes sensitive data, you may send it instead to genom...@soe.ucsc.edu.

---
Steve Heitner
UCSC Genome Bioinformatics Group

--

XiaoJu Zhang

unread,
Jul 16, 2015, 12:09:23 PM7/16/15
to st...@soe.ucsc.edu, gen...@soe.ucsc.edu
Hi Steve,

Thank you for your explanation and suggestion, and the shell script is definitely a saver. However, I am still eager to know what in general database is, a SQL file? Since some of the kenUtils require tDb qDb. Where can user download them or how do user generate them? Could you please show me a link to an example?

Thank you very much.

Ju

Luvina Guruvadoo

unread,
Jul 17, 2015, 12:54:24 PM7/17/15
to XiaoJu Zhang, gen...@soe.ucsc.edu
Hello Ju,

As my colleague Steve mentioned, the wiki page assumes that you have your own mirror installed that includes these databases. Please see this wiki for more information on how to install your own mirror: http://genomewiki.ucsc.edu/index.php/Minimal_Browser_Installation.

If you have any further questions, please reply to gen...@soe.ucsc.edu. All messages sent to that address are archived on a publicly-accessible forum. If your question includes sensitive data, you may send it instead to genom...@soe.ucsc.edu.

- - -
Luvina Guruvadoo
UCSC Genome Bioinformatics Group


--


XiaoJu Zhang

unread,
Jul 17, 2015, 7:18:34 PM7/17/15
to Luvina Guruvadoo, gen...@soe.ucsc.edu

Thank you Luvina and UCSC Genome group, 

Your hint helped me to connect the dots. Now I understand why we need install UCSC genome browser for aligning sequences. I actually did install the genome browser mirror with a script before. To share my learning experience to the others who might also be interested in creating alignment, I will keep corresponding this email.

To install ucsc genome browser, there is the script, which is also mentioned in UCSC’s official installation instruction website: 
download from Github https://github.com/maximilianh/browserInstall/blob/master/browserInstall.sh
It makes installation much easier, and it also offers command line to download and import database you are referring. The details instruction can be found in the repo's README.

Back to the alignment generation. 
In the Whole genome alignment HowTo step by step tutorial (http://genomewiki.ucsc.edu/index.php/Whole_genome_alignment_howto), after chaining steps, there are multiple more steps: Netting and Maffing, and Phastcons. However, in the scripting method (http://genomewiki.ucsc.edu/images/9/93/RunLastzChain_sh.txt) it seems end at the chaining step, resulted with a all.chain.gz 
My questions are:

  1. Chains and nets introduction page (http://genomewiki.ucsc.edu/index.php/Chains_Nets) explains these two terms. "Net is hierarchical collection of chains". If I want to go to multiz step to generate multiple alignment, do I have to do netting, or I can start from all.chain file? The LiftOver files download from UCSC are chain files but not netted, right?
  2. the Whole genome alignment HowTo tutorial actually only shows how to generate pairwise alignment, and then generate .maf format of pairwise alignment, is that right?
  3. With results from pairwise alignment (Liftover), do you have any suggestion for me to learn how to create multiple alignment?

Thank you in advance for your help.

Have a great weekend.

Ju 

XiaoJu Zhang

unread,
Jul 23, 2015, 6:19:52 PM7/23/15
to gen...@soe.ucsc.edu
Dear UCSC genome group,

I have a question about the multiple alignment HOWTO tutorial. In the step 5 Phastcons.
$ phyloFit -i maf/002.maf
it complains "ERROR: --tree required"
How do I define the tree? In the phastCons help, I saw example of 
--tree "(human, (mouse, rat))", 
however for two species alignment, how do I define the tree?

Please note that in order to get available database, instead of use ci2 and cs2, I decided to choose ce10 and cb3 as my test set. 

Thanks,

Ju

Matthew Speir

unread,
Jul 24, 2015, 12:13:28 PM7/24/15
to XiaoJu Zhang, gen...@soe.ucsc.edu
Hi Ju,

Thank you for your questions about the process of creating a Multiz multiple alignment.

Chains and nets introduction page (http://genomewiki.ucsc.edu/index.php/Chains_Nets) explains these two terms. "Net is hierarchical collection of chains". If I want to go to multiz step to generate multiple alignment, do I have to do netting, or I can start from all.chain file? The LiftOver files download from UCSC are chain files but not netted, right?
Yes, you will need to generate net files, and the type of net files will depend on the quality of the assemblies you are using. You can see some examples of the different types of net files in the "alignment type" column here:
http://genomewiki.ucsc.edu/index.php/Hg19_100way_Genome_size_statistics

the Whole genome alignment HowTo tutorial actually only shows how to generate pairwise alignment, and then generate .maf format of pairwise alignment, is that right?
Yes, if you want more than just two species in your alignment you will need to use something like Multiz to create your N-way multiple alignment.

With results from pairwise alignment (Liftover), do you have any suggestion for me to learn how to create multiple alignment?
Our "make docs" contain all of the steps we used when generating a N-way multiple alignment and the associated phastCons and phyloP data. You can the see the steps used to create the 20-way multiple alignment on hg38 here: http://genome-source.cse.ucsc.edu/gitweb/?p=kent.git;a=blob_plain;f=src/hg/makeDb/doc/hg38/multiz20way.txt;hb=HEAD.

You can use the steps described in the make doc as a guide for creating your own multiple alignments.
however for two species alignment, how do I define the tree?
As noted in my response to your last question, you can find this information in the make doc. You can use the "tree_doctor" program included in the PHAST package, http://compgen.cshl.edu/phast/, to extract the tree for a subset of species from a larger tree.

For example, we have a tree for a 10-way multiple alignment on ce9 here: http://genome-source.cse.ucsc.edu/gitweb/?p=kent.git;a=blob_plain;f=src/hg/utils/phyloTrees/ce9.10way.nh

This file has the contents:

((((((ce9:0.003,((cb3:0.005,caeRem3:0.003):0.004,caePb2:0.013):0.002):0.001,
  caeJap3:0.023):0.04,haeCon1:0.06):0.06,priPac1:0.12):0.06,
  (melHap1:0.14,melInc1:0.15):0.03):0.06,bruMal1:0.24);


Using 'tree_doctor' from the PHAST package:

$ tree_doctor --prune-all-but ce9,cb3 ce9.10way.nh

The resulting tree is:

(ce9:0.003000,cb3:0.011000);

You can then substitute ce10 for ce9.

We have other phyloTrees that include other species available here:
http://genome-source.cse.ucsc.edu/gitweb/?p=kent.git;a=tree;f=src/hg/utils/phyloTrees

I hope this is helpful. If you have any further questions, please reply to gen...@soe.ucsc.edu. All messages sent to that address are archived on a publicly-accessible Google Groups forum. If your question includes sensitive data, you may send it instead to genom...@soe.ucsc.edu.

Matthew Speir
UCSC Genome Bioinformatics Group
--


XiaoJu Zhang

unread,
Jul 28, 2015, 8:04:06 PM7/28/15
to Matthew Speir, gen...@soe.ucsc.edu
Thank you Matthew and UCSC genome group. 

The explanation and the instruction are really clear. However, I still have some additional questions came along with my alignment generation tests.

I have managed to follow the steps in the online tutorial “Whole_genome_alignment_howto” wiki page and get the pairwise alignment .maf file created, and I also managed to modify the script (http://genomewiki.ucsc.edu/images/9/93/RunLastzChain_sh) to my local machine, and get the final ce10.cb3.all.chain.gz file. With this chain.gz file, I then follow the preNetting, Netting and Maffing methods from the wikipage, assuming that the decompressed "ce10.cb3.all.chain.gz" from the script method is equivalent to "all.chain". 

However, when I check the final .maf files of two approaches, some obvious discrepancies are observed. The major one is that the .maf file generated from the wikipage has both '-' and '+' strand in the alignment, but .maf file from the script method only has '+' strand, no '-' strand is found for the query sequences. 

I am wondering at what step I mess up? 

More general questions related to Chaining step in HowTo wikipage: I understand that split the FASTA file of target and query file is necessary to parallel the lastz computation. Does this split, on the other hand, affect the result of the alignment? In other word, does lastz favor more fragmented sequences for the alignment computing? 

In the script method, the "partitionSequence.pl" script seems do the smiliar job as "faSplit", and for target sequences, "partitionSequence.pl" uses different option than query sequences, could you please explain why?

Thank you very much.

Best regards.


Ju

XiaoJu Zhang

unread,
Jul 28, 2015, 8:04:11 PM7/28/15
to Matthew Speir, gen...@soe.ucsc.edu
Dear UCSC genome group,

I just did more digging and found the answer for my first question. lastz's options are differently used in two methods, for the script, the option B=0, and this searches only forward strand. 

Now I am curious why the script uses B=0 for all lastzNear lastzMedium and lastzFar. Why is '-' strand  not allowed?


Thanks,

Ju

Jonathan Casper

unread,
Aug 12, 2015, 1:55:23 PM8/12/15
to XiaoJu Zhang, Matthew Speir, gen...@soe.ucsc.edu

Hello Ju,

Our engineers reviewed the commands in runLastzChain.sh and it looks like the B=0 is a mistake. Thank you for bringing it to our attention! In a different script, blastz-run-ucsc, we use B=0 to handle each strand separately for lineage specific repeat snipping. We hardly ever do that anymore though, and the same reasoning does not apply to runLastzChain.sh because it does not include a separate run on the reverse strand. Again, thank you for pointing this out. The versions of this script provided on our wiki and in the kent tree have been updated to remove the B=0 setting.

One of our engineers also notes that the other supplied lastz parameters are all subject to decisions based on the pairs involved. There is no rote formula that determines lastz parameters; there are a lot of tuning possibilities. The inclusion of B=0 was definitely a mistake on our part, but you will probably also need to adjust the other parameters for your particular alignment problem.

I hope this is helpful. If you have any further questions, please reply to gen...@soe.ucsc.edu or genome...@soe.ucsc.edu. Questions sent to those addresses will be archived in publicly-accessible forums for the benefit of other users. If your question contains sensitive data, you may send it instead to genom...@soe.ucsc.edu.

--
Jonathan Casper
UCSC Genome Bioinformatics Group


--


Reply all
Reply to author
Forward
0 new messages