[Genome] How to generate pairwise alignment data

608 views
Skip to first unread message

Shin Sasaki

unread,
Oct 4, 2005, 10:31:05 PM10/4/05
to
Hi,

I downloaded human/chimp, mouse/rat pairwise alignments data
(axtNet/chr*.axt.gz, *.chain.gz, *.net.gz, etc.).
I think these files are generated from blastz alignment data
with some programs written by Jim, for example, axtChain, netClass, etc.

So please tell me the detailed procedure to generate the data
that I downloaded, from .lav format alignments that blastz outputs.

Thanks you.

--
Shin Sasaki
University of Tokyo


Rachel Harte

unread,
Oct 5, 2005, 2:17:35 PM10/5/05
to
Hi Shin,

Yes, that is correct that these files are generated from the blastz
alignment using those programs. Here are the steps:

1) lavToAxt is used to convert the blastz lav file to an axt file
lavToAxt in.lav tNibDir qNibDir out.axt

where the tNibDir is a directory containing the chrom sequence files in
nib format for the target and qNibDir contains those for the query.
Alternatively, a 2bit format file can specified for the sequences
instead of these directories. If you are looking at chimp alignments on the
human browser, then chimp is the query and human is the target.

2) Next axtChain is run to create the chains.
axtChain in.axt tNibDir qNibDir out.chain
Sometimes the -minScore option is used to filter out low scoring
alignments and -scoreScheme is used if the scoring matrix used for blastz
is not the default.
tNibDir and qNibDir are the same as for step 1.
Chains are then sorted using chainMergeSort:

chainMergeSort file(s)

so a list of chain files is given as input to the program. If you see
a file such as mm6.rn3.all.chain.gz in the downloads then this *.all.chain
file was created in this way.

3) To create the nets from the chains,
i) the chainPreNet program is used:
chainPreNet in.chain target.sizes query.sizes out.chain

in.chain can be the all.chain produced by the previous step.
target.sizes and query.sizes are files with a list of the chromosomes and
their sizes seperated by tabs
e.g.
chr1 195109612
chr2 181764313

etc.
This can be obtained by downloading the chromInfo table for the relevant
assembly through our Downloads site:
http://hgdownload.cse.ucsc.edu/downloads.html

Pick the relevant species and assembly and follow the link to Annotation
Database where you will find the dumps from our database tables.

ii)Then add synteny information to the net:
netSyntenic in.net out.net

iii) Add classification information using the database tables:

netClass in.net tDb qDb out.net

If lineage-specific repeats were used for the Blastz alignment then
the -tNewR and -qNewR options must be used.

-tNewR=dir - Dir of chrN.out.spec files, with RepeatMasker .out format
lines describing lineage specific repeats in target
-qNewR=dir - Dir of chrN.out.spec files for query

The out.net file is the gzipped net file that you see in our Downloads
area.

4) The axtNet files are created by converting the net to axt format.
This is done using the netToAxt program:

netToAxt in.net in.chain tNibDir qNibDir out.axt

see step 1 for definition of tNibDir and qNibDir.

I hope that this helps. Please let me know if you need further details or
have any more questions.

Rachel
> _______________________________________________
> Genome maillist - Genome at soe.ucsc.edu
> http://www.soe.ucsc.edu/mailman/listinfo/genome
>

--
UCSC Genome Bioinformatics Group
http://genome.ucsc.edu




Shin Sasaki

unread,
Oct 7, 2005, 6:47:13 AM10/7/05
to
Hi Rachel,

I am grateful for your quick and kind answer.
It is very helpful information for me.

Well, I have still two questions.
First, is it right that netSyntenic and netClass programs
add synteny or classification information to net format data
and do not change the 'sequence' (or alignment position) information?

Second, I think 'raw' blastz output data has some false-positive
alignments because of unmasked repeats or paralogous regions.
The steps you said doesn't have explicit filtering steps,
for example, axtBest and axtRecipBest.
Are these filtering steps not needed?
Maybe axtChain or chainPreNet would remove such dusty alignments?

Thanks,

Shin Sasaki

Rachel Harte

unread,
Oct 11, 2005, 3:48:43 PM10/11/05
to
Shin,
It is correct that netSyntenic and netClass only add classification
information but do not change the sequence alignments.

Regarding filtering, if there are a lot of low scoring chains e.g. those
with score < 5000 then often we filter these out using the minScore option
for axtChain. Chains can also be filtered after they are made using the
chainFilter program. chainPreNet does remove chains that do not have a
chance of being netted. Then chainNet makes the alignments nets from
chains using the highest scoring chains in the top level. Gaps are
filled in with other chains at level 2 and then gaps in the
level 2 chains can be filled in with chains in level 3 etc. In the net,
chains are trimmed to fit into these sections that are not covered by a
higher-scoring chain. We also use netFilter with the minGap option set to
12 before loading the net into the database. This restricts the nets to
those with a gap size >= 12 bp.

Other possibilities for false positives in the alignments do include
regions that are not repeat masked as you say in your e-mail. There is an
option for blastz that has been used for some of the genome comparisons
where a genome is not fully repeat masked. For instance, tetraodon does
not have a species-specific repeats library for use with RepeatMasker. In
this case an option was used with blastz that masks a base if that base is
hit N times where N is 50 by default.

I hope that this answers your question.

Rachel
Reply all
Reply to author
Forward
0 new messages