[Genome] Generate conservation info for interspecies genome comparisons

487 views
Skip to first unread message

Bremen Braun

unread,
Dec 3, 2009, 12:31:48 PM12/3/09
to gen...@soe.ucsc.edu
Hello,

I have sequences of similar chromosomes for different species that I want to
compare. I would like to be able to generate a conservation track such as
the one seen here:
http://tinyurl.com/yz6o6fu

I looked at an example of steps to be taken. Could you please verify/clarify
for me? Let's assume I want to compare 2 chromosomes.
1. Mask both chromosomes
2. Align both chromosomes with blastz
3. Chain using axtChain (QUESTION: I see this program takes two directories
as arguments. If I wish only to compare two chromosomes, would these
directories have only one file each?)
4. Get size of each chromosome for chain filtering using faSize
5. Sort and filter chains
a. use chainMergeSort using chain from step 3
b. prenet with chainPreNet using chain from 5.a., size of target
chromosome gotten from step 4, and size of query chromosome from step 4 as
arguments
6. Netting?
7. Convert to .maf and use phyloFit
8. Run phastCons to get .wig output which can be uploaded as a conservation
track

Since I am only comparing two chromosomes at a time, I think only a single
chain is generated which doesn't have to be netted. Am I missing anything?

Thanks,
Bremen

Angie Hinrichs

unread,
Dec 8, 2009, 12:57:29 AM12/8/09
to Bremen Braun, gen...@soe.ucsc.edu
Hi Bremen,

You have the basic flow correct. Some details added below:

> 2. Align both chromosomes with blastz

We now use lastz, an improved version of blastz
(http://www.bx.psu.edu/miller_lab/, search for lastz) but blastz is
sufficient. The output format of blastz, LAV, must be converted to
either PSL or AXT so that axtChain can read the alignments (we have
lavToAxt and lavToPsl programs; PSL is more compact). lastz can
produce axt directly with the --output=axt option.


> 3. Chain using axtChain (QUESTION: I see this program takes two directories
> as arguments. If I wish only to compare two chromosomes, would these
> directories have only one file each?)

Yes, that would work. However, you can also use the -faQ and -faT
options, and give fasta files instead of directories. To see a
description of all axtChain options, run "axtChain" with no arguments.
Here is an example usage of -faQ and -faT:

axtChain -faQ -faT in.axt oneChrom.fa otherChrom.fa chroms.chain

Another alternative is to convert your fasta files into the compact
format 2bit like this:

faToTwoBit oneChrom.fa oneChrom.2bit

-- then you can give axtChain [and lastz but not blastz] the .2bit file
instead of the directory, and don't have to pass -faQ / -faT.


> 4. Get size of each chromosome for chain filtering using faSize

Yes. Note that the sequence name and size must appear in a
tab-separated file, so use the -detailed flag like this:

faSize -detailed oneChrom.fa > oneChrom.sizes


> 5. Sort and filter chains
> a. use chainMergeSort using chain from step 3
> b. prenet with chainPreNet using chain from 5.a., size of target
> chromosome gotten from step 4, and size of query chromosome from
> step 4 as arguments
> 6. Netting?

Yes. We pipe the output of chainPreNet to chainNet (and pipe that to
netSyntenic) like this:

chainPreNet chroms.chain oneChrom.sizes otherChrom.sizes stdout \
| chainNet stdin -minSpace=1 oneChrom.sizes otherChrom.sizes stdout /dev/null \
| netSyntenic stdin chroms.net


> 7. Convert to .maf and use phyloFit

Yes. For historical reasons we use netToAxt | axtToMaf, we don't have
a netToMaf. You can see an example of how phyloFit was run for the
D. melanogaster Conservation track in our source tree
(kent/src/hg/makeDb/doc/dm2.txt, search for "PHASTCONS 15WAY" and then
search for phyloFit).

If you have only two species, there is not much phylogenetic
information for phyloFit, but I suppose it could still make a
substitution rate model. If you have more than two species, you will
also need multiz from http://www.bx.psu.edu/miller_lab/ .


> 8. Run phastCons to get .wig output which can be uploaded as a conservation
> track

Yes. Adam Siepel also has a new method of scoring conservation,
phyloP, which we offer in addition to phastCons scores on newer
Conservation tracks. I have no idea whether one would be more
appropriate than the other when working with only two species.


> Since I am only comparing two chromosomes at a time, I think only a single
> chain is generated which doesn't have to be netted. Am I missing anything?

Netting is still necessary, in order to get single-coverage alignments
on which conservation scores are computed. axtChain usually produces
many chains that cover the same position on the reference
genome/chromosome. The netting process selects the highest-scoring
chain as the top level, and then fills in gaps (unaligned areas) using
the next highest-scoring chain, and then fills in that chain's gaps
using the next highest-scoring chain and so on. It is kind of like
extracting a global alignment from a sea of chained local alignments.


Max Haeussler contributed a very useful genomewiki page about
reconstructing our alignment & conservation pipeline (perhaps you have
seen it already?):

http://genomewiki.ucsc.edu/index.php/Whole_genome_alignment_howto

Hope that helps, and please send more questions to gen...@soe.ucsc.edu
as you have them,

Angie

----- "Bremen Braun" <bremen...@gmail.com> wrote:
> _______________________________________________
> Genome maillist - Gen...@lists.soe.ucsc.edu
> https://lists.soe.ucsc.edu/mailman/listinfo/genome

Bremen Braun

unread,
Dec 8, 2009, 2:51:37 PM12/8/09
to Angie Hinrichs, gen...@soe.ucsc.edu
Angie,
Thanks for the reply. I now have some questions regarding tool behavior.

1. Following Max's guide, I started by masking both fasta files to be
compared with trfBig using Tandem Repeats Finder v4.04. However, upon
running I get messages from stdout such as the following:

Freeing Memory...
Resolving output...
Done.maybeSystem: system(cd .; trf ./chr.tf 2 7 7 80 10 50 2000 -m ) failed
(exit status 1): Unknown error: 0

and

Freeing Memory...
Resolving output...
Done.maybeSystem: system(cd .; trf ./chr2.tf 2 7 7 80 10 50 2000 -m ) failed
(exit status 1): Cannot allocate memory

I am running OS X v10.6.2 on a quad core with 6 GB of RAM and using the
newest trfBig as far as I know.

2. chainPreNet generates the following error:

Got 1 chroms in directory/chr1.sizes, 1 in directory/chr2.sizes
hashMustFindVal: 'chr2' not found
Finishing nets
writing stdout
writing /dev/null
Couldn't open /proc/self/stat , No such file or directory
hashMustFindVal: 'chr2' not found
Got 1 chroms in directory/chr1.sizes, 1 in directory/chr2.sizes
Finishing nets
writing stdout
writing /dev/null
Couldn't open /proc/self/stat , No such file or directory

when running line:

chainMergeSort *.chain > all.chain
chainPreNet all.chain chr1.sizes chr2.sizes stdout \
| chainNet stdin -minSpace=1 chr1.sizes chr2.sizes stdout /dev/null \
| netSyntenic stdin out.net

All the tools I am using were compiled from the newest version of jksrc. Any
ideas what the problems are?

Bremen

Angie Hinrichs

unread,
Dec 9, 2009, 2:39:25 PM12/9/09
to Bremen Braun, gen...@soe.ucsc.edu
Hi Bremen,

Regarding the error messages from bigTrf/trf -- what do you see if you run the trf command (copied from the maybeSystem(...)) in your shell? I'm hoping that there will be a more informative error message from trf itself. If trf is failing to run on your sequence, then the trf command line and sequence would be a good test case for the TRF developers.

To help determine why we're not finding the expected sequence name in the hash of chrom names to sizes, can you send the following?:
* contents of your chr1.sizes and chr2.sizes files
* header lines of the two fasta files passed to lastz/blastz
* the first line that begins with "chain " in all.chain

Angie

----- "Bremen Braun" <bremen...@gmail.com> wrote:
> From: "Bremen Braun" <bremen...@gmail.com>
> To: "Angie Hinrichs" <an...@soe.ucsc.edu>
> Cc: gen...@soe.ucsc.edu
> Sent: Tuesday, December 8, 2009 11:51:37 AM GMT -08:00 US/Canada Pacific
> Subject: Re: [Genome] Generate conservation info for interspecies genome comparisons
>
> Angie,
> Thanks for the reply. I now have some questions regarding tool behavior.
>
> 1. Following Max's guide, I started by masking both fasta files to be compared with trfBig using Tandem Repeats Finder v4.04. However, upon running I get messages from stdout such as the following:
>
> Freeing Memory...
> Resolving output...
> Done.maybeSystem: system(cd .; trf ./ chr.tf 2 7 7 80 10 50 2000 -m ) failed (exit status 1): Unknown error: 0
>
> and
>
> Freeing Memory...
> Resolving output...
> Done.maybeSystem: system(cd .; trf ./ chr2.tf 2 7 7 80 10 50 2000 -m ) failed (exit status 1): Cannot allocate memory
> ( http://www.bx.psu.edu/miller_lab/ , search for lastz) but blastz is

Bremen Braun

unread,
Dec 15, 2009, 3:33:34 PM12/15/09
to Angie Hinrichs, gen...@soe.ucsc.edu
Angie,

1) Unfortunately, running trf prints no error message, but issuing "echo $?"
reveals it is indeed exiting with error code 1 which is interpreted by
trfBig with strerror. My guess is this is a problem running the precompiled
binary with Snow Leopard so I tried it with 32 bit Linux. This time trfBig
returns "Done.maybeSystem: ... failed (exit status 1): Success", so I guess
that means it's working properly.

2) As for using -faQ and -faT flags for axtChain (axtChain -faQ -faT -psl
-linearGap=loose $i ${args[0]} ${args[1]} $(basename $i .psl).chain), I get
a segfault so I opted to use directories for now.

3) The "hashMustFindVal" from earlier was because I had switched some
arguments.

Thanks for your help and hopefully I have it working now

Bremen

On Wed, Dec 9, 2009 at 1:39 PM, Angie Hinrichs <an...@soe.ucsc.edu> wrote:

> Hi Bremen,
>
> > Done.maybeSystem: system(cd .; trf ./chr.tf 2 7 7 80 10 50 2000 -m )
> failed (exit status 1): Unknown error: 0
> >
> > and
> >
> > Freeing Memory...
> > Resolving output...
> > Done.maybeSystem: system(cd .; trf ./chr2.tf 2 7 7 80 10 50 2000 -m )
>> > (http://www.bx.psu.edu/miller_lab/, search for lastz) but blastz is
Reply all
Reply to author
Forward
0 new messages