Question on simulated MAF details

39 views
Skip to first unread message

aarond...@ucdavis.edu

unread,
Jan 5, 2012, 12:37:06 AM1/5/12
to align...@googlegroups.com

Hi again,
As part of the work leading up to progressiveMauve:
http://www.plosone.org/article/info%3Adoi%2F10.1371%2Fjournal.pone.0011147
a variety of alignment accuracy metrics were implemented, some of which we
didn't describe in that paper to keep the presentation focused.
It would be fantastic if I could contribute these to the alignathon effort.
To do this I've been looking through the true alignment MAF files to
understand exactly what information is captured and some questions have
arisen. I've pasted one of the MAF blocks from the correct primate
alignment below (hopefully not too badly garbled by email). In it, there
are two entries for Orangutan, encoding what appears to be a 4 base pair
CACA duplication.

What I'm wondering is whether there is any easy-to-parse information
available on which copy of a duplicate region is the ancestral one?
For this particular example of short exact tandem duplicates it's not so
interesting, but in situations of long-range segmental duplication it's a
bit more important to know. Some of the scoring metrics that I would like
to contribute can resolve whether the aligner has preferentially chosen to
align the ancestral copy, but only if that information is preserved in the
simulation and available in the alignment.

a score=0.000000 tree="
(((simChimp.chr20.1+chr21.1.2+chr20.1+chr21.1.1.1.2+chr22+chr20.1+chr21.1.2
+chr20.1+chr21.1.1.1.1:0.1,simHuman.chr20.1+chr21.1.2+chr20.1+chr21.1.1.1.2
+chr22+chr20.1+chr21.1.2+chr20.1+chr21.1.1.1.1:0.1)sHuman-
sChimp.chr20.1+chr21.1.2+chr20.1+chr21.1.1.1.2+chr22+chr20.1+chr21.1.2+chr2
0.1+chr21.1.1.1.1:0.1,simGorilla.chr20.1+chr21.1.2+chr20.1+chr21.1.1.1.2+ch
r22+chr20.1+chr21.1.2+chr20.1+chr21.1.1.1.1:0.1)sG-sH-
sC.chr20.1+chr21.1.2+chr20.1+chr21.1.1.1.2+chr22+chr20.1+chr21.1.2+chr20.1+
chr21.1.1.1.1:0.1,simOrang.chr20.1+chr21.1.2+chr20.1+chr21.1.1.1.2:0.1,simO
rang.chr20.1+chr21.1.2+chr20.1+chr21.1.1.1.2:0.1)ancestor.chr20.1+chr21.1.2
+chr20.1+chr21.1.1.1.2;"
s
simChimp.chr20.1+chr21.1.2+chr20.1+chr21.1.1.1.2+chr22+chr20.1+chr21.1.2+ch
r20.1+chr21.1.1.1.1 14318 122 + 53121445 GCATACTCCATTATATAT----
ACAATTTGTTATTATATATCAAGACAATTAGTTATTTTAGCACCATAACCTACACACATATAGTGTATATGGATC
AAAACCAGTGGGGTGAGAACCTGAATCAA
s
simHuman.chr20.1+chr21.1.2+chr20.1+chr21.1.1.1.2+chr22+chr20.1+chr21.1.2+ch
r20.1+chr21.1.1.1.1 13921 126 + 53106993
GCATACTCCATTATATATCAAGACAATTTGTTATTATATATCAAGACAATTAGTTATTTTAGCACCATAACCTAC
ACACATATAGTGTATATGGATCAATACCAGTGGGGTGAGAACCTGAATCAA
s sHuman-
sChimp.chr20.1+chr21.1.2+chr20.1+chr21.1.1.1.2+chr22+chr20.1+chr21.1.2+chr2
0.1+chr21.1.1.1.1 13983 126 + 53053033
GCATACTCCATTATATATCAAGACAATTTGTTATTATATATCAAGACAATTAGTTATTTTAGCACCATAACCTAC
ACACATATAGTGTATATGGATCAATACCAGTGGGGTGAGAACCTGAATCAA
s
simGorilla.chr20.1+chr21.1.2+chr20.1+chr21.1.1.1.2+chr22+chr20.1+chr21.1.2+
chr20.1+chr21.1.1.1.1 13915 126 + 53120926
CCATACTCCATTATATATCAAGACAATTTGTTATTATATATCAAGACAATTAGTTATTTTAGCACCATAACCTAC
ACACATATAGTGTATATGGATCAATACCAGTGGGGTGAGAACCTGAGTCAA
s sG-sH-
sC.chr20.1+chr21.1.2+chr20.1+chr21.1.1.1.2+chr22+chr20.1+chr21.1.2+chr20.1+
chr21.1.1.1.1 14062 126 + 53031706
GCATACTCCATTATATATCAAGACAATTTGTTATTATATATCAAGACAATTAGTTATTTTAGCACCATAACCTAC
ACACATATAGTGTATATGGATCAATACCAGTGGGGTGAGAACCTGAATCAA
s simOrang.chr20.1+chr21.1.2+chr20.1+chr21.1.1.1.2
13829 80 + 15493520
ACATACTCCATTATATATCAAGACAATTTGTTATTATATATCAAGACAATTTGTTATTTTAGCACCATAACCTAC
GCACA----------------------------------------------
s simOrang.chr20.1+chr21.1.2+chr20.1+chr21.1.1.1.2
13909 50 + 15493520 ------------------------------------------------------
----------------------CACATATAGTGTATATGGATCAATACCAGTGGGGTGAGAACCTGAATCGA
s ancestor.chr20.1+chr21.1.2+chr20.1+chr21.1.1.1.2
13997 126 + 15428290
GCATACTCCATTATATATCAAGACAATTTGTTATTATATATCAAGACAATTTGTTATTTTAGCACCATAACCTAC
ACACATATAGTGTATATGGATCAATACCAGTGGGGTGAGAACCTGAATCAA

Dent Earl

unread,
Jan 5, 2012, 4:43:57 PM1/5/12
to align...@googlegroups.com
Hi Aaron,

It would be great if you added your evaluations!

The information as to which copy of a duplication is the ancestral copy is implicitly stored in the simulation but not explicitly. And I think in most circumstances it would be a challenge to ascertain this. The simulation proceeds in a step-wise fashion starting at the root and moving a certain step-length down the phylogenetic tree. Every single step results in a new genome with new sequences and an alignment to the previous genome. I believe one could start at a particular genome-to-parent alignment and then go back though the previous genomes to try to locate the origin of the duplication but I don't think it would be easy. I'll double check this with Robert Edgar who is the author of EVOLVER, perhaps I'm overlooking an obvious solution.

d

PS your email alerted me to the fact that the chromosome labels on the posted mafs were old and incorrect (fixed), thanks!

aarond...@ucdavis.edu

unread,
Jan 10, 2012, 6:52:38 AM1/10/12
to Dent Earl, align...@googlegroups.com

>
> The information as to which copy of a duplication is the ancestral copy
is
> implicitly stored in the simulation but not explicitly. And I think in
most
> circumstances it would be a challenge to ascertain this. The simulation
> proceeds in a step-wise fashion starting at the root and moving a certain
> step-length down the phylogenetic tree. Every single step results in a
new
> genome with new sequences and an alignment to the previous genome. I
> believe one could start at a particular genome-to-parent alignment and
> then go back though the previous genomes to try to locate the origin of
> the duplication but I don't think it would be easy. I'll double check
this
> with Robert Edgar who is the author of EVOLVER, perhaps I'm overlooking
an
> obvious solution.
>


Did Edgar ever get back to you on this topic? I looked through the evolver
software to determine whether this might be possible. Unfortunately it's a
bit of a closed source black box in what is otherwise a nice open project
you've got going here. I did notice that Edgar's transalign program, which
I think your code may use to build MAFs, has an option called -noparalogy.
I wonder if this were added around lines 1850 and 1880 of your
libSimControl.py if it would produce MAFs without the paralogs?? Given how
little documentation there seems to be for transalign it's difficult to
know exactly what effect this would have. Would you be willing to try it?

Dent Earl

unread,
Jan 10, 2012, 5:24:06 PM1/10/12
to aarond...@ucdavis.edu, align...@googlegroups.com
Hey Aaron,

I heard back from Robert and he confirmed my suspicion that there is no explicit tracking of this information. Here's what he said:

"""
Tandem dupes arbitrarily choose the first segment in
chromosome order as ancestral at the time of the event; this assignment then
propagates using transalign.
"""

So that covers tandem dupes, and the next bit here should cover duplicated regions. You're right that we use transalign to produce pairwise alignments (in maf format) between genomes in the simulation (nice sleuthing btw, this is indeed a key area). These are then joined using mafJoin (part of mafTools) to form the total alignments for the simulations (the mafs found in truths/*). The transalign feature -noparalogs will cause any alignments that have the "paralgous bit" (see quote below) set to be omitted from the output. I have an email from George Asimenos (part of the Evolver dev team) explaining the paralogous bit:

"""
* When evolver replicates a region A into A2 (the original copy being named A1 in the evolved genome), it obviously produces two alignments; A -> A1 and A -> A2. I believe that the alignment A -> A2 has a special bit set called "Paralog", so that one can distinguish the 'copied' DNA from the original.

* When, in a subsequent evolver step, (A1,A2) are evolved into (B1,B2) without any further duplications, the alignments output are A1 -> B1 and A2 -> B2 and do not have the paralog bit set.

* Transalign does not anything special for paralogs; for this example, it will regularly produce A -> B1 and A -> B2. Moreover, it will set the paralog bit in the output for any alignment produced by an alignment with the paralog bit set in the input. Therefore, if regions A1 and A2 had been replicated to (A1_orig, A1_copy) and (A2_orig, A2_copy), the alignment A1 -> A1_orig will not have the paralog bit set (as neither one of the A -> A1 and A1 -> A1_orig have that bit) but all the other ones (i.e. A -> A1_copy, A -> A2_orig, A -> A2_copy) will have that bit set.

* Transalign has an option (-noparalogy) to skip paralogous alignments.
"""

So using -noparalogy causes us to lose information from the alignments, which is why we're not using it here; it will not report alignments that have the paralogous bit set. If you wanted to, though, I think you're right about where in the code to make the modification.

I think this means we're still stuck and with the non-solution of walking through many alignments trying to pin-point duplication origins.

d

aarond...@ucdavis.edu

unread,
Jan 11, 2012, 2:25:30 PM1/11/12
to Dent Earl, align...@googlegroups.com, align...@googlegroups.com

Hi Dent, thanks for the response, but I'm not sure I understand the
implications...

>
> """
> Tandem dupes arbitrarily choose the first segment in
> chromosome order as ancestral at the time of the event; this assignment
> then
> propagates using transalign.
> """

So does this mean that evolver is not setting the 'paralogy bit' for sites
resulting from tandem duplication?

>
> * Transalign has an option (-noparalogy) to skip paralogous alignments.
> """
>
> So using -noparalogy causes us to lose information from the alignments,
> which is why we're not using it here; it will not report alignments that
> have the paralogous bit set. If you wanted to, though, I think you're
> right about where in the code to make the modification.

The motivation for keeping the paralog alignments is clear, without them
there would be no way to know whether aligners were able to capture these
regions.

But I'm concerned that there is a whole class of genome aligners that were
explicitly designed to avoid aligning paralogs in their output. Unless
there is some way to know whether a portion of the true alignment contains
a paralog, when measuring each aligner's sensitivity/recall it will be
difficult or impossible to know whether those aligners have low
sensitivity/recall simply because they are not including paralogs --
something which they were never designed to include.

The concern is that this creates a bias against a whole class of aligners.
I happen to be the author of one of these, progressiveMauve, but there are
many others including Mugsy, M-GCAT, at least some versions of TBA, and
probably others I have missed. Although the Genome10k project may not care
much about aligners that avoid paralogs, if you're going to benchmark
people's software it would be nice to do so in an unbiased manner. At the
moment I cannot see how that is possible in the Alignathon, but I'm trying
to help find a way.

Apart from the potential bias issue, knowing the evolutionary relationships
of aligned nucleotides enables various other interesting quantities to be
measured.

So along those lines, what happens when the simulation is run with -
noparalogy? If a parallel set of true alignments can be generated with the
only difference being that paralogous sites are not included then perhaps
scoring modules can optionally use those alignments? Is there any way you
could please try generating these yourself? I saw the recipes page and
there weren't really any instructions on what to do with the .xml files,
and I've also looked at evolverSimControl and there are enough dependencies
to install that it would take me a long time to get it up & running, and
would be unfortunate to put in all that work just to learn there's some
fundamental issue with transalign's -noparalogy flag that means the
approach won't work out...

Any chance you can help? Am I misunderstanding something about -noparalogy
that implies it won't help achieve the desired result?


Dent Earl

unread,
Jan 11, 2012, 4:52:34 PM1/11/12
to aarond...@ucdavis.edu, align...@googlegroups.com
Evolver does set the paralogy bit for tandem dupes, but it's only set for the alignment between the ancestor and the second segment. For example, in a single step of evolution between two genomes (termed parent and child) a region in parent, A, is tandem-ly duplicated to form A.1 and A.2 in child. A.1 comes before A.2 along the chromosome. Since A.1 comes first, evolver arbitrarily picks A.1 as being ancestral and A.2 as being derived. So the alignment A->A.1 does not have the paralogy bit set but the alignment A->A.2 does have the paralogy bit set.

Transalign, the program, is just a utility and doesn't have a roll in the stochastic parts of the simulation. It's job within the simulation is to maintain an alignment between each cycle (aka step) and the cycle at the previous phylogenetic branch point.

It would be possible to recompute the true mafs from a starting point of -noparalogy pairwise alignments, but since the decision on what constitutes the ancestral copy is arbitrary from the point of view of the simulation it may end up making things worse for programs. As it stands now if an aligner aligns a tandem dupe to an orthologous position in another species then ducky for it, and if it aligns only one copy of the dupe to the orthologog then that's still correct. But if we use -noparalogy and an aligner picks the (arbitrarily) "wrong" copy to align to the ortholog, then it will be marked as incorrect.

That caveat stated, it isn't terribly hard for me to generate the true mafs sans paralogous blocks. I'll get this started and should have it done in a week or so. Once we have alignments with and without paralogous blocks we can compare and contrast. It would be interesting to see if aligners that try to avoid paralogous areas perform differently when assessed with the different true mafs.

d

PS I added some more copy to the recipes page to try to make the contents more clear. The xml files are kicked out when the simulation starts and contain general parameters.

aarond...@ucdavis.edu

unread,
Jan 11, 2012, 11:42:36 PM1/11/12
to Dent Earl, align...@googlegroups.com, align...@googlegroups.com

I see the concern regarding tandem duplications. It should be possible to
write a scoring algorithm that reads both correct alignments -- with and
without paralogy, and so be able to give credit for aligning any copy of a
duplicate while also not penalizing for lack of paralog alignment. Would
that create a happy medium?


Dent Earl

unread,
Jan 12, 2012, 12:49:10 PM1/12/12
to aarond...@ucdavis.edu, align...@googlegroups.com
That sounds alright, though I'm not exactly sure what that scoring algorithm would look like... Ideas are welcome. Creating a version of the true alignments that remove the paralogous bit blocks should facilitate this.

d

Reply all
Reply to author
Forward
0 new messages