Influence of sites with limited signal

JK

unread,

Mar 14, 2013, 11:32:48 AM3/14/13

to ra...@googlegroups.com

I was wondering recently what is the exact influence of sites that bear limited amount of phylogenetic signal on the inference in RAxML? By 'sites with limited signal' I consider the following:

1. Sites with gaps
2. Sites with identical residues in all seqs
3. Sites with identical residues in all seqs but one (singletons)

I remember that case 1 was discussed earlier at some point, but what about cases 2 and 3? Are they simply omitted from the analysis because they are not informative? To test it out on my side I used a single source alignment, but with the abovementioned sites removed. Interestingly, I always got identical amino acid rate exchangeabilities and frequencies and nearly identical topologies and very close average bootstraps, with a roughly 30-50% speedup when I removed all three kinds of uninformative sites.
So, it is possible to gain quite a good speedup by removing them, at virtually no cost in support or topology accuracy.
Naturally, this requires some more extensive testing to determine the exact amount of potential speedup, the effect on topology (using e.g. RF distances) and branch supports, but I just wanted to discuss this early on, to know if going this way in my work makes any sense at all? Is the information stored in these sites useful for RAxML in some way?
I am particularly worried about the effect of singletons, because they are not informative sites, but could probably globaly inflate bootstrap supports (is 1 OTU vs all-the-rest a valid split or not?) and thus make poor branches look better than they actually do.

-Jacek

JK

unread,

Mar 14, 2013, 12:10:41 PM3/14/13

to ra...@googlegroups.com

EDIT: Of course, the exchangeability rates were supposed to be the same since the same model was used each time.

JK

unread,

Mar 14, 2013, 12:49:56 PM3/14/13

to ra...@googlegroups.com

EDIT2: Of course, amino acid freqs were supposed to be the same since there was no +F option and they were taken from the model (?). Running with +F changed little of the outcome in terms of topology, supports and running time, even though the equilibrium freqs were clearly different for different alignments. Sorry for the mess in the original post.

-J

Alexandros Stamatakis

unread,

Mar 15, 2013, 1:07:02 PM3/15/13

to ra...@googlegroups.com

Hi Jacek,

> I was wondering recently what is the exact influence of sites that
> bear limited amount of phylogenetic signal on the inference in RAxML?
> By 'sites with limited signal' I consider the following:
>
> 1. Sites with gaps
> 2. Sites with identical residues in all seqs
> 3. Sites with identical residues in all seqs but one (singletons)
>
> I remember that case 1 was discussed earlier at some point, but what
> about cases 2 and 3? Are they simply omitted from the analysis because
> they are not informative?

Informativeness is a concept from parsimony! They do contribute signal,
but in general very little and are hence not removed from the alignment.

> To test it out on my side I used a single source alignment, but with
> the abovementioned sites removed. Interestingly, I always got
> identical amino acid rate exchangeabilities and frequencies and nearly
> identical topologies and very close average bootstraps, with a roughly
> 30-50% speedup when I removed all three kinds of uninformative sites.

Makes sense and is kind of expected, but the question is if we can
develop objective criteria for removing sites rather than ad hoc
criteria.

> So, it is possible to gain quite a good speedup by removing them, at
> virtually no cost in support or topology accuracy.
> Naturally, this requires some more extensive testing to determine the
> exact amount of potential speedup, the effect on topology (using e.g.
> RF distances) and branch supports, but I just wanted to discuss this
> early on, to know if going this way in my work makes any sense at all?
> Is the information stored in these sites useful for RAxML in some way?

It's worth exploring, but my feeling is that it will be difficult to
come up with good criteria to do this.

Alexis

> I am particularly worried about the effect of singletons, because they
> are not informative sites, but could probably globaly inflate
> bootstrap supports (is 1 OTU vs all-the-rest a valid split or not?)
> and thus make poor branches look better than they actually do.
>
> -Jacek
>
>

> --
> You received this message because you are subscribed to the Google
> Groups "raxml" group.
> To unsubscribe from this group and stop receiving emails from it, send
> an email to raxml+un...@googlegroups.com.
> For more options, visit https://groups.google.com/groups/opt_out.
>
>

--
Alexandros (Alexis) Stamatakis

Research Group Leader, Heidelberg Institute for Theoretical Studies
Full Professor, Dept. of Informatics, Karlsruhe Institute of Technology
Adjunct Professor, Dept. of Ecology and Evolutionary Biology, University
of Arizona at Tucson

www.exelixis-lab.org

Alexandros Stamatakis

unread,

Mar 15, 2013, 1:07:45 PM3/15/13

to ra...@googlegroups.com

On Thu, 2013-03-14 at 09:49 -0700, JK wrote:
> EDIT2: Of course, amino acid freqs were supposed to be the same since there
> was no +F option and they were taken from the model (?).

Yes, that's how it works.

> Running with +F
> changed little of the outcome in terms of topology, supports and running
> time, even though the equilibrium freqs were clearly different for
> different alignments. Sorry for the mess in the original post.

makes sense.

alexis

JK

unread,

Mar 17, 2013, 10:27:15 AM3/17/13

to ra...@googlegroups.com, Alexandros...@gmail.com

Well, I agree that it might not be easy, but it think it's worth discussing nonetheless, even if that's where it would end. I can think of two ways this could go:

One way, obviously, would be to compare it against simulated data, where we would know the true tree and measure the deviation from it obtained with variously trimmed versions of the alignment using RF distances, the amount of common splits and their individual bootstraps or the amount of NNI/SPR/TBR operations necessary to get from the ML tree to the true tree.
Alternatively, if we don't have the true tree, we could compare the trees obtained with the trimmed variants against an alignment that would have the same amount of sites removed but on a totally random pattern (and repeat that many times). I realize that this would be more of a measure of precision than that of accuracy, but if we don't know the true tree, what else can we do?

Alexandros Stamatakis

unread,

Mar 18, 2013, 11:26:15 PM3/18/13

to JK, ra...@googlegroups.com

Hi Jacek,

> Well, I agree that it might not be easy, but it think it's worth
> discussing nonetheless, even if that's where it would end. I can think
> of two ways this could go:
>
> One way, obviously, would be to compare it against simulated data,
> where we would know the true tree and measure the deviation from it
> obtained with variously trimmed versions of the alignment using RF
> distances, the amount of common splits and their individual bootstraps
> or the amount of NNI/SPR/TBR operations necessary to get from the ML
> tree to the true tree.

Yes, but simulated data is generally weird and does often not behave
like real data, at least based on my experience.

> Alternatively, if we don't have the true tree, we could compare the
> trees obtained with the trimmed variants against an alignment that
> would have the same amount of sites removed but on a totally random
> pattern (and repeat that many times). I realize that this would be
> more of a measure of precision than that of accuracy, but if we don't
> know the true tree, what else can we do?

Sounds good to me.

Alexis

Reply all

Reply to author

Forward