Treatment of missing data

158 views
Skip to first unread message

Takuya Konuma

unread,
Nov 6, 2019, 5:06:52 AM11/6/19
to raxml
Hi all!

I wonder how missing data are treated in RAxML or RAxML-NG.

Because our NGS data have a lot of missing data, I want to know how they affect tree inference.

Please teach me.

Alexandros Stamatakis

unread,
Nov 6, 2019, 5:17:38 AM11/6/19
to ra...@googlegroups.com
It's treated as undetermined character, i.e., it doesn't contribute
anything to the likelihood. Thus, it's not they way they are treated
that affects the inference, but the mere absence of data/signal that can
affect the inference.

Alexis
> --
> You received this message because you are subscribed to the Google
> Groups "raxml" group.
> To unsubscribe from this group and stop receiving emails from it, send
> an email to raxml+un...@googlegroups.com
> <mailto:raxml+un...@googlegroups.com>.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/raxml/06778b97-9359-4c6c-a4c4-ac6349886fc0%40googlegroups.com
> <https://groups.google.com/d/msgid/raxml/06778b97-9359-4c6c-a4c4-ac6349886fc0%40googlegroups.com?utm_medium=email&utm_source=footer>.

--
Alexandros (Alexis) Stamatakis

Research Group Leader, Heidelberg Institute for Theoretical Studies
Full Professor, Dept. of Informatics, Karlsruhe Institute of Technology

www.exelixis-lab.org

Takuya Konuma

unread,
Nov 6, 2019, 5:48:51 AM11/6/19
to raxml
Thank you very much.

2019年11月6日水曜日 19時17分38秒 UTC+9 Alexis:
It's treated as undetermined character, i.e., it doesn't contribute
anything to the likelihood. Thus, it's not they way they are treated
that affects the inference, but the mere absence of data/signal that can
affect the inference.

Alexis

On 06.11.19 11:06, Takuya Konuma wrote:
> Hi all!
>
> I wonder how missing data are treated in RAxML or RAxML-NG.
>
> Because our NGS data have a lot of missing data, I want to know how they
> affect tree inference.
>
> Please teach me.
>
> --
> You received this message because you are subscribed to the Google
> Groups "raxml" group.
> To unsubscribe from this group and stop receiving emails from it, send

Omar Nurhusien

unread,
Nov 6, 2019, 7:16:31 AM11/6/19
to ra...@googlegroups.com

I have a related question, why do we need to provide a tree before analysis when we know we have no better tree than the distance tree that is the default already. In other words, how different will our final character based will be? Are there certain conditions that deem providing tree more helpful in terms of efficiency and search speed?

 

Sent from Mail for Windows 10

To unsubscribe from this group and stop receiving emails from it, send an email to raxml+un...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/raxml/71eaedc5-7688-429b-af41-1d2993c3a178%40googlegroups.com.

 

Omar Nurhusien

unread,
Nov 6, 2019, 7:23:04 AM11/6/19
to ra...@googlegroups.com

I think missing data is simply is treated as no data!

 

Sent from Mail for Windows 10

 

From: Takuya Konuma
Sent: Wednesday, November 6, 2019 4:48 AM
To: raxml
Subject: Re: [raxml] Treatment of missing data

 

Thank you very much.

2019116日水曜日 191738 UTC+9 Alexis:

To unsubscribe from this group and stop receiving emails from it, send an email to raxml+un...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/raxml/71eaedc5-7688-429b-af41-1d2993c3a178%40googlegroups.com.

 

Grimm

unread,
Nov 6, 2019, 10:08:21 AM11/6/19
to raxml
To illustrate a bit Alexi's answer.

Your NGS data will typically inform certain splits. But you also will have a lot of noise. For well-covered data, the noise is no problem, because NGS usually gives you a lot of patterns that support the splits. Note that there will be patterns that are additive, they enforce each other for a topological decision, and there will be patterns that better capture deep and flat signal (depending the taxonomic breadth of your data).

If your NGS data is dense enough to infer a backbone tree, then accessions with a lot of missing data have a good chance to be placed correctly and not hamper the tree-building process. But if the NGS data provide somewhat diffuse signal and there is little overlap in the accessions (some miss patterns 1-900, others 101-1000, etc.), then two things may happen.

  • The accession(s) will go rogue, because its (their) noisy data doesn't allow the algorithm to find an optimal position. When looking at the tree (and this applies to NGS and classic trees), this surfaces in unusually long terminal branches that are connected to the rest of the tree by very short root branches with poor bootstrap support.
  • The well-sampled patterns, which may be very few, will force a (potentially) biased tree on the otherwise cloudy data.
Since NGS data are usually vast, missing data bias are often overlooked or not tested for. We just looked at a tree which was partly biased because of filtering effects via the reference. And I know a couple of classic oligogene analysis where missing data had a detrimental effect including artificial branches with high support (e.g. when people combine gene regions of different divergence without assessing the information content of the combined gene regions).

To quickly study eventual missing data effect you can e.g. prune the data down to a matrix with low gappyness and compare the tree you get with the unfiltered tree. If the taxa in both trees turn out in the same/similar places, there's little to worry about. If they jump, and you get high supported conflict and strongly different branch-length patterns, the missing data is a problem.

Good inference, Guido

Alexandros Stamatakis

unread,
Nov 7, 2019, 5:05:20 AM11/7/19
to ra...@googlegroups.com
exactly,

alexis

On 06.11.19 13:23, Omar Nurhusien wrote:
> I think missing data is simply is treated as no data!
>
> Sent from Mail <https://go.microsoft.com/fwlink/?LinkId=550986> for
> Windows 10
>
> *From: *Takuya Konuma <mailto:konumat...@gmail.com>
> *Sent: *Wednesday, November 6, 2019 4:48 AM
> *To: *raxml <mailto:ra...@googlegroups.com>
> *Subject: *Re: [raxml] Treatment of missing data
>
> Thank you very much.
>
> 2019年11月6日水曜日 19時17分38秒 UTC+9 Alexis:
>
> It's treated as undetermined character, i.e., it doesn't contribute
> anything to the likelihood. Thus, it's not they way they are treated
> that affects the inference, but the mere absence of data/signal that
> can
> affect the inference.
>
> Alexis
>
> On 06.11.19 11:06, Takuya Konuma wrote:
> > Hi all!
> >
> > I wonder how missing data are treated in RAxML or RAxML-NG.
> >
> > Because our NGS data have a lot of missing data, I want to know
> how they
> > affect tree inference.
> >
> > Please teach me.
> >
> > --
> > You received this message because you are subscribed to the Google
> > Groups "raxml" group.
> > To unsubscribe from this group and stop receiving emails from it,
> send
> > an email to ra...@googlegroups.com <javascript:>
> > <mailto:ra...@googlegroups.com <javascript:>>.
> > To view this discussion on the web visit
> >
> https://groups.google.com/d/msgid/raxml/06778b97-9359-4c6c-a4c4-ac6349886fc0%40googlegroups.com
>
> >
> <https://groups.google.com/d/msgid/raxml/06778b97-9359-4c6c-a4c4-ac6349886fc0%40googlegroups.com?utm_medium=email&utm_source=footer>.
>
>
> --
> Alexandros (Alexis) Stamatakis
>
> Research Group Leader, Heidelberg Institute for Theoretical Studies
> Full Professor, Dept. of Informatics, Karlsruhe Institute of Technology
>
> www.exelixis-lab.org <http://www.exelixis-lab.org>
>
> --
> You received this message because you are subscribed to the Google
> Groups "raxml" group.
> To unsubscribe from this group and stop receiving emails from it, send
> an email to raxml+un...@googlegroups.com
> <mailto:raxml+un...@googlegroups.com>.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/raxml/71eaedc5-7688-429b-af41-1d2993c3a178%40googlegroups.com
> <https://groups.google.com/d/msgid/raxml/71eaedc5-7688-429b-af41-1d2993c3a178%40googlegroups.com?utm_medium=email&utm_source=footer>.
>
> --
> You received this message because you are subscribed to the Google
> Groups "raxml" group.
> To unsubscribe from this group and stop receiving emails from it, send
> an email to raxml+un...@googlegroups.com
> <mailto:raxml+un...@googlegroups.com>.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/raxml/5dc2bb24.1c69fb81.5ccfb.0480%40mx.google.com
> <https://groups.google.com/d/msgid/raxml/5dc2bb24.1c69fb81.5ccfb.0480%40mx.google.com?utm_medium=email&utm_source=footer>.

Alexandros Stamatakis

unread,
Nov 7, 2019, 5:06:17 AM11/7/19
to ra...@googlegroups.com
I am sorry, but I don't understand the question at all.

Alexis

On 06.11.19 13:16, Omar Nurhusien wrote:
> I have a related question, why do we need to provide a tree before
> analysis when we know we have no better tree than the distance tree that
> is the default already. In other words, how different will our final
> character based will be? Are there certain conditions that deem
> providing tree more helpful in terms of efficiency and search speed?
>
> Sent from Mail <https://go.microsoft.com/fwlink/?LinkId=550986> for
> Windows 10
>
> *From: *Takuya Konuma <mailto:konumat...@gmail.com>
> *Sent: *Wednesday, November 6, 2019 4:48 AM
> *To: *raxml <mailto:ra...@googlegroups.com>
> *Subject: *Re: [raxml] Treatment of missing data
>
> Thank you very much.
>
> 2019年11月6日水曜日 19時17分38秒 UTC+9 Alexis:
>
> It's treated as undetermined character, i.e., it doesn't contribute
> anything to the likelihood. Thus, it's not they way they are treated
> that affects the inference, but the mere absence of data/signal that
> can
> affect the inference.
>
> Alexis
>
> On 06.11.19 11:06, Takuya Konuma wrote:
> > Hi all!
> >
> > I wonder how missing data are treated in RAxML or RAxML-NG.
> >
> > Because our NGS data have a lot of missing data, I want to know
> how they
> > affect tree inference.
> >
> > Please teach me.
> >
> > --
> > You received this message because you are subscribed to the Google
> > Groups "raxml" group.
> > To unsubscribe from this group and stop receiving emails from it,
> send
> > an email to ra...@googlegroups.com <javascript:>
> > <mailto:ra...@googlegroups.com <javascript:>>.
> > To view this discussion on the web visit
> >
> https://groups.google.com/d/msgid/raxml/06778b97-9359-4c6c-a4c4-ac6349886fc0%40googlegroups.com
>
> >
> <https://groups.google.com/d/msgid/raxml/06778b97-9359-4c6c-a4c4-ac6349886fc0%40googlegroups.com?utm_medium=email&utm_source=footer>.
>
>
> --
> Alexandros (Alexis) Stamatakis
>
> Research Group Leader, Heidelberg Institute for Theoretical Studies
> Full Professor, Dept. of Informatics, Karlsruhe Institute of Technology
>
> www.exelixis-lab.org <http://www.exelixis-lab.org>
>
> --
> You received this message because you are subscribed to the Google
> Groups "raxml" group.
> To unsubscribe from this group and stop receiving emails from it, send
> an email to raxml+un...@googlegroups.com
> <mailto:raxml+un...@googlegroups.com>.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/raxml/71eaedc5-7688-429b-af41-1d2993c3a178%40googlegroups.com
> <https://groups.google.com/d/msgid/raxml/71eaedc5-7688-429b-af41-1d2993c3a178%40googlegroups.com?utm_medium=email&utm_source=footer>.
>
> --
> You received this message because you are subscribed to the Google
> Groups "raxml" group.
> To unsubscribe from this group and stop receiving emails from it, send
> an email to raxml+un...@googlegroups.com
> <mailto:raxml+un...@googlegroups.com>.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/raxml/5dc2b999.1c69fb81.2b722.41d7%40mx.google.com
> <https://groups.google.com/d/msgid/raxml/5dc2b999.1c69fb81.2b722.41d7%40mx.google.com?utm_medium=email&utm_source=footer>.

omaridris5315

unread,
Nov 7, 2019, 6:12:06 AM11/7/19
to ra...@googlegroups.com
I am referring to user tree that is used as input for running RAXML in tree search. Sorry for not being clear. Where do I get it and why do I need it to in the first place?
Thanks.



Sent from my Verizon, Samsung Galaxy smartphone
To unsubscribe from this group and stop receiving emails from it, send an email to raxml+un...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/raxml/5865a0a1-6c90-31d1-ecbd-7c25bfb811fc%40gmail.com.

Alexandros Stamatakis

unread,
Nov 7, 2019, 6:14:09 AM11/7/19
to ra...@googlegroups.com
you don't necessarily need a user tree, if you don't specify one, RAxML
will infer one, you need it as the ML search needs to start somewhere,
also in many cases a simple NJ tree will not be identical to the ML tree.

Alexis
> https://groups.google.com/d/msgid/raxml/5dc3fc02.1c69fb81.3580f.73ca%40mx.google.com
> <https://groups.google.com/d/msgid/raxml/5dc3fc02.1c69fb81.3580f.73ca%40mx.google.com?utm_medium=email&utm_source=footer>.

Takuya Konuma

unread,
Nov 20, 2019, 10:17:53 PM11/20/19
to raxml
Thank you for your kind explanation!

So if I try to infer tree using many species, will this be (potentially) biased tree?
SNP calling between distant species yield data which have "somewhat diffuse signal and  little overlap in the accessions" although the NGS data is robust.

I wonder the tree inferred using such data is correct, in other word it reflects the species phylogenetic relationship.

Alexandros Stamatakis

unread,
Nov 21, 2019, 2:50:18 AM11/21/19
to ra...@googlegroups.com


On 21.11.19 04:17, Takuya Konuma wrote:
> Thank you for your kind explanation!
>
> So if I try to infer tree using many species, will this be (potentially)
> biased tree?
> SNP calling between distant species yield data which have "somewhat
> diffuse signal and  little overlap in the accessions" although the NGS
> data is robust.

The more noise you have in the data, the harder it will be to resolve
the tree.

> I wonder the tree inferred using such data is correct, in other word it
> reflects the species phylogenetic relationship.

That's something we never know anyway, i.e., if the tree is right or
wrong. You could only try to simulate data and inject distinct noise
levels to get an intuition about how much this affects tree inference.

Alexis

>
> Am Donnerstag, 7. November 2019 00:08:21 UTC+9 schrieb Grimm:
>
> To illustrate a bit Alexi's answer.
>
> Your NGS data will typically inform certain splits. But you also
> will have a lot of noise. For well-covered data, the noise is no
> problem, because NGS usually gives you a lot of patterns that
> support the splits. Note that there will be patterns that are
> additive, they enforce each other for a topological decision, and
> there will be patterns that better capture deep and flat signal
> (depending the taxonomic breadth of your data).
>
> If your NGS data is dense enough to infer a backbone tree, then
> accessions with a lot of missing data have a good chance to be
> placed correctly and not hamper the tree-building process. But if
> the NGS data provide somewhat diffuse signal and there is little
> overlap in the accessions (some miss patterns 1-900, others
> 101-1000, etc.), then two things may happen.
>
> * The accession(s) will go rogue, because its (their) noisy data
> doesn't allow the algorithm to find an optimal position. When
> looking at the tree (and this applies to NGS and classic trees),
> this surfaces in unusually long terminal branches that are
> connected to the rest of the tree by very short root branches
> with poor bootstrap support.
> * The well-sampled patterns, which may be very few, will force a
> (potentially) biased tree on the otherwise cloudy data.
>
> Since NGS data are usually vast, missing data bias are often
> overlooked or not tested for. We just looked at a tree which was
> partly biased because of filtering effects via the reference. And I
> know a couple of classic oligogene analysis where missing data had a
> detrimental effect including artificial branches with high support
> (e.g. when people combine gene regions of different divergence
> without assessing the information content of the combined gene
> regions).
>
> To quickly study eventual missing data effect you can e.g. prune the
> data down to a matrix with low gappyness and compare the tree you
> get with the unfiltered tree. If the taxa in both trees turn out in
> the same/similar places, there's little to worry about. If they
> jump, and you get high supported conflict and strongly different
> branch-length patterns, the missing data is a problem.
>
> Good inference, Guido
>
> --
> You received this message because you are subscribed to the Google
> Groups "raxml" group.
> To unsubscribe from this group and stop receiving emails from it, send
> an email to raxml+un...@googlegroups.com
> <mailto:raxml+un...@googlegroups.com>.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/raxml/e5f0211e-1d39-446b-af8a-f34652587d7b%40googlegroups.com
> <https://groups.google.com/d/msgid/raxml/e5f0211e-1d39-446b-af8a-f34652587d7b%40googlegroups.com?utm_medium=email&utm_source=footer>.

Grimm

unread,
Nov 21, 2019, 8:20:30 AM11/21/19
to raxml
An addon to Alexi, a bit philosophical

> I wonder the tree inferred using such data is correct, in other word it
> reflects the species phylogenetic relationship.

That's something we never know anyway, i.e., if the tree is right or
wrong. You could only try to simulate data and inject distinct noise
levels to get an intuition about how much this affects tree inference.
 

In practise, there is no data or tree that can reflect species phylogenetic relationships, at least in non-viruses (but viruses recombine, which violate the assumption of a tree, too) or organisms that reproduce not exclusively asexually. The reason for this is very simple: speciation is not a strictly dichotomous process. If sexually, it's inheritently recombinant. So any inferred tree is partially wrong, it must be. However, it may sensibly approach the coalescent, the sorting of mutations/genes/genomes via the general evolution putting aside occassional and inevitable reticulation. Because what happens once inbreeding outcompetes outbreeding is what we call "lineage sorting": inherited lineage-specific patterns emerge that outcompete random or convergent patterns.

Critical question for your data is hence not is my tree right or wrong, as Alexi says, we will never know that, but if it
  1. depicts the differentiation in the data
  2. makes sense
For 2., you "just" need to know your organism and other data on it.

But 1. can be tested by exploring the signal in the data. Adding to my earlier comment (hunting and pruning rogues), if you suspect there's something wrong with tree these are the follow up questions:
How do my data support the tree? RaxML implement classic BS but also the TC/IC measure.
Do the bootstrap replicates show alternatives? I.e. if you tree has branches with < 100 BS support, is it because of signal indecisiveness (no alternatives with support) or a conflicting alternatives (split support).
Are these alternative significantly worse than my inferred tree? Eg. testing them using the RAxML's quick in-built SH test or the more elaborate and precise AU test (implemented in CONSEL)
When you are familiar with R (which I'm not), you can even extract the patterns from your data supporting alternative trees.

Cheers, Guido


Grimm

unread,
Nov 21, 2019, 9:02:38 AM11/21/19
to raxml
A post-scriptum

A very simple test to see if you have a signal issue in your data is to check if the branches in the tree that have BS <(<) 50 have better supported conflicting alternatives in the BS pseudoreplicate sample. If your data are consistent, the branch in the optimised ML tree should always be the best-supported in the BS sample. Irrespective of support, i.e. even if a branch in the ML tree has a low BS of e.g. 30, one only needs to worry if the BS replicate shows a conflicting bipartition with BS > 30. When all alternatives are random with BS ~ 0, it just means your data doesn't provide a strong but a consistent signal for this relationship.

Reply all
Reply to author
Forward
0 new messages