simulations to examine the effects of a)gene-tree topologies, b)gene-tree tree models, c)missing data

jigyasa arora

unread,

Jul 27, 2021, 11:02:54 PM7/27/21

to GeneRax

Hey all!

I would like to request some advice/suggestions! I am examining if the Transfer events in my gene-tree are affected by the following parameter-

a)tree topology (1000 simulations of randomized host-tree topologies),

b)gene-tree models (HKY (kappa=2) vs GTR),

c)missing data (50% data present, 75% data present)

I find that Transfer rates of my gene-tree are significantly different for (a) and (c).

I used the Kruskal-wallis test to compare the parameters/methods followed by the Dunn test for pairwise comparisons.

>kruskal.test(transfer_rates ~ method,data=alldata) #global test

>library(PMCMRplus)

>posthoc.kruskal.dunn.test(x=alldata$transfer_rates, g=as.factor(alldata$method), dist="Tukey") #pairwise test

GTR GTR_50perc GTR_75perc HKY

GTR_50perc 0.000014359699 - - -

GTR_75perc 1.00000 0.00033 - -

HKY 0.29878 0.000000000028 0.10013 -

HKY_50perc 0.00033 1.00000 0.00332 0.000000019176

HKY_75perc 0.10013 0.26188 0.29878 0.00016

simulation < 0.0000000000000002 < 0.0000000000000002 < 0.0000000000000002 < 0.0000000000000002

HKY_50perc HKY_75perc

GTR_50perc - -

GTR_75perc - -

HKY - -

HKY_50perc - -

HKY_75perc 0.29878 -

simulation < 0.0000000000000002 < 0.0000000000000002

I wanted to ask if 1) the steps used here for statistical analysis (Kruskal-Wallis and Dunn test) are correct?

2) Is it common for ML-based HGT estimation methods to be biased towards missing data?

Thanks again!

Benoit Morel

unread,

Jul 28, 2021, 5:07:58 PM7/28/21

to GeneRax

Dear Jigyasa,

I am afraid that I am not qualified to answer 1).

Regarding 2) (and your general question), here is my feeling (but it's just a guess):

- the species tree topology should not affect that much the DTL rates (but I don't know if the host tree behaves like the species tree, and it depends on how you simulate the gene trees).

- the substitution model could affect the inferred rates a bit, in particular if one model finds "better" trees. We tried playing with model misspecification in the GeneRax paper, but it didn't affect much the gene tree topology accuracy (so I don't think it should introduce an important bias)

- I think that missing data is likely to introduce a bias, but I can't say how for sure. The nature of this bias would depend on the distribution of missing data.

I insist that this is just a guess, and would need more serious studies to be confirmed.

Does it help?

Benoit

jigyasa arora

unread,

Jul 28, 2021, 10:25:59 PM7/28/21

to GeneRax

Hey Benoit

Thank you again for replying!

Regarding your reply to (2)

- the species tree topology should not affect that much the DTL rates (but I don't know if the host tree behaves like the species tree, and it depends on how you simulate the gene trees).

We see co-evolution between the host and the symbiont (using other methods) which is how we are trying to explain them as being equivalent to species-gene trees. I simulated randomization of the host tree tips such that they do not showcase co-evolution with the symbiont tree anymore. The aim here is to examine if the transfer rates observed in the empirical data are significantly different from random transfers if there was no co-evolution.

I wonder what you mean by that the species tree topology would not affect the DTL rates on the gene tree? Being species-tree aware wouldn't uncertainty in the species tree affect the gene tree topology?

- I think that missing data is likely to introduce a bias, but I can't say how for sure. The nature of this bias would depend on the distribution of missing data.

I completely agree. The distribution of the missing data is very important. The results I showed were based on randomly removing the tips in the symbiont tree. But as I do not have a "ground truth" about what is the actual representation in the sequencing data (i.e. how much data I am already missing in my empirical analysis), doing any random removal would be wrong.

I was thinking about changing the distribution of missing data per sample from empirical data/simulating a perfect co-evolution and then incremently remove tips per sample. Let see

Thanks again for all the advice and suggestions!

jigyasa arora

unread,

Jul 28, 2021, 10:38:54 PM7/28/21

to GeneRax

Sorry, for the randomization part, I got confused. I simulated randomized host tree tips and used them as symbiont tree. I did not change the topology of the host tree.

Benoit Morel

unread,

Jul 29, 2021, 7:52:56 AM7/29/21

to GeneRax

Hi Jigyasa,

I wonder what you mean by that the species tree topology would not affect the DTL rates on the gene tree? Being species-tree aware wouldn't uncertainty in the species tree affect the gene tree topology?

Using a species tree to correct the gene tree highly affects the DTL rates, yes. But if you simulate n different species trees, and m gene trees for each species tree, I would expect that for a large m, the average DTL rates would be the same for each species tree. It might be different for very extreme topologies (for instance a "caterpillar tree", which is the most unbalanced tree one can get), but those are unlikely to be obtained from random simulations.

But as I do not have a "ground truth" about what is the actual representation in the sequencing data (i.e. how much data I am already missing in my empirical analysis),

This is unfortunately a very hard problem that we also face. We don't know what a realistic missing data distribution is, and it's almost impossible to estimate from "real" data... In general, I am very interested in any observations (from simulations or real data) about missing data, so feel free to discuss your thoughts/results with us ;-)

An interesting experiment would be to assign to each species a different removal probability.

Also, be aware that when there is a very high quantity of missing data, reconciliation methods might be tempted to replace a sequence of losses with a single HGT. I would not expect this to happen too much with 25% or 50% missing data, but it's hard to quantify without trying...