comparing multiple two-lambda models

F U

unread,

Apr 4, 2025, 9:44:30 AMApr 4

to hahnlab-cafe

Hello,

Thank you for developing this method!

I have been testing different configurations of multi lambda models for my dataset. It's quite small (only 6 taxa).
Just running on the real data, I find that many two-lambda models and even a 3-lambda and 4 lambda model converges.

I have run 1000 simulations under the global lambda model and tested this with one of my two-lambda models. The LRT in my real data (value 1556) is wildly outside the distribution of the LRTs from the simulated data (ranges between -45150 and -43974), so I would reject the global model. I hope this is the correct way to do the simulation-based model selection?

Assuming that this is correct, I am now a bit confused on how to compare different two-lambda models against each other.
I could test each two-lambda model against the global model first. But if I then end up with the LRTs suggesting that multiple of my two-lambda models are more suitable for my data than the global lambda model how do I proceed? Should I simulate again using one of my two-lambda models and then compare LRTs between the two lambda models?

Or would you suggest I can disregard doing these (many) comparisons for now and proceed towards testing 2-lambda vs 3-lambda and then (possibly) 3-lambda vs 4-lambda model?
The 3-lambda and 4-lambda model further divides the two-lambda model I already tested with simulations.

I am grateful for any guidance you have to share,
F

Hahn, Matthew

unread,

Apr 4, 2025, 9:53:04 AMApr 4

to F U, hahnlab-cafe

Hello,

Thank you for your message. I think there is something strange about the likelihood values you are reporting for your dataset vs the simulated data. Maybe it’s just that I don’t understand the contrasts you are running. For the simulations, you should have simulated a dataset of similar size to your real dataset (possibly from the same root distribution) using a 1-rate model, and then fit both 1- and 2-rate models to those simulated data. Regardless of the details. the 2-rate model should always fit slightly better, and so have the same sign as the contrast on your real data. The fact that this isn’t true suggests something is amiss.

Otherwise, you can test models of any complexity you wish in any order. We have no real recommendations for those steps.

Matthew Hahn

--
You received this message because you are subscribed to the Google Groups "hahnlab-cafe" group.
To unsubscribe from this group and stop receiving emails from it, send an email to hahnlabcafe...@googlegroups.com.
To view this discussion visit https://groups.google.com/d/msgid/hahnlabcafe/cb382db0-3379-4bb6-8c80-f791fdfc0bd1n%40googlegroups.com.

Hahn, Matthew

unread,

Apr 4, 2025, 11:58:34 AMApr 4

to Flora Uesseler, hahnlab-cafe

Hi again,

I think maybe I’m confused as to why the likelihoods are different by so many orders of magnitude in the real and simulated data. If you are evaluating the two models across 10k gene families, the likelihoods should not be 16 and 18—they should be very large (negative log-likelihood) numbers.

Also, your commands for getting the likelihoods of the simulated data are confusing to me. The main thing is that you must ask CAFE to estimate lambda and the likelihoods—you cannot give lambda. I also wasn’t clear on why the commands labeled “# 2-lambda” were for a 1-lambda model, nor how the lambda values given were passed to CAFE.

cheers,

Matt

On Apr 4, 2025, at 11:20 AM, Flora Uesseler <fues...@gmail.com> wrote:

You don't often get email from fues...@gmail.com. Learn why this is important

Hello Dr. Hahn,

Thank you for your reply! I can give more information, maybe then the problem becomes more easy to pinpoint. My dataset (clustered with OrthoMCL following steps in your CAFE tutorial) has around 11 thousand clusters. So for similar size in each simulation I produce 10k gene families.

Here's the code where I 1)simulate 2) run global model on simulated data and 3) run 2-lambda model on simulated data. Actually, I mistyped in my previous message and so far only have 100 simulated data. The other +900 are still running.

LAMBDA=0.00076025704930212 #single
#for some reason running simulation with error model always fails, so simulate without:
$CAFE5 -s10000 -i ../all_by_all/filtered_cafe_input.txt -t $TREE -l $LAMBDA -o sim10000_single_lambda_${SLURM_ARRAY_TASK_ID}
$CAFE5 -i sim10000_single_lambda_${SLURM_ARRAY_TASK_ID}/simulation.txt -t $TREE -l $LAMBDA -e${ERROR} -o sim10000_single_lambda_${SLURM_ARRAY_TASK_ID}/results LAMBDA=0.0030770389046543,0.0005562107568264 # 2-lambda
$CAFE5 -i sim10000_single_lambda_${SLURM_ARRAY_TASK_ID}/simulation.txt -t $TREE -l $LAMBDA -e${ERROR} -y $LAMBDA_TREE -o sim10000_single_lambda_${SLURM_ARRAY_TASK_ID}/results_multi_lambda_Galapagos

The lambda are from running the global vs 2-lambda model on my real data (and checking for convergence in 3 separate runs) like this:

$CAFE5 -i ../all_by_all/filtered_cafe_input.txt -t $TREE --cores $SLURM_CPUS_PER_TASK -e${ERROR} -o results_with_errormodel

$CAFE5 -i ../all_by_all/filtered_cafe_input.txt -t $TREE --cores $SLURM_CPUS_PER_TASK -y $LAMBDA_TREE -e${ERROR} -o results_with_error

From my 100 simulations I took the "Model Base Final Likelihood (-lnL):" values from Base_results.txt (100 respectively, for running under Global Lambda Model and for running under 2-Lambda Model). I attach a plot of the distribution to this mail.

Regardless of the details. the 2-rate model should always fit slightly better, and so have the same sign as the contrast on your real data. The fact that this isn’t true suggests something is amiss.

This is true for my real data. Here I get -lnL 81508 for the global rate model and 80730 for the 2-rate model - so the likelihood shows a better fit for the 2-rate model. But if I check my simulations, the opposite is true - the -lnL are around 16.000 for the global rate model and around 18.000 for the 2-rate model.

Unfortunately I was unable to understand the structure and purpose of the root distribution file. I know there are examples on the GitHub page, but as I did not understand, I could not recreate it with my data to use for simulations. Maybe not using it is causing my simulations to behave so weirdly? If you also think so, I would be very grateful for some explanation on the root distribution, so I can try it.

Thank you a lot and have a nice weekend,

F

Flora Uesseler

unread,

Apr 7, 2025, 4:09:42 AMApr 7

to Hahn, Matthew, hahnlab-cafe

Hello again,

Thank you for explaining, then I had understood the workflow wrong previously!
I now took the 1000 simulated datasets and ran the 1-lambda and 2-lambda models on them without specifying lambda, to get the likelihoods. I get similar -lnL values around 16 000 for both models. I attach my likelihood ratio plot again (for visualization I had to remove 9 outliers as they had more extreme values that shifted the plot too much. I am not sure if I should be worried about those "long tails" due to a few data points?).

In any case, when comparing these models with my real data, I get a likelihood ratio of 1556, which falls outside of this distribution. Therefore, I can conclude that the 1-lambda model is less suited to my data than the 2-lambda model, correct?

Thank you for your guidance!

F

LRT_distribution_simulations.png

Hahn, Matthew

unread,

Apr 7, 2025, 10:07:37 AMApr 7

to Flora Uesseler, hahnlab-cafe

Yes, exactly right (though I wouldn’t exclude the outliers). Your data definitely support the 2-lambda model!

Matt

cheers,

Matt

<LRT_distribution_simulations.png>

<LRT_distribution_simulations.png>

Reply all

Reply to author

Forward