Comparing two lambda models and error in simulation in cafe5

Wenqiang Shi

unread,

Jun 8, 2023, 3:51:04 AM6/8/23

to hahnlab-cafe

Dear Cafe5 authors,

Greetings!

I want to use cafe5 to test whether the global model is better than a local model with three lambda in different parts of the tree. It's quite similar to this thread https://groups.google.com/g/hahnlabcafe/c/yqwm5G1klLs, but I don't know how to do this in cafe5. I didn't find a complete tutorial for this purpose, or I missed something.

1. Cafe5 introduces gamma rate categories (-k) and poisson distribution (-p). Can I set these parameters in the model comparison? In my case, the best k of global model is 5 according to lnL, while the best k of the local model is 4.

2. Instead of simulation, can I use the lnL in the output of cafe5 directly in the likelihood ratio test? In an old paper (https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2147951/), they used -lnL values in the output of global and local models to do a simple likelihood ratio test. Assume -2(lnL(global) - lnL(local)) follows χ2-distributed with degrees of freedom equal to the number of excess parameters. If I understand right, here the number of excess parameters should be 2 (two additional lambda in the local model) plus -1 (smaller k in the local model).

3. I also tried to do a simulation in cafe5 using the following cmd. It gives me an error and works fine withour -s100. The data are in the attachment.
cafe5 -i filtered.cafe.input.tsv -t r8s_ultrametric_mychange.txt -y separate_lambda.txt -l 0.0015 -s100 -o simulate_pk2.

================Error==========================
Filtering families not present at the root from: 18468 to 7802

No root family size distribution specified, using uniform distribution
Simulating with 1 model(s)
Simulating 100 families for model Base
cafe5: src/matrix_cache.cpp:62: int matrix::select_random_y(int, int) const: Assertion `max < _size' failed.
Aborted (core dumped)

====================================================

Thanks for your help!

Best wishes,

Wenqiang

separate_lambda.txt

r8s_ultrametric_mychange.txt

filtered.cafe.input.tsv

Hahn, Matthew

unread,

Jun 8, 2023, 4:52:59 PM6/8/23

to Wenqiang Shi, hahnlab-cafe

Hi Wenqiang,

First, in point 3 I think you have hit a bug that has also hit others—please download CAFE 5.1 to fix the simulation. To answer your other questions:

1. One cannot compare the likelihoods of gamma models with different k—these are not standard statistical parameters. You should feel free to pick k=4 and go from there, but these are not nested models.

2. We have not found that a likelihood ratio test between netted models with different number of lambda parameters is chi-squared distributed. This doesn’t mean it is not, but you would have to test to see whether this is the case. And again, k is not a model parameter.

cheers,

Matt

--
You received this message because you are subscribed to the Google Groups "hahnlab-cafe" group.
To unsubscribe from this group and stop receiving emails from it, send an email to hahnlabcafe...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/hahnlabcafe/debee03f-27ba-4c97-9154-f4a9cbb5e757n%40googlegroups.com.
<separate_lambda.txt><r8s_ultrametric_mychange.txt><filtered.cafe.input.tsv>

Wenqiang Shi

unread,

Jun 8, 2023, 9:27:28 PM6/8/23

to hahnlab-cafe

Hi Matt,

Thanks for the quick response. It helps a lot! The bug is resolved in cafe5.1.

Correct me if I'm wrong. So far, we can't compare the global and local models with same k>1 based on simulation because cafe5 doesn't have lhtest function. As cafe4 doesn't support the gamma models with k>1, so we can't use lhtest in cafe4 to test global and local models with k>1.

Hahn, Matthew

unread,

Jun 9, 2023, 2:23:19 PM6/9/23

to Wenqiang Shi, hahnlab-cafe

Hi Wenqiang,

You can compare with k>1 with simulation, even though lhtest is deprecated: you can just use the “simulation” function to do what lhtest used to do (it just ran a bunch of simulations and then fit models for you). It shouldn’t be too hard, which is why we got rid of lhtest!

cheers,

Matt

To view this discussion on the web visit https://groups.google.com/d/msgid/hahnlabcafe/fe610127-ea86-4e7e-bc3f-d82687f6d6c9n%40googlegroups.com.

Wenqiang Shi

unread,

Jun 18, 2023, 3:56:52 AM6/18/23

to hahnlab-cafe

Thank you for your detailed guidance. It helps a lot!

Cheer

Wenqiang

Wenqiang Shi

unread,

Jun 19, 2023, 10:34:36 PM6/19/23

to hahnlab-cafe

Hi Matt,

Sorry to bother you again. Currently, I want to compare a model with 2 lambda(M2) and a model with three lambda(M3). I found the choice of K could impact the final selected model.

The attachment is the summary of the results under different K and lambda. When K=1 or 2, the likelihood value of M3 model is better, and the estimation of the third lambda (for a sub branch of in M2 model) is higher. This is what we expected. However, when K=3, M2 model is better based on lnL. When K=4, M3 model is better again but the estimation of the third lambda (for a sub branch of in 2 lambda) is similar to the second lambda.

In previous threads, you mentioned we can start from K=4 but we can't compare models with different K. If I try simulation with K=4,5,6, each K might prefer different model. How can we decide which K I should use?

Another issue is that when doing the simulation under K=4 for M2 model, Cafe5 threw an error. The following is the command and the error and I'm using cafe5.1.

=====================================================================

Command line: /home/wqshi/packages/CAFE5/bin/cafe5 -i filtered.cafe.input.tsv -t r8s_ultrametric_mychange.txt -y separate_lambda.txt -eresults/Base_error_model.txt -s200 -l 0.002 -a 0.005 -k4 -p -o simulate_new

Filtering families not present at the root from: 18468 to 7802

Estimating Poisson root distribution from gene families

Optimizer strategy: Nelder-Mead with similarity cutoff
Iterations: 300
Expansion: 2
Reflection: 1

Starting Search for Initial Parameter Values

Completed 28 iterations
Time: 0H 0M 0S
Best match is: 0.39461852903004
Final -lnL: 103718.29463723

Empirical Prior Estimation Result : (28 iterations)
Poisson lambda: 0.39461852903004 & Score: 103718.29463723
Simulating with 1 model(s)
Simulating 200 families for model Gamma
WARNING: Saturated branch using lambda 0.008 on branch length 146.267
WARNING: Saturated branch using lambda 0.008 on branch length 159.512
WARNING: Saturated branch using lambda 0.008 on branch length 176.597
WARNING: Saturated branch using lambda 0.008 on branch length 189.732
WARNING: Saturated branch using lambda 0.008 on branch length 146.267
WARNING: Saturated branch using lambda 0.008 on branch length 159.512
WARNING: Saturated branch using lambda 0.008 on branch length 176.597
WARNING: Saturated branch using lambda 0.008 on branch length 189.732
WARNING: Saturated branch using lambda 0.008 on branch length 146.267
WARNING: Saturated branch using lambda 0.008 on branch length 159.512
WARNING: Saturated branch using lambda 0.008 on branch length 176.597
WARNING: Saturated branch using lambda 0.008 on branch length 189.732
WARNING: Saturated branch using lambda 0.008 on branch length 146.267
WARNING: Saturated branch using lambda 0.008 on branch length 159.512
WARNING: Saturated branch using lambda 0.008 on branch length 176.597
WARNING: Saturated branch using lambda 0.008 on branch length 189.732
WARNING: Saturated branch using lambda 0.008 on branch length 146.267
WARNING: Saturated branch using lambda 0.008 on branch length 159.512
WARNING: Saturated branch using lambda 0.008 on branch length 176.597
WARNING: Saturated branch using lambda 0.008 on branch length 189.732
WARNING: Saturated branch using lambda 0.008 on branch length 146.267
WARNING: Saturated branch using lambda 0.008 on branch length 159.512
WARNING: Saturated branch using lambda 0.008 on branch length 176.597
WARNING: Saturated branch using lambda 0.008 on branch length 189.732
WARNING: Saturated branch using lambda 0.008 on branch length 146.267
WARNING: Saturated branch using lambda 0.008 on branch length 159.512
WARNING: Saturated branch using lambda 0.008 on branch length 176.597
WARNING: Saturated branch using lambda 0.008 on branch length 189.732
WARNING: Saturated branch using lambda 0.008 on branch length 146.267
WARNING: Saturated branch using lambda 0.008 on branch length 159.512
WARNING: Saturated branch using lambda 0.008 on branch length 176.597
WARNING: Saturated branch using lambda 0.008 on branch length 189.732
WARNING: Saturated branch using lambda 0.008 on branch length 146.267
WARNING: Saturated branch using lambda 0.008 on branch length 159.512
WARNING: Saturated branch using lambda 0.008 on branch length 176.597
WARNING: Saturated branch using lambda 0.008 on branch length 189.732
WARNING: Saturated branch using lambda 0.008 on branch length 146.267
WARNING: Saturated branch using lambda 0.008 on branch length 159.512
WARNING: Saturated branch using lambda 0.008 on branch length 176.597
WARNING: Saturated branch using lambda 0.008 on branch length 189.732
WARNING: Saturated branch using lambda 0.008 on branch length 146.267
WARNING: Saturated branch using lambda 0.008 on branch length 159.512
WARNING: Saturated branch using lambda 0.008 on branch length 176.597
WARNING: Saturated branch using lambda 0.008 on branch length 189.732
Trying to simulate leaf family size that was not included in error model

==================================================================================

Best regards,

Wenqiang

cafe-summary.xlsx

Base_error_model.txt

r8s_ultrametric_mychange.txt

separate_lambda.txt

filtered.cafe.input.tsv

Hahn, Matthew

unread,

Jun 20, 2023, 3:17:21 AM6/20/23

to Wenqiang Shi, hahnlab-cafe

Hi Wenqiang,

Unfortunately, as far as I know, no one has yet solved the problem of choosing the optimal K. This is also true for gamma-distributed rate variation among nucleotide or amino acid sites. So I don’t have a clear answer for you.

As for the error, it seems that you are simulating with too large a lambda value. CAFE will not allow lambda*t (the length of a branch) to be greater than 1.

cheers,

Matt

To view this discussion on the web visit https://groups.google.com/d/msgid/hahnlabcafe/95de80a4-3396-49a1-8e4d-84a794c669d1n%40googlegroups.com.
<cafe-summary.xlsx><Base_error_model.txt><r8s_ultrametric_mychange.txt><separate_lambda.txt><filtered.cafe.input.tsv>

Wenqiang Shi

unread,

Jun 20, 2023, 11:04:37 AM6/20/23

to hahnlab-cafe

Hi Matt,

Thanks for the quick response. Hope this is the last question.

As it's hard to determine the best K, and I will focus on K=1.

Just want to confirm that whether I can use simulation to compare a model of two lambdas with a model with three lambdas? In the group discussion, one can only compare a global model with local models.

When I do the simulation of two lambda model, there's a wired warning "Trying to simulate leaf family size that was not included in error model". Is this a warning I can ignore?

===============================================================================================

Command line: /home/wqshi/packages/CAFE5/bin/cafe5 -i filtered.cafe.input.tsv -t r8s_ultrametric_mychange.txt -y separate_lambda.txt -s7000 -l 0.001 -p -o simulate_new -eBase_error_model.txt

Filtering families not present at the root from: 18468 to 7802

Estimating Poisson root distribution from gene families

Optimizer strategy: Nelder-Mead with similarity cutoff
Iterations: 300
Expansion: 2
Reflection: 1

Starting Search for Initial Parameter Values

Completed 25 iterations
Time: 0H 0M 0S
Best match is: 0.39461758289981
Final -lnL: 103718.29463793

Empirical Prior Estimation Result : (25 iterations)
Poisson lambda: 0.39461758289981 & Score: 103718.29463793
Simulating with 1 model(s)
Simulating 7000 families for model Base

Trying to simulate leaf family size that was not included in error model

===========================================================

Best regards,

Wenqiang

Ben Fulton

unread,

Jun 30, 2023, 3:30:00 PM6/30/23

to hahnlab-cafe

Hi,

I think the maxcnt value in your error model is a little too low. Since you have some families in your data with 89 copies of a gene, maybe bumping the max count up to 200 or so might do it. (You can make it as large as you like, but CAFE will be more slow the larger you make it.)

--

Ben Fulton

Indiana University

Wenqiang Shi

unread,

Jul 10, 2023, 3:23:25 AM7/10/23

to Ben Fulton, hahnlab-cafe

Thanks for the response. I will have a try.

Best regards,

Wenqiang

To view this discussion on the web visit https://groups.google.com/d/msgid/hahnlabcafe/fa8878ff-dd51-4604-865f-fe7fddbe0bb7n%40googlegroups.com.

Reply all

Reply to author

Forward