caferror fails to converge or initialise

177 views
Skip to first unread message

Paschalis Natsidis

unread,
Jun 8, 2020, 5:10:06 AM6/8/20
to hahnlab-cafe
Hello,

I am using CAFE to try to infer gene family evolution among metazoa. 

However, when I use the caferror.py module, I get a warning that it fails to converge or initialise.
Please check the attached file.

Also, I am having a hard time to understand what 'inf' score means, and how can the different lambdas be interpreted.

Many thanks

Paschalis
Screenshot 2020-05-29 at 19.09.21.png

Dan Vanderpool

unread,
Jun 8, 2020, 12:09:38 PM6/8/20
to Paschalis Natsidis, hahnlab-cafe
Hello Paschalis,

you need to remove the most variable families from your data set as for these there is an infinitely small probability of the observed change under the lambda you are using.  

Typically a reasonable filter is to open the file in excel and add a column off to the right that is the difference between the min and max for each family.  Sort the entire sheet such that the largest differences are at the top.  Remove the top ~10 most variable families and save this as a new file (remember to check that the line breaks are unix and that you have deleted the new column you added). Try running CAFE with the reduced dataset (just see if it will initialize before running a full error model search). If it works great, if not try removing more families.

dv

--
You received this message because you are subscribed to the Google Groups "hahnlab-cafe" group.
To unsubscribe from this group and stop receiving emails from it, send an email to hahnlabcafe...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/hahnlabcafe/54cb6eb1-6b2b-4da6-bb12-12cc072aec90o%40googlegroups.com.
<Screenshot 2020-05-29 at 19.09.21.png>

__________________________
Dan Vanderpool
Postdoctoral Scholar
Department of Biology
Indiana University
Hahn Lab, Jordan Hall 249B
1001 East Third Street
Bloomington, IN 47405
Email: ddvand...@gmail.com

Philipp Schiffer

unread,
Jul 9, 2020, 4:57:42 AM7/9/20
to hahnlab-cafe
Hi Dan,

I am working with Paschalis on this and have implemented the suggestions you made and ran the error script again, but to no avail. So basically what I did is removing all our gene families (orthogroups), which had a difference larger than 50 between min and max. At the same time I also kicked out all the families with less than 69 entries (we have 68 species; obviously this kicks also groups, which have a 0:0:3:6:0:0.....20 pattern, but well...).
So my last run I killed at this point:

CAFE run number 71805
Pre run (no error models)
Rewriting CAFE shell script...
Running CAFE [silently] with error models listed above...
CAFE run complete! Retrieving Score..........
Score with above error models:     inf
Lambda with above error models:    0.00501509605405
++WARNING: CAFE failed to converge or initialize. This run will be re-done. Check badRuns.txt for more info.

I had used this command:
python2 caferror.py -i script.cafe -d output3 -v 0 -f 1

This is in version CAFE v4.2.1 run on Ubuntu 20.04.

I am attaching my script and the output folder archived. Your help with this would be greatly appreciated.

Best regards

Philipp 

On Monday, 8 June 2020 18:09:38 UTC+2, Dan Vanderpool wrote:
Hello Paschalis,

you need to remove the most variable families from your data set as for these there is an infinitely small probability of the observed change under the lambda you are using.  

Typically a reasonable filter is to open the file in excel and add a column off to the right that is the difference between the min and max for each family.  Sort the entire sheet such that the largest differences are at the top.  Remove the top ~10 most variable families and save this as a new file (remember to check that the line breaks are unix and that you have deleted the new column you added). Try running CAFE with the reduced dataset (just see if it will initialize before running a full error model search). If it works great, if not try removing more families.

dv

On Jun 8, 2020, at 5:10 AM, Paschalis Natsidis <natsidis...@gmail.com> wrote:

Hello,

I am using CAFE to try to infer gene family evolution among metazoa. 

However, when I use the caferror.py module, I get a warning that it fails to converge or initialise.
Please check the attached file.

Also, I am having a hard time to understand what 'inf' score means, and how can the different lambdas be interpreted.

Many thanks

Paschalis

--
You received this message because you are subscribed to the Google Groups "hahnlab-cafe" group.
To unsubscribe from this group and stop receiving emails from it, send an email to hahnl...@googlegroups.com.

To view this discussion on the web visit https://groups.google.com/d/msgid/hahnlabcafe/54cb6eb1-6b2b-4da6-bb12-12cc072aec90o%40googlegroups.com.
<Screenshot 2020-05-29 at 19.09.21.png>

__________________________
Dan Vanderpool
Postdoctoral Scholar
Department of Biology
Indiana University
Hahn Lab, Jordan Hall 249B
1001 East Third Street
Bloomington, IN 47405
output3.tgz
script.cafe

Gregg Thomas

unread,
Jul 9, 2020, 12:47:45 PM7/9/20
to Philipp Schiffer, hahnlab-cafe
Hi all,
You could also try adding the -filter option to the load command. This makes sure your families are all present at the most recent common ancestor.

Also, could you try running your input script directly with CAFE?

source script.cafe or cafe script.cafe

The resulting CAFE log file might report more information. Let us know what that says and we can try and figure this out!
-Gregg

To unsubscribe from this group and stop receiving emails from it, send an email to hahnlabcafe...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/hahnlabcafe/bafa299c-bf94-4b78-9e33-75e664e13574o%40googlegroups.com.

Dan Vanderpool

unread,
Jul 9, 2020, 2:10:29 PM7/9/20
to Gregg Thomas, Philipp Schiffer, hahnlab-cafe
Hello Philip,

Also, try putting your tree command first

#!/home/philipp/CAFE/release/cafe
#version
#date

tree (((((((((((((((Nematoplanacoelogynoporoides:1.0,Microstomumlineare:1.0):1.0,Stenostomumsthenum:2.0):1.0,Symbionpandora:3.0):3.0,((Malacobdellagrossa:1.0,Cerebratulussp:1.0):1.0,Cephalothrixlinearis:2.0):4.0):3.0,((Helobdellasp:1.0,Capitellateleta:1.0):1.0,Alvinellapompejana:2.0):7.0):5.0,(((Lymnaeastagnalis:1.0,Crassostreagigas:1.0):1.0,Leptochitonrugatus:2.0):2.0,(Phoronisvancouverensis:1.0,Lingulaanatina:1.0):3.0):10.0):4.0,((Spadellasp:1.0,Eukrohniasp:1.0):2.0,(Brachionusplicatilis:1.0,Adinetaricciae:1.0):2.0):15.0):10.0,((((((Romanomermisculicivorax:1.0,Plectussambesii:1.0):1.0,Enoplusbrevis:2.0):2.0,(Ramazzottiusvarieornatus:1.0,Hypsibiusdujardini:1.0):3.0):1.0,Paragordiusvarius:5.0):3.0,((Triboliumcastaneum:1.0,Daphniapulex:1.0):1.0,Strigamiamaritima:2.0):6.0):1.0,Priapuluscaudatus:9.0):19.0):24.0,((((((((Symsagittiferaroscoffensis:1.0,Praesagittiferanaikaiensis:1.0):2.0,(Pseudaphanostomavariabilis:1.0,Isodiametrapulchra:1.0):2.0):1.0,Hofsteniasp:4.0):1.0,Paratomellarubra:5.0):2.0,(Nemertodermawestbladi:1.0,Mearastichopi:1.0):6.0):2.0,(Xenoturbellaprofunda:1.0,Xenoturbellabocki:1.0):8.0):8.0,((((Strongylocentrotuspurpuratus:1.0,Parastichopussp:1.0):2.0,(Patiriaminiata:1.0,Amphiurafiliformis:1.0):2.0):1.0,Florometraserratisima:4.0):3.0,((Saccoglossuskowalevskii:1.0,Ptychoderaflava:1.0):1.0,Rhabdopleurasp:2.0):5.0):10.0):6.0,((((Homosapiens:1.0,Callorhinchusmilii:1.0):1.0,Petromyzonmarinus:2.0):2.0,(Salpathompsoni:1.0,Cionaintestinalis:1.0):3.0):1.0,Branchiostomafloridae:5.0):18.0):29.0):4.0,(((Liriopetetraphylla:1.0,Hydravulgaris:1.0):1.0,Alatinaalata:2.0):1.0,Nematostellavectensis:3.0):53.0):1.0,Trichoplaxadhaerens:57.0):3.0,((Lampeapancerina:1.0,Coeloplanameteoris:1.0):1.0,Mnemiopsisleidyi:2.0):58.0):3.0,((Syconcoactum:1.0,Plakinajani:1.0):1.0,Amphimedonqueenslandica:2.0):61.0):1.0,Capsasporaowczarzaki:64.0):3.0,((Salpingoecarosetta:1.0,Monosigabrevicollis:1.0):1.0,Acanthoecasp:2.0):65.0);

load -i 68matrix_68_50_reduced.tsv -t 5 -l cafelog.txt -p 0.05
lambda -s 
report cafereport


You also don’t need a lambda tree if you are only using one rate.




Dan Vanderpool
Postdoctoral Scholar
Department of Biology
Indiana University
Hahn Lab, Jordan Hall 249B
1001 East Third Street
Bloomington, IN 47405

Philipp Schiffer

unread,
Jul 9, 2020, 3:25:54 PM7/9/20
to hahnlab-cafe
Hi both!

Thanks for the feedback. I am trying both now and hope to be able to report tomorrow.
Meanwhile as a short note. The use of 10 threads is actually hard coded into the caferror.py script:
loadLine = "load -i " + families + " -t 10 -l " + logFile

Cheers

Phil

On Thursday, 9 July 2020 20:10:29 UTC+2, Dan Vanderpool wrote:
Hello Philip,

Also, try putting your tree command first

#!/home/philipp/CAFE/release/cafe
#version
#date

tree (((((((((((((((Nematoplanacoelogynoporoides:1.0,Microstomumlineare:1.0):1.0,Stenostomumsthenum:2.0):1.0,Symbionpandora:3.0):3.0,((Malacobdellagrossa:1.0,Cerebratulussp:1.0):1.0,Cephalothrixlinearis:2.0):4.0):3.0,((Helobdellasp:1.0,Capitellateleta:1.0):1.0,Alvinellapompejana:2.0):7.0):5.0,(((Lymnaeastagnalis:1.0,Crassostreagigas:1.0):1.0,Leptochitonrugatus:2.0):2.0,(Phoronisvancouverensis:1.0,Lingulaanatina:1.0):3.0):10.0):4.0,((Spadellasp:1.0,Eukrohniasp:1.0):2.0,(Brachionusplicatilis:1.0,Adinetaricciae:1.0):2.0):15.0):10.0,((((((Romanomermisculicivorax:1.0,Plectussambesii:1.0):1.0,Enoplusbrevis:2.0):2.0,(Ramazzottiusvarieornatus:1.0,Hypsibiusdujardini:1.0):3.0):1.0,Paragordiusvarius:5.0):3.0,((Triboliumcastaneum:1.0,Daphniapulex:1.0):1.0,Strigamiamaritima:2.0):6.0):1.0,Priapuluscaudatus:9.0):19.0):24.0,((((((((Symsagittiferaroscoffensis:1.0,Praesagittiferanaikaiensis:1.0):2.0,(Pseudaphanostomavariabilis:1.0,Isodiametrapulchra:1.0):2.0):1.0,Hofsteniasp:4.0):1.0,Paratomellarubra:5.0):2.0,(Nemertodermawestbladi:1.0,Mearastichopi:1.0):6.0):2.0,(Xenoturbellaprofunda:1.0,Xenoturbellabocki:1.0):8.0):8.0,((((Strongylocentrotuspurpuratus:1.0,Parastichopussp:1.0):2.0,(Patiriaminiata:1.0,Amphiurafiliformis:1.0):2.0):1.0,Florometraserratisima:4.0):3.0,((Saccoglossuskowalevskii:1.0,Ptychoderaflava:1.0):1.0,Rhabdopleurasp:2.0):5.0):10.0):6.0,((((Homosapiens:1.0,Callorhinchusmilii:1.0):1.0,Petromyzonmarinus:2.0):2.0,(Salpathompsoni:1.0,Cionaintestinalis:1.0):3.0):1.0,Branchiostomafloridae:5.0):18.0):29.0):4.0,(((Liriopetetraphylla:1.0,Hydravulgaris:1.0):1.0,Alatinaalata:2.0):1.0,Nematostellavectensis:3.0):53.0):1.0,Trichoplaxadhaerens:57.0):3.0,((Lampeapancerina:1.0,Coeloplanameteoris:1.0):1.0,Mnemiopsisleidyi:2.0):58.0):3.0,((Syconcoactum:1.0,Plakinajani:1.0):1.0,Amphimedonqueenslandica:2.0):61.0):1.0,Capsasporaowczarzaki:64.0):3.0,((Salpingoecarosetta:1.0,Monosigabrevicollis:1.0):1.0,Acanthoecasp:2.0):65.0);

load -i 68matrix_68_50_reduced.tsv -t 5 -l cafelog.txt -p 0.05
lambda -s 
report cafereport


You also don’t need a lambda tree if you are only using one rate.


On Jul 9, 2020, at 12:47 PM, Gregg Thomas <greg...@gmail.com> wrote:

Hi all,
You could also try adding the -filter option to the load command. This makes sure your families are all present at the most recent common ancestor.

Also, could you try running your input script directly with CAFE?

source script.cafe or cafe script.cafe

The resulting CAFE log file might report more information. Let us know what that says and we can try and figure this out!
-Gregg


--
You received this message because you are subscribed to the Google Groups "hahnlab-cafe" group.
To unsubscribe from this group and stop receiving emails from it, send an email to hahnl...@googlegroups.com.

__________________________
Dan Vanderpool
Postdoctoral Scholar
Department of Biology
Indiana University
Hahn Lab, Jordan Hall 249B
1001 East Third Street
Bloomington, IN 47405

Gregg Thomas

unread,
Jul 9, 2020, 5:55:39 PM7/9/20
to Philipp Schiffer, hahnlab-cafe
Heh, wonder why I did that. Thanks for pointing it out!
-Gregg

To unsubscribe from this group and stop receiving emails from it, send an email to hahnlabcafe...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/hahnlabcafe/a0339651-6e3c-4d37-8d9d-1d2c05da1ea2o%40googlegroups.com.

Philipp Schiffer

unread,
Jul 10, 2020, 6:26:44 AM7/10/20
to hahnlab-cafe
Hi both!

I have now run the caferror script with the modified input file suggest by Dan. This did not converge and I killed it after about 22600 iterations. The results are attached in the archive called output_treefirst.
Following Gregg's suggestion I also ran the script directly with cafe (run_with_cafe.tgz). This finished in minutes. Actually, I just tried it on the dataset, which includes the most variable families, as well and that finished in about half an hour. However, Paschalis and I are puzzled still to what extend this gives us what we want, the error models to then run the cafe analysis on our data set. We are also still struggling with Paschalis initial question of "having a hard time to understand what 'inf' score means, and how can the different lambdas be interpreted." Would be great to get your feedback on this.

Just saw that the first output was too big for this group. I will try in another post. Otherwise, would be great if I could just email it?

Cheers

Philipp


On Thursday, 9 July 2020 23:55:39 UTC+2, greggwct wrote:
Heh, wonder why I did that. Thanks for pointing it out!
-Gregg

run_with_cafe.tgz

Philipp Schiffer

unread,
Jul 10, 2020, 6:29:47 AM7/10/20
to hahnlab-cafe

Gregg Thomas

unread,
Jul 11, 2020, 12:31:22 PM7/11/20
to Philipp Schiffer, hahnlab-cafe
Hey again,
Sorry for the confusion. We're trying to figure out why the CAFE runs with caferror are coming back with scores of 'inf'. A score of inf means that CAFE was unable to accurately converge on a rate estimate. This usually happens because of an issue with the input data having gene families that contain large amounts of variance, which is why Dan first suggested to remove families with a large range. I suggested running your script through CAFE alone to see if there was any more information in the CAFE log file that could help us determine why CAFE isn't converging. Looking at this output I can see that CAFE is still returning a score of inf, so now we at least know the problem isn't with caferror.

If it's possible, do you think you could send your input data matrix so we can try and replicate this issue?
Thanks!
-Gregg

To unsubscribe from this group and stop receiving emails from it, send an email to hahnlabcafe...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/hahnlabcafe/c3a69424-76aa-45a7-ac5f-8bc8fefd6a4ao%40googlegroups.com.

Philipp Schiffer

unread,
Jul 12, 2020, 6:07:48 AM7/12/20
to hahnlab-cafe
Hi Gregg,

here are our matrices. Or rather, here is our original matrix with 68 diverse species derived from an OrthoFinder run with >150 species. We have reduced it to 68, as we do have an inferred phylogeny for tese 68 (not just a NCBI cladogram). From this 68 species I then made a matrix where I removed all groups with less than 69 orthologues, and another one where I also removed groups with a difference of >50 in the min/max (as already explained above). 

Cheers

Philipp


On Saturday, 11 July 2020 18:31:22 UTC+2, greggwct wrote:
Hey again,
Sorry for the confusion. We're trying to figure out why the CAFE runs with caferror are coming back with scores of 'inf'. A score of inf means that CAFE was unable to accurately converge on a rate estimate. This usually happens because of an issue with the input data having gene families that contain large amounts of variance, which is why Dan first suggested to remove families with a large range. I suggested running your script through CAFE alone to see if there was any more information in the CAFE log file that could help us determine why CAFE isn't converging. Looking at this output I can see that CAFE is still returning a score of inf, so now we at least know the problem isn't with caferror.

If it's possible, do you think you could send your input data matrix so we can try and replicate this issue?
Thanks!
-Gregg

matrices.tgz

Dan Vanderpool

unread,
Jul 13, 2020, 11:47:12 AM7/13/20
to Philipp Schiffer, hahnlab-cafe
Hello Philipp,

Looking at your tree, I think you will have a very hard time estimating a meaningful lambda given all of the tips within a clade are closely related with clades separated by long branches between clades.  You may want to try analyzing some of the clades separately, or using multiple lambdas.  Also, CAFE5 incorporates among family rate variation which may help as well but convergence is harder to achieve while using both multiple lambdas and the discrete gamma correction.  

Dan

To unsubscribe from this group and stop receiving emails from it, send an email to hahnlabcafe...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/hahnlabcafe/2aa07bb5-0dbe-44f3-b4b4-3e3519ea4a88o%40googlegroups.com.
<matrices.tgz>

__________________________
Dan Vanderpool
Postdoctoral Scholar
Department of Biology
Indiana University
Hahn Lab, Jordan Hall 249B
1001 East Third Street
Bloomington, IN 47405

Gregg Thomas

unread,
Jul 13, 2020, 12:23:20 PM7/13/20
to Dan Vanderpool, Philipp Schiffer, hahnlab-cafe
I think Dan is correct that the disparity between branch lengths in your tree is causing some problems (though it's not something I've run into before). I was unable to filter your data in such a way to get it to converge with the full tree, however limiting to a subset of species (Romanomermisculicivorax, Plectussambesii, Enoplusbrevis, Ramazzottiusvarieornatus, Hypsibiusdujardini, Paragordiusvarius, Triboliumcastaneum, Daphniapulex, Strigamiamaritima, Priapuluscaudatus) CAFE was able to finish (though still with some families failing to converge).

I'm wondering how the branch lengths on your tree were estimated? CAFE works best with branches in terms of millions of years. However, for some of these species (e.g. Daphnia and Human) I know the divergence time on the order of hundreds of millions of years. This sampling may be at too deep a timescale for CAFE to accurately estimate gain/loss rates.

-Gregg

Dan Vanderpool

unread,
Jul 13, 2020, 12:39:59 PM7/13/20
to Gregg Thomas, Philipp Schiffer, hahnlab-cafe
Yeah, I guess maybe if these units are in hundreds of millions of years.  I didn’t see Human in there until Gregg pointed it out.  In this case any estimation of gene family turnover will likely be nonsense.  Too much time has passed to reliably count changes and discern family membership/orthologs (whole genome duplications likely exist in many/all(?) of these groups over this time span as well).  I would definitely do as Gregg did and break the tree up into smaller, more closely related clades.  

Dan

Philipp Schiffer

unread,
Jul 13, 2020, 3:40:04 PM7/13/20
to hahnlab-cafe
Hi both!

Thanks for looking into this. Paschalis and I had actually thought the reason might be the deep nodes.
To shortly comment on the tree. This is a phylobayes phylogeny with a carefully chosen and filtered phylogenomic gene set. 

Now follows the problem. Our focal species in all of this is Xenoturbella bocki. Looking at the tree you'll see it sits along with the acoels, hemichordates and echinoderms, and all of them separated by long branches in that particular clade. More tricky even, we actually would be most interested to see what is different in Xenoturbella, but also this whole clade in comparison to other deuterostomes and protostomes, and the non-bilaterians. Surely, we can select early branching chordates, like Amphioxus and Ciona and kick out humans and friends, but in particular on the protostome side "good genomes" are from deeply nested species (C. elegans, Drosophila, Tribolium). Non-bilaterians will be far as well. Seems we are in bit of a pickle.
Is there something you would suggest me trying, potentially really pruning the tree where possible and using CAFE5?

Cheers

Philipp

On Monday, 13 July 2020 18:39:59 UTC+2, Dan Vanderpool wrote:
Yeah, I guess maybe if these units are in hundreds of millions of years.  I didn’t see Human in there until Gregg pointed it out.  In this case any estimation of gene family turnover will likely be nonsense.  Too much time has passed to reliably count changes and discern family membership/orthologs (whole genome duplications likely exist in many/all(?) of these groups over this time span as well).  I would definitely do as Gregg did and break the tree up into smaller, more closely related clades.  

Dan



On Jul 13, 2020, at 12:22 PM, Gregg Thomas <greg...@gmail.com> wrote:

I think Dan is correct that the disparity between branch lengths in your tree is causing some problems (though it's not something I've run into before). I was unable to filter your data in such a way to get it to converge with the full tree, however limiting to a subset of species (Romanomermisculicivorax, Plectussambesii, Enoplusbrevis, Ramazzottiusvarieornatus, Hypsibiusdujardini, Paragordiusvarius, Triboliumcastaneum, Daphniapulex, Strigamiamaritima, Priapuluscaudatus) CAFE was able to finish (though still with some families failing to converge).

I'm wondering how the branch lengths on your tree were estimated? CAFE works best with branches in terms of millions of years. However, for some of these species (e.g. Daphnia and Human) I know the divergence time on the order of hundreds of millions of years. This sampling may be at too deep a timescale for CAFE to accurately estimate gain/loss rates.

-Gregg

On Mon, Jul 13, 2020 at 9:47 AM Dan Vanderpool <ddvand...@gmail.com> wrote:
Hello Philipp,

Looking at your tree, I think you will have a very hard time estimating a meaningful lambda given all of the tips within a clade are closely related with clades separated by long branches between clades.  You may want to try analyzing some of the clades separately, or using multiple lambdas.  Also, CAFE5 incorporates among family rate variation which may help as well but convergence is harder to achieve while using both multiple lambdas and the discrete gamma correction.  

Dan


__________________________
Dan Vanderpool
Postdoctoral Scholar
Department of Biology
Indiana University
Hahn Lab, Jordan Hall 249B
1001 East Third Street
Bloomington, IN 47405

--
You received this message because you are subscribed to the Google Groups "hahnlab-cafe" group.
To unsubscribe from this group and stop receiving emails from it, send an email to hahnl...@googlegroups.com.

__________________________
Dan Vanderpool
Postdoctoral Scholar
Department of Biology
Indiana University
Hahn Lab, Jordan Hall 249B
1001 East Third Street
Bloomington, IN 47405

kavita kumari

unread,
Sep 4, 2025, 10:41:42 AM (9 days ago) Sep 4
to hahnlab-cafe
hey could you solve the problem? If yes can u share what possible steps you have taken.
thankyou.

Reply all
Reply to author
Forward
0 new messages