Some questions about divergence time estimation with large size SNPs dataset

董鹏程

unread,

Mar 23, 2022, 1:17:21 PM3/23/22

to beast-users

Hello everyone,

I'm a graduate student, and I wanna perform divergence time estimation using whole genome SNPs dataset(about 220000 SNPs for each individual, 100 individuals total), but here comes some question when I performing this analysis:

1)it seems to be extremely calculation expensive, even by using a 10 individuals small dataset divided from the big one when I tried to use SNAPP on CIPRES(I checked the log file, only 59000 MCMC chains in 240 CPU Hrs) , then I wondered how to reduce my computing cost to an acceptable level while retaining more phylogenetic information? Should I reduce the number of individual or the number of SNPs? I tried to resample a smaller dataset randomly (22000 SNPs)by using a python script(generate 22000 random and no repeat index first, then use this sorted list to resample all sequences), but when I use it to perform phylogenetic analysis using IQ-tree, the topology was completely changed compared to the tree using original dataset, I think my resample is not available, or did it with a wrong way.

2)I have been read various papers and tutorials, some evidences indicate that the SNAPP is the only solution of divergence time estimation based on SNPs now, but there are still lots of researches treat SNPs as normal gene sequence and perform this analysis normally using BEAST, is that methodologically correct? If I do the same thing as them, will it reduce my computing cost?

3)I read the tutorial of SNAPP on github( tutorials/README.md at main · ForBioPhylogenomics/tutorials · GitHub ), and I noticed there is a word "Saga" in this tutorial, it seems that this is a phylogenetic analysis platform like CIRPES, but I can't find it with google. Is there anyone knows the website or some other similarly platforms? I'm in a small research group so I have to find computation resources by myself.

Any help would be greatly appreciated:))

michaelm

unread,

Mar 24, 2022, 12:02:53 PM3/24/22

to beast-users

Hi Arc,

2) If you use BEAST and not SNAPP for a SNP dataset, this may lead to overestimation of terminal branch lengths and therefore the age estimation will be unreliable.

3) The GitHub repository is for a Phylogenomics course taught by the Scandinavian ForBio school. Saga is a computer cluster that was available to students of the course. You will need to find another cluster to run your analysis on. Alternatively, you could reduce the dataset until it is small enough to run on your desktop computer.

Best wishes,

Michael

Arc

unread,

Mar 26, 2022, 1:09:03 PM3/26/22

to beast-users

Thank you for your suggestion Michael :D, I decide to use a smaller dataset to perform analysis use my laptop, I reduce sequences to 7 with 120000 SNPs for each individual(SNPs became invariant sites had been removed), and this analysis only need 27 hours to finish( in the case of 10000000 MCMC )

Remco Bouckaert

unread,

Apr 5, 2022, 4:05:48 PM4/5/22

to beast...@googlegroups.com

You may want to check out the SNAPPER package for BEAST 2 (https://github.com/rbouckaert/snapper). It is an extremely good approximation to SNAPP, and its computational time is not sensitive to the number of individuals per species, so that may solve your computational problems.

Arc

unread,

Apr 11, 2022, 2:43:00 PM4/11/22

to beast-users

Sorry for late reply higg, thank you for your advice, but here comes to a problem when I try to use SNAPPER to perform my analysis: when I use converted SNP dataset(datatype integerdata with symbols 012), the BEAST cannot start analysis, and the error message is" Index 2 out of bounds for length 2", I googled it and I find another instance in [Solved] Error:index 2 out of bounds for length 2 at test.main - CodeProject, the answer is "it seems to be only 2 values in an array object (corresponding to index 0 & 1), thus while accessing index 2, it throws an error.". But when I review the tutorial " Species delimitation with SNAPPER (2021 version) "from BEAST2 website, it did used the convert SNP integer dataset with symbols 012, I'm confused and I dont know where goes wrong

Reply all

Reply to author

Forward