Dear Matt,
Thank you for your interest in SNaQ!
We are aware that the function "readTrees2CF" is not very efficient, and time increases dramatically with the number of taxa. The reason is that the function will compute the CF for all subsets of 4 taxa, and the number of subsets increases with the number of taxa.
Depending on the number of taxa that you have, this function could certainly take a while.
We have plans to make this function more efficient, but unfortunately, we have not had the time to do so.
There are two options that you can try at the moment (besides the unappealing option of using "readTrees2CF" on your full dataset and wait forever):
1) You can use the option "whichQ" and "numQ", to use a random sample of size "numQ" of 4-taxon subsets (instead of using all the subsets).
julia> raxmlCF = readTrees2CF("all_trees.RAxML_collapsed.tre",writeTab=false,writeSummary=false, whichQ="rand", numQ=xxx)
The number "numQ" will depend on the number of taxa that you have, and really do not know the best value to choose. The downside of this alternative is that we want all branches in the tree to be represented by at least one 4-taxon subset. When the subsets are chosen at random, depending on the value of "numQ", there might be branches that are not represented by any 4-taxon subset, and this will cause problems for SNaQ. So, you don't want to use a value of "numQ" too small. The resulting network will depend on the random sample of 4-taxon subsets, but you can then run for different samples to see if the resulting network is affected.
2) You can choose yourself the 4-taxon subsets to use with the option "quartetfile". In the
documentation, you can read:
- quartetfile: name of text file with list of 4-taxon subsets to be analyzed. If none is specified, the function will list all possible 4-taxon subsets.
So, you need to write a text file with a list of 4-taxon subsets (one per line, taxa names separated by commas: A,B,C,D). The function "readTrees2CF" will then only compute the CF for the subsets in that list, instead of all the 4-taxon subsets which can be a lot. It is a case-by-case answer on how to choose this list of 4-taxon subsets, but you would want to have at least one 4-taxon subset for each branch in the tree (I can try to explain in more details if this is not very clear).
julia> raxmlCF = readTrees2CF("all_trees.RAxML_collapsed.tre",writeTab=false,writeSummary=false, quartetfile="quartets.txt")
Your list of 4-taxon subsets would be in a file called "quartets.txt".
One last comment, given that you have within-species sampling, you might want to check out the documentation for the case of
multiple alleles.This, however, will be relevant after you have the table of CF from "readTrees2CF". You can save this table by removing the "writeTab=false" option:
julia> raxmlCF = readTrees2CF("all_trees.RAxML_collapsed.tre",writeSummary=false, quartetfile="quartets.txt")
The table will be saved with the default name "tableCF.txt", but you can specify its name with the "CFfile" option:
julia> raxmlCF = readTrees2CF("all_trees.RAxML_collapsed.tre",CFfile = "nameTable.txt", writeSummary=false, quartetfile="quartets.txt")
Sorry I don't have a better solution at the moment! We hope to improve the efficiency of "readTrees2CF" and to have tools for multiple alleles when the input data is a list of trees (instead of a table of CF) soon!
Claudia
On Tuesday, December 6, 2016 at 10:34:52 AM UTC-6, mparks wrote:S