SNaQ with large dataset

538 views
Skip to first unread message

mparks

unread,
Dec 6, 2016, 11:34:52 AM12/6/16
to PhyloNetworks users
Hello Drs. Solis-Lemus and Ane,
I am working with a large dataset consisting of a little under a hundred taxa and around 3600 loci. We are interested in running our data through the PhyloNetworks/SNaQ pipeline, but I am currently getting stuck on essentially the first command:

raxmlCF = readTrees2CF("all_trees.RAxML_collapsed.tre",writeTab=false,writeSummary=false)


The trees I am running through this command were generated from bootstrapped RAxML analysis. I have tested running times for this command (on my laptop - 3.1GHz processor, 16GB RAM) using 2, 4, 6 and 8 trees from our dataset, and I am seeing an exponential increase in runtime, such that for a two-tree dataset my runtime is 15-30 seconds, while for an eight-tree set my runtime is over an hour - clearly this would not work to go much beyond one or a couple dozen loci. Is there a better way to approach this problem that you know of, or am I pushing the limits of dataset size for SNaQ? I do see that the datasets in your paper were scaled up to 3000 loci, albeit with fewer taxa than we are dealing with.

A couple of notes:
1) these trees have had any nodes with <33% bootstrap support collapsed
2) the dataset is broad phylogenetically, but does have a fair amount of within-genus and even within-species sampling.


Thanks in advance for any advice you can offer.


Best,

Matt

Claudia R. Solís-Lemus

unread,
Dec 6, 2016, 9:02:28 PM12/6/16
to PhyloNetworks users
Dear Matt,

Thank you for your interest in SNaQ!

We are aware that the function "readTrees2CF" is not very efficient, and time increases dramatically with the number of taxa. The reason is that the function will compute the CF for all subsets of 4 taxa, and the number of subsets increases with the number of taxa.
Depending on the number of taxa that you have, this function could certainly take a while. 
We have plans to make this function more efficient, but unfortunately, we have not had the time to do so.

There are two options that you can try at the moment (besides the unappealing option of using "readTrees2CF" on your full dataset and wait forever):

1) You can use the option "whichQ" and "numQ", to use a random sample of size "numQ" of 4-taxon subsets (instead of using all the subsets).

julia> raxmlCF = readTrees2CF("all_trees.RAxML_collapsed.tre",writeTab=false,writeSummary=false, whichQ="rand", numQ=xxx)

The number "numQ" will depend on the number of taxa that you have, and really do not know the best value to choose. The downside of this alternative is that we want all branches in the tree to be represented by at least one 4-taxon subset. When the subsets are chosen at random, depending on the value of "numQ", there might be branches that are not represented by any 4-taxon subset, and this will cause problems for SNaQ. So, you don't want to use a value of "numQ" too small. The resulting network will depend on the random sample of 4-taxon subsets, but you can then run for different samples to see if the resulting network is affected.

2) You can choose yourself the 4-taxon subsets to use with the option "quartetfile". In the documentation, you can read:
  • quartetfile: name of text file with list of 4-taxon subsets to be analyzed. If none is specified, the function will list all possible 4-taxon subsets.
So, you need to write a text file with a list of 4-taxon subsets (one per line, taxa names separated by commas: A,B,C,D). The function "readTrees2CF" will then only compute the CF for the subsets in that list, instead of all the 4-taxon subsets which can be a lot. It is a case-by-case answer on how to choose this list of 4-taxon subsets, but you would want to have at least one 4-taxon subset for each branch in the tree (I can try to explain in more details if this is not very clear).

julia> raxmlCF = readTrees2CF("all_trees.RAxML_collapsed.tre",writeTab=false,writeSummary=false, quartetfile="quartets.txt")
Your list of 4-taxon subsets would be in a file called "quartets.txt".

One last comment, given that you have within-species sampling, you might want to check out the documentation for the case of multiple alleles.
This, however, will be relevant after you have the table of CF from "readTrees2CF". You can save this table by removing the "writeTab=false" option:

julia> raxmlCF = readTrees2CF("all_trees.RAxML_collapsed.tre",writeSummary=false, quartetfile="quartets.txt")

The table will be saved with the default name "tableCF.txt", but you can specify its name with the "CFfile" option:

julia> raxmlCF = readTrees2CF("all_trees.RAxML_collapsed.tre",CFfile = "nameTable.txt", writeSummary=false, quartetfile="quartets.txt")


Sorry I don't have a better solution at the moment! We hope to improve the efficiency of "readTrees2CF" and to have tools for multiple alleles when the input data is a list of trees (instead of a table of CF) soon!
Claudia


On Tuesday, December 6, 2016 at 10:34:52 AM UTC-6, mparks wrote:S

Cécile Ané

unread,
Sep 22, 2019, 11:16:07 PM9/22/19
to PhyloNetworks users
This problem should be fixed now: there's a new function "observedquartetCF" that is much faster (it will be in the next registered version, and available now on the development version). Computing time increases linearly with the number of trees. A trial with ~110 taxa across 1300 trees took about 20 minutes. It's still slow: but there's over 5 million four-taxon sets with 110 taxa. At least it's feasible now. SNaQ will still choke with that many taxa, unless there are way fewer species, and 100 is the number of individuals.
Cecile.


On Tuesday, December 6, 2016 at 10:34:52 AM UTC-6, mparks wrote:

IOANA ANGHEL

unread,
Dec 6, 2022, 6:42:13 PM12/6/22
to PhyloNetworks users
Hello Drs. Ané and Solís-Lemus, 

I am running into a similar issue as Matt, where the readTrees2CF() function is estimated to take 15 hours. I was interested in the new (at the time) function you had mentioned "observedquartetCF". I tried to use it the same way as readTrees2CF():

julia> raxmlCF = observedquartetCF("trees.tre", writeTab=false, writeSummary=false)

ERROR: UndefVarError: observedquartetCF not defined

Stacktrace:

 [1] top-level scope

   @ REPL[5]:1


I also couldn't find the function in varinfo(PhyloNetworks). Am I using the function correctly?

My data are 63 samples across a genus, with 2-3 samples per species and 348 gene trees. I am using PhyloNetworks version v0.15.3.

Thank you so much for your help!
Ioana

Cécile Ané

unread,
Dec 6, 2022, 9:13:24 PM12/6/22
to PhyloNetworks users
Ah, indeed, the function was renamed before the next package version was released. 
The function name is (long but more explicit): countquartetsintrees. You can use it in 2 steps like this to store the concordance factors on file before using them with SNaQ:

# read in trees, calculate quartet CFs. BUT: add a taxon map with several samples / species:
q,t = countquartetsintrees("trees.tre");
df = writeTableCF(q,t)  # data frame with observed CFs: gene frequencies
# next: save the quartet concordance factors to a file, to re-use them later without having to redo the steps above:
using CSV
CSV.write("tableCF.csv", df);
# convert to "DataCF" object for use in snaq!:
d_sp = readTableCF("tableCF.csv");

The package documentation is the best resource for updated explanations, like:
here for the manual page about reading in gene trees,
- or here for the function itself.
- also here if you have multiple samples per species: the function can directly get the quartet CFs at the species level. There are way fewer than quartet CFs at the sample level, so that's good :)

All the best,
Cécile.

IOANA ANGHEL

unread,
Dec 15, 2022, 4:44:05 PM12/15/22
to PhyloNetworks users
Thank you so much Cécile, that was very useful! 

I was able to follow the manual instructions for the "countquartetsintrees" function, which ran very quickly, as well as providing a taxon map to generate CF at the species level only. So now I have 24 species and 4 outgroups, which is still higher than the recommended <25 tips to run snaq!  in a reasonable amount of time. 

I am wondering if there might be an option to subset the taxa in PhyloNetworks, rather than having to generate the gene trees again with some subsets of the taxa. I noticed the Pouchon et al 2018 paper on Espeletia subset their 41 taxa to multiple datasets of <25 to run it more efficiently, perhaps they might be a better source for my question.

I really appreciate your work in creating this tool and for your support in using it!

All the best, 
Ioana

Cécile Ané

unread,
Dec 15, 2022, 5:02:53 PM12/15/22
to PhyloNetworks users
absolutely: there's function "deleteleaf!" to prune a taxon from a phylogeny (gene tree or species network). But here, the most efficient way to subset the taxa might be to filter out any unwanted row in your table of concordance factors. You could make a list of taxa you want to prune (say "list_to_be_pruned"), then create a function that takes a row of data and returns "true" if all 4 taxa are to be kept, false if any of the 4 taxa is to be pruned. Then use "filter!" to reduce your data frame with all concordance factor to just the rows you want to keep. Then you can use that as input to SNaQ. You will also need to prune the unwanted taxa from your starting topology.

For the manipulation on the data frame, you can follow an example that does a similar thing at the very end of this page, and it would look something like this for your:
# define "list_to_be_pruned"
function hastaxon2prune(row) # replace :t1 :t2 etc. by appropriate column names in your data
  row[:t1] in list_to_be_pruned || row[:t2] in list_to_be_pruned ||
  row[:t3] in list_to_be_pruned || row[:t4] in list_to_be_pruned
end
df_sp_reduced = filter(!hastaxon2prune, df_sp) # use df_sp_reduced downstream

For pruning multiple taxa from your starting topology, you can use a loop like this:
for taxon in list_to_be_pruned
  deleteleaf!(net, taxon) # adjust "net" to the name of the starting phylogeny
end

I hope it's clear!
Cécile.
Reply all
Reply to author
Forward
0 new messages