Multiple alleles for bootnaq possible?

3 views
Skip to first unread message

*

unread,
Sep 6, 2024, 1:42:18 AM9/6/24
to PhyloNetworks users
Hi all,
can someone help me with my code?
I have multiple individuals per each species, and Im trying to use "multiple alleles approach" to construct phylogenetic network.

julia> tm = CSV.read("mappingfile.csv", DataFrame)
julia> taxonmap = Dict(row[:individual] => row[:species] for row in eachrow(tm))
julia> genetrees = readMultiTopology("raxml/best_trees.tre")
julia> sort(tipLabels(genetrees[1]))
julia> sort(tipLabels(genetrees[2]))
julia> sort(tipLabels(genetrees[3]))
julia> sort(tipLabels(genetrees[4]))
julia> sort(tipLabels(genetrees[5]))
julia> sort(tipLabels(genetrees[6]))
julia> df_sp = writeTableCF(countquartetsintrees(genetrees, taxonmap, showprogressbar=false)...)
15×9 DataFrame
 Row │ qind   t1          t2         t3         t4         CF12_34   CF13_24   CF14_23   ngenes
     │ Int64  String15    String15   String15   String15   Float64   Float64   Float64   Float64
.....
.....
julia> d_sp = readTableCF("tableCF_species.csv");
between 5.99999999999886 and 6.000000000000881 gene trees per 4-taxon set
-------------------------------------------------------------------------------
julia> summarizeDataCF(d_sp)
data consists of 15 4-taxon subsets
Taxa: ["Central_EU", "East_EU", "RO_Buchar", "SK_Kezmar", "SK_ZlateM", "Taiwan"]
Number of Taxa: 6
Maximum number of 4-taxon subsets: 15. Thus, 100.0 percent of 4-taxon subsets sampled
-------------------------------------------------------------------------------
julia> start_net2 = readTopology("start_net2.tre")
HybridNetwork, Rooted Network
19 edges
18 nodes: 6 tips, 2 hybrid nodes, 10 internal tree nodes.
-------------------------------------------------------------------------------
julia> net3 = bootsnaq(start_net2, df_sp, hmax=3, nrep=5, runs=10, filename="snaq/net3")
ERROR: BoundsError: attempt to access 9-element Vector{Symbol} at index [[6, 7, 9, 10, 12, 13]]

OR

julia> net3 = bootsnaq(start_net2, d_sp, hmax=3, nrep=5, runs=10, filename="snaq/net3")
ERROR: MethodError: no method matching bootsnaq(::HybridNetwork, :ataCF; hmax::Int64, nrep::Int64, runs::Int64, filename::String)
The function `bootsnaq` exists, but no method is defined for this combination of argument types.

Closest candidates are:
  bootsnaq(::HybridNetwork, ::Union{DataFrame, Vector{Vector{HybridNetwork}}}; hmax, liktolAbs, Nfail, ftolRel, ftolAbs, xtolRel, xtolAbs, verbose, closeN, Nmov0, runs, outgroup, filename, seed, probST, nrep, prcnet, otherNet, quartetfile)

OR

julia> net3 = bootsnaq(start_net2, d_sp, hmax=3, nrep=5, runs=5, filename="snaq/net3")
ERROR: BoundsError: attempt to access 8-element Vector{Symbol} at index [[6, 7, 9, 10, 12, 13]]

(I don't understand these errors.)

 I still have an error concerning "d_sp" object:

> bootnet2 = bootsnaq(snaqnet2, d_sp, hmax=2, nrep=10, filename="snaq/bootnet2")
ERROR: MethodError: no method matching bootsnaq(::HybridNetwork, :ataCF; hmax::Int64, nrep::Int64, filename::String)
The function `bootsnaq` exists, but no method is defined for this combination of argument types.

Closest candidates are:
  bootsnaq(::HybridNetwork, ::Union{DataFrame, Vector{Vector{HybridNetwork}}}; hmax, liktolAbs, Nfail, ftolRel, ftolAbs, xtolRel, xtolAbs, verbose, closeN, Nmov0, runs, outgroup, filename, seed, probST, nrep, prcnet, otherNet, quartetfile)

Thanks.

Cécile Ané

unread,
Sep 6, 2024, 10:41:09 AM9/6/24
to PhyloNetworks users
Short answer: bootsnaq with multiple alleles has not been implemented in a "one-does-it-all" function.

More explanations:

- If we have 1 gene tree per gene, then we don't have information to separate real discordance between (true) gene trees versus discordance due to uncertainty in estimated gene trees (gene trees that differ because each one was estimated with some error). In that case, we cannot do any bootstrapping.
From what I understand of your code, your input has 1 tree per gene. After reading the gene trees, your data frame df_sp has a "point estimates" for each quartet concordance factor, in columns named "CF12_34" etc. But it does not have credibility intervals around these point estimates.

- The bootsnaq function is looking for other columns (by default, columns 6,7; 9,10 and 12,13) containing the lower & upper bounds of the credibility intervals for the quartet concordance factors CF12_34, CF13_24 and CF14_23. The error appears because your data does not have these columns.

- If the input is gene trees instead of quartet concordance factors estimated with credibility intervals, then we need a bootstrap sample of trees for each gene (for example: 100 trees / gene, not 1 tree / gene). The variation within a gene tells about gene tree uncertainty. The (extra) variation between genes tells about discordance between genes. So a bootstrap analysis could be run by sampling 1 tree from the gene's bootstrap file, for each gene, and then estimating a network from these sample (1 per gene only). Over and over.

The function "bootsnaq" can do this automatically if 1 individual = 1 tip in the network. But one would have to code this more manually when there's multiple alleles per tip. (grant proposals to do this don't get funded...: no new biological finding, no new theory...).

Cécile.
Reply all
Reply to author
Forward
0 new messages