Conditions and results on PhyloNetworks

412 views
Skip to first unread message

yaf...@gmail.com

unread,
Nov 14, 2015, 3:21:09 AM11/14/15
to PhyloNetworks users
Hi all,

I have some confuses on conditions and results on PhyloNetworks. Hope getting your helps^^
1. If the input is the trees set, which produced by raxml, does the result of phylonetworks have bias on the trees set? Because the best tree of raxml sometimes could changed depending on the seeds and start tree, right? In other words, we could get better results from input of CFs table.

2. Actually, Phylonetworks also need a start tree, how should I choose a start tree, is it better to be a primary concordance tree?

3. The other software 'phylonet' require the input trees which should be root trees. However, Phylonetworks could accept root and unroot tree, how different is between root tree and unroot tree on the result when we use Phylonetworks?

4. The manual mentioned that phylonetworks could not detect the overlap hybirdization, so, if could the signal be enhanced or decreased when overlap hybirdization happened?

5. For my case, I used the different inputs, (one is trees set, the other is CF table). And I got the different results from phylonetworks. What made me so confused is the result from CF table, the topology are different from the primary concordance tree, the topology of which should be the real topology of species tree in my mind. So i do not know how to understand the result from table CF. (Plus, the results are attached, please check it.)

Thanks for your reading~

Best, 
Yafei  

cftree_phylonetworks.pdf
cftable_phylonetworks.pdf
Screen Shot 2015-11-02 at 20.39.37.png

Claudia R. Solís-Lemus

unread,
Nov 14, 2015, 10:52:39 AM11/14/15
to PhyloNetworks users
Hi Yafei,

1. I have not used raxml much to know about the bias you mention, let me double check before answering this.

2. it is ok to use the primary concordance tree. You could also get a starting tree from QuartetMaxCut or Astral that will use the same input as PhyloNetworks (a CF table or a list of gene trees). Another idea is to run snaq on hmax=0 on the starting tree to see which is chosen as the best tree (hmax=0 represents a tree) and use that tree as starting point for the network.

3. SNaQ in PhyloNetworks does not need rooted gene trees. If you provide rooted gene trees, it will ignore the root because it does not need it. But, the estimated network will also be unrooted, so you should be careful when studying the estimated network. If you see the plots you sent my in Dendroscope, they are both unrooted.

4. If the "true" network that we want to detect has overlapping hybridizations, SNaQ will only be able to detect one. It is important to note that as the network becomes more complex (more hybridizations, and even overlapping), you will need a lot more data to detect them. 

5. So, the first thing is that both estimated networks should be viewed as unrooted (even when Dendroscope tries to plot with some direction). You mention that the starting tree is what you would expect as the main underlying tree, which has clades (Asub,Agem), (Aech,Adif) and (Ayon,Aten).

It is hard to know the underlying main tree in the figures from Dendroscope because we do not have the probability for each hybrid edge. That is, the edges in blue are the hybrid edges and each hybrid edge should have a "weight". The hybrid edge with the higher weight is regarded as part of the true main tree.

Maybe you already did this, but for me to compare underlying main trees, I would need to know the weights of the hybrid edges. You can get the information from the .out files if you can read extended parenthetical format.
Hopefully you saved both .out files (one from input CF table and the other from input trees).

The other information that you can get from the .out files is the value for the pseudolikelihood for each estimation. Since SNaQ is trying to find the network that maximizes the pseudolikelihood, you can use the pseudolikelihood values to compare the two estimated networks. For optimization purposes, we actually minimize -log(pseudolikelihood), so the network with lower P-loglik value in their .out file would be preferred.

It is difficult to estimate networks because the space of networks is very big. Runs can inevitably be stuck in local maxima, so starting different runs is a good idea. Actually, if you have different starting trees, it is a good idea to use them as starting point and compare estimated networks at the end according to the pseudolikelihood value.

One last thing to double check: if you got the CF table from bucky.pl, you should remove the columns with the confidence intervals for each CF because at the moment, the function readTableCF will take the first 7 columns. I want to modify this so that you use the table directly from bucky.pl, but for the moment, you need to be careful that the columns in your CF table are in this order:
taxon1 taxon2 taxon3 taxon4 CF12|34 CF13|24 CF14|23
Maybe this is the reason why you got different result.

Claudia R. Solís-Lemus

unread,
Nov 14, 2015, 12:08:17 PM11/14/15
to PhyloNetworks users
Hello,
I've changed the readTableCF function so that it will read the table from bucky.pl directly.
I should have done this sooner to prevent errors.
Let me know if this changes the results you got from CF table.
Thanks!
Claudia


On Saturday, November 14, 2015 at 2:21:09 AM UTC-6, yaf...@gmail.com wrote:

Cécile Ané

unread,
Nov 14, 2015, 4:14:21 PM11/14/15
to PhyloNetworks users
Hi Yafei,
Just a few further thoughts:

1. I don't think RAxML trees have a bias, but they surely have some estimation error. The advantage to using CFs estimated from BUCKy, instead, is that the gene tree estimation error is accounted for by BUCKy. So the quartet CFs output by BUCKy are influenced a lot more by genes whose trees are more certain, and are almost not influenced by genes whose tree are completely uncertain. This is done quartet by quartet separately, and some genes might have good resolution (therefore influence) on some quartets, and not on others. Anyway, the main difference here between using gene trees straight from RAxML, versus using quartet CFs estimated by BUCKy is that the former assumes no gene tree error, while the latter accounts for (and integrates) gene tree error. I prefer the second option, but I recognize it's more complicated.

2. Like Claudia said, the primary concordance tree or the population tree output by BUCKy (on the full taxon set) should be a good starting point. Another option is a tree obtained by QuartetMaxCut, which you can get from the table of quartet CFs that you already have. The TICR pipeline has this too (https://github.com/nstenz/TICR#qmc). A third option would be the tree obtained by ASTRAL, which is also based on quartets. But ASTRAL cannot take the table of quartet CFs as input. You will need to give ASTRAL the full set of RAxML trees, and ASTRAL will assume that these trees are perfectly reconstructed.

3. PhyloNet will always think that your input trees are rooted. This is because any tree has to be arbitrarily rooted in some way, to be written to a text file in parenthetical format. It is the user's responsibility to know if the rooting used for the parenthetical description is meaningful or not. PhyloNet will always think that the root used to write the trees is meaningful. So technically, PhyloNet would accept trees that you think are unrooted. But if the root is meaningless, then the output of PhyloNet would also be meaningless. In practice, some software will arbitrarily use the last taxon to root the trees and write them in parenthetical format. This might be a good choice, in case your taxon sample has a single outgroup, that this outgroup is  last alphabetically, and if it is present in all gene trees: no missing data. If it's a good choice, then even if you don't know it, the output of PhyloNet would be meaningful. But in general you should be careful to check the root of your gene trees. If you have 2 outgroup taxa instead of 1, for instance, the default rooting of gene trees to write them in parenthetical format would typically be bad. Or if a gene is missing all outgroups, then the default rooting is likely to be meaningless as well.

Like Claudia said, SNaQ in PhyloNetworks ignores any rooting of the gene trees, so you should get the same results regardless of how the gene trees are rooted in the text files.

yaf...@gmail.com

unread,
Nov 14, 2015, 8:44:55 PM11/14/15
to PhyloNetworks users
Hi Claudia,

I tried the Phylonetworks today morning. It does not work now=-=

julia> Pkg.update()

INFO: Updating METADATA...

INFO: Updating PhyloNetworks...

INFO: Computing changes...

INFO: No packages to install, update or remove


julia> using PhyloNetworks

INFO: Precompiling module PhyloNetworks...

ERROR: LoadError: LoadError: syntax: incomplete: "function" at /Users/mao/.julia/v0.4/PhyloNetworks/src/readData.jl:49 requires end

 in include at /Volumes/Julia/Julia-0.4.1.app/Contents/Resources/julia/lib/julia/sys.dylib

 in include_from_node1 at /Volumes/Julia/Julia-0.4.1.app/Contents/Resources/julia/lib/julia/sys.dylib

 in include at /Volumes/Julia/Julia-0.4.1.app/Contents/Resources/julia/lib/julia/sys.dylib

 in include_from_node1 at /Volumes/Julia/Julia-0.4.1.app/Contents/Resources/julia/lib/julia/sys.dylib

 [inlined code] from none:2

 in anonymous at no file:0

 in process_options at /Volumes/Julia/Julia-0.4.1.app/Contents/Resources/julia/lib/julia/sys.dylib

 in _start at /Volumes/Julia/Julia-0.4.1.app/Contents/Resources/julia/lib/julia/sys.dylib

while loading /Users/mao/.julia/v0.4/PhyloNetworks/src/readData.jl, in expression starting on line 42

while loading /Users/mao/.julia/v0.4/PhyloNetworks/src/PhyloNetworks.jl, in expression starting on line 54

ERROR: Failed to precompile PhyloNetworks to /Users/mao/.julia/lib/v0.4/PhyloNetworks.ji

 in error at /Volumes/Julia/Julia-0.4.1.app/Contents/Resources/julia/lib/julia/sys.dylib

 in compilecache at loading.jl:384

 in require at /Volumes/Julia/Julia-0.4.1.app/Contents/Resources/julia/lib/julia/sys.dylib


julia> whos(PhyloNetworks)

ERROR: UndefVarError: PhyloNetworks not defined


julia> d=readTrees2CF("/Users/mao/all_ortholog_raxml_bran.txt");

ERROR: UndefVarError: readTrees2CF not defined


julia> d=readTableCF("/Users/mao/mygenes-bucky/my-gene-mb.CFs.csv")

ERROR: UndefVarError: readTableCF not defined


Best, 
Yafei 

在 2015年11月15日星期日 UTC+9上午2:08:17,Claudia R. Solís-Lemus写道:

Claudia R. Solís-Lemus

unread,
Nov 14, 2015, 10:43:32 PM11/14/15
to PhyloNetworks users
Sorry Yafei!
I made a small error, it should work now. Sorry for the delay in response!
Claudia

On Saturday, November 14, 2015 at 2:21:09 AM UTC-6, yaf...@gmail.com wrote:

yaf...@gmail.com

unread,
Nov 15, 2015, 3:37:29 AM11/15/15
to PhyloNetworks users
Hi Claudia, 

Now, It works for read CVS file. However, If I use txt file as input with space sparator, it does not work. Anyway, It works for read CVS file now. 
Thanks a lot.

Best,
Yafei 
在 2015年11月15日星期日 UTC+9下午12:43:32,Claudia R. Solís-Lemus写道:

Claudia R. Solís-Lemus

unread,
Nov 15, 2015, 7:45:14 AM11/15/15
to PhyloNetworks users
Oh yes, if your separator is different than a comma, you need to specify it:
d=readTableCF("table.txt", sep=" ")

But this has always been like this, so I wonder if you were able to read a txt separated by spaces before.


On Saturday, November 14, 2015 at 2:21:09 AM UTC-6, yaf...@gmail.com wrote:

yaf...@gmail.com

unread,
Nov 15, 2015, 6:20:18 PM11/15/15
to PhyloNetworks users
Hi Claudia, 

Thanks for your reminder. Now, I remember that you have already mentioned in Manual. Sorry for that. 

Best, 
Yafei 

在 2015年11月15日星期日 UTC+9下午9:45:14,Claudia R. Solís-Lemus写道:

yaf...@gmail.com

unread,
Nov 15, 2015, 8:52:17 PM11/15/15
to PhyloNetworks users
Hi Cecile,

Thanks for your reply, I agree with your thoughts.

What I talked about Raxml bias is about gene trees estimation error, So do you agree that using other gene tree reconstruction software would be better than results from Raxml? Some papers showed that Raxml is not the best tool on estimating gene trees. 

Moreover, I tried the PhyloNet under rooted tree and unrooted tree with same gene trees set, the result with rooted tree is different from unrooted tree. Comparing with the SNaQ results, I prefer to believe the result of rooted tree. Interesting, I also try the different inputs in SNaQ (one is gene trees set, the other is CF table), they have same hybridization edge but the node of edge is on different position (different positions in the branch). Moreover, Using the max reticulate number 1 in Phylonet with rooted gene trees set, the Phylonet gives the different hybridizaiton edge from SNaQ, otherwise, Using the max reticulate number 2, Phylonet gives two hybridation edges, one of them is same as SNaQ. So, there are probable reasons to get those results for my understanding: 1. gene trees estimation error; 2. SNaQ coud not detect overlapping hybridizaiton, but I am not sure how much confidence on Phylonet to detect overlapping hybridizaiton. 3. I prefer to SNaQ result for my case by far, because I do not have outgroup here for getting a "good" rooted trees for Phylonet and I just rooted the tree with one taxon which is basal species in previous phylogeny studies in my case. Luckily, I got the one outgroup specie genome last week, I would like to see how is going on for my case. 

Finally, Thanks for sharing your useful thoughts and nice softwares again.
Thanks a lot for your group again^^

Best, 
Yafei  

在 2015年11月15日星期日 UTC+9上午6:14:21,Cécile Ané写道:

Cécile Ané

unread,
Nov 16, 2015, 12:12:19 PM11/16/15
to PhyloNetworks users
Hi Yafei,


What I talked about Raxml bias is about gene trees estimation error, So do you agree that using other gene tree reconstruction software would be better than results from Raxml? Some papers showed that Raxml is not the best tool on estimating gene trees. 

All methods have some degree of estimation error anyway. In terms of methods that maximize the likelihood, the leading implementations are RAxML, PhyML and GARLI. Perhaps more important than the software, is the evolutionary model that you use (e.g HKY versus GTR, partitions within a gene, rate heterogeneity between sites with a Gamma distribution etc.). You can take a look at this paper, where the authors look at the tree reconstructed by various software packages, on a difficult data set (>4,000 taxa, much missing data in this supermatrix): Wright et al (2015): http://dx.doi.org/10.1002/jez.b.22642

Like I mentioned earlier, BUCKy uses the full gene tree sample from MrBayes, from each gene, and integrates over gene tree uncertainty to estimate quartet concordance factors, giving more weight to genes that have more information on the quartet, and less weight to genes that have less information on the quartet. So if this approach can be used (computationally), I think it deals better with gene tree error than using either RAxML trees, or GARLI trees.

 
Moreover, I tried the PhyloNet under rooted tree and unrooted tree with same gene trees set, the result with rooted tree is different from unrooted tree.

Because PhyloNet will actually think that your unrooted trees are rooted, wherever they were rooted to be written to a text file, you should discard the PhyloNet analysis on your unrooted trees, and instead prefer the PhyloNet analyses on your rooted gene trees.

Cheers,
Cecile.

yaf...@gmail.com

unread,
Jan 19, 2016, 12:54:49 AM1/19/16
to PhyloNetworks users
Hi Cecile,

Happy new year ^^
Thanks for your detail reply.
Recently, I am thinking about the 'gene tree', could I use haplotype loci or genotype loci to instead of single gene?
My thought are below:
1. All of genes in one haplotype loci should be linked together so that they usually should have same evolution pattern (same topology of each of genes).
2. Usually, a haplotype loci have more sequence information than one single gene, so we could get more precise 'gene tree' for that loci.
3. For some low coverage genomes, it is hard to do precise genome assembling, but if we have a closed reference genome, we could easily do haplotyping or genotyping. 

Thanks again^^

Best, 
Yafei 

在 2015年11月17日星期二 UTC+9上午2:12:19,Cécile Ané写道:

Cécile Ané

unread,
Jan 20, 2016, 2:58:39 PM1/20/16
to PhyloNetworks users
Hi Yafei,
I don't know how you define your haplotypes, but it sounds like the loci you define as "haplotypes" are longer than the loci defined as "genes". 

For the purpose of reconstructing a species tree or a species network, it doesn't matter that loci code for proteins or not. We typically call them "genes" for convenience, but they don't need to be genes. They just need to be loci within which no recombination occurred, or within which recombination did not change the underlying tree. (Many recombination events do not actually change the tree topology for the sampled individuals.)

So you should define your loci with that in mind: choose them as long as possible to get informative alignment on "the" tree for each locus, but short enough that each locus has one single underlying tree. That's the goal of running MDL as the first step of the TICR pipeline (https://github.com/nstenz/TICR). It's also the goal of the binning approach by Siavash Mirarab et al. (https://github.com/smirarab/binning).

So, if you can back up your thought #1 below (perhaps using MDL), then using the longer haplotypes as loci would be valid and advantageous, yes. Alternatively, you could use binning to group your genes into longer loci.
Cecile.

yaf...@gmail.com

unread,
Jan 21, 2016, 8:15:54 AM1/21/16
to PhyloNetworks users
Hi Cecile,

Thanks for your kind reply^^

In my mind, the genes within one haplotype or genotype should be recombination free, they usually inherit together.

Also, I think, Phylonet-HMM, they use genotype (or haplotype, depending on how you call it out by different softwares) to detect the introgression region. 

As your mention of binning method, one more far away thinking, I am thinking about how to use Rad-tag data to fit to TICR with binning method (I think it is so hard). Because Rad-tag data usually has 50bp as one alignment. In generally, researchers use the snp calling data (I prefer to call it as variation sites) to make species tree, in my experience, it is so hard to get a precise specie tree, not to mention to make phylonetwork tree.  (Rad-tag is for population genetics in beginning, but now we try to use to solve one genus phylogeny). Do you have any idea about it? It is a just thinking not so focus on what we were talking about above (Sorry for that). 

Thanks again ^^

Best, 
Yafei

在 2016年1月21日星期四 UTC+9上午4:58:39,Cécile Ané写道:
Reply all
Reply to author
Forward
0 new messages