Bug in TreeTemplateTools::midRoot?

47 views
Skip to first unread message

Bastien Boussau

unread,
Mar 4, 2014, 9:09:43 AM3/4/14
to biopp-de...@googlegroups.com
Hi all,

I have the impression that there is a bug in TreeTemplateTools::midRoot that replaces TreeTemplateTools::midpointRooting. Demonstration:

std::cout << "HEHEHEHEHEEHEH "<<TreeTemplateTools::treeToParenthesis(*tree_, true);
    //        TreeTools::midpointRooting(*tree_); //DEPRECATED in latest Bio++ versions
      TreeTemplateTools::midRoot(*tree_, "sum of squares");
    std::cout << TreeTemplateTools::treeToParenthesis(*tree_, true);


HEHEHEHEHEEHEH (Loxodonta_africana_0:1,(Canis_familiaris_1:1,((Gorilla_gorilla_2:1,Mus_musculus_3:1)4:1,Homo_sapiens_5:1)6:1)7:1);
((Gorilla_gorilla_2:1,Mus_musculus_3:1)4:1,Homo_sapiens_5:1,(Canis_familiaris_1:1,Loxodonta_africana_0:2)7:1);

If I'm not mistaken, the input tree was rooted, and the output tree is not... The midpointRooting function used to work, if that's any help.

Thanks!

Best,

Bastien.

Nicolas Rochette

unread,
Mar 6, 2014, 5:06:07 AM3/6/14
to biopp-de...@googlegroups.com
Hi Bastien,

Actually, according to the sum of squares criterion, the root that is returned (on one of the inner nodes) is the unique best position. Currently this is the intended behavior and not actually a "bug".

If we don't want trees rooted on nodes, then options are :
— introduce branches of length 0
— discard the sum of squares as a rooting criterion

Best,

Nicolas
--
You received this message because you are subscribed to the Google Groups "Bio++ Development Forum" group.
To unsubscribe from this group and stop receiving emails from it, send an email to biopp-devel-fo...@googlegroups.com.
To post to this group, send email to biopp-de...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/biopp-devel-forum/b2042eb0-60e7-470e-b0be-70b7ce222801%40googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Bastien Boussau

unread,
Mar 6, 2014, 5:16:33 AM3/6/14
to biopp-de...@googlegroups.com
Hi Nicolas,

Thanks for your answer. I think it is quite unintuitive that midroot can return a non-rooted tree (in the sense "non-rooted" is usually understood in phylogenetics)... And it changes the behavior compared to what was implemented before. Maybe you discussed that before with other people and you all agreed this was the right thing to do, in which case I'm sorry I'm annoying you with this. If this has never been discussed before, maybe other people should give their opinions. If everybody but me is fine with having a midroot function that can return an unrooted tree, then it's OK, but then I think it should be clearly said in the comments to the function.

Anyway, I was using midpointRooting just to get a rooted tree from which to start some exploration algorithms. If I use "variance" instead of "sum of squares" as a criterion, am I guaranteed to get a properly rooted phylogenetic tree?

Bastien.

Nicolas Rochette

unread,
Mar 6, 2014, 6:06:44 AM3/6/14
to biopp-de...@googlegroups.com
Hi Bastien,

Actually eventually your problem lies in the Newick format itself, which itself does not specify whether a tree is rooted or strictly bifurcating. "Non-rooted means rooted on a branch" is just an ad-hoc but fragile solution to try to circumvent this ambiguity. You are only considering bifurcating trees because it is the ones you are usually using.

Of course the behavior has changed, since the algorithm itself has changed. Also, I have never requested to deprecate the old function — please do not assume the contrary when you have no idea.

The simplest solution to your problem is, as I said, to introduce a zero-length branch.

Nicolas

Mathieu Groussin

unread,
Mar 6, 2014, 7:18:28 AM3/6/14
to biopp-de...@googlegroups.com
Hi all,
what do you think about adding a bool argument to the function, to specify if the user wants to impose to the function to return a rooted binary tree, even if the best solution is an inner node (and so, creating a branch of length 0) ?

Mathieu

Le 3/6/14 12:06 PM, Nicolas Rochette a écrit :

Bastien Boussau

unread,
Mar 6, 2014, 7:30:20 AM3/6/14
to biopp-de...@googlegroups.com
Thanks Nicolas and Mathieu.

The bool may be a good idea, but then someone would need to implement that, and I'm not sure anyone has the time right now... In addition, a branch of length 0 or very small may pose problems for potential subsequent numerical optimizations.

Perhaps one solution that would not take long would be to put back the midpointRooting function, and explain in the comments the difference to the functions.

Finally, Nicolas, just to be sure:
You said in your first response that "discard the sum of squares as a rooting criterion" was a solution: again, does that mean that if I use "variance" instead of "sum of squares", it will work or does it mean that I should just not use your function at all?

Bastien.

Julien Yann Dutheil

unread,
Mar 6, 2014, 9:51:24 AM3/6/14
to Bio++ Development Forum
Hi All,

I just have to say that one reason for deprecating the old midpointRooting version was that Nicolas' version was much faster and efficient. I actually did not know there were differences in the results (and to my experience so far I did not see any). I have to apologize then if this broke compatibility with previous implementations. I will check what this original implementation exactly did, and maybe we indeed can have three functions or three criteria. I do that asap. In the mean time Bastien, you can "undeprecate" the old function and use it, if that solves your issue.

cheers,

Julien.



For more options, visit https://groups.google.com/groups/opt_out.



--
Julien Y. Dutheil, Ph-D
0 (+49) 6421 178 986

§ Max Planck Institute for Terrestrial Microbiology
Department of Organismic Interactions
Marburg -- GERMANY

§ Intitute of Evolutionary Sciences - Montpellier
University of Montpellier 2 -- FRANCE

Bastien Boussau

unread,
Mar 6, 2014, 9:54:21 AM3/6/14
to biopp-de...@googlegroups.com
Hi Julien, 

Thank you very much!
I will undeprecate the function. That arrangement will work for me, although I understand it’s not the most elegant one.

Best, 

Bastien.



Nicolas Rochette

unread,
Mar 6, 2014, 10:18:37 AM3/6/14
to biopp-de...@googlegroups.com
Hi Bastien, Mathieu,

I think the variance criterion will not yield trees rooted on a node but I have no mathematical proof for it. Actually I am not sure there are midroot criteria that really guarantee that.

If a too short branch may cause numerical problems I guess we could have a 'force_branch_root' option which would move the root on the middle of the shortest of the three adjacent branches (so the branch lengths would remain similar to those in the original tree). The implementation should be very simple (maybe 5 lines?).

Best,

Nicolas


Bastien Boussau wrote, on 03/06/2014 01:30 PM:

Bastien Boussau

unread,
Mar 6, 2014, 10:28:30 AM3/6/14
to biopp-de...@googlegroups.com
Hi Nicolas,

I think the 'force_branch_root' option is a good idea. I don't have the time to write the 5 lines you suggest, notably because I would first need to understand your code. If you feel that you can afford to do that, then that's great, but otherwise de-deprecating midpointRooting will work for me, so please do not feel like you have to do anything urgently.

Best,

Bastien.

Nicolas Rochette

unread,
Mar 6, 2014, 10:42:22 AM3/6/14
to biopp-de...@googlegroups.com
My idea was just to add 5 independent lines at the end of the function to move the root in case it was on a node. Something like "if(force_branch_root && root->getNumberOfSons > 2)" ...

I could write these lines but my version of the libs is not up-to-date so I would prefer to paste them here and let someone push them for me.

Nicolas


Bastien Boussau wrote, on 03/06/2014 04:28 PM:

Bastien Boussau

unread,
Mar 6, 2014, 10:57:57 AM3/6/14
to biopp-de...@googlegroups.com
OK, I can do the pushing! In this case, I won't dedeprecate midpointRooting.

Bastien.

Nicolas Rochette

unread,
Mar 6, 2014, 11:19:58 AM3/6/14
to biopp-de...@googlegroups.com
I think this should do it. More 15 that 5 lines but it's very
straightforward. It's a pity we don't have a method to introduce a new
node at a particular point in a branch. (But maybe we have one?) Let me
know if you think there is a mistake. In any case it should be a good
template.

Could you please modify TreeTemplateTools.h and add the option to the
doc there ?

Nicolas

//

@ TreeTemplateTools.cpp :1011
void TreeTemplateTools::midRoot (TreeTemplate<Node>& tree, const string&
criterion, const bool force_branch_root)

@ TreeTemplateTools.cpp :1062
if(force_branch_root)
{
Node* orig_root = tree.getRootNode();
vector<Node*> root_sons = orig_root->getSons();
if(root_sons.size() >2)
{
Node* nearest = root_sons.at(0);
for(vector<Node*>::iterator n = root_sons.begin(); n !=
root_sons.end(); ++n)
if((**n).getDistanceToFather() < nearest->getDistanceToFather())
nearest = *n;
const double d = nearest.getDistanceToFather();
Node* new_root = new Node();
new_root->setId( TreeTools::getMPNUId(tree, tree.getRootId()) );
orig_root.removeSon(nearest);
new_root.addSon(nearest);
orig_root.setDistanceToFather(d/2.);
nearest.setDistanceToFather(d/2.);
const vector<string> branch_properties = nearest->getBranchPropertyNames();
for(vector<string>::const_iterator p = branch_properties.begin();
p!=branch_properties.end(); ++p)
new_root->setBranchProperty(*p, *nearest->getBranchProperty(*p));
}
}

Nicolas Rochette

unread,
Mar 6, 2014, 11:29:43 AM3/6/14
to biopp-de...@googlegroups.com
Bugs there are :/ Posted too quickly

if(force_branch_root)
{
Node* orig_root = tree.getRootNode();
vector<Node*> root_sons = orig_root->getSons();
if(root_sons.size() >2)
{
Node* nearest = root_sons.at(0);
for(vector<Node*>::iterator n = root_sons.begin(); n !=
root_sons.end(); ++n)
if((**n).getDistanceToFather() < nearest->getDistanceToFather())
nearest = *n;
const double d = nearest->getDistanceToFather();
Node* new_root = new Node();
new_root->setId( TreeTools::getMPNUId(tree, tree.getRootId()) );
orig_root->removeSon(nearest);
orig_root->addSon(new_root);
new_root->addSon(nearest);
new_root->setDistanceToFather(d/2.);
nearest->setDistanceToFather(d/2.);
const vector<string> branch_properties =
nearest->getBranchPropertyNames();
for(vector<string>::const_iterator p = branch_properties.begin();
p!=branch_properties.end(); ++p)
new_root->setBranchProperty(*p, *nearest->getBranchProperty(*p));
}
tree.rootAt(new_root);
}

Bastien Boussau

unread,
Mar 6, 2014, 11:32:17 AM3/6/14
to biopp-de...@googlegroups.com
Hi Nicolas,

Thanks! So this version should be the good one? I'll put it in the libraries then, and will also update the documentation.
Thanks!

Bastien.

Nicolas Rochette

unread,
Mar 6, 2014, 11:43:43 AM3/6/14
to biopp-de...@googlegroups.com
Yes, it should be good. I hope you'll be happy with it !
Best,

Nicolas


Bastien Boussau wrote, on 03/06/2014 05:32 PM:

Julien Yann Dutheil

unread,
Mar 6, 2014, 2:43:03 PM3/6/14
to Bio++ Development Forum
So, what the old midpointRooting method does is that it first compute the distance matrix from the tree (that's why it was so inefficient: it does it in a generic manner, using the tree interface :s ), and then take the maximum distance, cut it in two, and root on the branch where this "midpoint" falls ! Very crude indeed, i think it is worth deprecating this. I think in the old case, if this fell at a node, then a branch of length 0 was created. That actually seems a reasonable default option to me, even if the branching order is in that case arbitrary. Having a option as input of the method would of course be a plus. In addition, looking at the code again, we should avoid raw strings like "sum of square" as an option. A much better code would be:

short TreeTemplateTools::MIDPOINT_ROOTING_VARIANCE = 1;
short TreeTemplateTools::MIDPOINT_ROOTING_SUM_OF_SQUARES = 2;

and have an option 'short criterion' in the midpoint function. Bastien, if you add the fix Nicolas proposed, I can made those cosmetic changes.

Cheers,

J.



--
You received this message because you are subscribed to the Google Groups "Bio++ Development Forum" group.
To unsubscribe from this group and stop receiving emails from it, send an email to biopp-devel-fo...@googlegroups.com.
To post to this group, send email to biopp-de...@googlegroups.com.

For more options, visit https://groups.google.com/groups/opt_out.



--

Bastien Boussau

unread,
Mar 6, 2014, 7:34:11 PM3/6/14
to biopp-de...@googlegroups.com
The few tests I made of the function are all positive, the new option seems to work! I did a little correction to the code though, by moving “tree.rootAt(new_root);” just before the preceding “}” (otherwise it did not compile).

I committed the change, so Julien if you want to add your cosmetic changes, feel free to do so.

Thank you all, 

Best,

Bastien. 

Nicolas Rochette

unread,
Mar 7, 2014, 6:48:16 AM3/7/14
to biopp-de...@googlegroups.com
Hi Julien,

Just to comment on this. I am always struggling with these static variables and in this particular case I would find the raw strings much more intuitive and practical to use and maintain. As far as I understand, the aim of these variables is only to reduce the memory size of the passed arguments. I am not sure that for high level methods such as midroot the gain is worth making the user interface abstruse.

Anyway, if it is good practice or needed to maintain the code homogeneity of the library then please make the change. I don't intend to insist at all.

Nicolas


Julien Yann Dutheil wrote, on 03/06/2014 08:43 PM:

Julien Yann Dutheil

unread,
Mar 7, 2014, 9:29:51 AM3/7/14
to Bio++ Development Forum
Hi Nicolas,

For me this is not at all a matter of memory saving. See the following: a library user types "sum of square" instead of "sum of squares". Everything compiles fine, and only during program execution he gets an exception. If the developer writes TreeTemplateTools::SUM_OF_SQUARE instead of SUM_OF_SQUARES, he gets a compilation error, so the issue is detected much earlier.
In the current implementation, you also need to update the option strings in two places: the function itself and the documentation, whereas with static variable you do it only once (providing you name your variables consistently). Finally, I usually find it more easy to get the possible choice from variables (code completion will tell me all options starting with MIDPOINT_ROOTING_XXX), than looking at the documentation.

That is all based on my personal experience though, I have to admit I'm rather ignorant of what is considered the golden practice buy code gurus. But I tend to take example of practice from libraries that I find well implemented, such as java swing or Qt. And they do use static variables rather than string literals. And as you say, consistency throughout the library is definitely a plus (not to say a requirement).

Cheers,

Julien.




--
You received this message because you are subscribed to the Google Groups "Bio++ Development Forum" group.
To unsubscribe from this group and stop receiving emails from it, send an email to biopp-devel-fo...@googlegroups.com.
To post to this group, send email to biopp-de...@googlegroups.com.

Nicolas Rochette

unread,
Mar 7, 2014, 10:56:16 AM3/7/14
to biopp-de...@googlegroups.com
I totally agree with the bug detection potential and I also think this should be given priority.

But I do find the current variables system quite abstruse and I have had problems finding the right static variables, so we should be clear in the doc.

Best,

Nicolas


Julien Yann Dutheil wrote, on 03/07/2014 03:29 PM:
Reply all
Reply to author
Forward
0 new messages