testNH vs gabranch

80 views
Skip to first unread message

Tristan Lefebure

unread,
Jul 12, 2012, 3:36:14 AM7/12/12
to testnh-h...@googlegroups.com
Dear all,

I have been using testNH (YN98, free clustering, BIC) and gabranch [1] from Hyphy to explore variation of the dn/ds on a gene family tree where non-fonctionalisation occurred independently. Briefly gabranch uses a computationally intensive genetic algorithm to find the best dn/ds branch model. The model of evolution used by gabranch is a MuseGaut94 combined with the best fit substitution model, and the best branch model is selected using cAIC.

Though the approach is quite different, I was expecting the two methods to get similar results. This is not the case, mainly regarding the number of dn/ds partitions:
    - testNH gives 7 partitions
    - gabanch gives 2 partitions

To go a little bit further I was wondering if it would be possible to get testNH to use:
   1. the same model of evolution as gabranch: MG94 but with rAT, rAC, .... set manually. I only see one parameter in the bpp MG94 model, is this the dn/ds? I guess this means that all nucleotide substitutions are equally likely. In Hyphy language the best model was a MG94x010232 (i.e. a 4 parameters substitution model)
   2. use cAIC to select the best partitions

Any help/comments much appreciated, and thanks for the nice soft!

(I should also say that hyphy does the job in 24h using 30 CPUs, while it takes 2h on a single CPU for the whole testNH pipeline!)

--
Tristan

[1] http://mbe.oxfordjournals.org/content/22/3/478.full?keytype=ref&ijkey=58b5TG2Kv933M

Julien Yann Dutheil

unread,
Jul 12, 2012, 4:29:46 AM7/12/12
to testnh-h...@googlegroups.com
Hi Tristan,

I indeed dicovered the gabranch approach after we had our paper out,
it is definitely a missing reference :s
Here are some comments:

On Thu, Jul 12, 2012 at 9:36 AM, Tristan Lefebure
<tristan....@gmail.com> wrote:
> Though the approach is quite different, I was expecting the two methods to
> get similar results. This is not the case, mainly regarding the number of
> dn/ds partitions:
> - testNH gives 7 partitions
> - gabanch gives 2 partitions

Despite the susbstituion model issue which indeed is not the same, I
can see several possible explanations:
Is the 7 partitions model a specialization of the 2 partitions models
of gabranch?
* if yes: is the cAIC of the 7 partition model lower than the 2 partition model?
** if yes: then gabranch did not test the 7 partition model, and
therefore does not explore the model space so well :s
** if no: is the BIC of the 7 model lower than the 2 partition model?
*** if yes: the difference result in the model choice criterion, cAIC
and BIC selecting distinct models
*** if no: we have an awful bug in partnh!
* if no: then the mapping failed in finding the most pertinent top
level partition, which would not be so nice and would deserve more
investigation...

>
> To go a little bit further I was wondering if it would be possible to get
> testNH to use:
> 1. the same model of evolution as gabranch: MG94 but with rAT, rAC, ....
> set manually. I only see one parameter in the bpp MG94 model, is this the
> dn/ds? I guess this means that all nucleotide substitutions are equally
> likely. In Hyphy language the best model was a MG94x010232 (i.e. a 4
> parameters substitution model)

For this aspect, I think Laurent is the one who can answer best...
MG94 is available in bpp, so it should be possible to use it?

> 2. use cAIC to select the best partitions

This is on my todo list for some time. It is not complicated, I will
try to add it asap.

>
> Any help/comments much appreciated, and thanks for the nice soft!

Many thanks for your interest!

All the best,

J.

>
> (I should also say that hyphy does the job in 24h using 30 CPUs, while it
> takes 2h on a single CPU for the whole testNH pipeline!)
>
> --
> Tristan
>
> [1]
> http://mbe.oxfordjournals.org/content/22/3/478.full?keytype=ref&ijkey=58b5TG2Kv933M
>



--
Julien Y. Dutheil, Ph-D
0 (+49) 6421 178 530

§ Max Planck Institute for Terrestrial Microbiology
Department of Organismic Interactions
Marburg -- GERMANY

§ Intitute of Evolutionary Sciences - Montpellier
University of Montpellier 2 -- FRANCE

Laurent Guéguen

unread,
Jul 12, 2012, 7:17:35 AM7/12/12
to testnh-help-forum
Hi,

> > To go a little bit further I was wondering if it would be possible to get
> > testNH to use:
> >    1. the same model of evolution as gabranch: MG94 but with rAT, rAC, ....
> > set manually. I only see one parameter in the bpp MG94 model, is this the
> > dn/ds? I guess this means that all nucleotide substitutions are equally
> > likely. In Hyphy language the best model was a MG94x010232 (i.e. a 4
> > parameters substitution model)
>
> For this aspect, I think Laurent is the one who can answer best...
> MG94 is available in bpp, so it should be possible to use it?
>

in the MG94 paper, the parameters are rho=alpha/beta the ratio between
non-synonymous
and synonymous substitutions, and the equilibirum frequencies of the
nucleotides in each
phase of the codons. This is what I implemented Bio++, so I do not
know rAT, rAC, ...
parameters in this model. Perhaps there is a reference to more
elaborate model?
I had a quick look in hyphy website, but could not find any reference
to it.
Could you give me a link please?

Laurent

Laurent Guéguen

unread,
Jul 12, 2012, 7:39:31 AM7/12/12
to testnh-help-forum
Ok, I found the MBE reference you gave in your post.
So the model denoted there MG94xHKY85 is available like this:

model=CodonDistPhasFreq(model=K80(), frequencies=F3X4)

with many parameters in it...

CodonDistPhasFreq is codon model based on a nucleotide model (here
K80),
using distance between amino acids (here simply omega for non-
synonymous/synonymous ratio),
and where the substitution rates are multiplied by equilibrium phase-
specific (or not)
nucleotide frequencies.

If I may dare two remarks:

On my on experiments, this model does not perform significantly
better than YN98-F3X4
(with the same number of parameters, the only difference is how
equilibrium frequencies
are used in the generator).

In the paper, the formula used for equilibrium frequencies that remove
stop codons pi_xyz=.... seems false to me, because the removal of a
triplet
because it is a stop does not have an identical effect on the other
codons.
For example, removing TAG will slower the departure rate from codons
that are
1 substitution away from it (TAC, TGG, etc), but none on others.

A+,
L

Tristan Lefebure

unread,
Jul 12, 2012, 12:17:23 PM7/12/12
to testnh-h...@googlegroups.com
On Thursday, July 12, 2012 1:39:31 PM UTC+2, Laurent Guéguen wrote:
Ok, I found the MBE reference you gave in your post.
So the model denoted there MG94xHKY85 is available like this:

model=CodonDistPhasFreq(model=K80(), frequencies=F3X4)

with many parameters in it...

CodonDistPhasFreq is codon model based on a nucleotide model (here
K80),
using distance between amino acids (here simply omega for non-
synonymous/synonymous ratio),
and where the substitution rates are multiplied by equilibrium phase-
specific (or not)
nucleotide frequencies.


Hi Laurent,

I will try to run this model in testNH... and will be back to you (could actually knock at your door, we are in the same department!)

 
If I may dare two remarks:

On my on experiments,  this model does not perform significantly
better than YN98-F3X4
(with the same number of parameters, the only difference is how
equilibrium frequencies
are used in the generator).

In the paper, the formula used for equilibrium frequencies that remove
stop codons pi_xyz=.... seems false to me, because the removal of a
triplet
because it is a stop does not have an identical effect on the other
codons.
For example, removing TAG will slower the departure rate from codons
that are
1 substitution away from it (TAC, TGG, etc), but none on others.

A+,
L




Some more references on hyphy codon models for example here:
http://dx.doi.org/10.1093/molbev/msi105

In the hyphy way, it is common to first select the best fitting REV model, and then combine it with the MG94. I don't have any personal preferences but there is several papers favoring the MG94 models against the YN98. This one comes to my mind:
http://dx.doi.org/10.1534/genetics.108.092254

Thanks!

--
Tristan

Tristan Lefebure

unread,
Jul 12, 2012, 12:42:05 PM7/12/12
to testnh-h...@googlegroups.com


On Thursday, July 12, 2012 10:29:46 AM UTC+2, Julien Yann Dutheil wrote:
Hi Tristan,

I indeed dicovered the gabranch approach after we had our paper out,
it is definitely a missing reference :s
Here are some comments:

On Thu, Jul 12, 2012 at 9:36 AM, Tristan Lefebure
wrote:
> Though the approach is quite different, I was expecting the two methods to
> get similar results. This is not the case, mainly regarding the number of
> dn/ds partitions:
>     - testNH gives 7 partitions
>     - gabanch gives 2 partitions

Despite the susbstituion model issue which indeed is not the same, I
can see several possible explanations:
Is the 7 partitions model a specialization of the 2 partitions models
of gabranch?
* if yes: is the cAIC of the 7 partition model lower than the 2 partition model?
** if yes: then gabranch did not test the 7 partition model, and
therefore does not explore the model space so well :s
** if no: is the BIC of the 7 model lower than the 2 partition model?
*** if yes: the difference result in the model choice criterion, cAIC
and BIC selecting distinct models
*** if no: we have an awful bug in partnh!
* if no: then the mapping failed in finding the most pertinent top
level partition, which would not be so nice and would deserve more
investigation...


I am pretty sure that gabranch didn't test so many partitions. I actually don't think any population of the genetic algo add more than 3 partitions.

May be the best is to run gabranch on the test data set that comes with testNH using the datamonkey server. It will run on small dataset on their cluster.

http://www.datamonkey.org/dataupload.php

The MantellidDataset.phy crashes with most of my scripts, you mentioned that a fasta file was available somewhere, right?
 
Cheers,
--Tristan

Julien Yann Dutheil

unread,
Jul 12, 2012, 5:02:46 PM7/12/12
to testnh-h...@googlegroups.com
On Thu, Jul 12, 2012 at 6:17 PM, Tristan Lefebure
<tristan....@gmail.com> wrote:

> I will try to run this model in testNH... and will be back to you (could
> actually knock at your door, we are in the same department!)

You guys... future generations will laugh at us!

Julien Yann Dutheil

unread,
Jul 12, 2012, 5:05:18 PM7/12/12
to testnh-h...@googlegroups.com
On Thu, Jul 12, 2012 at 6:42 PM, Tristan Lefebure
<tristan....@gmail.com> wrote:

> May be the best is to run gabranch on the test data set that comes with
> testNH using the datamonkey server. It will run on small dataset on their
> cluster.
>
> http://www.datamonkey.org/dataupload.php
>
> The MantellidDataset.phy crashes with most of my scripts, you mentioned that
> a fasta file was available somewhere, right?

It crashes because of the phylip format? Then bppSeqMan from bppsuite
should allow you to convert it to fasta...

Best,

Julien.

>
> Cheers,
> --Tristan

Laurent Guéguen

unread,
Jul 13, 2012, 4:03:40 AM7/13/12
to testnh-help-forum
Yes indeed,

so sure it may be more convenient to talk about it in real life.

Otherwise, about the comparison between programs, I have not been
into gabranch, but isn't it more designed in a population genetic
background?
So perhas the hypotheses and models are quite different than in
testnh?

Cheers,
L


On 12 juil, 23:02, Julien Yann Dutheil <jy.duth...@gmail.com> wrote:
> On Thu, Jul 12, 2012 at 6:17 PM, Tristan Lefebure
>

Tristan Lefebure

unread,
Jul 16, 2012, 5:59:43 AM7/16/12
to testnh-h...@googlegroups.com
Hi there,

I submitted the testnh-1.0.0/examples/MantellidFrogs/ data set to datamonkey.

The job id is: upload.634429168213138.1

The best AIC substitution model is (012343)  (i.e. REV model with rCG = rGT):
http://www.datamonkey.org/spool/upload.634429168213138.1_model.php

Gabranch is running here:
http://hyphy.ucsd.edu/cgi-bin/Datamonkey2007/mpiJobStatus.pl?fileName=upload.634429168213138.1&task=gabranch

--
Tristan

Tristan Lefebure

unread,
Jul 16, 2012, 6:08:28 AM7/16/12
to testnh-h...@googlegroups.com

On Friday, July 13, 2012 10:03:40 AM UTC+2, Laurent Guéguen wrote:
Otherwise, about the comparison between programs, I have not been
into gabranch, but isn't it more designed in a population genetic
background?
So perhas the hypotheses and models are quite different than in
testnh?

They use a genetic algorithm but apply it to find variation of the dn/ds along a tree using branch codon models. To me the objective and general framework, not the methods (mapping+clustering vs GA), are very similar (though testnh can do a lot more). Kind of using two different heuristics to solve the same problem...
--
Tristan

Tristan Lefebure

unread,
Jul 17, 2012, 12:08:19 PM7/17/12
to testnh-h...@googlegroups.com
The gabranch run is over. It found 5 partitions. Following your MBE paper there was 2 and 5 partitions for the free and join models, respectively. The dn/ds are:
- partnh free: 0.044, 0.080
- partnh join: 0.041, 0.044, 0.045, 0.05, 0.080
- gabranch: 0.049, 0.060, 0.083, 0.104, 0.129

So some differences...

--
Tristan

Julien Yann Dutheil

unread,
Jul 17, 2012, 1:09:38 PM7/17/12
to testnh-h...@googlegroups.com
Hi Tristan,

Maybe I missed an episode, but if I understand correctly, the substitution model is not the same (YN98 vs MG94), as well as the model selection criterion (BIC vs AICc) ? And then come the ultimate question, if the outputs are different, who is doing better? ;)

J.

PS: adding cAIC to partnh is still on my todo list... :s

--
You received this message because you are subscribed to the Google Groups "testnh-help-forum" group.
To unsubscribe from this group, send email to testnh-help-fo...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.
 
 

Julien Yann Dutheil

unread,
Jul 17, 2012, 1:52:21 PM7/17/12
to testnh-h...@googlegroups.com
Ok, have added AICc as a possible selection criterion. Would be interesting to use the selected partitions by gabranch to build the corresponding model in bppml, and then check if it has a better AICc than the one selected by partnh (with the new AICc option criterion). At least that would give us some information about the ability of partnh to find the best possible partition...

Best,

J.

PS: eventually, if gabranch outputs its partitions in a readable format, I can try to run the corresponding NH model with bppml.

Tristan Lefebure

unread,
Jul 19, 2012, 5:02:26 AM7/19/12
to testnh-h...@googlegroups.com

PS: eventually, if gabranch outputs its partitions in a readable format, I can try to run the corresponding NH model with bppml.


Hello,

There will be some formatting modification to do, but this file is probably the best start:

http://www.datamonkey.org/spool/upload.634429168213138.1.branch_classes

For a given tree, with node names annotated as bootstrap scores, it tells you which nodes falls into which partition.

--
Tristan
 

Julien Yann Dutheil

unread,
Jul 19, 2012, 4:55:11 PM7/19/12
to testnh-h...@googlegroups.com
Hi Tristan,

Looks like the tree you are using is not the same as the one we do? Could it be that gaBranch also optimizes topologies? That would explain the difference in execution times!

Cheers,

Julien.

--
You received this message because you are subscribed to the Google Groups "testnh-help-forum" group.
To unsubscribe from this group, send email to testnh-help-fo...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.
 
 

Tristan Lefebure

unread,
Jul 20, 2012, 7:56:11 AM7/20/12
to testnh-h...@googlegroups.com
Hi Julien,
As I recall, wehn you use their websever, they actually make a NJ tree. When you run gabranch locally, you provide the tree. And no, gabranch does not do any topology optimisation...

I am trying to compare the partitioning on my dataset, how do you guys import the NHX trees in R?


Best,
--
Tristan


Julien Yann Dutheil

unread,
Jul 20, 2012, 2:57:09 PM7/20/12
to testnh-h...@googlegroups.com
On Fri, Jul 20, 2012 at 1:56 PM, Tristan Lefebure <tristan....@gmail.com> wrote:
Hi Julien,
As I recall, wehn you use their websever, they actually make a NJ tree. When you run gabranch locally, you provide the tree. And no, gabranch does not do any topology optimisation...

So what did you use for testnh? The tree provided on the example directory or an NJ tree? 

I am trying to compare the partitioning on my dataset, how do you guys import the NHX trees in R?

We do not do that as far as I know... The NHX tree was used to make sure node ids are consistent between mapnh and partnh. NHX is deprecated for PhyloXML, so it is poorly supported by other software. It is a bit of a pitty as for many usage it is highly convenient, whilst being much lighter than PhyloML... 

Best,

J.


Best,

--
Tristan


--
You received this message because you are subscribed to the Google Groups "testnh-help-forum" group.
To unsubscribe from this group, send email to testnh-help-fo...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.
 
 

Tristan Lefebure

unread,
Jul 23, 2012, 10:28:21 AM7/23/12
to testnh-h...@googlegroups.com

On Fri, Jul 20, 2012 at 1:56 PM, Tristan Lefebure wrote:
Hi Julien,
As I recall, wehn you use their websever, they actually make a NJ tree. When you run gabranch locally, you provide the tree. And no, gabranch does not do any topology optimisation...

So what did you use for testnh? The tree provided on the example directory or an NJ tree? 

Regarding the frog dataset, I did not run testNH, just used your MBE paper results. When I get a second (probably in september), I'll run this dataset on a cluster using the distributed example tree.

Would you recommend another comparison? The traditional lyzosyme one?

--
Tristan

Tristan Lefebure

unread,
Aug 31, 2012, 5:17:39 AM8/31/12
to testnh-h...@googlegroups.com
Hello,

Sorry, slowly coming back to this post:

Unless I made a mistake, It looks like that the cAIC criteria does favor a much simpler model on my dataset (3 vs 10 partitions):


Threshold NbPartitions LogL DF AIC BIC AIC-c
0.168804 10 -13401.3 213 27228.5 27987.8 29168.1595744681
0 3 -13446.5 206 27305.1 28039.4 28884.4333333333

I used the following cAIC formula:

cAIC = AIC + 2*DF*(DF-1)/(nbrCodon - DF -1)

with nbrCodon = 261 codons

This is quite a dramatic change regarding the number of partitions (though not so different biologically), and the results are much closer to the GAbranch one. What do you think? Is there an objective way to prefer cAIC over BIC?

You also mentioned that you implemented cAIC in testNH, should I use the git version to have access to it?
(which I guess I'll get with a 'git clone http://biopp.univ-montp2.fr/git/testnh')

Thanks!!!
--
Tristan

Julien Yann Dutheil

unread,
Aug 31, 2012, 5:27:05 AM8/31/12
to testnh-h...@googlegroups.com
Hi Tristan,

That is very good to know! No, no a priori criterion to favor one over the other. My experience is that BIC works better than AIC, but I only recently came to hear about AICc.

AICc is on the git version, yes.

Cheers,

J.

--
 
 
 

Tristan Lefebure

unread,
Sep 3, 2012, 4:22:12 AM9/3/12
to testnh-h...@googlegroups.com
Hi Julien,

I have re-run partNH using the latest git version, and there is few things that I am not getting:

1. The likelihood are much different, example:

NbPartitions LogL 1.0 LogL git
1 -13678.4 -13629.9
3 -13446.5 -13385
4 -13438.5 -13365.8
5 -13418.8 -13339.8
6 -13414.1 -13331.2
7 -13412.1 -13325.2
8 -13408.7 -13318.4
9 -13407.9 -13313.4
10 -13401.3 -13303.3
11 -13400.2 -13298.8
12 -13399.5 -13294.7
13 -13399.3 -13292.1
14 -13398.3 -13275.5
15 -13397.7 -13270.7
16 -13396.9 -13265.2
17 -13396.5 -13257.5
18 -13396.1 -13253.3
19 -13396
20 -13394.5

Looks like the latest testNH always finds lower likelihood (I've tested this on two data-sets), which in return have a deep impact on the best model (whichever criteria you use): the best model have much less partitions.


2. The number of degree of freedom (DF) is also different:

NbPartitions DF 1.0 DF git
1 204 204
3 206 224
4 207 234
5 208 244
6 209 254
7 210 264
8 211 274
9 212 284
10 213 294
11 214 304
12 215 314
13 216 324
14 217 334
15 218 344
16 219 354
17 220 364
18 221 374
19 222
20 223

A bug?

3. The latest PartNH does not return any BIC value for the homogeneous model (partition = 1). Example:

Threshold NbPartitions LogL DF AIC
AICc BIC
0 1 -13629.9 204 27667.8 28395
0 3 -13385 224 27218 28711.5 28016.4

4. The reported AICc values are not calculated using:


cAIC = AIC + 2*DF*(DF-1)/(nbrCodon - DF -1)

example:

Threshold NbPartitions LogL DF AIC
AICc BIC hand made AICc
0 1 -13629.9 204 27667.8 28395
29146.8
0 3 -13385 224 27218 28711.5 28016.4 29993.1111111111


Thanks for your help....

--
Tristan

Julien Yann Dutheil

unread,
Sep 3, 2012, 4:28:45 AM9/3/12
to testnh-h...@googlegroups.com
Hi Tristan,

In the latest git verison of the libs, the syntax have changed, and the examples in testnh have not been updated yet. So it can be that it is not the same model which is ran... I investigate this.

cheers,

J.

Tristan Lefebure

unread,
Sep 7, 2012, 5:03:33 AM9/7/12
to testnh-h...@googlegroups.com
Hi Julien,
Is there anything I can do to help?
In the mean time, should I stay away from the git version and use the stable testNH?
Thanks!
--
Tristan

Julien Yann Dutheil

unread,
Sep 7, 2012, 4:04:30 PM9/7/12
to testnh-h...@googlegroups.com
Hi Tristan,

I am sorry to take so long to reply... I can confirm the pb is due to a syntax update. You can definitely use the git version, but you have to update the option file a bit as some of the parameter names have changed. I am in the process of updating the examples on git, but did not have time to finish this properly and to commit the changes. I am in ECCB next week, maybe I find a bit of time to do that there, but I cannot promise :s I will let you know asap...

Julien.

Tristan Lefebure

unread,
Sep 13, 2012, 10:00:15 AM9/13/12
to testnh-h...@googlegroups.com
Hi Julien,
Sorry I am too impatient: Is the syntax update on the definition of the rate distribution?

I looked into the lastest code and found that you changed:
  rate_distribution = Uniform
into:
  rate_distribution = Constant



Thanks very much!!

--
Tristan

Julien Yann Dutheil

unread,
Sep 13, 2012, 10:59:08 AM9/13/12
to testnh-h...@googlegroups.com
That is actually an even more recent one... the main pb is the parameter names. I did not have time to continue the update because was out of the lab and just came back.

Cheers,

J.

Florent Lassalle

unread,
Aug 28, 2014, 9:49:30 AM8/28/14
to testnh-h...@googlegroups.com, florent lassalle
Hello everybody,

nice to see that when I search topics about gabranch on google I find a thread by people from my former lab... 

So I was curious to know if 2 years after, you have reached a more precise idea of the pros and cons of using gabranch vs. testnh in this task of partitioning the tree into several classes with a specific omega for each.

I was considering using gabranch but I'm usually reluctant to use Hyphy as the HBL syntax is so tedious and the pipeline hard to use in practice even for already coded scripts, the GA is generally long to converge and there is the big problem that the GA results depend on the size of the population of models that were tried (which in turn depend on the number of CPUs you allocated to the GA...), so testnh sounds at start a good alternative :).

Can you please feed me some advice on that please ?

Cheers,

Florent

- - -

Florent Lassalle
Post-Doctoral Research Associate
University College London
UCL Genetics Institute - Dept. of Genetics, Ecology & Environment
Gower Street
London WC1E 6BT
Tel: +44 203 108 4229 (int. 54229)
florent....@ucl.ac.uk

Florent Lassalle

unread,
Aug 28, 2014, 1:04:26 PM8/28/14
to testnh-h...@googlegroups.com, florent....@ucl.ac.uk
I must add that GAbrach is actually no more supported in HyPhy, because they say that their BranchSiteREL algorithm does the intended work instead (cf. SKP : http://www.hyphy.org/cgi-bin/hyphy_forums/YaBB.pl?num=1305326766).
the new algorithm is described here http://mbe.oxfordjournals.org/content/28/11/3033
I don't know what this BranchSiteREL is worth either, so if you have an opinion about it please share ;)

Florent
Reply all
Reply to author
Forward
0 new messages