LG4M and LG4X models in RAxML

Alexandros Stamatakis

unread,

Mar 11, 2013, 7:14:59 PM3/11/13

to ra...@googlegroups.com, Jacek Kominek

Dear All,

I just pushed a version of standard RAxML that implements the LG4M model
(see below) to github.

The model name is LG4, i.e. you can invoke it like this:

./raxmlHPC-AVX -s prot.phy -m PROTGAMMALG4 -p 12345 -n L23

Note that the actual implementation is still pretty slow and
experimental, since I need to decide if it's worth optimizing this.

I'd appreciate some feedback on this, before I continue.

Alexis

>
> On Wed, 2012-09-26 at 13:00 +0200, Jacek Kominek wrote:
> > Dear Alexis,
> > I wanted to ask your opinion about the new models being introduced by
> > Olivier in his recent paper:
> >
> > Le SQ, Dang CC, Gascuel O. Modeling Protein Evolution with Several Amino
> > Acid Replacement Matrices Depending on Site Rates. Mol Biol Evol. 2012
> > Jan 10;29(10):2921–36.
> >
> > The LG4M and LG4X seem to have some merit, do you have any plans to add
> > them (or any mixture-based models, actually) to RAxML?
> >
> > Best,
> > -Jacek

>

--
Alexandros (Alexis) Stamatakis

Research Group Leader, Heidelberg Institute for Theoretical Studies
Full Professor, Dept. of Informatics, Karlsruhe Institute of Technology
Adjunct Professor, Dept. of Ecology and Evolutionary Biology, University
of Arizona at Tucson

www.exelixis-lab.org

JK

unread,

Mar 13, 2013, 9:35:52 AM3/13/13

to ra...@googlegroups.com, Jacek Kominek, Alexandros...@gmail.com

Thanks for the update Alexis,

I did some initial testing with my own data and LG4 seems to be working good, with significantly better likelihoods (in the 10-150 range) for 2-4x longer calculation times. One thing I've noticed was that the reported "Base frequencies" tables are different when you use LG4 and LG, with some of the differences being quite big (up to 3-5x). I assume that in LG4 each gamma rate has its own equilibrium frequency matrix, but which one is actually reported, the first one?

More strangely though, I noticed that sometimes I would get a worse likelihood with LG4 than with LG. It is strange because I thought LG4 and LG are embedded models, in the similar way as, say, GTR and HKY, so you would always get a better likelihood with a more complex model, even if the gain wouldn't be significant. I attached the offending datafile so that you may take a look yourself. It may be a very specific case (highly gapped but otherwise highly identical alignment) as I could not reproduce this 'effect' with other data yet, but I just thought I'd share it. Given enough eyeballs...etc.

-Jacek

PS: The attachment is a fasta file, but google groups didn't like it, so I had to change the extension.

test1_aa.txt

Alexandros Stamatakis

unread,

Mar 13, 2013, 12:50:29 PM3/13/13

to JK, ra...@googlegroups.com, Jacek Kominek, Olivier GASCUEL

Hi Jacek,

> I did some initial testing with my own data and LG4 seems to be
> working good, with significantly better likelihoods (in the 10-150
> range) for 2-4x longer calculation times.

Okay.

> One thing I've noticed was that the reported "Base frequencies" tables
> are different when you use LG4 and LG, with some of the differences
> being quite big (up to 3-5x). I assume that in LG4 each gamma rate has
> its own equilibrium frequency matrix, but which one is actually
> reported, the first one?

The base freqs are nor printed out correctly yet. I first wanted
somebody to test this before I finalize and optimize the LG4
implementation.

> More strangely though, I noticed that sometimes I would get a worse
> likelihood with LG4 than with LG.

I don't think that this is a surprise. I have added Olivier Gascuel to
the CC. Olivier could you let us know what you think?

> It is strange because I thought LG4 and LG are embedded models, in the
> similar way as, say, GTR and HKY, so you would always get a better
> likelihood with a more complex model, even if the gain wouldn't be
> significant.

It may however be that LG just fits the data better over all or the
majority of the discrete GAMMA rates than the 4 LG matrices of LG4.

If one really wants to get better likelihoods one could try a ML
assignment of the 4 LG models to the 4 rates, i.e., test all possible
assignments which should be 4^4 = 256 and then just keep the best one.

> I attached the offending datafile so that you may take a look
> yourself. It may be a very specific case (highly gapped but otherwise
> highly identical alignment) as I could not reproduce this 'effect'
> with other data yet, but I just thought I'd share it. Given enough
> eyeballs...etc.

Thanks.
>
Alexis

Alexis

unread,

Mar 13, 2013, 10:12:07 PM3/13/13

to ra...@googlegroups.com, Jacek Kominek, Alexandros...@gmail.com

Hi Jacek,

I just pushed a (hopefully) fully functional and fully optimzied version of the LG4 model to github.

Alexis

JK

unread,

Mar 14, 2013, 5:52:32 AM3/14/13

to ra...@googlegroups.com, Jacek Kominek, Alexandros...@gmail.com

Ok, thanks again for the effort. Testing, testing. (So far, so good, +F works, pthreads works, bootstraping works)

On a side note, I found another case where using LG4 would result in actually worse likelihoods than the vanilla LG. The difference is only ~10 lnl units, like in the previous case, but since both models have the same number of parameters (if I understand it correctly), it is as significant as it looks. The datafile was a completely different one this, so it seems to be a more general effect and shows why model testing should be a necessary step in any ML-based phylogenetic analysis.

While we're at it - a random but naturally flowing thought: is it possible to use PROTCAT instead of PROTGAMMA and effectively turn the model into LG4X?

-Jacek

Alexandros Stamatakis

unread,

Mar 14, 2013, 12:42:45 PM3/14/13

to JK, ra...@googlegroups.com, Jacek Kominek

Hi Jacek,

> Ok, thanks again for the effort. Testing, testing. (So far, so good,
> +F works, pthreads works, bootstraping works)

:-)

>
> On a side note, I found another case where using LG4 would result in
> actually worse likelihoods than the vanilla LG. The difference is only
> ~10 lnl units, like in the previous case, but since both models have
> the same number of parameters (if I understand it correctly), it is as
> significant as it looks. The datafile was a completely different one
> this, so it seems to be a more general effect and shows why model
> testing should be a necessary step in any ML-based phylogenetic
> analysis.

Sure thing, as I said, I'd expect LG4 to perform worse in some cases.

> While we're at it - a random but naturally flowing thought: is it
> possible to use PROTCAT instead of PROTGAMMA and effectively turn the
> model into LG4X?

Kind of, I'd assume an unpartitioned dataset and do a ML assignment of
the 4 matrices to individual sites. Implementing this will be a pian
though.

Alexis

Karen

unread,

Nov 29, 2013, 7:26:58 AM11/29/13

to ra...@googlegroups.com, Jacek Kominek, Alexandros...@gmail.com

Dear all,

is it also possible to invoke the LG4 Model in a partitioned dataset just in 1 of e.g. 10 partitions?
Rob Lanfaer implemented also the LG4M into PartitionFinder and somehow noted that this might not work in RAxML or ExaML...?

cheers Karen

Alexandros Stamatakis

unread,

Nov 29, 2013, 7:41:18 AM11/29/13

to Karen, ra...@googlegroups.com

Hi Karen,

> is it also possible to invoke the LG4 Model in a partitioned dataset just
> in 1 of e.g. 10 partitions?
> Rob Lanfaer implemented also the LG4M into PartitionFinder and somehow
> noted that this might not work in RAxML or ExaML...?

In RAxML it works, you'd just specify it in the model file as any other
model, e.g., LG4M or LG4X.

In ExaML only LG4M is implemented at present.

Alexis

Karen

unread,

Nov 29, 2013, 7:54:36 AM11/29/13

to ra...@googlegroups.com, Karen

Hi Alexis,

cool, thanks a lot for the fast reply!! and all have a nice weekend :)

Karen

Reply all

Reply to author

Forward