RY coding in Raxml?

707 views
Skip to first unread message

Ruhfel

unread,
May 24, 2012, 1:00:24 PM5/24/12
to raxml
Dear all,

I am wondering how to implement RY coding in raxml.

This normally means that all purines (A or G) are coded as Rs and all
pyrimidines (C or T) are coded as Ys.

But raxml doesn't handle Rs and Ys, they would be treated as missing
data.

So here is the question.... Could I just go through the original
alignment and code as R or Y with a perl script, and then on a second
pass change all Rs to As and all Ys to Cs so that there would be
characters that raxml would accept. Or something like that?

Second question, would this new two character matrix (only As and Cs)
be best analyzed in raxml by the standard GTRGAMMA model?

Best,
-Brad

Ruhfel

unread,
May 24, 2012, 2:06:54 PM5/24/12
to raxml
oops, OK raxml can handle Rs and Ys. But if I recode the data this
way, what model should I use?

-Brad

Alexandros Stamatakis

unread,
May 27, 2012, 4:53:20 PM5/27/12
to ra...@googlegroups.com
Hi Brad,

That's right RAxML can handle Rs and Ys, the question about the model is a good one,
but since there is only GTR that's the only you can use. I guess that, if tha dataset is large enough
GTR should do. If the underlying model is less complex, you'd expect the GTR parameters to be estimated
accordingly, i.e., when the model is HKY85, the GTR rate estimates should just be HKY85-like.

In other words, I wouldn't worry too much about the model.

Nonetheless, you can also re-encode Rs and Ys as 0 and 1 and then use a simpler binary model.

Alexis
--
Dr. Alexandros Stamatakis
Research Group Leader HITS, Heidelberg
Adjunct Professor, Dept. of Ecology and Evolutionary Biology, University of Arizona at Tucson
www.exelixis-lab.org

Ruhfel

unread,
May 27, 2012, 5:24:46 PM5/27/12
to raxml
Yes, thanks Alexis.

I am wondering what the difference is between the two approaches:

Rs and Ys with GTRGAMA vs 0s and 1s with BIN GAMMA.

Are they essentially the same thing? Or is one more appropriate? This
was my intended question in the original post, I just didn't state it
as clearly as I hoped.

Best,
-Brad

On May 27, 4:53 pm, Alexandros Stamatakis

Alexandros Stamatakis

unread,
May 27, 2012, 5:25:48 PM5/27/12
to ra...@googlegroups.com
There should be a difference, with RY encoding you still have more parameters in the model.

Alexis

Ravi Vydianathan

unread,
Sep 16, 2013, 10:52:37 PM9/16/13
to ra...@googlegroups.com, Alexandros...@gmail.com
Dear Alexis,

I just came across this group and am not sure if this is the right place to post my query (apologies in advance for that). I am working on a set of sequences which always cluster together in a phylogenetic tree and the explanation for this is being stated as "that GC-bias is causing the sequences to cluster together". I wanted to know your opinion on this in general and then comes the second part of my query. RY-coding is supposedly able to take care of compositional bias in sequences and "enhance" historical signal (read this somewhere). Do you think this would be able to take care of GC-bias as well? If yes, then the effect on the tree should be negated, correct? If I still see the sequences clustering together after RY-encoding, that means something else is responsible for that - e.g. a different duplication history maybe. I hope to hear your views on this.

Thanks and regards,

Ravi

Grimm

unread,
Sep 17, 2013, 8:30:23 AM9/17/13
to ra...@googlegroups.com
Hi Brad,


"Rs and Ys with GTRGAMA vs 0s and 1s with BIN GAMMA."

If you really want a RY coding, you should use BIN-GAMMA.

As Alexis said, RY in GTRGAMMA may give different results. You will essentially make your estimate based on a 4x4 nucleotide transition (transformation) matrix. The substitution probability you are interested in will then the probability to change from (1,0,1,0)  to (0,1,0,1), based on a 4x4 matrix that includes the six substitution probabilities for all possible substitutions (transversions and transitions alike)

RY coding however was meant to ignore the effective nucleotides, and to recode a matrix so that only transversions will be recognised as mutations, and transitions as neutral.
In a ML framework you hence only want to optimise a single substitution probability, for the transition of a purine (A or G = 0) into a pyrimidine (C or T = 1). Also you assume that the probability of all transitions are 1, and that it makes no difference whether A or G is changed e.g. into a C (and vice versa). If you would use GTRGAMMA you would allow the programme to optimise different substitution probabilities for A<->C vs. G<->C; and also to opt for values between 0 and 1 for the transitions.

So, BIN-GAMMA would be the only appropriate setting (if one feels RY recoding is appropriate at all)

Also you may get problems in estimating any model with a matrix that is just composed of R and Y.
RAxML used to optimise the substitution matrix parameters based only one non-ambiguous base calls (A, C, G, T) and ignored in this step the ambiguity codes; may be this is still the case, Alexis?

Jacob Berv

unread,
Nov 22, 2017, 2:39:46 PM11/22/17
to raxml
Hi everyone-
Responding to an old thread but hopefully this will get bumped up and noticed. I've been running some experiments with RY coding under a GTR model in raxml and I've noticed a particular behavior that I am having a hard time understanding. When I compare branch lengths from an RY coded dataset to the same (nucleotide coded) dataset, the branch lengths from the RY coded dataset are systematically longer -- when I would have guessed that they should be systematically shorter. RY coding reduces the number of parsimony informative sites in my dataset by about 1/2, so I would expect that with ~1/2 of the number of 'substitutions' -- branch lengths should be shorter-- but they are in fact longer. What am I missing here? Do I need to re-scale the branch lengths from one of theses analyses in some way to make them comparable? Would this expectation hold under the BIN-GAMMA model? (I haven't run this yet -- but it is still curious to me that this affect appears under a GTR model). For terminal branches, RY coding seems to increase their average length by a factor of ~1.3. For root to tip branch lengths, the effect is much smaller and more stochastic (factor of ~1.025, on average). Any ideas? Why aren't branch lengths cut in 1/2 (on average), under this scheme?

Best,
Jacob Berv
Message has been deleted

Grimm

unread,
Nov 23, 2017, 7:28:18 AM11/23/17
to raxml
Hi Jacob

(second try, found an important typo, so I deleted the original answer)


Am Mittwoch, 22. November 2017 20:39:46 UTC+1 schrieb Jacob Berv:
Hi everyone-
Responding to an old thread but hopefully this will get bumped up and noticed. I've been running some experiments with RY coding under a GTR model in raxml and I've noticed a particular behavior that I am having a hard time understanding.

This means you fed a RY recoded matrix into RAxML, not an accordingly recoded binary matrix? Than this may explain it, the standard GTR model treats a R as A/G – terminal likelyhood being a (1,0,1,0) vector – and Y as C/T: (0,1,0,1). GTR optimises a 4x4 substitution model and hence evaluate the change from R to Y as quite dramatic, having only R and Y to start with (depending on the structure of your data)
Whereas based on the nucleotide matrix, the optimisation may find higher general probability (i.e. resulting in lower branch lengths) for A or G changing into C or T.


 
When I compare branch lengths from an RY coded dataset to the same (nucleotide coded) dataset, the branch lengths from the RY coded dataset are systematically longer -- when I would have guessed that they should be systematically shorter. RY coding reduces the number of parsimony informative sites in my dataset by about 1/2, so I would expect that with ~1/2 of the number of 'substitutions' -- branch lengths should be shorter-- but they are in fact longer.

Note that the branch lengths are not mirroring changes like in a parsimony tree, but (negative ln of) probabilities for a change. So the number of parsimony-informative sites (distinct alignment patterns under ML) cannot be translated into more or less change (longer/shorter) branch-lengths. It just depends whether the mutation as the variable site are likely to occur or not.

The increasing branch-lengths just mean that by changing from nucleotide-based to RY, substitution is less likely too happen in general. Which can mean that your A/G and C/T mutation patterns are quite well sorted.

From a principal point of view, if you want a RY-based tree (following the basic idea behing RY-coding), you should re-score your matrix into a binary (0,1) matrix and run with BINGAMMA.

Cheers, Guido

Jacob Berv

unread,
Nov 24, 2017, 10:35:40 PM11/24/17
to ra...@googlegroups.com
Dear Guido,
Thanks so much for this detailed response. That makes alot of sense. It’s pretty amazing that despite losing 1/2 of the information, raxml is able to essentially reconstruct the same branch lengths with RY coding as it is with nucleotide coding — for root to tip or with terminal branch lengths, R^2 ~0.98. Amazing!

I ran the RY coded dataset as a 01 matrix with the BINGAMMA model, and the root to tip and terminal branch lengths seem to fit more with my prior expectation — under BINGAMMA the branch lengths are substantially reduced. However, while it seems reasonable for me compare branch lengths generated under the same models (ie, nucleotide coding under GTR, and RY coding under GTR), it seems less reasonable to make the comparison of GTR branch lengths to BIN branch lengths — am I wrong? I suppose the units of the branch lengths are the same - so maybe it’s actually ok to make this comparison? The main reason I’m doing this is to see if RY coding reduces branch length variance (to investigate base compositional evolution).

Best,
Jake


--
You received this message because you are subscribed to a topic in the Google Groups "raxml" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/raxml/BfoP_GXRrrY/unsubscribe.
To unsubscribe from this group and all its topics, send an email to raxml+un...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Grimm

unread,
Nov 25, 2017, 6:13:13 AM11/25/17
to raxml
Hi Jake,


I ran the RY coded dataset as a 01 matrix with the BINGAMMA model, and the root to tip and terminal branch lengths seem to fit more with my prior expectation — under BINGAMMA the branch lengths are substantially reduced. However, while it seems reasonable for me compare branch lengths generated under the same models (ie, nucleotide coding under GTR, and RY coding under GTR), it seems less reasonable to make the comparison of GTR branch lengths to BIN branch lengths — am I wrong?

I think both comparisons would be problematic.

1) Even when using the same model (GTR on nucleotides vs. GTR on RY-recoded), you compare the results inferred from two fundamentally different matrices. Which, however, should be – from a biological point of view – identical in their signal. But the result is not. By transforming the matrix, you changed profoundly the parameters. The differences relate to the data and (possibly) issues in optimising a sensible model, so they are  – to some degree – artificial. So, it makes little sense (anaylsis/biology-wise, no idea about the mathematics) to compare the absolute branch-lengths directly.

You can judge the primary effect of the recording by comparing the substitution models RAxML optimised (can be found in the RAxML info file).

2) The binary and RY-recoded matrices are fully equivalent regarding their signal. Thus, the difference here is purely model-based, and the branch-lengths can be compared 1:1. Again, using the substitution models in the RAxML info file, you can see how severely the GTR-optimised model deviates from the (theoretically more fitting in case of a RY matrix) binary model.
 
I suppose the units of the branch lengths are the same - so maybe it’s actually ok to make this comparison? The main reason I’m doing this is to see if RY coding reduces branch length variance (to investigate base compositional evolution).

I guess this entirely depends on how you plan to calculate the branch length variance. But that is too stastical of a question for me to answer. (I'm just interested in topologies and their biases; with respect to RY coding, that would mean: Did the topology change and how much? So I'd just use the RY/binary-topology and re-calculate the branch-lengths and model parameters using the original matrix and the GTR model; option -f e in RAxML, and then compare the likelihood of both trees, eventual AU or SH test; plus I'd correlate the split support for the original vs. RY-binary analysis, option -f m).

Jacob Berv

unread,
Nov 25, 2017, 12:34:37 PM11/25/17
to ra...@googlegroups.com
When I said branch length variance- I actually meant branch length variance-- i.e.- I am interested in whether or not RY coding decreases the variance among terminal (or root to tip) branch lengths. My hypothesis is that it should if there are lineages which have experienced base compositional evolution. Do you think that would be a valid test? It seems like variance is actually increased for RY under GTR, and substantially decreased for RY under BIN.

Jake

Alexandros Stamatakis

unread,
Nov 26, 2017, 9:02:16 AM11/26/17
to ra...@googlegroups.com
Dear Jake,

> Thanks so much for this detailed response. That makes alot of sense.
> It’s pretty amazing that despite losing 1/2 of the information, raxml is
> able to essentially reconstruct the same branch lengths with RY coding
> as it is with nucleotide coding — for root to tip or with terminal
> branch lengths, R^2 ~0.98. Amazing!

:-)

I guess what you are observing for RY coding (longer branches) is due to
the fact that the signal becomes more fuzzy, as internally at the tips
the likelihoods for the respective nucleotides representing R or Y are
both set to 1.0. Like Guido I would also have suggested to run a binary
model on the RY-coded matrix.

> I ran the RY coded dataset as a 01 matrix with the BINGAMMA model, and
> the root to tip and terminal branch lengths seem to fit more with my
> prior expectation — under BINGAMMA the branch lengths are substantially
> reduced. However, while it seems reasonable for me compare branch
> lengths generated under the same models (ie, nucleotide coding under
> GTR, and RY coding under GTR), it seems less reasonable to make the
> comparison of GTR branch lengths to BIN branch lengths — am I wrong?

I don't think it's a good idea to do that as the input datasets are just
different, but in fact, the branch lengths mean the same (i.e., mean
number of subst per site). What you can do however, is to compare the
variance and other statistics about branch lengths between the three
input datasets (DNA, RY, binary).

Also note that, another way of assessing variance is to generate 100 BS
replicates for each dataset type (DNA, RY, binary) and then estimate the
branch lengths for those 100 replicates on the respective, fixed ML
tree. This will, for each input dataset type give you a variance of
br-lens under bootstrapping and so might be more informative.

Hope this helps,

Alexis


I
> suppose the units of the branch lengths are the same - so maybe it’s
> actually ok to make this comparison? The main reason I’m doing this is
> to see if RY coding reduces branch length variance (to investigate base
> compositional evolution).
>
> Best,
> Jake
>
>
>> On Nov 23, 2017, at 7:28 AM, Grimm <grim...@gmail.com
>> <mailto:raxml+un...@googlegroups.com>.
>> For more options, visit https://groups.google.com/d/optout.
>
> --
> You received this message because you are subscribed to the Google
> Groups "raxml" group.
> To unsubscribe from this group and stop receiving emails from it, send
> an email to raxml+un...@googlegroups.com
> <mailto:raxml+un...@googlegroups.com>.
> For more options, visit https://groups.google.com/d/optout.

--
Alexandros (Alexis) Stamatakis

Research Group Leader, Heidelberg Institute for Theoretical Studies
Full Professor, Dept. of Informatics, Karlsruhe Institute of Technology

www.exelixis-lab.org

Jacob Berv

unread,
Nov 26, 2017, 10:17:21 PM11/26/17
to ra...@googlegroups.com
Thanks for your responses Alexis,

Not sure if you saw my post subsequent to the one you responded to — though perhaps you did. As a follow up, do you think that by comparing variance in branch lengths estimated with [RY under BIN] to [nuc under GTR], one could estimate the proportion of the substitutional process that can be attributed to base compositional evolution on the AT-GC axis of molecular evolution?

For example, if variance is reduced under RY coding (under BIN), by a factor, would that factor tell us something about the molecular clock?

If I could compare ratios of branch lengths for terminal lineages (which I guess I shouldn’t do?), would that indicate the ‘proportion’ of the individual lineage's branch length that can be explained by compositional evolution along that lineage?

Jake


To unsubscribe from this group and all its topics, send an email to raxml+un...@googlegroups.com.

Alexandros Stamatakis

unread,
Nov 27, 2017, 7:19:04 AM11/27/17
to ra...@googlegroups.com
Dear Jake,

> Not sure if you saw my post subsequent to the one you responded to —
> though perhaps you did.

Yes I did.

> As a follow up, do you think that by comparing
> variance in branch lengths estimated with [RY under BIN] to [nuc under
> GTR], one could estimate the proportion of the substitutional process
> that can be attributed to base compositional evolution on the AT-GC axis
> of molecular evolution?

Not sure, I really don't have a clear opinion here.

> For example, if variance is reduced under RY coding (under BIN), by a
> factor, would that factor tell us something about the molecular clock?

Depends how you calculate that variance, here maybe you are refering to
variance among branch lengths on a single tree I presume?

> If I could compare ratios of branch lengths for terminal lineages (which
> I guess I shouldn’t do?), would that indicate the ‘proportion’ of the
> individual lineage's branch length that can be explained by
> compositional evolution along that lineage?

I don't think so, maybe you should be looking for statistical tests for
this kind of variation. You may also, try to identify specific subtrees,
prune them from the overall tree and do a ML estimate of the base
frequencies on these subtrees to see how much the estimates of base
frequencies differ.

Alexis

>
> Jake
>
>
>> On Nov 26, 2017, at 9:02 AM, Alexandros Stamatakis
>> <alexandros...@gmail.com
>>>> visithttps://groups.google.com/d/topic/raxml/BfoP_GXRrrY/unsubscribe.
>>>> To unsubscribe from this group and all its topics, send an email
>>>> toraxml+u...@googlegroups.com
>>>> <mailto:raxml+un...@googlegroups.com><mailto:raxml+un...@googlegroups.com>.
>>>> For more options, visithttps://groups.google.com/d/optout.
>>> --
>>> You received this message because you are subscribed to the Google
>>> Groups "raxml" group.
>>> To unsubscribe from this group and stop receiving emails from it,
>>> send an email toraxml+u...@googlegroups.com
>>> <mailto:raxml+un...@googlegroups.com><mailto:raxml+un...@googlegroups.com>.
>>> For more options, visithttps://groups.google.com/d/optout.
>>
>> --
>> Alexandros (Alexis) Stamatakis
>>
>> Research Group Leader, Heidelberg Institute for Theoretical Studies
>> Full Professor, Dept. of Informatics, Karlsruhe Institute of Technology
>>
>> www.exelixis-lab.org <http://www.exelixis-lab.org/>
Reply all
Reply to author
Forward
0 new messages