Using RAxML only with inputs as indels

EB

unread,

Mar 22, 2018, 5:50:26 PM3/22/18

to raxml

Hello there

This topic has been discussed quite a lot elsewhere within this Google Group, e.g.

https://groups.google.com/forum/#!topic/raxml/0EvcoldRAKI

However, the use-case I have in mind for the following experiment may require some discussion:

I have single-read WG reads whereby there are a high density of short to mid-range InDels. Based on the mechanism of how these InDels came to be, we expect that one should be able to trace the lineage of these cells solely based on the identity/location of these InDels.

Would one be able to use RAxML for this task? My first though was to encode all nucleobases as 0, all insertions as 1, and all deletions as -1. However, it's somewhat possible that SNVs could encode information as well----there may be a better way.

Alexandros Stamatakis

unread,

Mar 23, 2018, 12:56:44 AM3/23/18

to ra...@googlegroups.com

> This topic has been discussed quite a lot elsewhere within this Google
> Group, e.g.
>
> https://groups.google.com/forum/#!topic/raxml/0EvcoldRAKI
>
> However, the use-case I have in mind for the following experiment may
> require some discussion:
>
> I have single-read WG

What does WG stand for?

> reads whereby there are a high density of short to
> mid-range InDels. Based on the mechanism of how these InDels came to be,
> we expect that one should be able to trace the lineage of these cells
> solely based on the identity/location of these InDels.
>
> Would one be able to use RAxML for this task? My first though was to
> encode all nucleobases as 0, all insertions as 1, and all deletions as
> -1.

You could model these as three distinct states in a multi-state model.
But how do you know which are inesrtions and which are deletions?

> However, it's somewhat possible that SNVs could encode information
> as well----there may be a better way.

Without gaps you mean? You could model the nucleotides as additional
states, maybe using 7 states?

Alexis

>
>
> --
> You received this message because you are subscribed to the Google
> Groups "raxml" group.
> To unsubscribe from this group and stop receiving emails from it, send
> an email to raxml+un...@googlegroups.com
> <mailto:raxml+un...@googlegroups.com>.
> For more options, visit https://groups.google.com/d/optout.

--
Alexandros (Alexis) Stamatakis

Research Group Leader, Heidelberg Institute for Theoretical Studies
Full Professor, Dept. of Informatics, Karlsruhe Institute of Technology

www.exelixis-lab.org

EB

unread,

Mar 23, 2018, 1:47:06 PM3/23/18

to raxml

Hi Alexis

Thanks for the response.

> What does WG stand for?

Sorry, I meant that this data is WGS data. Single-cell whole-genome.

> You could model these as three distinct states in a multi-state model.

But how do you know which are inesrtions and which are deletions?

After alignment, I would know this information fairly certainly, right?

> Without gaps you mean? You could model the nucleotides as additional
states, maybe using 7 states?

With this model, we would use 0=no variation, 1=insertion, -1=deletion, 2=A, 3=C, 4=G, 5=T (seven states), right?

So, AGGT111CGAAAGT-1-1-1 (no SNPs) would become 00001110000000-1-1-1?

This might not be the best approach of course, if we just take the alignment with this model, we would be discarding the information of the rest of the alignment (i.e. the actual sequence which is aligned). This contains useful information.

Grimm

unread,

Mar 24, 2018, 8:44:33 AM3/24/18

to raxml

Hi,

> You could model these as three distinct states in a multi-state model.

But how do you know which are inesrtions and which are deletions?

After alignment, I would know this information fairly certainly, right?

Not really, we usually make the call whether its an insertion or deletion by comparison about all taxa. If most have it, and some miss it, it's a "deletion", if only few have the extra piece, it's and insertion.
I just checked this in great detail in classic Sanger sequencing plastid data, and I can decide between deletions and duplications or more complex insertions, but you have to be very careful with the majority rule argument. Effectively you first have to make a tree without the indel data, then map the indel data on the tree, to make the call: likely loss (deletion) or gain (insertion).

From a genetic-phylogenetic point of view, I see no point in seperating, and would stick to binary (or ternary):
absent - 0, present - 1
absent - 0, intragenomic variation (in case you caught that) - 1, present - 2

> Without gaps you mean? You could model the nucleotides as additional
states, maybe using 7 states?

With this model, we would use 0=no variation, 1=insertion, -1=deletion, 2=A, 3=C, 4=G, 5=T (seven states), right?

In principle yes, but you can't use "-1", Multistate has to be 0,1,2,... there's an R-script available that can transform a nucleotide matrix with indels into a multistate matrix. My first author made it for our 2ISP paper (http://dx.doi.org/10.1093/sysbio/syt052), should be in the dryad submission: http://dx.doi.org/10.5061/dryad.21jj2

So, AGGT111CGAAAGT-1-1-1 (no SNPs) would become 00001110000000-1-1-1?

Only if you want to ignore the SNPs. But then you can analyse your data as binary (absent/present, see above).
For multistate, it should become 02236661200023444
0=A
1=C
2=G
3=T
4=Deletion
5=No modification
6=Insertion

This might not be the best approach of course, if we just take the alignment with this model, we would be discarding the information of the rest of the alignment (i.e. the actual sequence which is aligned). This contains useful information.

I'd be careful about signal issues. From the plastids, I realised that indel patterns are often convergent, and may even be saturated to a certain degree: they may be finely sorted within a subtree, but you have the same pattern (and other sorting) in other subtrees.

I suggest you run an analysis without the indels, only the SNPs, to get a sort of backbone tree, then run a binary analysis ignoring the SNPs, only using absent/present for the indels, and then try out the multistate optimisation, and compare the outcome: test the inferred trees against each other using AU-test, and correlated the branch support directly.

PS here you are quite unlimited, you can go up to 35 states, 10 arabic numbers + 26 letters, so you technically could do an optimisation that conserves SNPs with 2ISP and indel on/off situations
Eg. (0 = gap, 1 = gap or data/indel, 2 = indel only)
group1 A2G
group2 R1G
group3 G0R
etc. and the ML tree will consider this all (but see also the results we had for ML-A, standard 4x4 model and ML-I, 2ISP-aware for data sets with high level of intragenomic variation. ML does differ between A, R, and G also when using the standard 4x4 model. The problem with hyperstate models is that when some states (e.g. ambiguous base calls) are much rarer than others (e.g. an occasional R when all others decide on A or G), they will be regarded unlikely, and may be overweighted (for the example, RAxML will in multistate mode optimise a substitution model where the probability to change from A or G into R << than to change between A and G, which makes little genetic sense, and you would get a more sensible result not coding R as an extra state). Only when a good deal information is encoded in polymorphic sites, ML-I outperforms ML-A .

Cheers, and good inferencing,
Guido

Grimm

unread,

Mar 24, 2018, 8:57:48 AM3/24/18

to raxml

Sorry, 10 + 26 gives 36 (bad in maths, reason I became a biologist...)

EB

unread,

Mar 24, 2018, 7:01:18 PM3/24/18

to raxml

Hi Guido Grimm

Not really, we usually make the call whether its an insertion or deletion by comparison about all taxa. If most have it, and some miss it, it's a "deletion", if only few have the extra piece, it's and insertion.

I don't understand. If I align to a reference, I know this information and can label these as insertions/deletions/not manipulated.

I just checked this in great detail in classic Sanger sequencing plastid data, and I can decide between deletions and duplications or more complex insertions, but you have to be very careful with the majority rule argument. Effectively you first have to make a tree without the indel data, then map the indel data on the tree, to make the call: likely loss (deletion) or gain (insertion).

From a genetic-phylogenetic point of view, I see no point in seperating, and would stick to binary (or ternary):
absent - 0, present - 1
absent - 0, intragenomic variation (in case you caught that) - 1, present - 2

> Without gaps you mean? You could model the nucleotides as additional
states, maybe using 7 states?

With this model, we would use 0=no variation, 1=insertion, -1=deletion, 2=A, 3=C, 4=G, 5=T (seven states), right?

In principle yes, but you can't use "-1", Multistate has to be 0,1,2,... there's an R-script available that can transform a nucleotide matrix with indels into a multistate matrix. My first author made it for our 2ISP paper (http://dx.doi.org/10.1093/sysbio/syt052), should be in the dryad submission: http://dx.doi.org/10.5061/dryad.21jj2

So, AGGT111CGAAAGT-1-1-1 (no SNPs) would become 00001110000000-1-1-1?

Only if you want to ignore the SNPs. But then you can analyse your data as binary (absent/present, see above).
For multistate, it should become 02236661200023444
0=A
1=C
2=G
3=T
4=Deletion
5=No modification
6=Insertion

I guess my question for you and Alexis and others is, does such a model make sense in terms of RAxML? Also, does it matter if I use a different labeling system, e.g.

1=No modification
2=inserted base

3 = deleted base

4 = C

5 = G

6 = T

7 = G

Grimm

unread,

Mar 25, 2018, 10:35:39 AM3/25/18

to raxml

Am Sonntag, 25. März 2018 00:01:18 UTC+1 schrieb EB:

Hi Guido Grimm

Not really, we usually make the call whether its an insertion or deletion by comparison about all taxa. If most have it, and some miss it, it's a "deletion", if only few have the extra piece, it's and insertion.

I don't understand. If I align to a reference, I know this information and can label these as insertions/deletions/not manipulated.

Several levels.
Level 1 (essentially Alexi's original question): How do you assess that your reference shows the ancestral situation?
Level 2: When you treat insertions and deletions as different processes, and optimise via RAxML multistate option, RAxML will optimise a rate to insert, and another one to delete. As a geneticist, I'd argue insertions and deletions cannot be taken apart. I further would argue that each part of the genome with length-polymorphism has a different probability for indels. In all sequence data, I looked at, I could observe regions attracting indels (insertions and/or deletions), and others that did not. That is trivial for regions with structural constraints (e.g. ribosomal DNAs, tRNA-genes, protein-coding in general), but can also be observed in non-coding, non-transcribed regions. And since its and/or, there is little reason to seperate inserts and deletions beyond binary abscence/presence.
Level 3: You have the same issue with treating inserts and deletions as two different states as you have with treating ambiguous base calls as additional states. If one of them is rare, it will have a greater effect than the other. Which brings us back to level 2 (and 1): are your insertions (with respect to the reference) the consequence of a different process than your deletions?

But easy to test: you code them just binary (absent/precense), run the analysis, and compare it with the analysis using ternary coding: deletion – no modification – insertion.

E.g. lets say we code the entire data for indels as binary: 0 = gap, 1 no-gap, or ternary with respect to reference: 0 = deletion, 1 = no modif., 2 = insertion
reference: GGGG -             - AGAT - ATCT - CCCC   -> binary: 1 - 0 - 1 - 1 - 1 -> ternary: 1 - 1 - 1 - 1 - 1 (or just only the variation: 0 - 1 / 1 - 1)
type1:       GGGG -             - AGAT -           - CCCC    -> binary: 1 - 0 - 1 - 0 - 1 -> ternary: 1 - 1 - 1 - 0 - 1 (0 - 0 / 1 - 0)
type2:       GGGG - GGGG - AGAT - ATCT - CCCC    -> binary: 1 - 1 - 1 - 1 - 1 -> ternary: 1 - 2 - 1 - 1 - 1 (1 - 1/ 2 - 1)
In both cases you have two distinct alignment patterns, at the same positions (2 and 4), and both show the same pattern. Only difference being that for binary your model will have one parameter less (speedier), and the topological effect (and also data-bias) may increase for ternary: RAxML notices for binary that there a two site that changed absence-presence, a certain probability (can't calculate that). For ternary RAxML notices that there's an equal chance to have a deletion OR insertion, so the probability for either one may be halved. So you get more decisiveness from ternary than binary with the risk of over-weighting (Level 3-issue above). So ternary is a double-edged sword.

Another example (inspired from a dataset, non-coding, non-transcribed, I recently worked)

reference: GGGG - AGAT-           - ATCT - CCCC [a hairpin-like sequence]
type 1:      GGGG - AGAT- AGAT - ATCT - CCCC [simple 5' duplication]
type 2:      GGGG - AGAT- AGAT -           - CCCC [a length-compensating deletion]
Using the ternary coding (only seen variation), we would code this as
ref:            1 - 1
type 1:      2 - 1
type 2:      2 - 0
But only if we realise the mutation sequence: The AGAT is duplicated, and the ATCT is lost. Binary it would be
ref: 0 - 1
type 1:     1 - 1
type 2:     1 - 0
Again, the patterns are equivalent per se, just their weighting differs.
But what is when your reference has already a modification? For ternary this would need to be taking into account
[actual ancestor:   GGGG - AGAT-                       - CCCC -> ternary: (1 - 1 -) 1 - 1 (-1) -> binary: 0 - 0 * this is the ancestral sequence, of the all-ancestor, which you don't have in your sample]
reference:             GGGG - AGAT-           - ATCT - CCCC ->                           1 - 2        ->              0 - 1
type 1:                  GGGG - AGAT- AGAT - ATCT - CCCC ->                           2 - 2        ->              1 - 1
type 2:                  GGGG - AGAT- AGAT -           - CCCC ->                           2 - 1        ->              1 - 0
The pattern is still the same, but now you have no deletion, but two insertion events; this changes the matrix you code, and would effect the model you optimise for how likely it is to have an insertion (quite likely) and deletion (unlikely). Quite a difference to above, where we had equal probability for insertion and deletion.
But the binary coding will be the same (your level 1 issue).

Cheers, Guido

PS My data doesn't stop there, I have additional types, the original duplications appears inverted in some sequences of the same clade forming a pseudo-loop in the pseudo-hairpin, and deletions/additional duplications are found in both the 5' and 3' parts of the pseudo-hairpin (pseudo, cause it's a non-coding plastid spacer). By reconstructing the patterns within terminal clades, and compare them beyond clades, I have a good idea how it originally looked like, but usually indels are really simply absence/presence and rarely genetic synapomorphies (in a classic, strict Hennigian sense), and if you reference is not the actual ancestral sequence or from a known low-evolving taxon (a genetic living fossil), you have no way to decide wether it does include insertions (compared to the best reference: the LCA – last common ancestor), which you will code as deletions (or deletions you score as insertions). But as I said, you can just code both options, and see what comes up. In the best case, it doesn't matter at all. And if there a difference, you just compare the models RAxML optimised (in the RAxML_info files). If the probability for insertion vs deletion is very different, then you may have a insertion/deletion sample bias, which you can cross-check for by estimating the proportion of ref-based insertions vs. deletions. If they are about the same amount, but have different substitution probabilities, than one may really be more probable than the other.

Grimm

unread,

Mar 25, 2018, 10:42:31 AM3/25/18

to raxml

Regarding you end-question.
The sequence doesn't matter at all for the inference. I'd start with A = 0, C = 1, etc. so I'd see directly the classic 4 x 4 part of the optimised model, with the specialities coming at the end.
But you can of course also go with deletion = 0, no modif. = 1, insertion = 2, etc.
You just have to remember how you coded it, for being able to interpret the optimised multi-state models. You'll get in the RAxML info file the probabilities for 0<>1, 0<>2, etc.

Only important thing is: it must start with 0, and be consecutively numbered, otherwise RAxML will not accept the input. Each state you use must be found at least once in your read-in alignment.

Alexandros Stamatakis

unread,

Mar 26, 2018, 1:47:01 AM3/26/18

to ra...@googlegroups.com

Okay, so we are currently developing and testing dedicated models for
single-cell whole-genome data in RAxML-NG. Maybe we should take this
conversation off-line from the group and see if we can provide you
advance access to this (I'd need to check with my collaborators, but I
think that would be okay). So please send me an email to my personal
email address and we can discuss further.

Alexis

Reply all

Reply to author

Forward