Hi Sonia,
adding to Alexi, pending the complexity of your indels, you can use the binary option to code step-matrices as well, or use multistate instead
e.g.
tip1 AAAGGGAAA
tip2 AAA–––AAA
tip3 AAAAAAAAA
one could use a ternary coding
tip1 AAAGGGAAA = 0
tip2 AAA–––AAA = 1
tip3 AAAAAAAAA = 2
(in this particular case one would even consider treating this character as ordered viewing the needed mutation steps to generate such a sequence pattern)
Two more things:
First, a warning: gapcoding can be a double-edge sword, as it may invite false positives. For
instance, duplications in non-coding chloroplast gene regions can be
highly convergent. So, if the gap pattern doesn't match the point
mutations, it's thin ice to rely on the binary partition. If the number of convergent gap patterns outnumber the number of (phylogenetically sorted) point mutations, the binary gap-coding partition will outcompete the nucleotide-substitution based tree.
Good news, you don't need to code gaps at all as binaries if the matrix also has good signal in the varying sites: Since gaps are treated as N's under ML, they are considered when optimising the tree. ML using the standard substitution model is semi-aware of gaps, the tip probability vector of a gap is the same than for N/missing data: p (1,1,1,1). I.e. if you have this four taxa problem
tip1 AAAAAAAAA
tip2 AAAAAAAAA
tip3 AAA–––AAA
tip4 AAA–––AAA
Parsimony will give you a star tree, but ML will prefer tip1 + tip2 | tip3 + tip4 split above the alternatives because it see an alignment pattern involving substitution from A (1,0,0,0) to N (1,1,1,1). I.e. if there are gaps showing congruent splits with the varying sites, they will already stabilise your topology even when you just leave them as they are in the alignment.
Also means: if you code your gaps as e.g. a binary partition, you need to exclude them from the nucleotide partition. Otherwise you duplicated in a way their signals.
To be on the safe side, always run the standard analysis (no coding), the combined (nucleotide + binary codes) and seperated (only remaining nucleotides, only gap codes) and compare the trees. The should converge to the same tree, only resolve some aspects better or worse.
Exception from the usually-doesn't-pay-to-gapcode rule is when your sequences are mostly
differentiated by gap patterns and very few point mutations because then
it may be hard for the algorithm to optimise a substitution model at
all, and infer a meaningful tree. In such case one just binarises all alignment patterns.
e.g.
tip1 GAAAAAAGG = 101
tip2 AAAAAAAGG = 001
tip3 AAA–––AAA = 010
tip4 GAA–––AAA = 100
Happy coding,
Guido