Hi,
I'm looking for advice / hypotheses. I encountered some results that I am confused about during my research. I have narrowed it down to what appears to be that TNT treats internal gaps ('-') differently from leading gaps at the beginning of the sequence.
To test this I artificially added gaps '-' to the classic primates cytb dataset. So the original sequence might look like (I have shortened the full sequence here with ...):
Saimiri_sciureus ATGACCTCACCCCGCAAAACACACCCTTTAACAAAAATCATTAATAACTCCTTTATCGACCTTCCTACACCATCCAACATCTCT...
Cebus_albifrons ATGACCTCTCCCCGCAAAACACACCCATTAATAAAAA...
I modified many of the sequences to replace some of the leading characters with gaps like this (Dataset B):
Saimiri_sciureus --------------------------------------------------CTTTATCGACCTTCCTACACCATCCAACATCTCT...
Cebus_albifrons ----------CCCGCAAAACACACCCATTAATAAAAA...
and compared the resulting trees with ones where I added a dummy 'A' in front of all of the sequences like this (Dataset A):
Saimiri_sciureus A--------------------------------------------------CTTTATCGACCTTCCTACACCATCCAACATCTCT...
Cebus_albifrons A----------CCCGCAAAACACACCCATTAATAAAAA...
So the only difference between Dataset A and Dataset B is a leading 'A' in each sequence before the gaps.
The trees from the analysis on dataset A are significantly worse (nrf of 0.65 with 5 different random starting seeds) than the trees from dataset B (nrf around 0.2).
Is there a biological/algorithmic reason that I am missing for why this might be the case? I can provide more details. I am using the 'nogaps' option which I understand to mean that '-' are treated as missing data.
Any help or ideas would be appreciated!