Gaps at the beginning of a nucleotide sequence are treated differently?

42 views
Skip to first unread message

Rob Hun

unread,
Jan 23, 2024, 2:29:00 PM1/23/24
to TNT-Tree Analysis using New Technology
Hi,

I'm looking for advice / hypotheses. I encountered some results that I am confused about during my research. I have narrowed it down to what appears to be that TNT treats internal gaps ('-') differently from leading gaps at the beginning of the sequence. 

To test this I artificially added gaps '-' to the classic primates cytb dataset. So the original sequence might look like (I have shortened the full sequence here with ...):

Saimiri_sciureus               ATGACCTCACCCCGCAAAACACACCCTTTAACAAAAATCATTAATAACTCCTTTATCGACCTTCCTACACCATCCAACATCTCT...
Cebus_albifrons                ATGACCTCTCCCCGCAAAACACACCCATTAATAAAAA...
I modified many of the sequences to replace some of the leading characters with gaps like this (Dataset B):

Saimiri_sciureus --------------------------------------------------CTTTATCGACCTTCCTACACCATCCAACATCTCT...
Cebus_albifrons ----------CCCGCAAAACACACCCATTAATAAAAA...

and compared the resulting trees with ones where I added a dummy 'A' in front of all of the sequences like this (Dataset A):

Saimiri_sciureus A--------------------------------------------------CTTTATCGACCTTCCTACACCATCCAACATCTCT...
Cebus_albifrons A----------CCCGCAAAACACACCCATTAATAAAAA...

So the only difference between Dataset A and Dataset B is a leading 'A' in each sequence before the gaps.

The trees from the analysis on dataset A are significantly worse (nrf of 0.65 with 5 different random starting seeds) than the trees from dataset B (nrf around 0.2).

Is there a biological/algorithmic reason that I am missing for why this might be the case? I can provide more details. I am using the 'nogaps' option which I understand to mean that '-' are treated as missing data.

Any help or ideas would be appreciated!

Rob Hun

unread,
Mar 26, 2024, 11:51:25 AM3/26/24
to TNT-Tree Analysis using New Technology
This was solved by a very thorough and quick response from Pablo (TNT's bug reporting). In case anyone encounters this in the future, As they wrote:

"I see in your test datasets (A and B) that you are using the "gaps" and "trimhead" options. This means that TNT will consider gaps as a fifth state, except for those that are leading gaps, which will be missing. This would make sense when some sequences are much shorter than others by virtue of having sequenced only the middle part.

Therefore, the behaviour you report is fully expected: in dataset A the gaps are no longer leading; they are all preceded by an A. "
Reply all
Reply to author
Forward
0 new messages