Highly Length Variable Sequences

Ulf

unread,

May 8, 2008, 2:36:24 PM5/8/08

to poy4

Hi All,

I am conducting a combined POY (version 4.0) analysis of three
different datasets (only 30 sequences). One of the datasets ( i.e.
partial LSU rRNA sequences) consist of sequences that are
characterized by a high degree of length variability (i.e. the
difference between the shortest and the longest sequences is about 400
nucleotides). I have run this data many times in POY 4.0 and the
heuristic searches (in combination with iterative pass optimizations)
are unable to find a "good" solution which appears to be due to the
length variable LSU sequences. Heuristic PAUP searches based on
multiple alignment matrices of the all the three datasets (aligned
separately and the combined into a single input matrix) gave rise to
much shorter trees (i.e 7376 steps) than those obtained from the
POY4.0 (i.e. lowest cost tree was 7758) runs. PAUP analyses were
conducted with gaps as the 5th character state whereas all the
transformation costs in all the POY runs were set to 1.

Is there a way to deal with sequences of highly varying length in POY?
I am analyzing them as prealigned sequences now and that seems to
produce much shorter trees (i.e somewhat shorter lengths than those
generated by the PAUP runs mentioned above). However, I prefer to have
POY give me the "optimal" alignment. Any suggestions of how to solve
this problem? How about dividing up the length variable data into
subsections based on info (conserved regions) obtained from a multiple
alignment matrix?

With best regards,

Ulf Sorhannus
Department of Biology
Edinboro University of Pennsylvania
Edinboro, Pa 16444
USA

Andres Varon

unread,

May 8, 2008, 3:43:22 PM5/8/08

to poy4

Hello Ulf,

I'm surprised to hear this, specially if you even used iterative pass.
I see that you have been very careful to verify that the cost is
comparable between the static homology analysis and your dynamic
homology analyses, but in every case where someone has told me this in
the past, turned out that the coding was different between the two
executions. That doesn't mean that you could not be the first to find
a real case where our default heuristics are not doing well!

So, before trying to find out what to do to the data to improve the
results, I would like to verify that everything is indeed the same.
Could you show us the script?

best,

Andres

Ulf

unread,

May 8, 2008, 4:39:27 PM5/8/08

to poy4

Andres,

Here is the script that I have used for the heuristic searches:

read(prealigned:("*.aln", tcm:(1,1)),"ssuatt.fas")
set(seed:133717, log:"all_data_search.log", root:"Corethronhystrix")
report(timer:"search start")
transform (tcm:(1,1), gap_opening:1)
build(1000)
swap(trees:8, threshold:8.0)
select()
perturb(transform(static_approx), iterations:40,ratchet:(0.3,3))
select()
fuse(iterations:350, swap(trees:6, threshold:4))
select()
report("TREESout", trees:(total) ,"constree", graphconsensus,
"diagnosis", diagnosis)
report("implalnALL.out", implied_alignments)
report(timer:"search end")
set(nolog)
exit()

here is the script for the iterative pass optimization:

read(prealigned:("*.aln", tcm:(1,1)),"ssuatt.fas")
read("treein")
transform(tcm:(1,1),gap_opening:1)
set(iterative)
swap(around)
select()
report("all_treesIT2",trees:(total),"constreeIT2",graphconsensus,
"diagnosisIT2",diagnosis)
report("implalnIT2", implied_alignments)
exit()

If you want I can also send you the data. One thing I forgot to
mention in my previous message was that there are a number of missing
sequences (11 sequences out of 30 sequences) in the LSU data set.

Thank you for looking into this problem!

ULF

> > USA- Hide quoted text -
>
> - Show quoted text -

Andres Varon

unread,

May 8, 2008, 5:23:53 PM5/8/08

to po...@googlegroups.com

Hello Ulf,

This is a case of a different character coding between the two
analyses. See below.

On May 8, 2008, at 4:39 PM, Ulf wrote:
>
> Andres,
>
>
> Here is the script that I have used for the heuristic searches:
>
> read(prealigned:("*.aln", tcm:(1,1)),"ssuatt.fas")
> set(seed:133717, log:"all_data_search.log", root:"Corethronhystrix")
> report(timer:"search start")

Here is the difference:

>
> transform (tcm:(1,1), gap_opening:1)

In POY 4 the gap_opening cost is in addition to the individual indel
cost. So effecitvely:

A-A
AAA

Has cost 2 in your transform selection, 1 from the individual indel of
the tcm, plus 1 from the gap_opening parameter. Just use transform
(tcm:(1,1)) and things will look better.

Anyway, if some of the differences in length are due to sampling
errors, you should probably break the sequences using a multiple
sequence alignment from something like MUSCLE. That way, you can leave
the portions that are really missing, as missing data. An example of a
broken dataset is the following:

>taxon1
AAA#CCC#GGG

>taxon2
AAA#CCCCC#

>taxon3
##GGGGGG

In this case, taxon1 has 3 fragments present, taxon two is missing the
last one, and taxon 3 is missing the first two.

best,

Andres

Ulf

unread,

May 8, 2008, 6:14:55 PM5/8/08

to poy4

THANKS!

ULF

> >> - Show quoted text -- Hide quoted text -

Reply all

Reply to author

Forward