Sate errors using more than one gene, formatting data and bootstrap using raxml

53 views
Skip to first unread message

Patricia Sanchez-Baracaldo

unread,
Apr 17, 2012, 10:15:35 AM4/17/12
to SATe User
Dear Sate User-group,

I am trying to use Sate to analyse a large data set. I am starting by
just checking that the program runs and it reads my files.

1. Program runs. I have a Mac OS 10.6.8
2. Program reads single and multilocus

Just to check formatting – I am working with a small group of genes –
5 genes and 25 taxa. I have run 5 separate single gene files and they
seem to behave OK.

The problem starts when I want to run multilocus runs.

I get several problems:

When analysing two genes at the time:

1. It seems to duplicate some taxa – they seem to have unusual long-
branches. I have long branches in my data set – but not the one that
is being duplicated by sate. Starting with 25 taxa – showing 26 in
taxa in the tree file (Figtree).
2. Formatting data – they are in a fasta format – all sequences/taxa
are in the same order and I think names are the same in both files–
maybe I have missed a space.. but I can’t see it.

When analysing 3 genes at the time:

1. A similar problem is happening here – it is duplicating my taxa.
From 25 in matrix/file. The tree shows 27.
2. When running analyses it said they was an error – yet they was a
tree?
3. File that says error has no information. Error only appeared in
Sate but it continued and there was a tree at the end. Please find
message at the end of this message – Appendix 1.

Common error:

1. Sate – when opening tree files in figtree – it says that there is
no tree. However, there is a tree. So it gives a misleading
message.
2. If sate doesn’t read files – it says job is finished but there is
no specific message.

More questions Formatting data:

1) Is there are list somewhere pointing at what is the best way to
format large data set? I can see becoming a problem when dealing a
very larage multigene analyses (My data set is 139 genes with over 220
taxa). Do I need single files, interleve?

2) Not all genes are present across all taxa? Some are missing in the
outgroups? I tried uploading some files with missing data (by this I
mean some taxa had no genes – this is for my simple data set I talked
about above – and sate didn’t recognise the files. It put it down to
this – since it later on read files where all taxa had protein
information.

3) In order to calculate bootstrap values – what is the best way to go
about it. Raxml? I have very little experience with this program,
but it seems that Raxml needs a PHYLIP format. Does Raxml need the
raw data (e.g.139 genes) to calculate bootstrap values? If this is
the case – does this mean I need to format all data into a PHYLIP
format?

I will be attending the workshop in May. I would like to have my data
formatted to be able to work on it during the workshop. That is why I
am testing that things work.

Any help will be greatly appreciated.

Thank you.

Patricia

Appendix 1. This is what it said:

SATe INFO: Reading input sequences from '/Users/patriciascoones/
Documents/Patricia/PROYECTS/NITROGENASE /Sate - nitrogenase
phylogenetic analyses/SateMultiLocus/nifDMuscle17April.fasta'...
SATe INFO: Reading input sequences from '/Users/patriciascoones/
Documents/Patricia/PROYECTS/NITROGENASE /Sate - nitrogenase
phylogenetic analyses/SateMultiLocus/nifEMuscle17April.fasta'...
SATe INFO: Reading input sequences from '/Users/patriciascoones/
Documents/Patricia/PROYECTS/NITROGENASE /Sate - nitrogenase
phylogenetic analyses/SateMultiLocus/nifHMuscle17April.fasta'...
SATe INFO: Directory for temporary files created at /Users/
patriciascoones/.sate/satejob/temp9fgzTO
SATe INFO: Performing initial tree search to get starting tree...
SATe INFO: Input sequences assumed to be aligned (based on sequence
lengths).
SATe INFO: Performing initial tree search to get starting tree...
SATe INFO: Starting SATe algorithm on initial tree...
SATe INFO: Step 0. Realigning with decomposition strategy set to
longest
SATe INFO: Step 0. Alignment obtained. Tree inference beginning...
SATe INFO: realignment accepted.
SATe INFO: current score: -24469.398797, best score: -24469.398797
SATe INFO: Step 1. Realigning with decomposition strategy set to
longest
SATe is exiting because of an error:
max() arg is an empty sequence
Job satejob is finished.
SATe INFO: Reading input sequences from '/Users/patriciascoones/
Documents/Patricia/PROYECTS/NITROGENASE /Sate - nitrogenase
phylogenetic analyses/SateMultiLocus/nifDMuscle17April.fasta'...
SATe INFO: Reading input sequences from '/Users/patriciascoones/
Documents/Patricia/PROYECTS/NITROGENASE /Sate - nitrogenase
phylogenetic analyses/SateMultiLocus/nifEMuscle17April.fasta'...
SATe INFO: Reading input sequences from '/Users/patriciascoones/
Documents/Patricia/PROYECTS/NITROGENASE /Sate - nitrogenase
phylogenetic analyses/SateMultiLocus/nifHMuscle17April.fasta'...
SATe INFO: Directory for temporary files created at /Users/
patriciascoones/.sate/satejob/tempImBg_r
SATe INFO: Performing initial tree search to get starting tree...
SATe INFO: Input sequences assumed to be aligned (based on sequence
lengths).
SATe INFO: Performing initial tree search to get starting tree...
SATe INFO: Starting SATe algorithm on initial tree...
SATe INFO: Step 0. Realigning with decomposition strategy set to
longest
SATe INFO: Step 0. Alignment obtained. Tree inference beginning...
SATe INFO: realignment accepted.
SATe INFO: current score: -23653.22246, best score: -23653.22246
SATe INFO: Step 1. Realigning with decomposition strategy set to
longest
SATe INFO: Step 1. Alignment obtained. Tree inference beginning...
SATe INFO: realignment accepted.
SATe INFO: current score: -23795.828073, best score: -23653.22246
SATe INFO: Writing final alignment to /Users/patriciascoones/Documents/
Patricia/PROYECTS/NITROGENASE /TEST-PHYLOGENY/MultiloucsE&D&HApril17/
satejob1.marker001.nifDMuscle17April.aln
SATe INFO: Writing final alignment to /Users/patriciascoones/Documents/
Patricia/PROYECTS/NITROGENASE /TEST-PHYLOGENY/MultiloucsE&D&HApril17/
satejob1.marker002.nifEMuscle17April.aln
SATe INFO: Writing final alignment to /Users/patriciascoones/Documents/
Patricia/PROYECTS/NITROGENASE /TEST-PHYLOGENY/MultiloucsE&D&HApril17/
satejob1.marker003.nifHMuscle17April.aln
SATe INFO: Writing final tree to /Users/patriciascoones/Documents/
Patricia/PROYECTS/NITROGENASE /TEST-PHYLOGENY/MultiloucsE&D&HApril17/
satejob1.tre
SATe INFO: Writing final likelihood score to /Users/patriciascoones/
Documents/Patricia/PROYECTS/NITROGENASE /TEST-PHYLOGENY/
MultiloucsE&D&HApril17/satejob1.score
SATe INFO: Total time spent: 809.807792902s
SATe INFO: Note that temporary files from the run have not been
deleted, they can be found in:
'/Users/patriciascoones/.sate/satejob/tempImBg_r'

SATe INFO:
If you cannot see this directory in the Finder application, you may
want to use
the 'open' command executed from a Terminal. You can do this by
launching the
/Applications/Utilities/Terminal program and then typing

open '/Users/patriciascoones/.sate/satejob/tempImBg_r'

followed by a return at the prompt. If the argument to the open
command is a
directory, then it should open a Finder window in the directory (even
if that
directory is hidden by default).

Job satejob is finished.

Tandy Warnow

unread,
Apr 17, 2012, 11:01:48 AM4/17/12
to sate...@googlegroups.com
Thanks, Patricia -- We'll figure out the problem and get back to you as soon as we can.

Tandy


--
You received this message because you are subscribed to the Google Groups "SATe User" group.
To post to this group, send email to sate...@googlegroups.com
To unsubscribe from this group, send email to sate-user+...@googlegroups.com
For more options, visit this group at http://groups.google.com/group/sate-user?hl=en

Mark Holder

unread,
Apr 17, 2012, 12:04:45 PM4/17/12
to sate...@googlegroups.com
Hi,
I'm sorry to hear about the problems. We'll certainly look into this and fix it as soon as possible.

If you would like to email me the data files then you can email them to mtho...@ku.edu or mtho...@gmail.com (That would keep them off the public group email list). We'll certainly try to diagnose what could be causing this even if you don't send the data files, but it might be easier for us to run your data.

There were some bugs associated with handling of taxa names in a previous version of SATé, so please make sure that you are running the most recent version of the software 2.1.0 (available from http://phylo.bio.ku.edu/software/sate/sate.html )

all the best,
Mark

Jamie Oaks

unread,
Apr 17, 2012, 2:12:08 PM4/17/12
to sate...@googlegroups.com
Hi Patricia,

If you do decide to send the data files to Mark, please also send the trees returned by sate with the duplicated taxa, and those that are not opening with figtree. We'll try to resolve the issues ASAP.

Many Thanks,

Jamie

On Apr 17, 2012, at 9:15 AM, Patricia Sanchez-Baracaldo wrote:

Patricia Sanchez-Baracaldo

unread,
Apr 18, 2012, 5:06:37 AM4/18/12
to SATe User
Hi Mark,

Thank you. I will am sending you the data files and trees I've got
yesterday.

best wishes,
P

On Apr 17, 5:04 pm, Mark Holder <mthol...@gmail.com> wrote:
> Hi,
>         I'm sorry to hear about the problems. We'll certainly look into this and fix it as soon as possible.
>
>         If you would like to email me the data files then you can email them to mthol...@ku.edu or mthol...@gmail.com  (That would keep them off the public group email list).  We'll certainly try to diagnose what could be causing this even if you don't send the data files, but it might be easier for us to run your data.

Patricia Sanchez-Baracaldo

unread,
Apr 18, 2012, 8:44:54 AM4/18/12
to SATe User
Hi all,

I have sorted the problem out. The two duplicated taxa happened to
have tiny differences in their name - so Sate wouldn't reject them -
but duplicated them and gave them a long branch. I went back to my
original files (I use Se-al) to convert to fasta and the names were OK
but when converting to fasta Se-al messes the names up a bit. I had
to do a lot cross checking yesterday across 3 gene files. Any
suggestions on programs apart from Se-al to handle data?

I haven't sent you the files since I figured the problem out.

Going back to my original message I have questions about formatting
and bootstrap in RaxmL. I will list them again at the end of this
message. Thank you.

Remaining Questions from yesterday:

More questions Formatting data:

1) What is the best way to format large data sets? I can see becoming
a problem when dealing a very large multigene analyses (My data set is
139 genes with over 220 taxa). Do I need single gene files or
interleve? - Single files is what is shown as examples. How to cross
reference all the names across all the files?

2) When genes are NOT present across all taxa what can be done? I
tested this and Sate doesn’t accept any missing gene data. It gives
me an error message.

3) In order to calculate bootstrap values – what is the best way to
go
about it. Raxml? I have very little experience with this program,
but it seems that Raxml needs a PHYLIP format. Does Raxml need ALL
the
raw data (e.g.139 genes) to calculate bootstrap values? If this is
the case – does this mean I need to format all data into a PHYLIP
format just to get bootstraps values? I am using RaxmlGUI.

Looking forward to hearing from you guys.

Best wishes,
Patricia

Jamie Oaks

unread,
Apr 18, 2012, 12:53:22 PM4/18/12
to sate...@googlegroups.com
Thanks for the update, Patricia, I'm glad you were able to work our the problem.

Sorry we missed some of your questions in the original post. Here are my best attempt at answers:

> 1) What is the best way to format large data sets? I can see becoming
> a problem when dealing a very large multigene analyses (My data set is
> 139 genes with over 220 taxa). Do I need single gene files or
> interleve? - Single files is what is shown as examples. How to cross
> reference all the names across all the files?

In general, there are a lot of options out there for formatting sequence data (you can confirm this with a google search for something like "convert between sequence formats"). But, I suspect the number of good options is much smaller. Erick Matsen has an impressively comprehensive sequence management command line program, seqmagick (http://fhcrc.github.com/seqmagick/). I've never used it. I always mean to, but then think of it after I've already solved my current formatting problem with an on-the-fly bash or python script. seqmagick is a python program that uses the BioPython library, so I'm not sure how well it will scale up for really large datasets.

Specific to sate, you want each locus in a separate fasta file (they can be aligned or unaligned), all within the same directory. Making sure the taxon names are consistent across the files is important. So, in whatever data management scheme you use, it is probably best to have the IDs originate from one source. One way to do this is to have a master concatenated data file from which the locus-specific fasta files are created.

> 2) When genes are NOT present across all taxa what can be done? I
> tested this and Sate doesn’t accept any missing gene data. It gives
> me an error message.

Sate should be able to handle multi-locus data when the taxa for each locus are not completely overlapping. Make sure you are using the most recent version, and please send us any error messages that arise.

> 3) In order to calculate bootstrap values – what is the best way to
> go
> about it. Raxml? I have very little experience with this program,
> but it seems that Raxml needs a PHYLIP format. Does Raxml need ALL
> the
> raw data (e.g.139 genes) to calculate bootstrap values? If this is
> the case – does this mean I need to format all data into a PHYLIP
> format just to get bootstraps values? I am using RaxmlGUI.

Yes, sate will not perform bootstrapping, so you will need to use other software. RAxML is a good option, but yes, it does require a concatenated phylip file. Sate creates such a file, and a character-subsets file, in the temporary file directory (which you can have sate preserve). I will talk with the other developers about an option to have these files placed in the output directory for multi-locus analyses.

Jamie

Patricia Sanchez-Baracaldo

unread,
Apr 19, 2012, 5:33:41 AM4/19/12
to SATe User
Hi Jaime,

Thank you for the very helpful email with suggestions. Some of your
points I have already gathered from the sample files.

I supposed there is no easy way to formatting a large data set. At the
moment I have a 139 gene and 259 taxa :-(
I have them all in single files they are aligned but I need to convert
them into fasta format. I'll figure out a way.
I do have Master file - but it is so big hardly anything can read it
- it is in nexus format.

Going back to Sate:

I thought I was using the latest version - but when looking at about
Sate it says 2.0.3 - this is what was downloaded from
http://phylo.bio.ku.edu/software/sate/sate.html - I use a mac - it
says it is 2.1.0. What would it say 2.1? I am not sure what is going
on. How can I get the latest version for Mac if it is not on that
website?

Missing data:
I was using the same data sent I sent Mark (which I believe you now
have) - and I 'deleted' a sequence to test that it can handle missing
data. Sate didn't recognise it and gave me error messages. It would be
great to if it can handle data sets with missing data. Shall I do
this again and send you what error message is telling me?

Going back to bootstraps:

I am using raxMLGUI. Would this be OK? - or do I need to use the
other one (command line).
If Sate uses raxML for tree estimation - it seems like an awful lot of
effort to format data for sate and then reformat data just to
calculate the bootstrap values. Unless there is an easier way to use
what Sate is producing and import it into raxML - especially for large
data sets - I'll wait to hear from you on this one - this is crucial.
I found an alignment file with aln extension - is the the phy format
file you mentioned?

Alignments and 'bad sectors':
Another question is that - while sate improves alignments for
phylogenetic reconstruction - in real life examples and especially if
organisms are distantly related - there would be some sections that
need to be excluded otherwise this would bring noise into the analyses
- there is no way to do this? Is there?

Taxa sets:
With the format we are using. There is no easy way to include exclude
taxa sets. Does it mean if this is something I'd like to do - I would
have to do it manually say accross 139 gene files - this seems rather
impractical. Let me know if i am missing something.

Thanks again - I hope all the questions I have sent are clear - let me
know.

Cheers,
P



Mark Holder

unread,
Apr 19, 2012, 11:23:50 PM4/19/12
to sate...@googlegroups.com
Hi Patricia,
We're hoping to have new version released early next week. That should help with some of these issues (e.g. the piggybacking of RAxML bootstrapping after SATé). Thanks for reporting bugs and your issues with the software. I'm sorry to hear of the difficulties, but it is a huge help to us to get the feedback. Thanks for taking the time to send the emails!

I have a few specific comments below.

On Apr 19, 2012, at 4:33 AM, Patricia Sanchez-Baracaldo wrote:

> Hi Jaim


> I supposed there is no easy way to formatting a large data set. At the
> moment I have a 139 gene and 259 taxa :-(

> ...


> I do have Master file - but it is so big hardly anything can read it
> - it is in nexus format.

You can email me the data file off list ( mtho...@ku.edu ), and I can break them up (the master file will work, if you have the character #'s that correspond to each gene's boundaries). Or you can zip compress a directory with the individual files; I can convert each one. (I'm afraid that juggling file formats is a recurring headache with bioinformatics tools. I tend to use PAUP or NCLconverter in the NCL package; but PAUP is not free and NCLConverter is not well documented.There are several programming based approaches to do this type of operation for a large number of files - biopython and bioperl. I'm less familiar with MEGA and SEAView, but they may be good options).


> Going back to Sate:
> Sate it says 2.0.3

I suspect this is a bug in the "about" menu item causing it to report the wrong version. We'll get this fixed in the next release. Sorry about that.


> Missing data:
...
Jamie and I are trying to replicate this, but have not been able to yet. We may ask you for more details if we can't replicate it tomorrow.

>
> Going back to bootstraps:
>
> I am using raxMLGUI. Would this be OK? - or do I need to use the
> other one (command line).
> If Sate uses raxML for tree estimation - it seems like an awful lot of
> effort to format data for sate and then reformat data just to
> calculate the bootstrap values.

We're going to look into to exporting a relaxed phylip version of the alignment in the next release. An issue with this is that it will require us to change some names (FASTA allows basically any name for a sequence, but relaxed phylip doesn't. So that is why we've been using FASTA). Changing the sequence names will be a bit confusing, I'm afraid; but it is hard to avoid (this step is done internally during SATé and then the names are changed back for export at the very end.

>
> Alignments and 'bad sectors':
> Another question is that - while sate improves alignments for
> phylogenetic reconstruction - in real life examples and especially if
> organisms are distantly related - there would be some sections that
> need to be excluded otherwise this would bring noise into the analyses
> - there is no way to do this? Is there?

No. We don't have any filtering or exclusion steps. You can put the output of SATé into other software (Guidance or GBlocks) to filter, if you like.


> Taxa sets:
> With the format we are using. There is no easy way to include exclude
> taxa sets. Does it mean if this is something I'd like to do - I would
> have to do it manually say accross 139 gene files - this seems rather
> impractical. Let me know if i am missing something.

You are correct that SATé will produce an alignment and tree for the data that you give it. There is not an exclude option. (taxa or characters).

all the best,
Mark

Mark Holder

mtho...@gmail.com
http://phylo.bio.ku.edu/mark-holder

Patricia Sanchez-Baracaldo

unread,
Apr 20, 2012, 12:37:18 PM4/20/12
to SATe User
Hi Mark,

Thanks for the email.

I have done the painful task of turing my big PAUP file into smaller
ones - and they are now in nexus. I think I will use mesquite to
change the format to fasta. My main concern at the moment is the
missing data in Sate because my data set has from bacteria to
angiosperms so you can imagine some missing taxa - so you can imagine
has some genes are not present across. Let me know if you want me to
send you something related to this and the error message I am
getting. This is crucial for me.

The other concern is not being able to include or exclude character
sets or taxa sets - which you can be done in PAUP. People here say it
is an antiquated programme but in my opinion it has some really good
things.

Including and excluding taxa help us see what happens to long branch
attraction in some lineages - which can be problematic and a big issue
determining deep branch relationships. So having this facility is of
great use for us. Also we use the tree for character analyses - so it
is only the first step - it just needs to be robust.

I am looking forward to seeing the new version and to be able to get
some branch support. But yes, something that can be easily imported
to get values would really help.

All of my data is already aligned - if anything it would like to
exclude regions that are likely to introduce noise and do not aligned
properly.

Perhaps during the workshop I could get some help with this programs/
hits on how to implement NCLconverter in the NCL package - since I am
not familiar with this.

Thanks again,

best wishes,

Patricia




On Apr 20, 4:23 am, Mark Holder <mthol...@gmail.com> wrote:
> Hi Patricia,
>         We're hoping to have new version released early next week. That should help with some of these issues (e.g. the piggybacking of RAxML bootstrapping after SATé).    Thanks for reporting bugs and your issues with the software. I'm sorry to hear of the difficulties, but it is a huge help to us to get the feedback. Thanks for taking the time to send the emails!
>
>         I have a few specific comments below.
>
> On Apr 19, 2012, at 4:33 AM, Patricia Sanchez-Baracaldo wrote:
>
> > Hi Jaim
> > I supposed there is no easy way to formatting a large data set. At the
> > moment I have a 139 gene and 259 taxa :-(
> > ...
> > I do have  Master file - but it is so big hardly anything can read it
> > - it is in nexus format.
>
> You can email me the data file  off list ( mthol...@ku.edu ), and I can break them up (the master file will work, if you have the character #'s that correspond to each gene's boundaries). Or you can zip compress a directory with the individual files; I can convert each one.  (I'm afraid that juggling file formats is a recurring headache with bioinformatics tools. I tend to use PAUP or NCLconverter in the NCL package; but PAUP is not free and NCLConverter is not well documented.There are several programming based approaches to do this type of operation for a large number of files - biopython and bioperl. I'm less familiar with MEGA and SEAView, but they may be good options).
> mthol...@gmail.comhttp://phylo.bio.ku.edu/mark-holder

Jamie Oaks

unread,
Apr 24, 2012, 4:17:44 PM4/24/12
to sate...@googlegroups.com
Hi Patricia,

We found a bug in the current version of SATe associated with multi-locus analyses of amino acid data.  It explains your long branches.  SATe was inserting 'N's into the concatenated alignment for missing taxa, which is fine for nucleotide data, but for amino acid sequences is creating erroneous sequences of Asparagine.

We are correcting this error and will post a new release soon, but in the meantime, do not use SATe for multi-locus amino acid analyses.

Sorry for your trouble.  We'll get the new release out ASAP.

Thanks!

Jamie

~=~=~=~=~=~=~=~=~=~=~=~=~=~=~=~=~=~=~
Jamie Oaks
Biodiversity Institute
Department of Ecology & Evolutionary Biology
University of Kansas
Dyche Hall, 1345 Jayhawk Blvd
Lawrence, KS 66045-7561

Office Phone:  785-864-3439
Office Fax:  785-864-5335
E-mail:  joa...@ku.edu
~=~=~=~=~=~=~=~=~=~=~=~=~=~=~=~=~=~=~




Patricia Sanchez-Baracaldo

unread,
Apr 25, 2012, 10:13:24 AM4/25/12
to SATe User
Hi Jamie,

Thanks for letting me know. I will look forward to see the new
version.

An additional question. I will be combining aa and ribosomal RNA (with
AGC and U). Wondering in advance if U would be an issue? Or shall I
look into converting it into a T.

all the best,

Patricia

Jamie Oaks

unread,
Apr 25, 2012, 5:25:31 PM4/25/12
to sate...@googlegroups.com
Hi Patricia,

Currently, SATe does not recognize the nucleotide symbol "U", so you would need to convert to "T". Also, the multi-locus analyses in SATe currently does not support mixed datatypes (i.e., amino acid and DNA sequence data). However, you can run SATe on each locus individually, and subsequently concatenate the resulting alignments for a mixed-datatype analysis in RAxML.

Cheers,

Jamie

Mark Holder

unread,
Apr 25, 2012, 5:58:25 PM4/25/12
to sate...@googlegroups.com
Hi,
We should be able to fix both of those issues in the next couple of weeks. RAxML will deal with mixed data, but I don't think that FastTree will (I'll double-check that). So if we are able to get the mixed-datatype support in SATé then it is likely that RAxML will be the only choice for the 'tree estimator' setting.

thanks for the feedback and feature requests,
Mark
Reply all
Reply to author
Forward
0 new messages