--
You received this message because you are subscribed to the Google Groups "SATe User" group.
To post to this group, send email to sate...@googlegroups.com
To unsubscribe from this group, send email to sate-user+...@googlegroups.com
For more options, visit this group at http://groups.google.com/group/sate-user?hl=en
There were some bugs associated with handling of taxa names in a previous version of SATé, so please make sure that you are running the most recent version of the software 2.1.0 (available from http://phylo.bio.ku.edu/software/sate/sate.html )
all the best,
Mark
If you do decide to send the data files to Mark, please also send the trees returned by sate with the duplicated taxa, and those that are not opening with figtree. We'll try to resolve the issues ASAP.
Many Thanks,
Jamie
On Apr 17, 2012, at 9:15 AM, Patricia Sanchez-Baracaldo wrote:
Sorry we missed some of your questions in the original post. Here are my best attempt at answers:
> 1) What is the best way to format large data sets? I can see becoming
> a problem when dealing a very large multigene analyses (My data set is
> 139 genes with over 220 taxa). Do I need single gene files or
> interleve? - Single files is what is shown as examples. How to cross
> reference all the names across all the files?
In general, there are a lot of options out there for formatting sequence data (you can confirm this with a google search for something like "convert between sequence formats"). But, I suspect the number of good options is much smaller. Erick Matsen has an impressively comprehensive sequence management command line program, seqmagick (http://fhcrc.github.com/seqmagick/). I've never used it. I always mean to, but then think of it after I've already solved my current formatting problem with an on-the-fly bash or python script. seqmagick is a python program that uses the BioPython library, so I'm not sure how well it will scale up for really large datasets.
Specific to sate, you want each locus in a separate fasta file (they can be aligned or unaligned), all within the same directory. Making sure the taxon names are consistent across the files is important. So, in whatever data management scheme you use, it is probably best to have the IDs originate from one source. One way to do this is to have a master concatenated data file from which the locus-specific fasta files are created.
> 2) When genes are NOT present across all taxa what can be done? I
> tested this and Sate doesn’t accept any missing gene data. It gives
> me an error message.
Sate should be able to handle multi-locus data when the taxa for each locus are not completely overlapping. Make sure you are using the most recent version, and please send us any error messages that arise.
> 3) In order to calculate bootstrap values – what is the best way to
> go
> about it. Raxml? I have very little experience with this program,
> but it seems that Raxml needs a PHYLIP format. Does Raxml need ALL
> the
> raw data (e.g.139 genes) to calculate bootstrap values? If this is
> the case – does this mean I need to format all data into a PHYLIP
> format just to get bootstraps values? I am using RaxmlGUI.
Yes, sate will not perform bootstrapping, so you will need to use other software. RAxML is a good option, but yes, it does require a concatenated phylip file. Sate creates such a file, and a character-subsets file, in the temporary file directory (which you can have sate preserve). I will talk with the other developers about an option to have these files placed in the output directory for multi-locus analyses.
Jamie
I have a few specific comments below.
On Apr 19, 2012, at 4:33 AM, Patricia Sanchez-Baracaldo wrote:
> Hi Jaim
> I supposed there is no easy way to formatting a large data set. At the
> moment I have a 139 gene and 259 taxa :-(
> ...
> I do have Master file - but it is so big hardly anything can read it
> - it is in nexus format.
You can email me the data file off list ( mtho...@ku.edu ), and I can break them up (the master file will work, if you have the character #'s that correspond to each gene's boundaries). Or you can zip compress a directory with the individual files; I can convert each one. (I'm afraid that juggling file formats is a recurring headache with bioinformatics tools. I tend to use PAUP or NCLconverter in the NCL package; but PAUP is not free and NCLConverter is not well documented.There are several programming based approaches to do this type of operation for a large number of files - biopython and bioperl. I'm less familiar with MEGA and SEAView, but they may be good options).
> Going back to Sate:
> Sate it says 2.0.3
I suspect this is a bug in the "about" menu item causing it to report the wrong version. We'll get this fixed in the next release. Sorry about that.
> Missing data:
...
Jamie and I are trying to replicate this, but have not been able to yet. We may ask you for more details if we can't replicate it tomorrow.
>
> Going back to bootstraps:
>
> I am using raxMLGUI. Would this be OK? - or do I need to use the
> other one (command line).
> If Sate uses raxML for tree estimation - it seems like an awful lot of
> effort to format data for sate and then reformat data just to
> calculate the bootstrap values.
We're going to look into to exporting a relaxed phylip version of the alignment in the next release. An issue with this is that it will require us to change some names (FASTA allows basically any name for a sequence, but relaxed phylip doesn't. So that is why we've been using FASTA). Changing the sequence names will be a bit confusing, I'm afraid; but it is hard to avoid (this step is done internally during SATé and then the names are changed back for export at the very end.
>
> Alignments and 'bad sectors':
> Another question is that - while sate improves alignments for
> phylogenetic reconstruction - in real life examples and especially if
> organisms are distantly related - there would be some sections that
> need to be excluded otherwise this would bring noise into the analyses
> - there is no way to do this? Is there?
No. We don't have any filtering or exclusion steps. You can put the output of SATé into other software (Guidance or GBlocks) to filter, if you like.
> Taxa sets:
> With the format we are using. There is no easy way to include exclude
> taxa sets. Does it mean if this is something I'd like to do - I would
> have to do it manually say accross 139 gene files - this seems rather
> impractical. Let me know if i am missing something.
You are correct that SATé will produce an alignment and tree for the data that you give it. There is not an exclude option. (taxa or characters).
all the best,
Mark
Mark Holder