Subsets with 3 or fewer states causing RAxML error

Christopher Hann-Soden

unread,

Sep 8, 2015, 7:46:08 PM9/8/15

to PartitionFinder

Hi there,

I'm running PartitionFinder on a genome-scale dataset. Around 12MBp of alignment between 15 genomes, split into 14k partitions. Each partition represents a colinear segment of the genomes, as synteny is not conserved between these species.

The analysis seems to nearly complete, when Partition Finder tries to call RAxML with a phylip file generated by PartitionFinder (./analysis/phylofiles/da1bbae8b31a782da69f66f9e44285bd.phy) as input. This phylip file contains an alignment of 30bp which only has A, T, and C bases. RAxML is unable to work with this kind of data and throws an error, stopping the whole analysis. The log is too large to attach, so here's a link: https://drive.google.com/file/d/0B5Nt-F8HRgRZWEl1ZjFWR3BzMFk/view?usp=sharing

I am not certain where this phylip file was generated from. Is this supposed to be one of the partitions I designated, or something else? Because I filtered out all such partitions in this data set (or thought I did).

I find this error more than slightly ironic, because I am attempting to use PartitionFinder to circumvent such limitations. I am able to run a phylogenetic analysis on this dataset using ExaML, but when I attempt a bootstrap analysis using thousands of (sometimes very short) partitions, the probability of randomly bootstrapping a partition to have only 3/4 bases becomes very high, especially between hundreds of bootstrap replicates. I therefore want to use PartitionFinder to group partitions that are evolving similarly so that I can bootstrap them together and reduce the probability of this happening to a negligible amount.

Thank you,

Christopher

Brett Calcott

unread,

Sep 8, 2015, 8:05:51 PM9/8/15

to partiti...@googlegroups.com

Hi Christoph,

There's not much chance we can debug this without access to the config / alignment etc. Can you provide a link to a zipfile of the analysis?

Cheers,

Brett

--
You received this message because you are subscribed to the Google Groups "PartitionFinder" group.
To unsubscribe from this group and stop receiving emails from it, send an email to partitionfind...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Rob Lanfear

unread,

Sep 8, 2015, 9:00:50 PM9/8/15

to PartitionFinder

Hi Christopher,

I can see from your log file that PartitionFinder is not getting past your starting scheme. So, as far as I can tell, you have a subset in your starting scheme (i.e. one you defined in your data blocks) that has only 3 states. As you point out, RAxML cannot analyse this subset, so you will need to either remove this from your analysis, or join it together with another subset.

Cheers,

Rob

--

Rob Lanfear
School of Biological Sciences,
Macquarie University,

Sydney

phone: +61 (0)2 9850 8204

www.robertlanfear.com

Karen Meusemann

unread,

Sep 9, 2015, 12:35:14 AM9/9/15

to PartitionFinder

Hi all, just a short note:

FYI: a problem in this direction we are facing currently analysing large datasets (either rcluster or kmeans) from insects (aa) and eucalypt (nt), for which resulting partitions (more often when kmeans is used) do not fullfill RAxML/ExaML conditions in subsequent tree reconstructions: resulting partitions are too uniform (sometimes very small but not always) for downstream tree reconstruction. PartitionFinder thoiugh runs through correctly. This becomes much more serious when the generating BS replicates. Probably can't be solved in PF but maybe someone has a good idea.

Karen

Rob Lanfear

unread,

Sep 9, 2015, 6:55:37 AM9/9/15

to PartitionFinder

Hi Karen,

I'm a bit confused about your statement that partitions don't work in RAxML. PartitionFinder wouldn't run (as Christopher experienced) if any partitions don't run in RAxML (assuming you run it with --raxml). So, as far as I know, if PartitionFinder successfully analyses your dataset with the --raxml setting, you should be able to run an analysis (no bootstraps, see below) in RAxML without any issues. Can you clarify? If you have a dataset you can send me where this is an issue, I'd be interested to see it.

This is something I'd like to get right, and indeed it's one reason we still use RAxML and PhyML, rather than swithcing to other possible solutions (like PLL, or PyCogent). The other solutions might be quicker, but if we use RAxML and PhyML the hope is that we can generate partitioning schemes that will then work in those programs, at least for getting the single ML

I can see the bootstrap issue - this is a real problem. It's obviously something that would be nice to address, but I'm not sure how we could hope to do it in PF, since we cannot control how bootstrap replicates are made. However, you can imagine a simple solution in RAxML would be to make sure that when resampling a particular partition, one would only accept bootstrap replicates in which all four bases were present. That would slightly reduce the flexibility of the bootstrapping, but it would certainly solve the issue.

Cheers,

Rob

--

You received this message because you are subscribed to the Google Groups "PartitionFinder" group.
To unsubscribe from this group and stop receiving emails from it, send an email to partitionfind...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Karen Meusemann

unread,

Sep 10, 2015, 6:58:19 AM9/10/15

to PartitionFinder, bmi...@uni-bonn.de

Hi Rob,

for BS replicates see below.

Treesearches: Taking results from PartitionFinder (of course depending on the dataset etc) it may occur that
RAxML exits while e.g., generating a start-Tree which is needed for Examl, or when trying to run a analysis
or using ExaML (parser and examl itself), e.g. with messages like this:

./parse-examl* -m PROT -s File.phy -q File_kmeans.charset -n output
* holds also for: ./raxml-HPC-SSE3 or ./examl and other commands
...
Partition Subset78 number 77 has a problem, the number of expected states is 20 the number of states that are present is 19.
Please go and fix your data!
Partition Subset161 number 160 has a problem, the number of expected states is 20 the number of states that are present is 19.
Please go and fix your data!
....

etc.

(The same happened to Christopher with this nt data (here the message is the number of expected
states is 4 the number of states that are present is 3 (or 2 or 1) and it happens during PF)

For our datasets currently, PF runs and finishes successfully, the problems start afterwards:.
Above problems occur so far with kmeans results (aa) more often than with rcluster.

Looking at the results (as example only the last subsets (where altogether > 50 for this dataset)

DCMUT, Subset246 = 261721, 398258, 640546, 1136799, 1232082
DCMUT, Subset247 = 265835, 346969, 764136
LG, Subset248 = 298437, 390628, 447063, 542248, 649876, 688525, 737105, 788420
WAG, Subset249 = 386138, 510986, 539885, 754232, 912908, 925226
JTT, Subset250 = 500653, 1109415
WAG, Subset251 = 508149, 542232, 1122737, 1198047, 1242122
DCMUT, Subset252 = 530198, 783329, 1177314
LG, Subset253 = 835154
WAG, Subset254 = 1199066

If its true that RAxML actually should fail here in PF, I don't understand how the results are produced.
Might it be circumvented by using the option -O when RaxML is called (ignoring warnings), at least in
the PF version I use (just a guess) ? I could invoke as well this -O in RAXML but for ExaML its not
possible and probably also not recommended/sensible?

Looking at the results: I doubt that having partitions with 1 or very few sites or and if there are too uniform is good to use (even if it would be possible with RAxML/ExaML)
Maybe in points into the direction that rcluster is more suitable for aa (ignoring AICc values)? With rcluster this occures rarely. Nevertheless it can occur.
(It happen to me last year for 2 prelim. datasets. Only 2 partitions were concerned and this was easy manually to fix and for the BS it worked out luckily.
Having bad luck again resulting subsets can be too uniform for RAxML/Examl (probably not too small).

(I emailed already last week with Paul about this issue)

Anyway we must check currently resulting PF partitions (despite flagging RAxML) , afterwards (like we do now), otherwise we can't reconstruct a tree.

Bootstrap replicates:
I guess if we suggest to solve this in RAXML/EXaml probably they will point back and say: problem of the partitioning schemes.
Currently we try to sample so long BS and fix it manually in a way but this is time consuming and against its original purpose and appears impossible to handle. I like your idea to
to resample that long that it conditions are fulfilled automatically, but at the same time its biased (in a worse scenario it might turn out that the BS replicates are similar to the original ones (partition wise).

I think (gut feeling) it would be better to make then (manually if needed) partitions that fulfill conditions (for treesearch and BS (although against anything the algorithm did for us before)
The siutuation is a bit weired at the moment, currently I don't have any idea how to solve all this in a smart way)

####

I will send you the dataset separately (phy-file + kmeans result plus charset I for RAxML/Examl. - not modified.
(modified means we merged manually for RAxML/ExaML problematic subsets to get it run at least for a
treesearch - what I think makes the whole procedure not very sophisticated).

:) Maybe you can figure out the problem (starting after PF) or you have a good idea, would be awesome :)

Cheers Karen

Paul Frandsen

unread,

Sep 10, 2015, 7:18:08 AM9/10/15

to partiti...@googlegroups.com, Bernhard Misof

Karen,

Have you experienced this issue with nucleotides? Or are all of these amino acid alignments?

Paul

sent from my mobile phone, apologies for any typos

Message has been deleted

Karen Meusemann

unread,

Sep 10, 2015, 12:57:27 PM9/10/15

to PartitionFinder, bmi...@uni-bonn.de

Hi Paul:

for the insects yes - only on aa so far because I aa makes to me sense to resolve deep nodes. For later we planned maybe to look at the 2nd codon position or 1rst and 2nd, (but that does not solve the problem per se)
When started 2-3 months we decided to go for preliminary results only for the aa level because protein domain stuff etc doing correspondingly for the nt ist quite much effort.
Looking back, it might have been that we better would have done both - but we were not prepared to face such problems.

for plants - a side project, (eukalyptus) - I ran a nt set (200 genes, 50 species) with rcluster and kmeans - took kmeans because of the better AICc and considerations about rates (all ccodon positions)
and ran for at least for 90 out of 100 BS replicates into the same problem as above. Treesearch was ok.

Cheers Karen

Rob Lanfear

unread,

Sep 10, 2015, 6:56:15 PM9/10/15

to PartitionFinder, Bernhard Misof

Hi All,

My 2 cents. Bootstrap replicates are really not something we will deal with in PartitionFinder. There are lots of reasons, but the main one is that if we find a good partitioning scheme that has some small subsets (which we often do), it doesn't seem sensible to reject that just because bootstrapping is difficult for those small subsets. More generally, I am not convinced about bootstrapping on two fronts: (1) with large datasets I'm not sure it's a sensible approach at all; (2) with small subsets (even inside large datasets) it doesn't seem sensible either. For example, at the limit, bootstrapping small partitions is completely pointless: a partition of a single base cannot be meaningfully bootstrapped in the way that RAxML builds bootstraps. I'd suggest that a more sensible approach here is to analyse branch support with methods that don't require bootstraps - these could be the aLRT (implemented in PhyML) which just tests for 0 length branches, or use MCMC methods (e.g. ExaBayes). Both of these would work fine on large datasets, and would also (in my opinion) give results that are easier to interpret than bootstraps in any case (especially bootstraps that are made by resampling each partition).

The stuff with PF I will look into. It will take a few weeks, because I'm pretty much booked up until the end of the month with other things and deadlines. My aim here is that if PF spits out a partiitoning scheme, it should run in RAxML. Thus, the most useful test dataset you can send me is one for which PF produces a partitioning scheme that will not run in RAxML. It might be as simple as turning on the -O flag in RAxML. If that's the case, I don't really see that I have anything to fix in PF, since it is already doing what it should.

However, here are some options of things I could implement, that folks here can comment on:

1. I could put in an option to remove the -O flag inside PartitionFinder. That way we could guarantee that what PF spits out will indeed run in RAxML without the -O flag.

2. If you can tell me exactly what the problems are in EXaML, I could look into putting in catches for these inside PF, with the aim that what PF spits out will work in EXaML too (though we could cross our fingers that option 1 would fix this straight away!).

Finally, if you are using kmeans, there is a better option for bootstraps than what you are currently doing, which would get around the problem of RAxML's bootstrap algorithm creating issues when resampling small subsets. Instead of bootstrapping a partitioned dataset, you could do this:

1. Create 1000 bootstraps of the entire dataset (i.e. as one partition).

2. Run each through the kmeans algorithm

3. Run RAxML / EXaML on each dataset

This may take some time, but as long as we can solve the issue of what PF spits out, it is guaranteed to solve your problem. I also think it's a more rigorous solution in general, because I'm really not sure what bootstraps mean when you resample partitioned datasets, particularly when there are a lot of partitions. I think that 100 bootstraps with this approach might be more meaningful than more bootstraps with some other approach.

Thoughts?

Cheers,

Rob

--

You received this message because you are subscribed to the Google Groups "PartitionFinder" group.
To unsubscribe from this group and stop receiving emails from it, send an email to partitionfind...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Christopher Hann-Soden

unread,

Sep 11, 2015, 5:39:58 PM9/11/15

to PartitionFinder, bmi...@uni-bonn.de

Hi all,

Thank you so much for your input and discussion.

It looks like there was a problem in filtering out low complexity partitions on my end generating the alignments, so I will work on that.

This discussion is very helpful, though. Especially Rob's suggestion about how to tackle bootstraps using a different approach. I will definitely give this a shot.

Best,

Christopher

Rob Lanfear

unread,

Sep 12, 2015, 6:18:59 PM9/12/15

to PartitionFinder, Bernhard Misof

Hi All,

I have released a new version of PartitionFinder that has an option which addresses this problem a little bit. The kmeans algorithm now has a minimum subset size, which can be determine by the user. The default is 100, since this seems like a sensible lower bound for algorithmically determined subsets.

The parameter can be controlled from the commandline with the --min-subset-size option, e.g.

```

--min-subset-size 1

```

would be the same as the old version of the algorithm, and

```

--min-subset-size 250

```

would give a minimum subset size of 250 columns.

The new version is here:

https://github.com/brettc/partitionfinder/releases/tag/v2.0.0-pre7

Rob

Reply all

Reply to author

Forward