Sorry, to add another take on this.
The Pythia score is relatively low, below 0.5 is managable in a tree environment. What Johannes pointed out is valid but, in practise, may be less a concern. One cannot really tell, as it all depends on the actual signal the matrix produces for certain subtrees. Every data set has its specialities, and these values only point us to its strengths or weaknesses
As Oleksiy pointed out you probably have tips (or groups of tips) that are difficult to tree because they are near-identical and result in accordingly flat, poorly resolved subtrees. They are inference-wise detrimental, increase the Pythia score, have little branch support, waste computation time, while providing no additional phylogenetic information. The simple solution to the problem is informed pruning to get a backbone tree, and then stuff that backbone tree.
What I would do is
- Run a quick-and-dirty comprehensive tree, more important the tree topology is the BS support for its branches
- Prune that tree to a set of representative sequences, representing terminal clades defined by e.g. a distance threshold (for the members of a subtree), branch support threshold or with algorithms that define species (don't know, if they work for protein-data though). E.g. Start at the tips and keep only one representative per branch that has BS > 40 or whatever threshold gives you a sizeable tip size reduction. Or the 2 most distant per terminal subtree which members have a pairwise distance below 0.05.
- Calculate the Pythia score of the pruned data set, if you like iterate (e.g. you start with a distance threshold of 0.05, increase in 0.05 steps, or BS of 10, 20, 30, ...) The aim is to get a pruned-as-possible tree with an as low as possible Pythia score but as many as possible clades covered see in the total tree.
- Generate a backbone tree with the rep. seqs, you will probably get a tree with appreciable branch support along all branches, fine phylogenetic distances (i.e. via the tree) between tips, which makes a good reference tree for EPA.
- Place all pruned tips using EPA-ng in that backbone tree. If the LWR << 1, take a look at the jPlace file and pin the split support on your reference tree. Maybe add some representatives, for overlooked clades (over-pruning). Note: Queries with split-LWR using a high-discriminative reference tree are those roques that can increase Pythia scores and decrease branch supports and trigger polytomies in the widely used but insufficient consensus trees (see below).
- If you want a total tree:
- Generate subalignments including the pruned queries and the rep.seqs. of the subtrees they were placed (LWR ~ 1).
- Run local trees
- Supertree them with the tree from Step 4, the reference tree, as the core, the phylogenetic backbone.
And, on a general note, please stop using consensus trees: they are inferior summaries of phylogenetic trees, their polytomies can have different data (signal) or biological reasons, hence, are meaningless in a phylogenetic context (PS consensus trees, being summaries, are not phylogenetic trees; only fully dichotomous trees are phylogenetic trees per definition). Decreased branch support can have two main reasons: ancestral-descendant relationships (hard polytomies), lack of discriminate signal, and internal signal conflict (soft polytomies). If you want to summarise topologically different trees, use consensus networks. E.g. you can run RF distance to compare your initial 50 trees but you can also visualize their differences with a consensus network, or Adams conensus tree: the Adams consensus only retains polytomies inflicted by rogues.
Rogues may act rogue-ish because of the one or other reason. E.g. an ancestral protein type X that evolved into A and B types, will trigger a (quasi-)hard polytomy because our phylogenetic trees cannot depict ancestor-descendant relationships, they only resolve sister relationships. You end up with a high supported X-A-B polytomy in the best case, in the worst with a X + (A + B) subtree because of long-branch attraction between A and B. Or the subtree is genuine because X is not the last (or a late) common ancestor (LCA, close to the hypothetical MRCA, most-recent common ancestor, the node connecting the A and B roots) of A and B but an early precursor type (ECA). In EPA, if the reference only include A and B, and LCA-X can have split-LWR for the internodes (branches) representing the A root, the B root and the A+B root, but an ECA-X could have split-LWR involving the next deeper internode(s) and only the A+B root.
Soft and misleading polytomies can be inflicted by a recombinant sequence. Imagine an AxB, being a recombinant of A and B, it will be attracted by both clades. If A and B are sisters the triggered soft polytomy in the consensus tree is not a big problem because we still have an AxB+A+B clade, but if they are cousins, the recombinant will be place in root-proximal ("basal") position to A or B, decreasing the support for A+AxB or AxB+B. If you then have a sister C to B, you may end up with a well supported AxB-B-C polytomy, which is wrong because AxB is not a possible sister of C, in fact, it's unrelated to C beyond they fact that it is half-B. Again EPA can be a great asset here, because a recombinant query will have accordingly split LWRs if there is enough signal in the query and high-discrimination capacity in the reference. Queries with split-LWR to non-connected internodes must be excluded when inferring phylogenetic trees, naturally. Consensus networks on the other hand, don't collapse everything into polytomies but respresent all alternative splits in a tree sample (strict consensus networks), or in a certain fraction of them (e.g., BS support networks). See Schliep et al. (2017;
http://dx.doi.org/10.1111/2041-210X.12760) for examples, R vignettes, and further literature.
Good inference
/G