looking at the skylines in the spread
sheet – which in themselves are very intriguing, as the mimick Bayesian
skylines and proof Alexi's point on how important more than 10 starting trees
are for particular datasets to get the best-possible topology, possibly
speeding also the final optimisation; you should keep them for the supplement –
I'll predict (boldly) that some of your gene trees samples have not yet met the
convergence criterion Alexi mentioned; specifically trnLF, L73, L194, L373 and L382
However, since your objective is to get ML trees, I would report this (criterion met or not after 1000 starting trees > signal issue leading to inherent topological ambiguity) to illustrate the basic issue with your matrix but not bother to go higher, because it may be a lot to go: note, even your plateaus are slightly inclined.
A reason for the only slight inclination maybe that the first tree of the flattened curve and the last (which may be 1001th or 1 millionth) before reaching stationarity only differ by quite irrelevant branches given your data's signal but already agree in major aspects. But they also show while there is some coherent signal in your data to infer ML trees, it is obscured by a lot of (terminal) noise.
A trick to get nonetheless the ML trees despite obscuring noise is to prune down the full tip set to a subset of meaningful placeholders following the general rule in phylogenetics: the fewer tips, the cleaner the trees. Also the more error-prone, so knowing the phylogenetic groups a priori to selecting proper placeholders is critical. The starting tree samples you generated can serve as a basis to do this.
First run a consensus network of your starting tree samples, excluding those pre-plateau with a high cut-off of 25 or 33, i.e. only edges are kept where the corresponding branch is found in at least every 4th, 3rd of the quasi-plateau starting trees. This will give you the tips in your set that just mess-up (leading to increasing inference time; represented as collapsed fans in the consensus network) while showing the main (signal-wise trivial) groups informed by each gene-region and the corresponding splits (branches).
Then, you just go back to the highest scoring starting tree already generated to pick one (or two) placeholders for each gene's main group, e.g. the shortest and longest-branched in each group. Or just randomly pick one for the group. Idea is to get a tip set that covers all aspects in the consensus network.
You do this for all your genes, and will end up with a smaller and more tree-like (signal-wise) concatenated matrix. There may be different placeholders to keep for each gene due to different groupings, so the number will still be comprehensive enough (i.e. you keep the entire data from each gene-selected placeholder for the concatenated matrix)
You then can easily infer sensible trees and branch support, gene wise and for the concatendated matrix. This gives you a phylogenetic backbone that will capture all deeper aspects a most-optimal all-tip tree would show.
If you need specifically to know where the filtered tips place (ML-wise), EPA is always a quick option. If you need more resolution, use subsets limited to the data of a group (you then have the required ML trees and only use the networks for ingroup differentiation/details, which should be possible)
This would allow you to have your entire data phylogenetically placed/grouped using ML, without having to bother about the topological uncertainties of a (possibly inevitably) noisy 500 x 20000 tree (did one use enough starting trees or not?). And you can run all analyses on a stand-alone machine (in fact, I ran larger matrices on a tetra-core PC using RAxML, took only a few days using the duplicates-eliminated matrices: but it of course depends how coherent the signal in the individual gene partitions is)
Important application note: Given your gene sample (two coding plastid genes mostly resolving only mid-/deep splits, a noncoding plastid barcode working only in tropical and high-divergent herbaceous plants; I suppose the L... are single-copy nuclear genes, so may generally tell a different story), in addition to tree-obscurity due to lack of discriminating signal – pending what hierarchical level you're looking at, you actually may have ancestors-descendant pairs in addition to random 3rd base pair noise leading to very flat, ambigous subtrees – there may simply be no universal tree for the concatenated matrix at all.
/G.
PPS Together with Markus Göker, Alexi and I once tested the effect of group-based filtering for all availaible (at the time) rbcL data on angiosperms (open access), which can be used as a theoretical basis for why to filter at all: by filtering group-wise, we saw not only an increase in branch support, but also tree resolution, ending up with trees not much different anymore from much more recent complete plastome trees, so apparently better trees than we could infer using the entire (all-tip) data (it's mostly matK; a dedicated Genealogical World post, secunded by rbcL that makes up the plastid backbone for angiosperms; and rpoC2 alone gets most of the full plastome tree, see Walker et al. 2019, PeerJ).
It's always good to assemble broad data, but for many questions, one should really experiment with an all-data backed/informed filtering to get less-tips matrices and hence trees, one can actually work with, and from which main information can then be propagate on the entire data. Filtering is only bad, if one ignores the dropped data entirely.