Hi Gustavo,
The number of aligned sites are irrelevant, relevant is only the number of "distinct alignment patterns". It may be quite low in a case like yours depending how homologuous the harvested protein data really is and the taxonomic breadth of the source organisms. The critical thing is the sample structure: since you blast against a single reference to decide what is harvested, one doesn't know whether the 50, 200 and 450 extra and putatively less similar sequences belong to substantially diverse groups or are very similiar among themselves. In the last step (so far: 500), you may add 250 sequences that have only 65% similarity to your reference but near-100% with each other. Or 1 that is most dissimilar just above the threshold but the other 249 are near-identical to the (part of) 150 added in the filtering step before (or those before-before, in fact many of them could just a little bit more dissimilar than the first 50)
The easiest way to change the ratio is simply to reduce the number of tips to a meaningful and representative set by eliminating the high-similar sequence groups, thus, terminal (often random, too) noise and indiscriminate, in a phylogenetic tree-inference context, data. Right now you may look a thicket, which would need to be first weeded out to see a tree. E.g. do a blast all vs all and a fixed threshold to define groups of near-identical sequences in your total sample (e.g. 95% identity, 90%, or increments, whatever much reduces the number of tips) and then randomly select two placeholders from each group to establish a phylogenetic backbone for the putative homologues. One you have that, you just use EPA-ng to place all the others in the found backbone tree as a check-up. EPA is the reason to keep two tips per group, as the jPlace result is more straightforward to interpret than when the reference tree only includes a single tip per high-similarity cluster.
You could also pick random placeholders from the sequence groups that have e.g. 95, 90, ...% identity scores with the reference to get a tip-reduced dataset with a better ratio. And then again use EPA-ng to place the remainder of the 5000.
You could also go for a stepwise addition, not adding the next 50, 100, etc. Blast hits but only add sequences that have a minimum dissimilarity to any of those already included at the step before. This should also boil down the number substantially as well while not losing tips that may be of important to understand the protein's genealogy.
In any case, there's no point in inferring a 5000-tip tree for a matrix that pushes up such a warning, it's much better to have a (much) smaller tree with tips that cover the range of sequence variation in the putative homologues and place all others in this backbone protein tree using, e.g., EPA-ng:
https://github.com/pierrebarbera/epa-ng
Cheers, Guido.
PS Here's three papers you could give a look in the context of pruning down tip-sets with too many, too similar sequences, to a meaningful sample, and treeability of data:
Haag J, Höhler D, Bettisworth B, Stamatakis A. 2022. From easy to hopeless - predicting the difficulty of phylogenetic analyses. Molecular Biology & Evolution doi:10.1093/molbev/msac1254
Mavian et al. 2020. Sampling bias and incorrect rooting make phylogenetic network tracing of SARS-COV-2 infections unreliable. Proceedings of the National Academy of Sciences.
https://doi.org/10.1073/pnas.2007295117Stamatakis A, Göker M, Grimm GW. 2010. Maximum likelihood analysis of 3,490 rbcL sequences: Scalability of comprehensive inference versus group-specific taxon sampling. Evolutionary Bioinformatics 6:73-90.
http://dx.doi.org/10.4137/EBO.S4528