Hi Gaetan,
you should consider the (always much appreciated!) suggestions by Guido and try to reduce you
dataset. Here are a couple of hints from my side:
> We are not experts in phylogenetric reconstruction, so we wanted to have your advice on our problem.
> - Do you think our usage of raxml-ng is correct ?
I'd try "--search --tree pars{1}" instead, it will probably converge (much) faster than starting
from a random tree.
> - We are currently running the process on a single machine but we have access wto a large datacenter,
> but from my understanding it's not a good idea to increase the threads too much (raxml suggested 12
> and we used 10, because I was afraid of being out of memory, we have only 128GB on this machine)
I'd try more threads, since memory consumption will most likely grow rather moderately for this
analysis..
> - We generate the binary sequences ourself, so we can change the data if needed. We thought about
> removing characteristics present in only one sequence, but we are unsure of the impact.
This will obviously reduce computation overhead a bit, but will also reduce parallelization
scalability, so depending on how many threads/cores you have in your system, you might not gain too
much.
In general, for any large dataset, I always recommend doing some exploration before submitting a
final run and waiting for weeks until it finishes. For instance, if a single SPR round takes many
days, or the number of rounds goes into hundreds (or even above 30-40), this is a clear sign that
you should have a second look at your dataset / analysis parameters / parallelization.
Hope this helps,
Alexey