Hi Pablo,
There are two possible reasons for this to happen:
1) The expected behavior of the Dirichlet process is to have roughly log(n) clusters, where n is the number of mutations.
2) The sampler may not be mixing well. The default behavior of PyClone is to put all mutations in different clusters initially. However, if there are a large number of mutations the MCMC runs will need quite a while to merge the SNVs into the right number of clusters.
I suspect it case 2) that is the problem here. Two solutions:
1) Add the `--init_method connected` flag when sampling. This initializes the sampler so all SNVs start in the same cluster. For large numbers of mutations this will speed things up, at the risk of getting stuck in a local optima. In practice this does not seem to be a big problem though.
2) Increase the number of iterations of the sampler. The default of 10,000 is probably fine for ~100 SNVs, but you need much larger values for 1,000s.
Best wishes,
Andy