Hi,
Thanks for the interest! First I'll put a couple links in case you didn't find the
relevant bit of the
manual. Also, I think I talk mostly there about speed, as opposed to memory, but the problems and solutions tend to be very correlated.
The answer depends quite a bit on what you're running. First, whether it's parameter caching or partitioning, but also whether it's the supervisory python process that's using the most memory, or the subsidiary smith-waterman/bcrham processes. But I can guess, and describe the two cases where I have run into memory limitations.
First, given the numbers you mention, it's quite likely it's just during parameter caching. In this case there's nothing complicated going on (as opposed to in partitioning, where it's in principle possible for some pathological repertoire structure to wreak havoc with the various strategies for avoiding all-against-all). Looking at an output file I have lying around for 240k sequences, it's 350 MB. So for 10 million seqs that'd be about 14 GB. And there's quite a bit of information in the in-memory dictionaries that we don't write to disk (otherwise we'd have to keep recalculating it), so just one copy of the annotations I could easily see taking up two or three times that in memory. (I actually spent quite a bit of time in the past trying to measure how much memory different things consumed, and the conclusion seemed to be that there is not a reliable way to measure this, i.e. a way that does a good job of counting subsidiary referenced objects.) And off the top of my head, we keep around at least two copies of the annotations (smith-waterman plus hmm), as well as the input sequence information, in memory at once. And that's before trying to do anything with it (which'll use more), plus inefficiencies from memory allocation. So I could see that getting into the hundreds of GB range, which agrees with my fairly high confidence that there isn't any sort of memory leak, just potential for memory optimization.
As to optimizing more for memory -- I think the main avenues for this would be either to write more of the intermediate info to disk, and to reduce the info stored in each annotation in memory. But both of these would involve significant speed tradeoffs (either i/o time, or recalculating the info when we needed it). So while there's for sure always scope for optimization, I've spent quite a bit of time in the past going through things to make sure there isn't a ton of wastage, so I'm not very optimistic that the footprint here can be reduced a lot without compromising speed way too much.
If, on the other hand, partitioning is the problem, the cause is typically different. Here, as with most any clustering algorithm, there's a large number of shenanigans involved in figuring out whether each sequence should be grouped together with each other sequence, without actually doing all-against-all comparison. And there are totally ways in which a (fairly pathological) repertoire can be structured that circumvent some of these. For instance, if it consists of a large number of sequences that are similar enough (in inferred naive sequence) but also far enough apart that the full hmm calculation is required for most of them. If a clustering step is taking a long time, you can check what's going on by looking in the work dir (ps auxw will show it), where we write a .progress file that shows details of memory usage, log probs, annotations calculated, and cluster structure. That said, I haven't had issues with any funny business here, other than some repertoire structures taking longer than others, in partis versions from the last several years.
ok, none of which you actually asked about ;-), I'm just thinking out loud. So if it's parameter caching:
- subsample. The parameters don't really get much more accurate past even tens of thousands of sequences, so caching parameters on several hundred thousand sequences would for sure be fine (the largest I've had cause to do was I think 3 million, which worked fine on a server similar to yours, but I think was also using a significant fraction of the memory). A good way to check the accuracy would be to run several subsamples with --plotdir <plotdir> set, and compare the plots (and germline sets) from the different subsamples. If sed'ing or some such isn't appealing, there's several options for
selecting various subsets of the input file: --n-max-queries, --n-random-queries, and --istartstop. If you wanted independent 300k subsets of the input file, for instance, you could run with --startstop 0:300000, --istartstop 300000:600000, --istartstop 600000:900000. If you're not absolutely certain the input file is randomized, it might be safer to use --n-random-queries, in which case they wouldn't be independent, but that shouldn't really matter.
if on the other hand it's partitioning (which uses more memory than parameter caching, so if it *was* parameter caching causing problems, it soon *will* be partitioning):
- read the bit of the manual linked above, but short answer is vsearch clustering is much faster and less memory intensive. It's also much less accurate, but it's still way more accurate than other methods (see paper), since it clusters on inferred naive, rather than on mature sequences, and uses shm-rate-dependent thresholds.
- subsample. If you're doing diversity studies of the whole repertoire, unless you're concerned with the far tail of the frequency distribution, there's typically little benefit to running on the entire sample, as opposed to several subsamples, and several subsamples are in any case required in order to measure uncertainties. While time required is entirely dependent on repertoire structure, as a rough guide, these days I typically run on 100k-ish subsamples, on a server similar to yours, and these take several hours to a day. It would for sure be great if we could run on millions of sequences in this amount of time, but it's tricky because all that changes as the sample size gets really large is the far tails of the frequency distribution, and those tails are precisely where you really need the vastly better accuracy of partis, compared to more heuristic methods (like vsearch partis, or the other methods in the paper).
- use seed sequences. Even if you're not interested in specific lineages to start with, there's a lot of ways you can pull seed sequences out of the repertoire (say, choose a cluster with vsearch partis, or with subsampling, and use the inferred naive as a seed), and then run full partis with a seed sequence on the whole sample. Since it entirely avoids the all-against-all problem inherent to full clustering, seed clustering can be run on samples tens to hundreds of times larger in the same time/memory.