Some general partis questions

30 views
Skip to first unread message

Jesse Connell

unread,
Mar 1, 2023, 5:31:23 PM3/1/23
to partis
I'm trying to use partis to help guide my search for a lineage of interest in a bunch of bulk MiSeq human antibody sequencing data, and I realized there are some basics I'm still fuzzy on, mainly around partitioning.  Thanks in advance for any advice here!

My starting point is a set of per-timepoint FASTA files, already cleaned up a bit (productive, complete antibody sequences only) and dereplicated, where I have a few tens of thousands of sequences per timepoint.  I already have a set of lineage members identified at each timepoint, but I'm now searching more deeply to try to see if I can find more that might fill in the tree between those and some from a related dataset.  I partitioned one later timepoint where I know our lineage is already quite mutated, and saw partis group the lineage members into three clonal families.  That raised some questions about how I should get the most accurate groupings, aiming to group all the members together but avoid including non-members:
  • Am I better off pooling all the sequences across timepoints and partitioning in one big set (around 200k or so), so it has as much information to work with as possible during the clustering?  That makes intuitive sense to me (speaking for myself and my semi-manual method, I need the earlier lineage members to help confirm the later more-divergent members, after all) but I have only a vague sense of how the clustering actually works.  (Is 10.1371/journal.pcbi.1005086 the best place to dig into that part more?)
  • If I use --seed-unique-id to focus partitioning on my lineage of interest, is the end result for that lineage any different than if I just pull the clonal family/families for the existing sequence from the default output, or is it just faster?  Are multiple IDs supported for this feature?
  • Does partis have any support for inferring VDDJ recombinations?  Just tangentially related, but, I spotted a few strange lineages here and there in that partition output that look like disparate lineages that were combined, and possibly some instances of VDDJ tripped things up.  We tend to work with lineages with long CDRH3s and from what I hear (e.g. 10.1101/gr.259598.119) VDDJ tends to crop up more frequently in that context.  Both IgBLAST and IMGT V-QUEST were completely befuddled when trying to even annotate one of these cases so I can hardly blame partis for being confused.
Thanks!

Jesse

Duncan Ralph

unread,
Mar 1, 2023, 8:07:10 PM3/1/23
to Jesse Connell, partis
Thanks for the thoughtful questions. First, it sounds like you're looking to find any seqs that *might* reasonably be clonal, in which case you'd probably want to look at the overclustering options, which make it keep merging beyond the best partition (as always there's still a list of partitions in the output file, the list just keeps going further than it normally would've) like --n-final-clusters and --min-largest-cluster-size.

1. This hasn't been established, but I agree it seems likely that having the less-mutated sequences "connecting" disparate branches might help. It shouldn't make a significant difference -- unlike a pure distance-based approach that'd just be connecting blobs, where having an intermediate blob would join to blobs on either side, partis kind of replaces each blob with its inferred naive sequence, so the difference would just be that it would probably have a more accurate naive sequence. But it might help. Yes that paper is where you'd start, but the more recent paired clustering paper has a lot of info also about single chain clustering (e.g. S2 fig, attached).

2. Results are the same to within uncertainties, which is to say if you only look at the best partition they can look quite different in terms of family size just depending on shuffling around within the uncertainties. No you can't do multiple ids at once unfortunately. If you want a whole family, best to seed with the naive sequence (or maybe the consensus of observed seqs? Could try both).

3. Nope, unfortunately. You could give it some minimal support by adding some fake DD genes to the D germline set, i.e. if you didn't care too much about lots of different deletions in the middle of the two D genes, though.

--
You received this message because you are subscribed to the Google Groups "partis" group.
To unsubscribe from this group and stop receiving emails from it, send an email to partis+un...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/partis/7932c418-a8c2-44d0-9973-e6d95ce53769n%40googlegroups.com.
p.jpg

Jesse Connell

unread,
Mar 2, 2023, 5:05:31 PM3/2/23
to Duncan Ralph, partis
Perfect, thanks very much!  I'm trying all-in-one partitioning now.  I think this sounds promising based on the paper, particularly the bits about unbalanced trees and the subcluster annotation.  (That figure S2 captures a lot of what's so hard about all this, and why I keep pushing back against ideas like "group based on >=80% CDR3 similarity" and the like when talking with colleagues!)  I won't worry about seeded partitioning for the moment since it's easy enough to just let it run on everything and come back when it's ready.

From my one-timepoint case earlier, I got the default set of ten partitions, each labeled with logprob of -inf.  I'd first assumed that was just an issue with writing a bunch of really tiny (pre-log) probabilities to YAML, but, is seeing "-inf" there a red flag?  (Though if that's not a problem, are they ordered according to probability, or can I not assume a particular order?)

I'll check out those overclustering options too.  (You're right, my approach here is to grab hold of any sequences plausibly related and then rule out the ones that aren't, so those options definitely look relevant.)  With those options, is the idea that the output list of partitions gets as long as needed until at least one of those criteria are met in one partition in the list?

Jesse

Duncan Ralph

unread,
Mar 2, 2023, 8:29:27 PM3/2/23
to Jesse Connell, partis
I just made a note to update the docs with more detail on the partitions, but for now: it clusters with hierarchical agglomeration, so its final result is a list of partitions with decreasing number of clusters. Calculating the probability for the entire partition is expensive, and we don't need to for the earlier partitions since at that stage we have lots of heuristic merges still to do (e.g. naive hamming merges that we're totally certain about). So earlier partitions have logprob set to -inf. By default partitioning proceeds until the next merge would decrease total logprob, so usually the most likely partition is the last or next-to-last one. If you set e.g. --n-final-clusters or --min-largest-cluster-size, it keeps merging past where the logprob decreases, and writes all these additional partitions to the output file.

The helper class I use for reading the partition lists is ClusterPath, which for instance could be used by inserting the following at line 15 in bin/parse-output.py:

from clusterpath import ClusterPath
cp = ClusterPath(fname='test/ref-results/partition-new-simu.yaml')
ptn = cp.best()  # function to access best partition
cp.print_partitions()  # print summary of the partitions in the file
sys.exit()

Generally the logprobs will thus increase until the best partition, and then decrease, but i doubt that's really guaranteed; clusterpath reads all partitions in the list and marks the highest logprob as the best one.
p(1).jpg

Jesse Connell

unread,
Mar 3, 2023, 1:13:29 PM3/3/23
to Duncan Ralph, partis
Thanks, I really appreciate the detailed info, and these links straight into the codebase.

So it keeps agglomerating things until the logprob would decrease instead of increase, stops there, and writes out that partition plus however many we've asked for (default of ten total, right?) to the file.  And if I give --n-final-clusters or --min-largest-cluster-size, it instead goes as far as it needs to to meet those criteria before stopping, rather than following the "next merge decreasing total logprob" rule.  Is that right?

But, how does it decide when the logprob is increasing or decreasing when making a new partition if it's not calculating the logprob of each partition?  In my file it looks like ClusterPath just falls back on marking the last partition in the list as the best, since they're all -inf and math.isinf(logprob) keeps being True (see below).  I guess my question for this part is, how does it decide when it's "early" versus when it should actually calculate the logprob?  Is that logic within bcrham?

              logprob   delta   clusters  n_procs      
                                                       
             -inf                  15339     2         
             -inf        nan       15338     2         
             -inf        nan       15337     2         
             -inf        nan       15336     2         
             -inf        nan       15335     2         
             -inf        nan       15334     2         
             -inf        nan       15333     2         
             -inf        nan       15332     2         
             -inf        nan       15331     1         
     best*   -inf        nan       15330               



Duncan Ralph

unread,
Mar 3, 2023, 3:16:44 PM3/3/23
to Jesse Connell, partis
So it keeps agglomerating things until the logprob would decrease instead of increase, stops there, and writes out that partition plus however many we've asked for (default of ten total, right?) to the file.  And if I give --n-final-clusters or --min-largest-cluster-size, it instead goes as far as it needs to to meet those criteria before stopping, rather than following the "next merge decreasing total logprob" rule.  Is that right?
Yes, that's right.
 
But, how does it decide when the logprob is increasing or decreasing when making a new partition if it's not calculating the logprob of each partition?  In my file it looks like ClusterPath just falls back on marking the last partition in the list as the best, since they're all -inf and math.isinf(logprob) keeps being True (see below).  I guess my question for this part is, how does it decide when it's "early" versus when it should actually calculate the logprob?  Is that logic within bcrham?

After collapsing all seqs with identical naive seqs, it starts hier agglom. During hier agglom it first does any naive hamming merges (merge clusters with very similar naive seqs) before calculating any forward probs. If you're seeing all partitions with -inf log prob, the last one is definitely the best, but I'd say it's basically a bug if it's not setting any log probs since that's confusing. I'll have to dig to figure it out though since I haven't seen it do that. It probably means there were no potential merges after finishing naive hamming merges. And yes this is all in ham.

Jesse Connell

unread,
Mar 3, 2023, 5:09:43 PM3/3/23
to Duncan Ralph, partis
Interesting.  I'll try to narrow down a minimal example of the "all -inf" thing for you.  And thanks again, I think I have a better sense of the whole process now.
Reply all
Reply to author
Forward
0 new messages