Proper format of input for paired case (and in general)

Jesse Connell

unread,

Feb 23, 2022, 1:43:36 PM2/23/22

to partis

Hello,

Are there any specific requirements for supplying the sequences and metadata for the paired heavy/light chain case when not necessarily working with 10x data? The paired-loci documentation mentions running a command like the partition step and just giving --paired-loci directly with a FASTA as input, but I don't see how it'll know what's what without either separate metadata or a particular sequence ID format (like 10x for example).

I've watched what the test script does in supplying a .fa and a .yaml with the pairing info supplied via the YAML file, and I can run that myself successfully, but I get tripped up trying that approach with my own data. I tried first with my own arbitrary sequence IDs but it seems to also expect an <ID>-<locus> format with the same ID between heavy and light chains and only the locus suffix varying. I got it working after modifying my original IDs to follow that format and also supplying YAML, but I'm still not quite sure I'm handling this right, so I'd appreciate any guidance there.

I'm also curious about what sort of inputs are more or less appropriate for partis in general. Most of what I work with (and what I'm also eager to try partis on!) is bulk/unbiased unpaired rep-seq data from MiSeq runs, but in this case I'm working with a few hundred paired Sanger sequences from B cells selected for a particular phenotype, and I'm wondering if the relatively lower quality and the low sequence count will be a problem here, especially for the parameter estimation step.

Thanks!

Jesse

Duncan Ralph

unread,

Feb 23, 2022, 6:06:22 PM2/23/22

to Jesse Connell, partis

ah yeah, i'll make a note to update that part of the docs. The safest thing is for sure to specify your own pairings with --input-metafnames with the key 'paired-uids'. You'd want it to look like test/paired/ref-results/test/simu/meta.yaml (altho that uses the json subset of yaml, which is faster but less easily human readable; either works). I also rewrote the pair info extraction/guessing fcn recently to be more configurable/general so you might be able to use that depending what your data looks like. Use --droplet-id-separators and --droplet-id-indices either in this script (if you want to run a separate step by hand first) or in the main bin/partis script (which just runs that other script as a first step). lmk if those help messages aren't good enough, of course, not sure how clear they are.

--
You received this message because you are subscribed to the Google Groups "partis" group.
To unsubscribe from this group and stop receiving emails from it, send an email to partis+un...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/partis/63e09a59-2be5-497b-8a7d-19789b929d1an%40googlegroups.com.

Duncan Ralph

unread,

Feb 23, 2022, 8:24:05 PM2/23/22

to Jesse Connell, partis

ah, and as regards the other question -- running on small samples (less than a few thousand seqs, say) should always work fine, but yeah, the parameter estimation will be less accurate. Partly just because there less information, but also I haven't put a ton of effort into having super accurate priors. The biggest place where you might want to check this is the germline inference, since it basically doesn't have any priors (it's been a todo since forever, and there may actually exist enough long read germline sequencing these days to make it feasible). I.e. it can't really infer [m]any alleles on small samples, so it usually will remove too many. You can play around to see how things behave with --leave-default-germline or maybe --dont-remove-unlikely-alleles, maybe combined with using your own curated germline set.

On Wed, Feb 23, 2022 at 10:43 AM Jesse Connell <jes...@gmail.com> wrote:

--

Jesse Connell

unread,

Feb 24, 2022, 12:53:39 PM2/24/22

to partis

Thanks! Right now I'm prepping my own YAML/JSON based on that meta.yaml you listed, and providing that with a separate cache-parameters call. That works, but only if my sequence IDs are modified to follow the sort of the formatting that file has. So for example if my FASTA looks like:

>1-igh

...

>1-igk

...

and my metadata file looks like:

{"1-igh": {"paired-uids": ["1-igk"], "locus": "igh"}, "1-igk": {"paired-uids": ["1-igh"], "locus": "igk"},...

Then I can run something like this, and all is well:

partis cache-parameters --paired-loci --infname input.fasta --input-metafnames input.meta.yaml --paramter-dir params --paired-outdir data

partis partition --paired-loci --paired-indir data --parameter-dir params --paired-outdir output

But if my sequence IDs are different (my originals used a naming scheme like "H1" paired with "K1", and then "H2" paired with, say, "L2", etc.) then I get this in my output from cache-paramters, even though the FASTA and metadata match up:

writing to paired subdirs
0/363 igh seqs pair with igk (warning)
0/363 igh seqs pair with igl (warning)

The first pair of entries in my FASTA and YAML here are like:

>H1

...

>K1

...

{"H1": {"paired-uids": ["K1"], "locus": "igh"}, "K1": {"paired-uids": ["H1"], "locus": "igk"}, ...

So it looks as though it's still insisting that my sequence IDs are formatted like the "guess" option for parsing droplet IDs, but wouldn't giving the paired-uids via the meta.yaml skip the guessing? Or does it still need to run the extract pairing info step even with metadata supplied? (Am I just conflating droplet IDs and UIDs incorrectly? I don't expect any parsing should need to happen since I supply the metadata, but I'm not totally clear on the droplet ID versus UID terminology.) It does work when I've reformatted my IDs, so I'm not stuck at the moment, at least. I just want to make sure I'm not screwing up something basic.

Jesse

Jesse Connell

unread,

Feb 24, 2022, 12:58:47 PM2/24/22

to partis

And thanks for this! Even with these limitations in mind I can already see that I'm getting more a accurate partition result compared to other tools for a few cases I've manually reviewed so far. This is very promising already, and all the more so for our rep-seq datasets when I give it a try with those.

Duncan Ralph

unread,

Feb 24, 2022, 8:08:13 PM2/24/22

to Jesse Connell, partis

whoops, darn it. Yeah, thanks, 'paired-uids' in --input-metafnames clearly ought to take precedence over values from extrac-pairing-info.py, but it turns out i never actually implemented that. Just pushed a fix that checks for 'paired-uids' in any of --input-metafnames. 

To view this discussion on the web visit https://groups.google.com/d/msgid/partis/a1fb5286-c746-491f-8424-81a98621f7ddn%40googlegroups.com.

p.png

Jesse Connell

unread,

Feb 25, 2022, 3:28:47 PM2/25/22

to partis

Not sure if my message a minute ago went through, so just to reiterate, it works now! Thanks. I get a "droplet id group check failed" warning from the partition command, but the partition output is the same as when I was altering my IDs before, so I think it's doing the right thing. Thanks for your help.

Reply all

Reply to author

Forward