Custom germline database

88 views
Skip to first unread message

Kristian Davidsen

unread,
Aug 12, 2016, 10:53:57 AM8/12/16
to partis
I have two questions regarding the flexibility of the germline sets in partis:

1) Can I change the current human germline database to a custom database which is the only possible germlines of the antibodies that I have sequences from? In this case I am getting data from a transgenic humanized mouse and I know exactly the full germline set beforehand and therefore want to restrict partis to only choose from these.

2) Is it possible to add a new germline database for a different animal? I might soon get data from a chicken AB repertoires and of course I would like to use partis in my analysis.

Best regards,
Kristian

Duncan Ralph

unread,
Aug 12, 2016, 5:07:31 PM8/12/16
to Kristian Davidsen, partis

1) yeah, just use --initial-germline-dir (sp? Im on  my phone). Look at the default to see format, its fairly simple.

2) also yeah. I added a species flag (only mouse and human so far), but as you say this is just changing the germline set.


--
You received this message because you are subscribed to the Google Groups "partis" group.
To unsubscribe from this group and stop receiving emails from it, send an email to partis+unsubscribe@googlegroups.com.
To post to this group, send email to par...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/partis/ec66c833-d4ca-4379-8be8-074fe4fe1238%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

aagata....@gmail.com

unread,
Nov 9, 2018, 9:04:21 AM11/9/18
to partis
@Duncan Does this also require example.csv file for each chain type or just fasta files with germline sequences of interest from IMGT? Could you maybe explain the content of this file and how/where it can be obtained? I want to use just a specific set of reference sequences from IMGT and that would be very helpful for me since I'm just starting with Partis :)


On Friday, 12 August 2016 23:07:31 UTC+2, Duncan Ralph wrote:

1) yeah, just use --initial-germline-dir (sp? Im on  my phone). Look at the default to see format, its fairly simple.

2) also yeah. I added a species flag (only mouse and human so far), but as you say this is just changing the germline set.

On Aug 12, 2016 8:53 AM, "Kristian Davidsen" <krd...@gmail.com> wrote:
I have two questions regarding the flexibility of the germline sets in partis:

1) Can I change the current human germline database to a custom database which is the only possible germlines of the antibodies that I have sequences from? In this case I am getting data from a transgenic humanized mouse and I know exactly the full germline set beforehand and therefore want to restrict partis to only choose from these.

2) Is it possible to add a new germline database for a different animal? I might soon get data from a chicken AB repertoires and of course I would like to use partis in my analysis.

Best regards,
Kristian

--
You received this message because you are subscribed to the Google Groups "partis" group.
To unsubscribe from this group and stop receiving emails from it, send an email to partis+un...@googlegroups.com.

Duncan Ralph

unread,
Nov 9, 2018, 11:10:33 AM11/9/18
to aagata....@gmail.com, par...@googlegroups.com
Hi,

Yeah, you will need extras.csv as well as the fastas. It has the cyst/tryp/phe position info. But the functions in python/glutils.py are fairly clever about figuring out info for missing sequences, so typically you can just use an existing extras.csv and it will align things correctly automatically as long as there's some genes in common.

Backing up a little, you'd want to copy the contents of, e.g. if you want human igh, data/germlines/human/igh to another directory:

mkdir new-germline-dir
cp -r data/germlines/human/igh new-germline-dir/

where note that it's important that igh/ ends up as a subdir with the name 'igh'. Then modify the fasta files in there to your liking, and run partis with

[...] --initial-germline-dir new-germlined-dir --locus igh [..]

if you want to be more careful, so you see how it's figuring out the extras.csv info, you could, after modifying the new dir, add these lines to, say, the top of bin/example-output-processing.py (just after it imports glutils)

ref_glfo = glutils.read_glfo('data/germlines/human', locus='igh', debug=True)
new_glfo = glutils.read_glfo('new-germline-dir', locus='igh', template_glfo=ref_glfo, debug=2)  # adds missing codon info
glutils.print_glfo(new_glfo)  # print the cleaned up glfo
glutils.write_glfo('cleaned-up-new-germline-dir', new_glfo, debug=True)
sys.exit()


Where passing the ref_glfo as a template_glfo is more likely to be able to figure out the cyst/tryp/phen info. And you'll probably want to pipe to '|less -RS', especially if on a laptop.


Duncan Ralph

unread,
Nov 9, 2018, 11:13:44 AM11/9/18
to aagata....@gmail.com, par...@googlegroups.com
Oh, and in cases where you just want to restrict it to a subset of the existing genes, you might also look at --only-genes

aagata....@gmail.com

unread,
Nov 13, 2018, 7:14:17 AM11/13/18
to partis
Thanks a lot Duncan!

I tried both solutions and I have some questions I'd appreciate if you could resolve. I tried first to set a subset of genes of interest, however I got the warning that a lot of my genes are not in glfo to begin with - I am assuming there is no references for those in default germlines and those would not be taken into account?

Then I try supplying a folder where I keep extras.csv but replace germline sequences as you suggested. It seems to be working, but I am getting a warning saying "1 bad cyst positions" and another one saying that some of the sequences have multiple names (I assume this regards to genes where some of the regions are identical but they differ in one - or more - of the other regions?) - how those could impact my results?

I also came accross the problem with pysam version which I have to update but I suppose you are aware of this one since it was described previously :)

Thanks for fast reply!

Duncan Ralph

unread,
Nov 13, 2018, 4:38:57 PM11/13/18
to Agata Szymanek, par...@googlegroups.com
Great! see inline

On Tue, Nov 13, 2018 at 4:14 AM <aagata....@gmail.com> wrote:
Thanks a lot Duncan!

I tried both solutions and I have some questions I'd appreciate if you could resolve. I tried first to set a subset of genes of interest, however I got the warning that a lot of my genes are not in glfo to begin with - I am assuming there is no references for those in default germlines and those would not be taken into account?

Yeah, --only-genes is just for restricting to specific genes that are in whatever germline set it's using for that run already.

 

Then I try supplying a folder where I keep extras.csv but replace germline sequences as you suggested. It seems to be working, but I am getting a warning saying "1 bad cyst positions" and another one saying that some of the sequences have multiple names (I assume this regards to genes where some of the regions are identical but they differ in one - or more - of the other regions?) - how those could impact my results?

The bad cyst position is telling you that one of the genes has a cysteine (start of cdr3) that is mutated, out of frame, or otherwise messed up. Most people I think are skeptical that germline genes without functional cysteines are expressed, or at least that they're common, so unless I had a strong reason to think such a sequence was actually in the subject's germline, I'd probably remove it.

Multiple names means that it was given two different gene names that correspond to the same sequence. Presumably this was because the new germline info you passed it had a sequence that was already in imgt under a different name, or something? Generally speaking, the nomenclature in ig is so messed up that you can never assume you know the sequence just because someone tells you a name, so in that sense this isn't a big deal, since your code kind of already needs to know not to trust names. On the other hand if you were assuming unique names, this'd be a problem.

 

I also came accross the problem with pysam version which I have to update but I suppose you are aware of this one since it was described previously :)

I've had two other people have issues with pysam versions, but I haven't really gotten enough information to figure out what actually happened in either case or what I should do to avoid other people running into it. What happened, and how did you fix it?


 

Agata Szymanek

unread,
Nov 20, 2018, 11:13:58 AM11/20/18
to dkr...@gmail.com, par...@googlegroups.com
Thanks a lot for you explanations - this was very helpful.

Regarding pysam version issue, I just updated partis' version in the container as follows:

docker exec -it partis conda update -y -c bioconda pysam

I believe it would be possible to upgrade pysam's version in the container by modyfying the Dockerfile? Hope this helps.

Message has been deleted

Duncan Ralph

unread,
Nov 20, 2018, 11:44:27 AM11/20/18
to agata.s...@ardigen.com, partis
Yes! That's exactly right, what I needed to add was the biocore channel, which has a much newer version of pysam. I think moving most of the conda installs to one line also helps. I actually finally replicated it last week and pushed a fix. 

On Tue, Nov 20, 2018, 8:14 AM <agata.s...@ardigen.com wrote:

Actually what I meant was updating pysam's version, not partis' :)
Reply all
Reply to author
Forward
0 new messages