Understanding the naming scheme for inferred alleles

31 views

Skip to first unread message

zacand...@gmail.com

unread,

Oct 24, 2018, 11:28:47 AM10/24/18

to partis

Hi,

I'm trying to understand how the inferred alleles are named. For instance, from my sample, partis gives the following V genes among others:

IGHV1-1*01
IGHV1-AAU*01
IGHV1-ABW*01+A102C
IGHV4-12037*01+C117T.G163C.C170T.A175G

(1) follows the standard IMGT naming scheme.

For (2), what does the AAU refer to or mean and how is (2) different from (1)?

For (3), what does the ABW and +A102C mean and how is (3) different from (2)?

For (4), what does the 12037 refer to and what does +C117T.G163C.C170T.A175G mean and how is it different from IGHV4-1*01?

Thanks,

Zach Montague

Duncan Ralph

unread,

Oct 26, 2018, 2:44:49 PM10/26/18

to par...@googlegroups.com

oops, missed hitting reply all. Also may as well add: I just learned yesterday that the macaque germline set in igblast includes a bunch of non-imgt genes curated by... I forget who, but it sounded like it would be a good supplement/alternative to the genes from the paper I mention below.

---------- Forwarded message ---------
From: Duncan Ralph <dkr...@gmail.com>
Date: Wed, Oct 24, 2018 at 11:40 AM
Subject: Re: Understanding the naming scheme for inferred alleles
To: <zacand...@gmail.com>

So, what we try to do when choosing the name for a new gene, is start from the gene from which we inferred it (the "template" gene), and add the C117T-style thing for each mutation. This is pretty standard, I think? It just means that the 117th base (0-indexed) is a T, instead of the C that occurs in the template gene. So your example above has four mutations from its template.

Where it gets hard is when you have either 1) insertions or deletions from the template or 2) so many point mutations that the name gets unwieldy. For these cases we create a new "sub-version" (the stuff after the dash but before the asterisk -- everyone calls it something different, but that's what I'll call it in this paragraph). This new after-the-dash string is supposed to be a [truncated] hash of the new sequence, e.g. '12037'. This looks like what happened in your 4) above -- a distant gene was inferred starting from an IGHV4 gene, so it created the new IGHV4-12037*01. Then, in a subsequent run, it inferred a new allele that was separated by four point mutations from this new gene. As an aside, if you're getting multiple runs with a current version of partis, that means it's running clustering-style inference first, and then running fitting-style inference after. Which is only turned on by default for non-humans, so I assume you've got non-human samples. Which also would explain why you've got a ton of new alleles, and a concomitantly large amount of uncertainty in what the germline set is. So it goes ;-)

I don't think the three-capitol-letters sub-version thing is from partis, so I wonder if maybe it was in the input germline set? They match gene names in the germline set I have here from the ramesh paper.
 But I don't think I have those in partis anywhere yet? So you probably 
passed them in by hand? I do recommend using them, since the imgt macaque germline is super incomplete, and the ramesh genes fill it in a *lot*. I just haven't gotten around 
to adding them to the default germline sets. And if you did add them by hand, you might find it useful to know that the reason I didn't finish that yet is that they seemed to have a bunch of the genes mislabelled as to functionality in the place linked from the paper (SRA, I think it was?). I was mailing back and forth for a bit, but they didn't seem to plan to fix, it so I still need to write some code to guess the right functionality info.

All that said! I strongly advise against thinking about the names as anything more than a rough guide. Everybody uses different inferred allele naming schemes, at least for the probably-intractable indel/lots of SNP cases, and even worse the "standard" imgt names change both in that they'll change the name that goes with a given sequence, and change the sequence that goes with a given name. So, within a given controlled environment (e.g. running a study on one machine with one version of software), sure, look at the name for an idea of what's going on. But for any kind of careful comparison you really have to use the sequences themselves.

--
You received this message because you are subscribed to the Google Groups "partis" group.
To unsubscribe from this group and stop receiving emails from it, send an email to partis+un...@googlegroups.com.
To post to this group, send email to par...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/partis/66612e82-6231-41ee-803e-ff4e5733d84d%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Reply all

Reply to author

Forward

0 new messages