Hi, With the help of a collaborator, we identified that sometimes the cdr3 returned by "cdr3_seqs" is missing part of the actual CDR3. Here is a example mouse heavy chain BCR. It looks like it happens when there are insertions, definitely when there are insertions in the CDR3. See example below:
input fasta seq:
GAGGTTCACCTGCAGCAGTCTGGGGCTGAGCTTGTGAGGCCAGGGGCCTCAGTCAAGTTGTCCTGCACAGCTTCTGGCTTTAACATTAAAGACGACTATATGCACTGGGTGAAACAGAGGCCTGAACAGGGCCTGGAGTGGATTGGATGGATTGATCCTGAGAATGATTATACTGAATATGCCTCGAAGTTCCAGGGCAAGGCCACTTTAACAGCAGACACATCCTCCAACACAGCCTACCTGCAGCTCAGCAGCCTGACATCTGAGGACACTGCCGTCTATTACTGTATAATTTATTACTACGGTAGTAGCGGGGTGGACTACTGGGGTCAAGGAACCTCAGTCACCGTCTCCTCA
aa sequence
EVHLQQSGAELVRPGASVKLSCTASGFNIKDDYMHWVKQRPEQGLEWIGWIDPENDYTEYASKFQGKATLTADTSSNTAYLQLSSLTSEDTAVYYCIIYYYGSSGVDYWGQGTSVTVSS
This CDR3 should be 14*3 = 42 nt. However partis returns:
"cdr3_length": 33,
"cdr3_seqs": ["TGTATAATTTATTACTACGGTATGGACTACTGG"]
This underlined part of the true CDR3 is missing (TGTATAATTTATTACTACGGTAGTAGCGGGGTGGACTACTGG). If you look at the germline gapped sequences it looks like an insertion in the J encoded part of the CDR3 has been inferred and that has been left out of the CDR3 that is returned.
full output from running partis annotation on this single sequence as heavy with a black6 only mouse germline, but it shouldn't be very different with the default mouse germline.
{"version-info": {"partis-git": {"commit": "99205b5da0e13d0743b4bef8cd8174ec113d8690", "n_ahead_of_tag": "1215", "tag": "0.16.0"}, "partis-yaml": 0.1}, "germline-info": {"seqs": {"j": {"IGHJ4*01": "ATTACTATGCTATGGACTACTGGGGTCAAGGAACCTCAGTCACCGTCTCCTCAG"}, "d": {"IGHD1-1*01": "TTTATTACTACGGTAGTAGCTAC"}, "v": {"IGHV14-4*01": "GAGGTTCAGCTGCAGCAGTCTGGGGCTGAGCTTGTGAGGCCAGGGGCCTCAGTCAAGTTGTCCTGCACAGCTTCTGGCTTTAACATTAAAGACGACTATATGCACTGGGTGAAGCAGAGGCCTGAACAGGGCCTGGAGTGGATTGGATGGATTGATCCTGAGAATGGTGATACTGAATATGCCTCGAAGTTCCAGGGCAAGGCCACTATAACAGCAGACACATCCTCCAACACAGCCTACCTGCAGCTCAGCAGCCTGACATCTGAGGACACTGCCGTCTATTACTGTACTACA"}}, "tryp-positions": {"IGHJ4*01": 20}, "cyst-positions": {"IGHV1-69*03": 252, "IGHV1S103*01": 253, "IGHV1-74*02": 252, "IGHV1-74*03": 252, "IGHV1S122*01": 252, "IGHV7-4*03": 291, "IGHV2-6-8*01": 282, "IGHV1-62-3*02": 252, "IGHV1-64*02": 252, "IGHV1S113*01": 252, "IGHV1-18*02": 253, "IGHV1-18*03": 252, "IGHV1-55*03": 252, "IGHV1-55*02": 252, "IGHV1S113*02": 253, "IGHV1-55*04": 252, "IGHV1S111*01": 252, "IGHV1-71*01": 285, "IGHV14-4*01": 285, "IGHV1S118*01": 253, "IGHV5-6-1*01": 285, "IGHV1S120*02": 252, "IGHV12-2-1*01": 288, "IGHV1-62-1*01": 283, "IGHV1S121*01": 252, "IGHV1S20*01": 242, "IGHV13-1*02": 291, "IGHV1S100*01": 253, "IGHV1S108*01": 253, "IGHV1S112*02": 253, "IGHV1-72*05": 252, "IGHV1-72*02": 252, "IGHV1-72*03": 252, "IGHV5-9-5*01": 285, "IGHV1-42*02": 253, "IGHV1S120*01": 252, "IGHV1-53*04": 252, "IGHV1-53*03": 252, "IGHV1-53*02": 252, "IGHV1S107*01": 253, "IGHV1S21*02": 222, "IGHV1S21*01": 230}, "functionalities": {}, "locus": "igh"}, "events": [{"input_seqs": ["GAGGTTCACCTGCAGCAGTCTGGGGCTGAGCTTGTGAGGCCAGGGGCCTCAGTCAAGTTGTCCTGCACAGCTTCTGGCTTTAACATTAAAGACGACTATATGCACTGGGTGAAACAGAGGCCTGAACAGGGCCTGGAGTGGATTGGATGGATTGATCCTGAGAATGATTATACTGAATATGCCTCGAAGTTCCAGGGCAAGGCCACTTTAACAGCAGACACATCCTCCAACACAGCCTACCTGCAGCTCAGCAGCCTGACATCTGAGGACACTGCCGTCTATTACTGTATAATTTATTACTACGGTAGTAGCGGGGTGGACTACTGGGGTCAAGGAACCTCAGTCACCGTCTCCTCAN"], "d_5p_del": 0, "mut_freqs": [0.020114942528735632], "duplicates": [[]], "vd_insertion": "TAA", "has_shm_indels": [true], "stops": [false], "d_3p_del": 20, "j_gene": "IGHJ4*01", "v_5p_del": 0, "codon_positions": {"j": 315, "v": 285}, "naive_seq": "GAGGTTCAGCTGCAGCAGTCTGGGGCTGAGCTTGTGAGGCCAGGGGCCTCAGTCAAGTTGTCCTGCACAGCTTCTGGCTTTAACATTAAAGACGACTATATGCACTGGGTGAAGCAGAGGCCTGAACAGGGCCTGGAGTGGATTGGATGGATTGATCCTGAGAATGGTGATACTGAATATGCCTCGAAGTTCCAGGGCAAGGCCACTATAACAGCAGACACATCCTCCAACACAGCCTACCTGCAGCTCAGCAGCCTGACATCTGAGGACACTGCCGTCTATTACTGTATAATTTATTACTATGCTATGGACTACTGGGGTCAAGGAACCTCAGTCACCGTCTCCTCAG", "cdr3_length": 33, "dj_insertion": "", "j_5p_del": 0, "invalid": false, "cdr3_seqs": ["TGTATAATTTATTACTACGGTATGGACTACTGG"], "qr_gap_seqs": ["GAGGTTCACCTGCAGCAGTCTGGGGCTGAGCTTGTGAGGCCAGGGGCCTCAGTCAAGTTGTCCTGCACAGCTTCTGGCTTTAACATTAAAGACGACTATATGCACTGGGTGAAACAGAGGCCTGAACAGGGCCTGGAGTGGATTGGATGGATTGATCCTGAGAATGATTATACTGAATATGCCTCGAAGTTCCAGGGCAAGGCCACTTTAACAGCAGACACATCCTCCAACACAGCCTACCTGCAGCTCAGCAGCCTGACATCTGAGGACACTGCCGTCTATTACTGTATAATTTATTACTACGGTAGTAGCGGGGTGGACTACTGGGGTCAAGGAACCTCAGTCACCGTCTCCTCAN"], "in_frames": [true], "n_mutations": [7], "fv_insertion": "", "mutated_invariants": [false], "j_3p_del": 0, "v_gene": "IGHV14-4*01", "indel_reversed_seqs": ["GAGGTTCACCTGCAGCAGTCTGGGGCTGAGCTTGTGAGGCCAGGGGCCTCAGTCAAGTTGTCCTGCACAGCTTCTGGCTTTAACATTAAAGACGACTATATGCACTGGGTGAAACAGAGGCCTGAACAGGGCCTGGAGTGGATTGGATGGATTGATCCTGAGAATGATTATACTGAATATGCCTCGAAGTTCCAGGGCAAGGCCACTTTAACAGCAGACACATCCTCCAACACAGCCTACCTGCAGCTCAGCAGCCTGACATCTGAGGACACTGCCGTCTATTACTGTATAATTTATTACTACGGTATGGACTACTGGGGTCAAGGAACCTCAGTCACCGTCTCCTCAN"], "unique_ids": ["RBS3_d12_HAmi_P22B03"], "v_3p_del": 5, "d_per_gene_support": {"IGHD1-1*01": 1.0}, "gl_gap_seqs": ["GAGGTTCAGCTGCAGCAGTCTGGGGCTGAGCTTGTGAGGCCAGGGGCCTCAGTCAAGTTGTCCTGCACAGCTTCTGGCTTTAACATTAAAGACGACTATATGCACTGGGTGAAGCAGAGGCCTGAACAGGGCCTGGAGTGGATTGGATGGATTGATCCTGAGAATGGTGATACTGAATATGCCTCGAAGTTCCAGGGCAAGGCCACTATAACAGCAGACACATCCTCCAACACAGCCTACCTGCAGCTCAGCAGCCTGACATCTGAGGACACTGCCGTCTATTACTGTATAATTTATTACTATGCTA.........TGGACTACTGGGGTCAAGGAACCTCAGTCACCGTCTCCTCAG"], "v_per_gene_support": {"IGHV14-4*01": 1.0}, "lengths": {"j": 54, "d": 3, "v": 289}, "j_per_gene_support": {"IGHJ4*01": 1.0}, "jf_insertion": "", "d_gene": "IGHD1-1*01", "regional_bounds": {"j": [295, 349], "d": [292, 295], "v": [0, 289]}}], "partitions": [{"n_procs": 1, "logprob": 0.0, "n_clusters": 1, "partition": [["RBS3_d12_HAmi_P22B03"]]}]}