cleavage products in pyteomics

27 views
Skip to first unread message

Axel Petzold

unread,
Dec 21, 2022, 9:47:10 AM12/21/22
to Pyteomics
How can a list of all possible cleavage products be produces in pyteomics if one:

(a) has the protein sequence in FASTA format in one file 
(b) a list of cleavage sides (by number of amino acid position in (a) in a different file. 

axel

Pyteomics

unread,
Dec 21, 2022, 12:16:19 PM12/21/22
to Pyteomics
Hi Axel,

Pyteomics will help you read the FASTA file, but the cleave function needs the cleavage rule, not numerical positions of cleavage sites. If the positions are already known, though, you only need to slice your sequence string using these indices. What exactly are you having trouble with?
If you need to account for the possibility of miscleavage, then this does become a bit trickier, but we can borrow some inspiration from the cleave function:

In [1]: from collections import deque

In [2]: sequence = '123456789'

In [3]: sites = [3, 7]

In [4]: MC = 1


Let's say we already have the sequence and the cleavage positions. Then we can do this:

In [5]: cl = 1

In [6]: trange = range(MC + 2)

In [7]: current = deque([0], maxlen=MC+2)

In [8]: for end in sites + [None]:
  ...:     if cl < MC + 2:
  ...:         cl += 1
  ...:     current.append(end)
  ...:     for j in trange[:cl - 1]:
  ...:         seq = sequence[current[j]:current[-1]]
  ...:         if seq:
  ...:             print(seq)
  ...:  
123
1234567
4567
456789
89


This is based on the code from Pyteomics, but doesn't use the library itself.
Pyteomics can get you the sequence from a FASTA file, though:

from pyteomics import fasta

with fasta.read('singe_sequence.fasta') as f:
    sequence = next(f).sequence

I hope this helps!

Best regards,
Lev

Axel Petzold

unread,
Dec 21, 2022, 12:45:08 PM12/21/22
to Pyteomics

Dear Lev


thanks for your super rapid response and neat coding.
The first challenge was to come up with all possible combinations of the cleavage products and your code does it.

The reason for using the aminoacid positions were the may repeats in the sequence.
Here is the sequence of the neurofilament protein. It has many intrinsically unstructured domains.


MMSFGGADALLGAPFAPLHGGGSLHYALARKGGAGGTRSAAGSSSGFHSWTRTSVSSVSA
SPSRFRGAGAASSTDSLDTLSNGPEGCMVAVATSRSEKEQLQALNDRFAGYIDKVRQLEA
HNRSLEGEAAALRQQQAGRSAMGELYEREVREMRGAVLRLGAARGQLRLEQEHLLEDIAH
VRQRLDDEARQREEAEAAARALARFAQEAEAARVDLQKKAQALQEECGYLRRHHQEEVGE
LLGQIQGSGAAQAQMQAETRDALKCDVTSALREIRAQLEGHAVQSTLQSEEWFRVRLDRL
SEAAKVNTDAMRSAQEEITEYRRQLQARTTELEALKSTKDSLERQRSELEDRHQADIASY
QEAIQQLDAELRNTKWEMAAQLREYQDLLNVKMALDIEIAAYRKLLEGEECRIGFGPIPF
SLPEGLPKIPSVSTHIKVKSEEKIKVVEKSEKETVIVEEQTEETQVTEEVTEEEEKEAKE
EEGKEEEGGEEEEAEGGEEETKSPPAEEAASPEKEAKSPVKEEAKSPAEAKSPEKEEAKS
PAEVKSPEKAKSPAKEEAKSPPEAKSPEKEEAKSPAEVKSPEKAKSPAKEEAKSPAEAKS
PEKAKSPVKEEAKSPAEAKSPVKEEAKSPAEVKSPEKAKSPTKEEAKSPEKAKSPEKAKS
PEKEEAKSPEKAKSPVKAEAKSPEKAKSPVKAEAKSPEKAKSPVKEEAKSPEKAKSPVKE
EAKSPEKAKSPVKEEAKTPEKAKSPVKEEAKSPEKAKSPEKAKTLDVKSPEAKTPAKEEA
RSPADKFPEKAKSPVKEEVKSPEKAKSPLKEDAKAPEKEIPKKEEVKSPVKEEEKPQEVK
VKEPPKKAEEEKAPATPKTEEKKDSKKEEAPKKEAPKPKVEEKKEPAVEKPKESKVEAKK
EEAEDKKKVPTPEKEAPAKVEVKEDAKPKEKTEVAKKEPDDAKAKEPSKPAEKKEAAPEK
KDTKEEKAKKPEEKPKTEAKAKEDDKTLSKEPSKPKAEKAEKSSSTDQKDSKPPEKATED
KAAKGK

Here are the cleavage sites:
1, 2, 3, 4, 6, 7, 8, 9, 10, 11, 13, 14, 15, 16, 17, 18, 19, 23, 24, 25, 26, 27, 28, 29, 30, 31, 33, 34, 37, 38, 39, 40, 41, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 57, 58, 59, 60, 64, 65, 66, 67, 68, 69, 70, 71, 74, 75, 76, 77, 78, 79, 80, 82, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 95, 96, 97, 98, 99, 100, 101, 102, 103, 104, 105, 106, 107, 108, 109, 110, 111, 112, 113, 114, 115, 116, 117, 118, 119, 120, 121, 123, 124, 125, 126, 127, 128, 129, 130, 131, 132, 133, 136, 137, 139, 140, 141, 142, 143, 144, 145, 146, 147, 148, 149, 150, 151, 152, 153, 154, 155, 156, 157, 158, 159, 160, 161, 162, 163, 164, 166, 167, 168, 169, 170, 171, 172, 173, 174, 175, 176, 177, 178, 179, 180, 181, 182, 184, 185, 186, 187, 188, 189, 190, 192, 193, 194, 195, 196, 197, 198, 199, 200, 201, 202, 203, 204, 205, 206, 207, 208, 209, 210, 211, 212, 213, 214, 215, 216, 217, 218, 219, 220, 221, 222, 223, 224, 225, 226, 228, 229, 230, 231, 232, 233, 234, 235, 236, 237, 238, 239, 240, 241, 242, 244, 245, 249, 250, 251, 252, 253, 254, 255, 256, 257, 258, 259, 260, 261, 262, 263, 264, 265, 266, 267, 268, 269, 270, 271, 272, 273, 274, 275, 276, 277, 278, 279, 281, 282, 283, 286, 287, 289, 290, 291, 292, 293, 294, 295, 296, 297, 298, 299, 300, 301, 302, 303, 304, 305, 306, 308, 309, 310, 311, 312, 313, 314, 315, 316, 317, 318, 319, 320, 321, 322, 323, 324, 325, 326, 327, 328, 329, 330, 331, 332, 333, 334, 335, 336, 338, 339, 340, 341, 342, 343, 344, 346, 347, 348, 349, 350, 351, 352, 353, 354, 355, 356, 357, 358, 359, 360, 361, 362, 363, 364, 366, 367, 368, 369, 370, 371, 372, 374, 375, 376, 377, 378, 379, 380, 381, 382, 383, 384, 385, 386, 387, 388, 389, 390, 391, 392, 393, 394, 395, 396, 397, 398, 399, 400, 401, 402, 403, 404, 405, 406, 407, 408, 409, 410, 412, 413, 414, 415, 418, 419, 420, 422, 423, 424, 426, 427, 428, 429, 431, 432, 434, 435, 436, 437, 438, 439, 440, 441, 442, 443, 444, 445, 446, 447, 448, 449, 450, 451, 452, 453, 454, 455, 456, 457, 458, 459, 461, 462, 463, 464, 465, 466, 467, 468, 469, 470, 471, 472, 473, 474, 475, 476, 477, 478, 479, 480, 481, 482, 483, 484, 485, 486, 487, 489, 490, 491, 492, 493, 494, 495, 497, 498, 499, 500, 501, 502, 505, 506, 507, 508, 509, 510, 512, 513, 514, 515, 516, 517, 519, 520, 521, 522, 523, 524, 525, 527, 528, 529, 530, 531, 533, 534, 535, 536, 537, 538, 539, 541, 542, 543, 544, 545, 547, 548, 549, 550, 551, 553, 554, 555, 556, 557, 558, 559, 562, 563, 564, 565, 567, 568, 569, 570, 571, 572, 573, 575, 576, 577, 578, 579, 581, 582, 583, 584, 585, 587, 588, 589, 590, 591, 592, 593, 595, 596, 597, 598, 599, 601, 602, 603, 604, 605, 607, 608, 609, 610, 611, 612, 613, 615, 616, 617, 618, 619, 621, 622, 623, 624, 625, 626, 627, 629, 630, 631, 632, 633, 635, 636, 637, 638, 639, 642, 643, 644, 645, 646, 647, 649, 650, 651, 652, 653, 655, 656, 657, 658, 659, 661, 662, 663, 664, 665, 666, 667, 669, 670, 671, 672, 673, 675, 676, 677, 678, 679, 680, 681, 683, 684, 685, 686, 687, 689, 690, 691, 692, 693, 694, 695, 697, 698, 699, 700, 701, 703, 704, 705, 706, 707, 708, 709, 711, 712, 713, 714, 715, 717, 718, 719, 720, 721, 722, 723, 725, 726, 727, 728, 729, 731, 732, 733, 734, 735, 736, 737, 738, 739, 740, 741, 742, 743, 745, 746, 747, 748, 749, 750, 751, 753, 754, 755, 756, 757, 759, 760, 761, 762, 763, 764, 765, 766, 767, 768, 770, 771, 772, 773, 774, 775, 776, 777, 778, 779, 780, 781, 783, 784, 785, 786, 787, 788, 789, 790, 791, 792, 794, 795, 796, 797, 798, 799, 800, 802, 803, 804, 805, 806, 808, 809, 810, 811, 812, 813, 814, 815, 816, 817, 818, 819, 820, 821, 822, 823, 824, 825, 826, 827, 829, 830, 831, 832, 833, 834, 835, 836, 837, 838, 839, 840, 841, 842, 843, 845, 846, 847, 848, 849, 850, 851, 852, 853, 854, 855, 856, 857, 858, 859, 860, 861, 862, 863, 864, 865, 866, 867, 868, 869, 870, 871, 872, 873, 874, 875, 876, 877, 878, 879, 880, 881, 882, 883, 884, 885, 886, 887, 888, 889, 890, 891, 892, 893, 894, 895, 896, 897, 898, 899, 900, 901, 902, 903, 904, 905, 906, 907, 908, 909, 911, 912, 913, 914, 915, 916, 917, 918, 919, 920, 921, 922, 923, 924, 925, 926, 927, 928, 929, 930, 931, 932, 933, 934, 935, 936, 937, 938, 939, 940, 941, 942, 943, 944, 945, 946, 948, 949, 950, 951, 952, 953, 954, 955, 956, 957, 958, 959, 960, 961, 962, 963, 964, 965, 966, 967, 968, 969, 970, 971, 972, 973, 974, 975, 976, 977, 978, 979, 980, 981, 982, 983, 984, 985, 986, 987, 988, 989, 990, 991, 993, 994, 995, 996, 997, 998, 999, 1000, 1001, 1002, 1006, 1007, 1008, 1009, 1010, 1011, 1012, 1014, 1015, 1016, 1017, 1018, 1019, 1020, 1021, 1022, 1023, 1024, 1025, 1026

This gives n=424,580 combinations (if I am not mistaken).

Best wishes
axel

Axel Petzold

unread,
Dec 22, 2022, 8:31:48 AM12/22/22
to Pyteomics
OK here is how it works. MC in your code, Lev, needs to be adjusted to the total number of cleavage sites, doesn't it.
For NfH there are 920. The code here below worked for me.

Thank you for putting me on the right track.

print('Cleavage for NfH at all sites gives these peptides: ')
from collections import deque
sequence = 'MMSFGGADALLGAPFAPLHGGGSLHYALARKGGAGGTRSAAGSSSGFHSWTRTSVSSVSASPSRFRGAGAASSTDSLDTLSNGPEGCMVAVATSRSEKEQLQALNDRFAGYIDKVRQLEAHNRSLEGEAAALRQQQAGRSAMGELYEREVREMRGAVLRLGAARGQLRLEQEHLLEDIAHVRQRLDDEARQREEAEAAARALARFAQEAEAARVDLQKKAQALQEECGYLRRHHQEEVGELLGQIQGSGAAQAQMQAETRDALKCDVTSALREIRAQLEGHAVQSTLQSEEWFRVRLDRLSEAAKVNTDAMRSAQEEITEYRRQLQARTTELEALKSTKDSLERQRSELEDRHQADIASYQEAIQQLDAELRNTKWEMAAQLREYQDLLNVKMALDIEIAAYRKLLEGEECRIGFGPIPFSLPEGLPKIPSVSTHIKVKSEEKIKVVEKSEKETVIVEEQTEETQVTEEVTEEEEKEAKEEEGKEEEGGEEEEAEGGEEETKSPPAEEAASPEKEAKSPVKEEAKSPAEAKSPEKEEAKSPAEVKSPEKAKSPAKEEAKSPPEAKSPEKEEAKSPAEVKSPEKAKSPAKEEAKSPAEAKSPEKAKSPVKEEAKSPAEAKSPVKEEAKSPAEVKSPEKAKSPTKEEAKSPEKAKSPEKAKSPEKEEAKSPEKAKSPVKAEAKSPEKAKSPVKAEAKSPEKAKSPVKEEAKSPEKAKSPVKEEAKSPEKAKSPVKEEAKTPEKAKSPVKEEAKSPEKAKSPEKAKTLDVKSPEAKTPAKEEARSPADKFPEKAKSPVKEEVKSPEKAKSPLKEDAKAPEKEIPKKEEVKSPVKEEEKPQEVKVKEPPKKAEEEKAPATPKTEEKKDSKKEEAPKKEAPKPKVEEKKEPAVEKPKESKVEAKKEEAEDKKKVPTPEKEAPAKVEVKEDAKPKEKTEVAKKEPDDAKAKEPSKPAEKKEAAPEKKDTKEEKAKKPEEKPKTEAKAKEDDKTLSKEPSKPKAEKAEKSSSTDQKDSKPPEKATEDKAAKGK'
sites = [1, 2, 3, 4, 6, 7, 8, 9, 10, 11, 13, 14, 15, 16, 17, 18, 19, 23, 24, 25, 26, 27, 28, 29, 30, 31, 33, 34, 37, 38, 39, 40, 41, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 57, 58, 59, 60, 64, 65, 66, 67, 68, 69, 70, 71, 74, 75, 76, 77, 78, 79, 80, 82, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 95, 96, 97, 98, 99, 100, 101, 102, 103, 104, 105, 106, 107, 108, 109, 110, 111, 112, 113, 114, 115, 116, 117, 118, 119, 120, 121, 123, 124, 125, 126, 127, 128, 129, 130, 131, 132, 133, 136, 137, 139, 140, 141, 142, 143, 144, 145, 146, 147, 148, 149, 150, 151, 152, 153, 154, 155, 156, 157, 158, 159, 160, 161, 162, 163, 164, 166, 167, 168, 169, 170, 171, 172, 173, 174, 175, 176, 177, 178, 179, 180, 181, 182, 184, 185, 186, 187, 188, 189, 190, 192, 193, 194, 195, 196, 197, 198, 199, 200, 201, 202, 203, 204, 205, 206, 207, 208, 209, 210, 211, 212, 213, 214, 215, 216, 217, 218, 219, 220, 221, 222, 223, 224, 225, 226, 228, 229, 230, 231, 232, 233, 234, 235, 236, 237, 238, 239, 240, 241, 242, 244, 245, 249, 250, 251, 252, 253, 254, 255, 256, 257, 258, 259, 260, 261, 262, 263, 264, 265, 266, 267, 268, 269, 270, 271, 272, 273, 274, 275, 276, 277, 278, 279, 281, 282, 283, 286, 287, 289, 290, 291, 292, 293, 294, 295, 296, 297, 298, 299, 300, 301, 302, 303, 304, 305, 306, 308, 309, 310, 311, 312, 313, 314, 315, 316, 317, 318, 319, 320, 321, 322, 323, 324, 325, 326, 327, 328, 329, 330, 331, 332, 333, 334, 335, 336, 338, 339, 340, 341, 342, 343, 344, 346, 347, 348, 349, 350, 351, 352, 353, 354, 355, 356, 357, 358, 359, 360, 361, 362, 363, 364, 366, 367, 368, 369, 370, 371, 372, 374, 375, 376, 377, 378, 379, 380, 381, 382, 383, 384, 385, 386, 387, 388, 389, 390, 391, 392, 393, 394, 395, 396, 397, 398, 399, 400, 401, 402, 403, 404, 405, 406, 407, 408, 409, 410, 412, 413, 414, 415, 418, 419, 420, 422, 423, 424, 426, 427, 428, 429, 431, 432, 434, 435, 436, 437, 438, 439, 440, 441, 442, 443, 444, 445, 446, 447, 448, 449, 450, 451, 452, 453, 454, 455, 456, 457, 458, 459, 461, 462, 463, 464, 465, 466, 467, 468, 469, 470, 471, 472, 473, 474, 475, 476, 477, 478, 479, 480, 481, 482, 483, 484, 485, 486, 487, 489, 490, 491, 492, 493, 494, 495, 497, 498, 499, 500, 501, 502, 505, 506, 507, 508, 509, 510, 512, 513, 514, 515, 516, 517, 519, 520, 521, 522, 523, 524, 525, 527, 528, 529, 530, 531, 533, 534, 535, 536, 537, 538, 539, 541, 542, 543, 544, 545, 547, 548, 549, 550, 551, 553, 554, 555, 556, 557, 558, 559, 562, 563, 564, 565, 567, 568, 569, 570, 571, 572, 573, 575, 576, 577, 578, 579, 581, 582, 583, 584, 585, 587, 588, 589, 590, 591, 592, 593, 595, 596, 597, 598, 599, 601, 602, 603, 604, 605, 607, 608, 609, 610, 611, 612, 613, 615, 616, 617, 618, 619, 621, 622, 623, 624, 625, 626, 627, 629, 630, 631, 632, 633, 635, 636, 637, 638, 639, 642, 643, 644, 645, 646, 647, 649, 650, 651, 652, 653, 655, 656, 657, 658, 659, 661, 662, 663, 664, 665, 666, 667, 669, 670, 671, 672, 673, 675, 676, 677, 678, 679, 680, 681, 683, 684, 685, 686, 687, 689, 690, 691, 692, 693, 694, 695, 697, 698, 699, 700, 701, 703, 704, 705, 706, 707, 708, 709, 711, 712, 713, 714, 715, 717, 718, 719, 720, 721, 722, 723, 725, 726, 727, 728, 729, 731, 732, 733, 734, 735, 736, 737, 738, 739, 740, 741, 742, 743, 745, 746, 747, 748, 749, 750, 751, 753, 754, 755, 756, 757, 759, 760, 761, 762, 763, 764, 765, 766, 767, 768, 770, 771, 772, 773, 774, 775, 776, 777, 778, 779, 780, 781, 783, 784, 785, 786, 787, 788, 789, 790, 791, 792, 794, 795, 796, 797, 798, 799, 800, 802, 803, 804, 805, 806, 808, 809, 810, 811, 812, 813, 814, 815, 816, 817, 818, 819, 820, 821, 822, 823, 824, 825, 826, 827, 829, 830, 831, 832, 833, 834, 835, 836, 837, 838, 839, 840, 841, 842, 843, 845, 846, 847, 848, 849, 850, 851, 852, 853, 854, 855, 856, 857, 858, 859, 860, 861, 862, 863, 864, 865, 866, 867, 868, 869, 870, 871, 872, 873, 874, 875, 876, 877, 878, 879, 880, 881, 882, 883, 884, 885, 886, 887, 888, 889, 890, 891, 892, 893, 894, 895, 896, 897, 898, 899, 900, 901, 902, 903, 904, 905, 906, 907, 908, 909, 911, 912, 913, 914, 915, 916, 917, 918, 919, 920, 921, 922, 923, 924, 925, 926, 927, 928, 929, 930, 931, 932, 933, 934, 935, 936, 937, 938, 939, 940, 941, 942, 943, 944, 945, 946, 948, 949, 950, 951, 952, 953, 954, 955, 956, 957, 958, 959, 960, 961, 962, 963, 964, 965, 966, 967, 968, 969, 970, 971, 972, 973, 974, 975, 976, 977, 978, 979, 980, 981, 982, 983, 984, 985, 986, 987, 988, 989, 990, 991, 993, 994, 995, 996, 997, 998, 999, 1000, 1001, 1002, 1006, 1007, 1008, 1009, 1010, 1011, 1012, 1014, 1015, 1016, 1017, 1018, 1019, 1020, 1021, 1022, 1023, 1024, 1025, 1026]
MC = 920
cl = 1

trange = range(MC + 2)
current = deque([0], maxlen=MC+2)
for end in sites + [None]:
    if cl < MC + 2:
        cl += 1
        current.append(end)

        for j in trange[:cl - 1]:
            seq = sequence[current[j]:current[-1]]
            if seq:
                print(seq)


Lev Levitsky

unread,
Dec 22, 2022, 12:59:32 PM12/22/22
to pyte...@googlegroups.com, axel...@gmail.com
Hi Axel,

The code is intended to filter the combinations by limiting the number of missed cleavage sites allowed. Indeed you can get all combinations by setting MC to the total number of sites. However, if you want all combinations, it can be achieved with much shorter code:

import itertools
for start, end in itertools.product(sites + [None], repeat=2):
    if start is None or end is None or start < end:
        print(sequence[start:end])


--
You received this message because you are subscribed to the Google Groups "Pyteomics" group.
To unsubscribe from this group and stop receiving emails from it, send an email to pyteomics+...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/pyteomics/768714ff-e6c4-48bc-8884-c16a9e150550n%40googlegroups.com.

Axel Petzold

unread,
Jan 17, 2023, 7:17:25 AM1/17/23
to Pyteomics
Thanks Lev, it really works beautiful and quick. 
The pyteomics tool for weight is useful (and very quick), too. 

For the set of sequences I got I used the following code which just gives the molecular weight (MW). Multiply with 0.99986 if to be converted to Dalton (Da). I did not need m/z and retention time, but that should be straight forward to code if required.

Here is the code:

from pyteomics import fasta, parser, mass

with fasta.read ('unique.fasta') as all_peptides : # this is how I named the output file of the cleavage sequences
    for descr, seq in all_peptides :
        peptide = mass.fast_mass (seq)
        with open ('weight.txt', mode = 'a') as file_object : # this is the name for the weight file 
            print (descr,seq,peptide, file = file_object)

For clarity, a bit of formatting was needed for the FASTA file first. Now it starts like this (only the first 3 cleavage products are shown):
>sp|1NfH
AAAL
>sp|2NfH
AAALR
>sp|3NfH
AAALRQQQ

Hope it helps others, too. Thanks for the great tool Lev !
Reply all
Reply to author
Forward
0 new messages