PEFF parsing

brett...@gmail.com

unread,

Aug 5, 2018, 6:36:05 PM8/5/18

to Pyteomics

G'day team,

I was just wondering if there are any plans to include PEFF parsing to the already awesome library? It would seem to be a good addition to the fasta parsing section.

It might be something we can help with if that's useful.

Cheers,
Brett

Lev Levitsky

unread,

Aug 6, 2018, 2:24:07 PM8/6/18

to pyteomics, brett...@gmail.com

Hi Brett,

Thank you for writing and indicating your interest. Adding PEFF has come up, and it shouldn't be too hard to add to the existing parsers.

The current priority is implementing some structural changes to all parsers in preparation for the next release, which I want to finish in September. I will look more closely into it and try to get something done on PEFF before the release.

If you have any thoughts or suggestions, you are most welcome to share them here or in a personal email to me.

Best regards,

Lev

--

---
You received this message because you are subscribed to the Google Groups "Pyteomics" group.
To unsubscribe from this group and stop receiving emails from it, send an email to pyteomics+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

--

Lev Levitsky
Institute for Energy Problems of Chemical Physics RAS
Laboratory of Physical and Chemical Methods for Structure Analysis
Leninsky pr. 38, bld. 2 119334 Moscow Russia
tel: +7 499 1378257 fax: +7 499 1378257, +7 499 1378258

Joshua Klein

unread,

Aug 6, 2018, 4:23:53 PM8/6/18

to pyteomics, brett...@gmail.com

Since this came up again, I thought I’d post my current PEFF Defline parser:

import re
from collections import OrderedDict

class PEFFDeflineParser(object):
    kv_pattern = re.compile(r"\\(?P<key>\S+)=(?P<value>.+?)(?:\s(?=\\)|$)")
    detect_pattern = re.compile(r"^>?\S+:\S+")

    def __init__(self, validate=True):
        self.validate = validate

    def extract_parenthesis_list(self, text):
        chunks = []
        chunk = []
        paren_level = 0
        i = 0
        n = len(text)
        while i < n:
            c = text[i]
            i += 1
            if c == "(":
                if paren_level > 0:
                    chunk.append(c)
                paren_level += 1
            elif c == ")":
                if paren_level > 1:
                    chunk.append(c)
                paren_level -= 1
                if paren_level == 0:
                    if chunk:
                        chunks.append(chunk)
                    chunk = []
            else:
                chunk.append(c)
        chunks = list(map(''.join, chunks))
        return chunks

    def split_pipe_separated_tuple(self, text):
        parts = text.split("|")
        return parts

    def coerce_types(self, key, value):
        if "|" in value:
            value = self.split_pipe_separated_tuple(value)
            result = []
            for i, v in enumerate(value):
                result.append(self._coerce_value(key, v, i))
            return tuple(result)
        else:
            return self._coerce_value(key, value, 0)

    def _coerce_value(self, key, value, index):
        try:
            return int(value)
        except ValueError:
            pass
        try:
            return float(value)
        except ValueError:
            pass
        return str(value)

    def parse(self, line):
        if self.validate:
            match = self.detect_pattern.match(line)
            if not match:
                raise ValueError(
                    "Failed to parse {!r} using {!r}".format(
                        line, self))
        storage = OrderedDict()
        prefix = None
        db_uid = None
        if line.startswith(">"):
            line = line[1:]
        prefix, line = line.split(":", 1)
        db_uid, line = line.split(" ", 1)
        storage['Prefix'] = prefix
        storage['DbUniqueId'] = db_uid
        for key, value in self.kv_pattern.findall(line):
            if not value.startswith("(") or " (" in value:
                storage[key] = self.coerce_types(key, value)
            else:
                # multi-value
                storage[key] = [self.coerce_types(key, v) for v in self.extract_parenthesis_list(value)]
        return storage

There are a few issues with it still, the largest being the lack of support for the index-labeled features for specifying proteoform variants. The other large issue is the lack of proper feature structure unpacking. It seemed like PEFF expects you to have parsed psi-ms.obo in order to look up the data types for some features, but instead I just settle for making tuples of roughly typed values.

On Mon, Aug 6, 2018 at 2:24 PM Lev Levitsky <lev.le...@phystech.edu> wrote:

Hi Brett,

Thank you for writing and indicating your interest. Adding PEFF has come up, and it shouldn't be too hard to add to the existing parsers.
The current priority is implementing some structural changes to all parsers in preparation for the next release, which I want to finish in September. I will look more closely into it and try to get something done on PEFF before the release.

If you have any thoughts or suggestions, you are most welcome to share them here or in a personal email to me.

Best regards,
Lev

On Mon, Aug 6, 2018 at 1:28 AM, <brett...@gmail.com> wrote:

G'day team,

I was just wondering if there are any plans to include PEFF parsing to the already awesome library? It would seem to be a good addition to the fasta parsing section.

It might be something we can help with if that's useful.

Cheers,
Brett

--

---
You received this message because you are subscribed to the Google Groups "Pyteomics" group.

To unsubscribe from this group and stop receiving emails from it, send an email to pyteomics+...@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

--
Lev Levitsky
Institute for Energy Problems of Chemical Physics RAS
Laboratory of Physical and Chemical Methods for Structure Analysis
Leninsky pr. 38, bld. 2 119334 Moscow Russia
tel: +7 499 1378257 fax: +7 499 1378257, +7 499 1378258

--

---
You received this message because you are subscribed to the Google Groups "Pyteomics" group.

To unsubscribe from this group and stop receiving emails from it, send an email to pyteomics+...@googlegroups.com.

Joshua Klein

unread,

Sep 8, 2018, 6:07:40 PM9/8/18

to pyteomics, Brett Tully

Just to follow up, we’ve put a draft version of a peff module into the 4.0 branch that builds on the new random-access FASTA mechanism. The PEFF parser can unpack and type the PEFF feature fields, and provides a mapping-like interface similar to the defline parsers in the fasta. It can also read the header block to extract database metadata, but at the moment no additional special processing is done. I’m waiting for discussion from the specification authors before attempting to do more with feature data structures, and it’s undecided whether we should implement a first-class object for dealing with iterating over proteoforms.

Any feedback would be useful.

Reply all

Reply to author

Forward