pydna read method not parsing Genbank header with long name: how to modify code?

har...@gmail.com

unread,

Mar 20, 2020, 10:29:59 PM3/20/20

to pydna

Hi all,

I was wondering how to change the default behavior in pyDNA to handle long name fields in Genbank files.

When the Genbank name is non-standard, pyDNA read function fails to get set the circular property correctly.

I want to use pyDNA to work on a large library of non-standard genbank files and need to modify pydna parser to support them correctly.

Here is my test

from pydna.readers import read
myseq = read("T5pr-6xHis-MBP-HRV_3C-fzq1-spprot-2xHJHJNLS-Link8-MIND10SO-mungedheader.gb",ds=True)
myseq_long = read("T5pr-6xHis-MBP-HRV_3C-fzq1-spprot-2xHJHJNLS-Link8-MIND10SO.gb",ds=True)

print(myseq.circular)
print(myseq_long.circular)

Gives:

True
False

These files are identical other than for their different LOCUS lines.

# This gets parsed correctly and shows up as circular

T5pr-6xHis-MBP-HRV_3C-fzq1-spprot-2xHJHJNLS-Link8-MIND10SO-mungedheader.gb

LOCUS T5pr-6xHis 9838 bp ds-DNA circular UNA 03-JAN-2019

#This does not get parsed correctly as circular

T5pr-6xHis-MBP-HRV_3C-fzq1-spprot-2xHJHJNLS-Link8-MIND10SO.gb

LOCUS T5pr-6xHis-MBP-HRV_3C-fzq1-spprot-2xHJHJNLS-Link8-MIND10SO 9838 bp DNA circular UNA 03-JAN-2019

Both of their names get parsed correctly inside Biopython ( see below). 

I was wondering how I can change the pyDNA code to handle them correctly so as to not have to roundtrip the sequence through Biopython and then into pyDNA.

I did see some code which uses pyparsing to handle the parsing , but was hoping for some pointers on where to make changes.

Really like the pyDNA package and would love to contribute my change back if successful.

Thanks

Hari

Biopython handling of the files:

from Bio import SeqIO
files = ["T5pr-6xHis-MBP-HRV_3C-fzq1-spprot-2xHJHJNLS-Link8-MIND10SO-mungedheader.gb","T5pr-6xHis-MBP-HRV_3C-fzq1-spprot-2xHJHJNLS-Link8-MIND10SO.gb"]
for _file in files:
    seqr = SeqIO.parse(_file,format="genbank")
    for _rec in seqr:
        print(_rec.name)
T5pr-6xHis
T5pr-6xHis-MBP-HRV_3C-fzq1-spprot-2xHJHJNLS-Link8-MIND10SO 

har...@gmail.com

unread,

Apr 5, 2020, 1:24:05 PM4/5/20

to pydna

Hi all

I managed to fix this behavior by two minor changes to pyDNA in the genbankfixer.py and parsers.py.

1) The first was a modification to have the toGB function to handle my case where the input name is longer than 24 characters--I added a space between the name and size.

(base) hari@Haris-MacBook-Pro:~/pydna/tests/broken_genbank_files (master)$ git diff ../../pydna/genbankfixer.py
diff --git a/pydna/genbankfixer.py b/pydna/genbankfixer.py
index 0d4943d..285f816 100644
--- a/pydna/genbankfixer.py
+++ b/pydna/genbankfixer.py
@@ -363,7 +363,7 @@ def toGB(jseq):
     divcode  = jseq["divcode"] or "   "
     date = jseq["date"] or "19-MAR-1970"


-    locusstr = "LOCUS       {name:<24}{size:>4} bp {prefix}{seqtype:<4}    {topology:<8} {divcode} {date}\n".format(name=name,
+    locusstr = "LOCUS       {name:<24} {size:>4} bp {prefix}{seqtype:<4}    {topology:<8} {divcode} {date}\n".format(name=name,
                                                                                                                     size=size,
                                                                                                                     prefix=prefix,
                                                                                                                     seqtype=seqtype,

2) The second was to read the DNA type from topology annotation which also seems to be in the Genbank spec.

(base) hari@Haris-MacBook-Pro:~/pydna/tests/broken_genbank_files (master)$ git diff ../../pydna/parsers.py
diff --git a/pydna/parsers.py b/pydna/parsers.py
index 44e811c..a851d30 100644
--- a/pydna/parsers.py
+++ b/pydna/parsers.py
@@ -86,7 +86,7 @@ def parse(data, ds = True):
                     handle.seek(0)
                     parser = _RecordParser()
                     residue_type = parser.parse(handle).residue_type
-                    if "circular" in residue_type:
+                    if "circular" in residue_type or parsed.annotations["topology"]=="circular" :
                         circular = True
                     else:
                         circular = False

With these my broken genbank files with extra long non-standard LOCUS names are parsed and correctly recognized as dseq.

Thanks a tonne for this powerful library

Hari

har...@gmail.com

unread,

Apr 5, 2020, 3:54:50 PM4/5/20

to pydna

Created a pull request with the changes:https://github.com/BjornFJohansson/pydna/pulls

This was my first time doing this so hope I did it correctly.

Hari

On Friday, March 20, 2020 at 7:29:59 PM UTC-7, har...@gmail.com wrote:

Reply all

Reply to author

Forward