python grep

Mag Gam

unread,

Apr 8, 2010, 7:21:11 AM4/8/10

to pytho...@python.org

I am in the process of reading a zipped file which is about 6gb.

I would like to know if there is a command similar to grep in python
because I would like to emulate, -A -B option of GNU grep.

Lets say I have this,

083828.441,AA
093828.441,AA
094028.441,AA
094058.441,CC
094828.441,AA
103828.441,AA
123828.441,AA

if I do grep -A2 -B2 "CC"

I get 2 lines before and 2 lines after "C"

Is there an easy way to do this in python?

TIA

Stefan Behnel

unread,

Apr 8, 2010, 7:31:47 AM4/8/10

to pytho...@python.org

Mag Gam, 08.04.2010 13:21:

Sure, just use a sliding window.

However, for a 6BG file, you won't really like the performance. It's
basically impossible to beat the speed of (f)grep.

I'd use the subprocess module to run zfgrep over the file and parse the
output in Python.

Stefan

Mag Gam

unread,

Apr 8, 2010, 8:21:10 AM4/8/10

to Stefan Behnel, pytho...@python.org

Oh, thats nice to know!

But I use the CSV module with gzip module. Is it still possible to do
it with the subprocess?

> --
> http://mail.python.org/mailman/listinfo/python-list
>

Stefan Behnel

unread,

Apr 8, 2010, 11:40:36 AM4/8/10

to pytho...@python.org

Mag Gam, 08.04.2010 14:21:

> On Thu, Apr 8, 2010 at 7:31 AM, Stefan Behnel wrote:
>> Mag Gam, 08.04.2010 13:21:
>>>
>>> I am in the process of reading a zipped file which is about 6gb.
>>>
>>> I would like to know if there is a command similar to grep in python
>>> because I would like to emulate, -A -B option of GNU grep.
>>>
>>> Lets say I have this,
>>>
>>> 083828.441,AA
>>> 093828.441,AA
>>> 094028.441,AA
>>> 094058.441,CC
>>> 094828.441,AA
>>> 103828.441,AA
>>> 123828.441,AA
>>>
>>>
>>> if I do grep -A2 -B2 "CC"
>>>
>>> I get 2 lines before and 2 lines after "C"
>>>
>>> Is there an easy way to do this in python?
>>
>> Sure, just use a sliding window.
>>
>> However, for a 6BG file, you won't really like the performance. It's
>> basically impossible to beat the speed of (f)grep.
>>
>> I'd use the subprocess module to run zfgrep over the file and parse the
>> output in Python.
>>

> Oh, thats nice to know!
>
> But I use the CSV module with gzip module. Is it still possible to do
> it with the subprocess?

Depends on what you do with the csv module and how it interacts with the
search above. Giving more detail may allow us to answer your question and
to provide better advice.

Stefan

Peter Otten

unread,

Apr 9, 2010, 2:55:13 PM4/9/10

to

Mag Gam wrote:

from itertools import islice, groupby
from collections import deque

def grep(instream, ismatch, before, after):
items_before = None
for key, group in groupby(instream, ismatch):
if key:
if items_before is not None:
for item in items_before:
yield "before", item
else:
items_before = not None # ;)
for item in group:
yield "match", item
else:
if items_before is not None:
for item in islice(group, after):
yield "after", item
items_before = deque(group, maxlen=before)

def demo1():
with open(__file__) as instream:
for state, (index, line) in grep(enumerate(instream, 1),
ismatch=lambda (i, s): "item" in s,
before=2, after=2):
print "%3d %-6s %s" % (index, state + ":", line),

def demo2():
from StringIO import StringIO
import csv
lines = StringIO("""\

083828.441,AA
093828.441,AA
094028.441,AA
094058.441,CC
094828.441,AA
103828.441,AA
123828.441,AA

""")

rows = csv.reader(lines)
for state, row in grep(rows, lambda r: r[-1] == "CC", 1, 2):
print row

if __name__ == "__main__":
demo1()
demo2()

Probably too slow; badly needs testing.

Peter