Parsing multiple lines from text file using regex

Marc

unread,

Oct 27, 2013, 5:09:46 PM10/27/13

to pytho...@python.org

Hi,
I am having an issue with something that would seem to have an easy solution, but which escapes me. I have configuration files that I would like to parse. The data I am having issue with is a multi-line attribute that has the following structure:

banner <option> <banner text delimiter>
Banner text
Banner text
Banner text
...
<banner text delimiter>

The regex 'banner\s+(\w+)\s+(.+)' captures the command nicely and banner.group(2) captures the delimiter nicely.

My issue is that I need to capture the lines between the delimiters (both delimiters are the same).

I have tried various permutations of

Delimiter=banner.group(2)
re.findall(Delimiter'(.*?)'Delimiter, line, re.DOTALL|re.MULTILINE)

with no luck

Examples I have found online all assume that the starting and ending delimiters are different and are defined directly in re.findall(). I would like to use the original regex extracting the banner.group(2), since it is already done, if possible.

Any help in pointing me in the right direction would be most appreciated.

Thank you,

Marc

Rhodri James

unread,

Oct 27, 2013, 6:19:41 PM10/27/13

to

On Sun, 27 Oct 2013 21:09:46 -0000, Marc <ma...@marcd.org> wrote:

> Hi,
> I am having an issue with something that would seem to have an easy
> solution, but which escapes me. I have configuration files that I would
> like to parse. The data I am having issue with is a multi-line attribute
> that has the following structure:
>
> banner <option> <banner text delimiter>
> Banner text
> Banner text
> Banner text
> ...
> <banner text delimiter>
>
> The regex 'banner\s+(\w+)\s+(.+)' captures the command nicely and
> banner.group(2) captures the delimiter nicely.
>
> My issue is that I need to capture the lines between the delimiters (both
> delimiters are the same).

I really, really wouldn't do this with a single regexp. You'll get a much
easier to understand program if you implement a small state machine
instead. In rough pseudocode:

collecting_banner = False
for line in configuration_file:
if not collecting_banner:
if found banner start:
get delimiter
collecting_banner = True
banner_lines = []
elif found other stuff:
do other stuff
elif found delimiter:
collecting_banner = False
else:
banner_lines.append(line)

--
Rhodri James *-* Wildebeest Herder to the Masses

Mark Lawrence

unread,

Oct 27, 2013, 6:26:02 PM10/27/13

to pytho...@python.org

On 27/10/2013 21:09, Marc wrote:
> Hi,
> I am having an issue with something that would seemtohave an easy

> solution,butwhich escapes me. I have configuration files that I would

> like to parse. The data I am having issue with is a multi-line
> attribute that has the following structure:
>
> banner <option> <banner text delimiter>
> Banner text
> Banner text
> Banner text
> ...
> <banner text delimiter>
>
> The regex 'banner\s+(\w+)\s+(.+)' captures the command nicely and
> banner.group(2) captures the delimiter nicely.
>
> My issue is that I need to capture the lines between the delimiters
> (both delimiters are the same).
>

> I have tried various permutations of
>
> Delimiter=banner.group(2)
> re.findall(Delimiter'(.*?)'Delimiter, line, re.DOTALL|re.MULTILINE)
>
> with no luck
>
> Examples I have found online all assume that the starting and ending
> delimiters are different and are defined directly in re.findall(). I
> would like to use the original regex extracting the banner.group(2),
> since it is already done, if possible.
>
>
> Any help in pointing me in the right direction would be most appreciated.
>
> Thank you,
>
> Marc
>

What was wrong with the answer Peter Otten gave you earlier today on the
tutor mailing list?

--
Python is the second best programming language in the world.
But the best has yet to be invented. Christian Tismer

Mark Lawrence

Roy Smith

unread,

Oct 27, 2013, 6:43:16 PM10/27/13

to

In article <op.w5mwa3iaa8ncjz@gnudebeest>,

"Rhodri James" <rho...@wildebst.demon.co.uk> wrote:

> I really, really wouldn't do this with a single regexp. You'll get a much
> easier to understand program if you implement a small state machine
> instead.

And what is a regex if not a small state machine?

Ben Finney

unread,

Oct 27, 2013, 7:34:55 PM10/27/13

to pytho...@python.org

Regex is not a state machine implemented by the original poster :-)

Or, in other words, I interpret Rhodri as saying that the right way to
do this is by implementing a *different* small state machine, which will
address the task better than the small state machine of regex.

--
\ “Pinky, are you pondering what I'm pondering?” “I think so, |
`\ Brain, but three round meals a day wouldn't be as hard to |
_o__) swallow.” —_Pinky and The Brain_ |
Ben Finney

Marc

unread,

Oct 27, 2013, 8:35:09 PM10/27/13

to Mark Lawrence, pytho...@python.org

>What was wrong with the answer Peter Otten gave you earlier today on the
>tutor mailing list?
>
>--
>Python is the second best programming language in the world.
>But the best has yet to be invented. Christian Tismer
>
>Mark Lawrence
>

I did not receive any answers from the Tutor list, so I thought I'd ask
here. If an answer was posted to the Tutor list, it never made it to my
inbox. Thanks to all that responded.

Mark Lawrence

unread,

Oct 27, 2013, 8:40:53 PM10/27/13

to pytho...@python.org

Okay, the following is taken directly from Peter's reply to you. Please
don't shoot the messenger :)

You can reference a group in the regex with \N, e. g.:

>>> text = """"banner option delim
... banner text
... banner text
... banner text
... delim
... """
>>> re.compile(r"banner\s+(\w+)\s+(\S+)\s+(.+?)\2", re.MULTILINE |
re.DOTALL).findall(text)
[('option', 'delim', 'banner text\nbanner text\nbanner text\n')]

Oscar Benjamin

unread,

Oct 28, 2013, 5:30:42 AM10/28/13

to Marc, Python List

On 28 October 2013 00:35, Marc <ma...@marcd.org> wrote:
>>What was wrong with the answer Peter Otten gave you earlier today on the
>>tutor mailing list?
>>
>>--
>>Python is the second best programming language in the world.
>>But the best has yet to be invented. Christian Tismer
>>
>>Mark Lawrence
>>
>
>
> I did not receive any answers from the Tutor list, so I thought I'd ask
> here. If an answer was posted to the Tutor list, it never made it to my
> inbox. Thanks to all that responded.

Hi Marc, did you actually subscribe to the tutor list or did you just
send an email there? Peter replied to you and you can see the reply
here:
https://mail.python.org/pipermail/tutor/2013-October/098156.html

He only sent the reply back to the tutor list and didn't email it
directly to you because it is assumed that you would also be
subscribed to the list.

Oscar

Marc

unread,

Oct 28, 2013, 12:53:10 PM10/28/13

to Oscar Benjamin, Python List

>Hi Marc, did you actually subscribe to the tutor list or did you just
>send an email there? Peter replied to you and you can see the reply
>here:
>https://mail.python.org/pipermail/tutor/2013-October/098156.html
>
>He only sent the reply back to the tutor list and didn't email it
>directly to you because it is assumed that you would also be subscribed
>to the list.
>
>
>Oscar

Thanks Oscar - yes, I am subscribed to both lists - the python-list through
this email address and the tutor list through my work email. I think my
work spam filter may be the issue - I will subscribe to the tutor list
through this email and unsubscribe from my work email.

Thanks again,

Marc

Marc

unread,

Oct 28, 2013, 12:56:52 PM10/28/13

to Oscar Benjamin, Python List

>Hi Marc, did you actually subscribe to the tutor list or did you just
>send an email there? Peter replied to you and you can see the reply
>here:
>https://mail.python.org/pipermail/tutor/2013-October/098156.html
>
>He only sent the reply back to the tutor list and didn't email it
>directly to you because it is assumed that you would also be subscribed
>to the list.
>
>
>Oscar

I again Oscar - apparently I am already subscribed to the Tutor list with
this email also:

'An attempt was made to subscribe your address to the mailing list
tu...@python.org. You are already subscribed to this mailing list.'

I control the spam filter for this domain, so I am not sure why I am not
getting the updates.

Marc

Jason Friedman

unread,

Nov 3, 2013, 6:12:37 AM11/3/13

to Marc, python-list

Hi,
I am having an issue with something that would seem to have an easy solution, but which escapes me. I have configuration files that I would like to parse. The data I am having issue with is a multi-line attribute that has the following structure:

banner <option> <banner text delimiter>
Banner text
Banner text
Banner text
...
<banner text delimiter>

This is an alternative solution someone else posted on this list for a similar problem I had:

#!/usr/bin/python3

from itertools import groupby

def get_lines_from_file(file_name):

with open(file_name) as reader:

for line in reader.readlines():

yield(line.strip())

counter = 0

def key_func(x):

if x.strip().startswith("banner") and x.strip().endswith("<banner text delimiter>"):

global counter

counter += 1

return counter

for key, group in groupby(get_lines_from_file("my_data"), key_func):

print(list(group)[1:-1])

Marc

unread,

Nov 3, 2013, 8:28:04 PM11/3/13

to Jason Friedman, python-list

> This is an alternative solution someone else posted on this list for a similar problem I had:

> #!/usr/bin/python3

> from itertools import groupby

> def get_lines_from_file(file_name):

> with open(file_name) as reader:

> for line in reader.readlines():

> yield(line.strip())

> counter = 0

> def key_func(x):

> if x.strip().startswith("banner") and x.strip().endswith("<banner text delimiter>"):

> global counter

> counter += 1

> return counter

> for key, group in groupby(get_lines_from_file("my_data"), key_func):

> print(list(group)[1:-1])

Thanks Jason,

banner = re.compile(r'banner\s+(\w+)\s+(.+)(.*?)\2', re.DOTALL).findall(lines)

worked nicely to get what I needed:

outfile.write("Banner type: %s Banner Delimiter: %s\n" % (banner[0][0], banner[0][1]))
outfile.write("Banner Text:\n")
outfile.write(banner[0][2])

Probably not the prettiest, most concise code, but it gets the job done.

Thanks again,

Marc