python regex: variable length of positive lookbehind assertion

Yubin Ruan

unread,

Jun 14, 2016, 11:28:37 PM6/14/16

to

Hi everyone,
I am struggling writing a right regex that match what I want:

Problem Description:

Given a string like this:

>>>string = "false_head <a>aaa</a> <a>bbb</a> false_tail \
true_head some_text_here <a>ccc</a> <a>ddd</a> <a>eee</a> true_tail"

I want to match the all the text surrounded by those "<a> </a>",
but only if those "<a> </a>" locate **in some distance** behind "true_head". That is, I expect to result to be like this:

>>>import re
>>>result = re.findall("the_regex",string)
>>>print result
["ccc","ddd","eee"]

How can I write a regex to match that?
I have try to use the **positive lookbehind assertion** in python regex,
but it does not allowed variable length of lookbehind.

Thanks in advance,
Ruan

Lawrence D’Oliveiro

unread,

Jun 15, 2016, 12:18:31 AM6/15/16

to

On Wednesday, June 15, 2016 at 3:28:37 PM UTC+12, Yubin Ruan wrote:

> I want to match the all the text surrounded by those "<a> </a>",

You are trying to use regex (type 3 grammar) to parse HTML (type 2 grammar) <https://en.wikipedia.org/wiki/Formal_grammar#The_Chomsky_hierarchy>?

No can do <http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454>.

Yubin Ruan

unread,

Jun 15, 2016, 12:38:48 AM6/15/16

to

Yes. I think you are correct. Thanks.

Jussi Piitulainen

unread,

Jun 15, 2016, 1:10:30 AM6/15/16

to

Don't.

Don't even try to do it all in one regex. Keep your regexen simple and
match in two steps.

For example, capture all such elements together with your marker:

re.findall(r'true_head|<a>[^<]+</a>', string)
==>
['<a>aaa</a>', '<a>bbb</a>',
'true_head', '<a>ccc</a>', '<a>ddd</a>', '<a>eee</a>']

Then filter the result in the obvious way (not involving any regex any
more, unless needed to recognize the true 'true_head' again). I've kept
the tags at this stage, so a possible '<a>true_head</a>' won't look like
'true_head' yet.

Another way is to find 'true_head' first (if you can recognize it safely
before also recognizing the elements), and then capture the elements in
the latter half only.

Vlastimil Brom

unread,

Jun 15, 2016, 4:31:39 AM6/15/16

to

> --
> https://mail.python.org/mailman/listinfo/python-list

Hi,
html-like data is generally not very suitable for parsing with regex,
as was explained in the previous answers (especially if comments and
nesting are massively involved).
However, if this suits your data and the usecase, you can use regex
with variable-length lookarounds in a much enhanced "regex" library
for python
https://pypi.python.org/pypi/regex

your pattern might then simply have the form you most likely have
intended, e.g.:
>>> regex.findall(r"(?<=true_head.*)<a>([^<]+)</a>(?=.*true_tail)", "false_head <a>aaa</a> <a>bbb</a> false_tail true_head some_text_here <a>ccc</a> <a>ddd</a> <a>eee</a> true_tail <a>fff</a> another_false_tail")
['ccc', 'ddd', 'eee']
>>>

If you are accustomed to use regular expressions, I'd certainly
recommend this excellent library (besides unlimited lookarounds, there
are repeated and recursive patterns, many unicode-related
enhancements, powerful character set operations, even fuzzy matching
and much more).

hth,
vbr

alister

unread,

Jun 15, 2016, 8:28:05 AM6/15/16

to

don't try to use regex to parse html it wont work reliably
i am surprised no one has mentioned beautifulsoup yet, which is probably
what you require.

--
What we anticipate seldom occurs; what we least expect generally happens.
-- Bengamin Disraeli

Jussi Piitulainen

unread,

Jun 15, 2016, 8:55:52 AM6/15/16

to

Nothing in the question indicates that the data is HTML.

Marko Rauhamaa

unread,

Jun 15, 2016, 10:34:06 AM6/15/16

to

Jussi Piitulainen <jussi.pi...@helsinki.fi>:

> alister writes:
>
>> On Tue, 14 Jun 2016 20:28:24 -0700, Yubin Ruan wrote:
>>> Given a string like this:
>>>
>>> >>>string = "false_head <a>aaa</a> <a>bbb</a> false_tail \
>>> true_head some_text_here <a>ccc</a> <a>ddd</a> <a>eee</a>
>>> true_tail"
>>>
>>> I want to match the all the text surrounded by those "<a> </a>",

>>> [...]

>>
>> don't try to use regex to parse html it wont work reliably

>> [...]

>
> Nothing in the question indicates that the data is HTML.

And nothing in alister's answer suggests that.

Marko

Jussi Piitulainen

unread,

Jun 15, 2016, 10:57:22 AM6/15/16

to

Marko Rauhamaa writes:

Now *I'm* surprised.

alister

unread,

Jun 15, 2016, 11:31:53 AM6/15/16

to

the <a></a> tags are a prety good indicator though
even if it is not HTML the same advise stands for XML (the quote example
would be invalid if it was XML)

if it is neither for these formats but still using a similar tag
structure then I would say that Reg ex is still unsuitable & the OP would
need to write a full parser for the format if one does not already exist

--
Farewell we call to hearth and hall!
Though wind may blow and rain may fall,
We must away ere break of day
Far over wood and mountain tall.

To Rivendell, where Elves yet dwell
In glades beneath the misty fell,
Through moor and waste we ride in haste,
And whither then we cannot tell.

With foes ahead, behind us dread,
Beneath the sky shall be our bed,
Until at last our toil be passed,
Our journey done, our errand sped.

We must away! We must away!
We ride before the break of day!
-- J. R. R. Tolkien

Jussi Piitulainen

unread,

Jun 15, 2016, 12:04:43 PM6/15/16

to

I can see how they point that way, but to me that alone seemed pretty
weak.

> even if it is not HTML the same advise stands for XML (the quote
> example would be invalid if it was XML)

It's not valid HTML either, for similar reasons. Or is it? I don't even
want to know.

> if it is neither for these formats but still using a similar tag
> structure then I would say that Reg ex is still unsuitable & the OP
> would need to write a full parser for the format if one does not
> already exist

That depends on details that weren't provided.

I work with a data format that mixes element tags with line-oriented
data records, and having a dedicated parser would be more of a hassle. A
couple of very simple regexen are useful in making sure that start tags
have a valid form and extracting attribute-value pairs from them. I'm
not at all experiencing "two problems" here. Some uses of regex are
good. (And now I may be about to experience the third problem. That
makes me sad.)

Anyway, I think you and another person guessed correctly that the OP is
indeed really considering HTML, and then your suggestion is certainly
helpful.

Michael Torrie

unread,

Jun 15, 2016, 8:41:06 PM6/15/16

to

On 06/15/2016 08:57 AM, Jussi Piitulainen wrote:

> Marko Rauhamaa writes:
>> And nothing in alister's answer suggests that.
>
> Now *I'm* surprised.

He simply said, here's a regex that can parse the example string the OP
gave us (which maybe looked a bit like HTML, but like you say, may not
be), but don't try to use this method to parse actual HTML because it
won't work reliably.

Jussi Piitulainen

unread,

Jun 16, 2016, 1:39:38 AM6/16/16

to

Interesting how differently we can read alister's answer. It was only
two sentences, one of which Marko replaced with "[...]" before adding
his own one-liner that is still quoted above. Let me quote alister's
response in full here, the way I see it in Gnus:

# don't try to use regex to parse html it wont work reliably
# i am surprised no one has mentioned beautifulsoup yet, which is probably
# what you require.

That followed the fully quoted original message, and then there was an
attributed citation from a Bengamin Disraeli, separated as a .sig.

Where in alister's original response do you see a regex that can parse
OP's example? I don't see any regex there. (The text where you seem to
me to say that there is one is still quoted above in the normal way.)

Instead of giving any direct answer to the question, alister expresses
surprise at nobody having suggested an HTML parser. (Marko snipped that,
but I've quoted alister's response in full above, so you can check it
without looking up the original messages.)

A surprise calls for an explanation. Or should I say that I felt that
this particular expression of surprise seemed to me to call for an
explanation, or in the very least that an explanation would not do much
harm and might even be considered mildly interesting. And I saw a fully
adequate explanation: that the question was not about parsing HTML. So I
said so.

Marko Rauhamaa

unread,

Jun 16, 2016, 2:03:15 AM6/16/16

to

Jussi Piitulainen <jussi.pi...@helsinki.fi>:

> Michael Torrie writes:
>
>> On 06/15/2016 08:57 AM, Jussi Piitulainen wrote:
>>> Marko Rauhamaa writes:
>>>> And nothing in alister's answer suggests that.
>>>
>>> Now *I'm* surprised.
>>
>> He simply said, here's a regex that can parse the example string the OP
>> gave us (which maybe looked a bit like HTML, but like you say, may not
>> be), but don't try to use this method to parse actual HTML because it
>> won't work reliably.
>
> Interesting how differently we can read alister's answer. It was only
> two sentences, one of which Marko replaced with "[...]" before adding
> his own one-liner that is still quoted above.

> [...]

>
> That followed the fully quoted original message, and then there was an
> attributed citation from a Bengamin Disraeli, separated as a .sig.
>

> [...]

>
> A surprise calls for an explanation. Or should I say that I felt that
> this particular expression of surprise seemed to me to call for an
> explanation, or in the very least that an explanation would not do much
> harm and might even be considered mildly interesting. And I saw a fully
> adequate explanation: that the question was not about parsing HTML. So I
> said so.

This is so meta.

Marko