Use regular expression to retrieve all image tags from a given content

263 views
Skip to first unread message

Mo Mughrabi

unread,
Jun 30, 2012, 8:37:13 AM6/30/12
to django...@googlegroups.com
Hello, 

am really a noob with regular expressions, I tried to do this on my own but I couldn't understand from the manuals how to approach it. Am trying to find all img tags of a given content, I wrote the below but its returning None

 content = i.content[0].value
            prog
= re.compile(r'^<img')
            result
= prog.match(content)
           
print result

any suggestions?


Sunny Nanda

unread,
Jun 30, 2012, 9:23:12 AM6/30/12
to django...@googlegroups.com
You can try the following two suggestions:

1. Try removing the "^" from the pattern and match only r"<img". I believe that the image tag might not be coming at the start of the string.
2. Try printing the value of "content" to check that the "<img" pattern exist in it. The match will be case sensitive, so even <IMG will not be matched.

On a sidenote, you should not be using regular expressions if you are doing anything complex that what you are doing right now.
HTML is not a regular language. So, you will be better off using an xml parser (like lxml or elementtree) or an html parser (BeautifulSoup)

-Sandeep

Maksim Schepelin

unread,
Jun 30, 2012, 11:31:15 AM6/30/12
to django...@googlegroups.com
Why not use html parse lib? BeautifulSoup(http://www.crummy.com/software/BeautifulSoup/) for expample

from BeautifulSoup import BeautifulSoup
soup = BeautifulSoup('put_youp_html_code_as_string')
images = soup.find_all('img')

If you need exactly regular expressions, watch this video: http://www.youtube.com/watch?v=kWyoYtvJpe4

суббота, 30 июня 2012 г., 20:37:13 UTC+8 пользователь mo.mughrabi написал:

Melvyn Sopacua

unread,
Jul 3, 2012, 1:57:33 PM7/3/12
to django...@googlegroups.com
On 30-6-2012 15:23, Sunny Nanda wrote:
> You can try the following two suggestions:
>
> 1. Try removing the "^" from the pattern and match only r"<img". I believe
> that the image tag might not be coming at the start of the string.

That, and re.match is bound to the start of the string. See:
http://docs.python.org/release/2.7.2/library/re.html#search-vs-match

What you're looking for is:
prog = re.compile(r'<img.*?/>')
matches = re.search(prog)
for match in matches :
print match

> On a sidenote, you should not be using regular expressions if you are doing
> anything complex that what you are doing right now.

This isn't complex. The email validator in django is complex. Using an
XML parser for this is quite overkill. If you need several elements
based on their nesting and/or sister elements, then an XML parser makes
more sense, or better xpath queries. This is simple stuff for regular
expressions and what they're made for.

--
Melvyn Sopacua


Tim Chase

unread,
Jul 3, 2012, 2:38:18 PM7/3/12
to django...@googlegroups.com, Melvyn Sopacua
The reason for using a true parser is to avoid obscure edge cases.
Your example fails on both

<IMG ... >

and

< img ... >

Also, depending on the use-case (such as stripping them out of
validated code), a use-case such as

<i<img>mg src="evil.gif">

could get part stripped out and leave the evil <img> tag in the text.

-tkc



Melvyn Sopacua

unread,
Jul 3, 2012, 2:55:49 PM7/3/12
to django...@googlegroups.com
On 3-7-2012 20:38, Tim Chase wrote:
> On 07/03/12 12:57, Melvyn Sopacua wrote:
>> On 30-6-2012 15:23, Sunny Nanda wrote:
>> What you're looking for is:
>> prog = re.compile(r'<img.*?/>')
>> matches = re.search(prog)
>> for match in matches :
>> print match
>>
>>> On a sidenote, you should not be using regular expressions if you are doing
>>> anything complex that what you are doing right now.
>>
>> This isn't complex. The email validator in django is complex. Using an
>> XML parser for this is quite overkill. If you need several elements
>> based on their nesting and/or sister elements, then an XML parser makes
>> more sense, or better xpath queries. This is simple stuff for regular
>> expressions and what they're made for.
>
> The reason for using a true parser is to avoid obscure edge cases.
> Your example fails on both
>
> <IMG ... >

Which is easily corrected with either <[Ii][Mm][Gg] or case-insensitive.
>
> and
>
> < img ... >

Which should fail.

> Also, depending on the use-case (such as stripping them out of
> validated code), a use-case such as
>
> <i<img>mg src="evil.gif">
>
> could get part stripped out and leave the evil <img> tag in the text.

r'<[Ii][Mm][Gg][^>]+[Ss][Rr][Cc]=[^>]+>' will leave very few corner
cases. The point is that if you want nothing but the tags (stripped or
matched), regular expressions can do the job just fine. It's actually
more complex to do this with parsers, as you have to deal with syntax
errors, keep state and rejoin the tags with the attributes for SAX based
parsers and the only advantageous parser is a DOM tree, which has a
large memory footprint on complex/large documents.
It's a trade-off you should make a decision on, not just blatantly
dismiss regular expressions when a document contains tags or call them
complex when they contain more then two characters. The call can even be
swayed in favor for either by the "I want to learn (regex|XML parsing)"
argument.
--
Melvyn Sopacua


Tim Chase

unread,
Jul 3, 2012, 9:03:19 PM7/3/12
to django...@googlegroups.com, Melvyn Sopacua
On 07/03/12 13:55, Melvyn Sopacua wrote:
> On 3-7-2012 20:38, Tim Chase wrote:
>> < img ... >
>
> Which should fail.

It depends on what the OP is using it for. If it's just for
extraction of images on the page to list them out, and such a tag
comes through, then the OP may be willing to let such
peculiarly-formed tags slide. However, if the eventual purpose is
to prevent users from adding image tags to a text-area, then it's
perfectly valid (or at least widely accepted by multiple browsers).

>> Also, depending on the use-case (such as stripping them out of
>> validated code), a use-case such as
>>
>> <i<img>mg src="evil.gif">
>>
>> could get part stripped out and leave the evil <img> tag in the text.
>
> r'<[Ii][Mm][Gg][^>]+[Ss][Rr][Cc]=[^>]+>' will leave very few corner
> cases.

And yet I keep coming up with those corner cases, as would any
attacker that wanted to inject an image into the page (again, if the
goal is preventing image injection). We could still have tags like

< img class="spoon" src="evil.gif">

and since the HTML spec is pretty lazy/sloppy regarding ignoring
unknown tags, one can even have garbage attributes introduced
anywhere you want:

<img micturations=plurdled src="evil.gif" gruntbuggly=freddled >

> The point is that if you want nothing but the tags (stripped or
> matched), regular expressions can do the job just fine.

The level of concern varies radically depending on whether one just
wants to extract/gather the sane(ish) image tags from a source, or
if the purpose is to sanitize input. A rough estimate for
extraction could be done something like the following untested regexp:

r = re.compile(r"""
< # tag opening
\s* # optional whitespace
img # the tag
\b # must end here
(?: # one of these things:
\s+ # whitespace
(?:[a-z][a-z0-9]+:)? # an optional namespace
src # a "src" attribute
\s* # optional whitespace
= # the equals sign
\s* # optional whitespace
( # capture the value
"[^"]*" # a double-quoted string
| # or
'[^']*' # a single-quoted string
| # or
[^-a-z0-9._:]*" # per HTML spec
) # end of the captured src
| # or something that's not src
\s+ # whitespace
(?:[a-z][a-z0-9]+:)? # an optional namespace
[a-z0-9]+ # the tag name
(?: # an optional value
\s* # optional whitespace
= # an assignment
\s* # optional whitespace
(?: # the value
"[^"]*" # a double-quoted string
| # or
'[^']*' # a single-quoted string
| # or
[^-a-z0-9._:]*" # per HTML spec
) # end of the captured src
) # end of ignored attribute
)* # zero or more attributes
\s* # optional whitespace
(?:/\s*) # an optional self-closing
> # closing >
""", re.I | re.VERBOSE)

So, that said, I'm not sure it's much more complex to use a real
parsing library :-)

> It's a trade-off you should make a decision on, not just blatantly
> dismiss regular expressions when a document contains tags or call them
> complex when they contain more then two characters.

I'm not blithely dismissing them, just making sure that

1) the use case is understood (I haven't heard the OP chime in
beyond the simple code snippet in the first post)

2) the complexity of adequately catching a wide variety of
edge-cases one can encounter when using regexps, and

3) parsing HTML/XML is often better left to battle-tested libraries,
as I'm sure there are missing bits in the above regex...things like
the actual allowable character-sets for attribute names, and there
might yet be some sociopathic code that could slip by, but this
catches most of the cases that occur from my understanding of the
HTML spec.

> The call can even be swayed in favor for either by the "I want to
> learn (regex|XML parsing)" argument.

THAT, I can't give you any grief about--I frequently use that
argument myself :-D

-tim





Melvyn Sopacua

unread,
Jul 4, 2012, 9:30:45 AM7/4/12
to django...@googlegroups.com
Aside from the \b matching positive against ><, this is a syntax
validator and of course when validating syntax you'll use a validating
parser. For all other cases, you'd be more interested in "image tags
with a src attribute", making the use-case quite a bit more simple.

My main beef with modern software is that for the simplest of things one
flees to full-blown libraries which happen to provide some utilities,
but the other 98% of the code from that library is unused. Case in
point, PIL to verify if a file is an image. But my rant alarm went off. :)

>> It's a trade-off you should make a decision on, not just blatantly
>> dismiss regular expressions when a document contains tags or call them
>> complex when they contain more then two characters.
>
> I'm not blithely dismissing them,

But the author I replied to did. I think you and I are on the same page.
Either way, the OP now has some nice examples of how to /refine/ regular
expressions and that's the real craft :).

--
Melvyn Sopacua


Tim Chase

unread,
Jul 4, 2012, 1:34:38 PM7/4/12
to django...@googlegroups.com, Melvyn Sopacua
On 07/04/12 08:30, Melvyn Sopacua wrote:
> On 4-7-2012 3:03, Tim Chase wrote:
>> [snip Tim's obscene regex]
>
> Aside from the \b matching positive against ><,

I'm not sure I follow...the \b just requires that a word-boundary
occur there, preventing it from matching something like "imgood".
It might even be vestigial as I originally had "*" rather than "+"
for the whitespace before attributes, so it could likely be removed
now without impacting the regexp.

> My main beef with modern software is that for the simplest of things one
> flees to full-blown libraries which happen to provide some utilities,
> but the other 98% of the code from that library is unused. Case in
> point, PIL to verify if a file is an image. But my rant alarm went off. :)

I've done enough work where pathological client/user data comes
through that I swear sometimes they're TRYING to break the app.
Like the person who uploaded an Excel file with a .jpg extension to
try and post the contained graph (re. your PIL image-detection comment).

So it all boils down the use case, and how much work you want to
spend maintaining it every time it breaks. If security is involved,
I want the best-tested library I can get so I don't have to make all
their mistakes myself. If it's just a dirty "get me adequate data
as fast as possible" (especially if it's a one-off for a single
data-source), then I'll often just hammer it out using whatever is
easiest.

> But the author I replied to did.

Ah. I was looking for a 2nd post in the thread from the OP ("Mo
Mughrabi") and didn't see anything. At least via gmane where I read
the list.

> I think you and I are on the same page.
> Either way, the OP now has some nice examples of how to /refine/ regular
> expressions and that's the real craft :).

Amen! :-)

-tim



Reply all
Reply to author
Forward
0 new messages