Remove all HTML but keep tags

Rob

unread,

Feb 10, 2012, 4:24:57 PM2/10/12

to

I am looking for a perl REGEX statement to remove all the HTML from a
string except for the tags. It would have to leave the (and
the ) tags but also longer ones such as etc. I
haven't been able to find anything similar online for this.

Can anyone help with a suitable REGEX for this? I have tried a few
things but had no success.

Any help would be much appreciated.

Rob

J. Gleixner

unread,

Feb 10, 2012, 5:02:55 PM2/10/12

to

Depending on how complex the 'string' is, you probably want to
avoid a regular expression solution and use a parser.e.g.
HTML::Parser.

Read the documentation and take a look a some of the examples
in the distribution, like hstrip and htext.

Jürgen Exner

unread,

Feb 10, 2012, 7:41:32 PM2/10/12

to

Rob <rdw...@gmail.com> wrote:
>I am looking for a perl REGEX statement to remove all the HTML from a

Please see the FAQ and the many, many archived posts why HTML and REGEX
is not a viable combination.

>string except for the tags. It would have to leave the (and
>the ) tags but also longer ones such as etc. I
>haven't been able to find anything similar online for this.
>
>Can anyone help with a suitable REGEX for this? I have tried a few
>things but had no success.

That is not surprising because it cannot be done for arbitrary HTML. For
further details please read up on the Chomsky hierarchy of languages.

jue

Peter J. Holzer

unread,

Feb 11, 2012, 4:06:50 AM2/11/12

to

[As J. Gleixner has already pointed out, there are HTML parsers
available for perl - doing this with a regexp is almost certainly not
the best way to do this]

On 2012-02-11 00:41, Jürgen Exner <jurg...@hotmail.com> wrote:
> Rob <rdw...@gmail.com> wrote:
>>I am looking for a perl REGEX statement to remove all the HTML from a
>
> Please see the FAQ and the many, many archived posts why HTML and REGEX
> is not a viable combination.
>
>>string except for the tags. It would have to leave the (and
>>the ) tags but also longer ones such as etc. I
>>haven't been able to find anything similar online for this.

What exactly do you mean by "remove all html except tags"?

What would the result of processing the following (simple) file be?

<html>
<head>
<title>
A test
</title>
</head>
<body>
<h1> A test </h1> <h2> for Robs script </h2>

The quick brown fox jumps over the lazy dog.

<table>
<tr>
<td>

upper left


lower left

</td>
<td>

right

</td>
</tr>
</table>


Over & out!

</body>
</html>

>>Can anyone help with a suitable REGEX for this? I have tried a few
>>things but had no success.

Well, what have you tried?

Some tips:

* Start with a formal grammar of what you want to match.
I usually use some form of BNF.
* Don't try to write the whole Regexp at once. Use one Regexp
for every production in your grammar and use variable substitution
to build more complex regexps (there is a parallel thread about
matching RFC5322 headers with some examples).
* Use /x and comments.

> That is not surprising because it cannot be done for arbitrary HTML. For
> further details please read up on the Chomsky hierarchy of languages.

Care to explain how the difference between regular and context-free
grammars is relevant to the task at hand? And you know of course that
Perl regexps are a superset of regular expressions, so that even if the
task is impossible with a regular expression, it may still be possible
with a regexp (has anyone tried to prove that regexps are/are not
equivalent to context-free grammars lately?).

hp

--
_ | Peter J. Holzer | Deprecating human carelessness and
|_|_) | Sysadmin WSR | ignorance has no successful track record.
| | | h...@hjp.at |
__/ | http://www.hjp.at/ | -- Bill Code on as...@irtf.org

George Mpouras

unread,

Feb 14, 2012, 1:14:53 PM2/14/12

to

# try this
use strict;
use warnings;
my $htm=sub{local $/=undef;$_=$_[0];<$_>}->(\*DATA);
while( $htm =~/<p[^>]*?>(.*?)<\/p>/gi ) {
print "*$^N*\n"
}

__DATA__

Earth blah1 Sun blah1
Moon blah2 
Venus
Hermesblah2
Jupiter

Eli the Bearded

unread,

Feb 14, 2012, 7:19:20 PM2/14/12

to

In comp.lang.perl.misc,
George Mpouras <nospam.gravit...@hotmail.com.nospam> wrote:
> # try this

> while( $htm =~/<p[^>]*?>(.*?)<\/p>/gi ) {
> print "*$^N*\n"
> }

That won't be effective if there are <PRE> (or other <P\w+>) tags,
missing tags, if there is markup inside the paragraphs, if
the close paragraph has optional whitespace like "", or if
the paragraph contains newlines.

That's all the errors I can find in a first glance.

Elijah
------
thinks it is html comments that make the original problem tricky

Remove all HTML but keep <p> tags

Rob

J. Gleixner

Jürgen Exner

Peter J. Holzer

George Mpouras

Eli the Bearded