[As J. Gleixner has already pointed out, there are HTML parsers
available for perl - doing this with a regexp is almost certainly not
the best way to do this]
On 2012-02-11 00:41, Jürgen Exner <
jurg...@hotmail.com> wrote:
> Rob <
rdw...@gmail.com> wrote:
>>I am looking for a perl REGEX statement to remove all the HTML from a
>
> Please see the FAQ and the many, many archived posts why HTML and REGEX
> is not a viable combination.
>
>>string except for the <p> tags. It would have to leave the <p> (and
>>the </p>) tags but also longer ones such as <p style=...> etc. I
>>haven't been able to find anything similar online for this.
What exactly do you mean by "remove all html except <p> tags"?
What would the result of processing the following (simple) file be?
<html>
<head>
<title>
A test
</title>
</head>
<body>
<h1> A test </h1> <h2> for Robs script </h2>
<p>
The quick brown fox jumps over the lazy dog.
</p>
<table>
<tr>
<td>
<p>
upper left
</p>
<p>
lower left
</p>
</td>
<td>
<p>
right
</p>
</td>
</tr>
</table>
<!--
<p>
This is not a paragraph
</p>
-->
<p>
Over & out!
</p>
</body>
</html>
>>Can anyone help with a suitable REGEX for this? I have tried a few
>>things but had no success.
Well, what have you tried?
Some tips:
* Start with a formal grammar of what you want to match.
I usually use some form of BNF.
* Don't try to write the whole Regexp at once. Use one Regexp
for every production in your grammar and use variable substitution
to build more complex regexps (there is a parallel thread about
matching RFC5322 headers with some examples).
* Use /x and comments.
> That is not surprising because it cannot be done for arbitrary HTML. For
> further details please read up on the Chomsky hierarchy of languages.
Care to explain how the difference between regular and context-free
grammars is relevant to the task at hand? And you know of course that
Perl regexps are a superset of regular expressions, so that even if the
task is impossible with a regular expression, it may still be possible
with a regexp (has anyone tried to prove that regexps are/are not
equivalent to context-free grammars lately?).
hp
--
_ | Peter J. Holzer | Deprecating human carelessness and
|_|_) | Sysadmin WSR | ignorance has no successful track record.
| | |
h...@hjp.at |
__/ |
http://www.hjp.at/ | -- Bill Code on
as...@irtf.org