HTML parsing

June Lee

unread,

Mar 15, 2008, 9:14:11 PM3/15/08

to

any good way to extract the data?

I want to parse the following HTML page and extract TV listing data
using VC++

http://tvlistings.zap2it.com/tvlistings/ZCGrid.do

is easy for VC++ to call PERL script and do some regular expression?

since the HTML page is not XML well formed, I cannot use a XML parser
right?

any other good ways to extract HTML page data?

Jürgen Exner

unread,

Mar 15, 2008, 8:35:48 PM3/15/08

to

June Lee <iiu...@yahoo.com> wrote:
>any good way to extract the data?
> I want to parse the following HTML page

As has been mentioned many, many times in this NG: if you want to parse HTML
then use an HTML parser.
For further details please Read The Fine Manual: perldoc -q HTML
" How do I remove HTML from a string?"
While it doesn't address your question directly there is really no big
difference between extracting the text (= 'removing HTML') and extracting
specific other pieces of information.

>and extract TV listing data
>using VC++

Ahemmmm, are you sure you are asking in the right NG? This NG is about Perl.

>is easy for VC++ to call PERL script and do some regular expression?

Are you asking? Well, Perl (Perl is not an acronym but a name proper) is
certainly good at processing text and Perl's regular expression language is
very expressive. However this is rather irrelevant for parsing HTML because
HTML is not a regular language in the first place. Only a masochist would
try to parse HTML using REs.

>since the HTML page is not XML well formed, I cannot use a XML parser
>right?

Why would you want to use an XML parser on HTML source data? Oh, right, you
are also asking VC++ questions in a Perl NG.

>any other good ways to extract HTML page data?

Use one of the existing HTML parsers.

jue

Petr Vileta

unread,

Mar 16, 2008, 3:21:12 PM3/16/08

to

Jürgen Exner wrote:
> June Lee <iiu...@yahoo.com> wrote:
>> any good way to extract the data?
>> I want to parse the following HTML page
>
> As has been mentioned many, many times in this NG: if you want to
> parse HTML then use an HTML parser.

You are right when HTML page are valid, but on not valid pages Parser fail.
The HTML code bellow work for most browsers but Parser fail on it.

--- example ---
<html><body>
<table border=1>
<tr><td>1<td>first row</td>
<tr><td>2<td>second row</td>
</table>
</body>
</html>
--- example ---

--
Petr Vileta, Czech republic
(My server rejects all messages from Yahoo and Hotmail. Send me your
mail from another non-spammer site please.)

Please reply to <petr AT practisoft DOT cz>

Joost Diepenmaat

unread,

Mar 16, 2008, 5:58:40 PM3/16/08

to

"Petr Vileta" <sto...@practisoft.cz> writes:

> Jürgen Exner wrote:
>> June Lee <iiu...@yahoo.com> wrote:
>>> any good way to extract the data?
>>> I want to parse the following HTML page
>>
>> As has been mentioned many, many times in this NG: if you want to
>> parse HTML then use an HTML parser.
> You are right when HTML page are valid, but on not valid pages Parser
> fail. The HTML code bellow work for most browsers but Parser fail on
> it.

HTML parsers aren't XML parsers. If they were, life would be a lot
easier for people writing browsers, and 90% of the pages on the web
would not render at all.

> --- example ---
> <html><body>
> <table border=1>
> <tr><td>1<td>first row</td>
> <tr><td>2<td>second row</td>
> </table>
> </body>
> </html>
> --- example ---

Works fine in HTML::Parser:

#!/usr/local/bin/perl
use HTML::Parser;

my $p = HTML::Parser->new( api_version => 3,
start_h => [\&start, "tagname, attr"],
end_h => [\&end, "tagname"],
marked_sections => 1,
);
$p->parse_file(\*DATA);

sub start {
print "START: @_\n";
}
sub end {
print "END: @_\n";
}

__DATA__

<html><body>
<table border=1>
<tr><td>1<td>first row</td>
<tr><td>2<td>second row</td>
</table>
</body>
</html>

--
Joost Diepenmaat | blog: http://joost.zeekat.nl/ | work: http://zeekat.nl/

Michele Dondi

unread,

Mar 16, 2008, 6:11:20 PM3/16/08

to

On Sun, 16 Mar 2008 20:21:12 +0100, "Petr Vileta"
<sto...@practisoft.cz> wrote:

>> As has been mentioned many, many times in this NG: if you want to
>> parse HTML then use an HTML parser.
>You are right when HTML page are valid, but on not valid pages Parser fail.
>The HTML code bellow work for most browsers but Parser fail on it.

^^^^^^
^^^^^^

I don't know of a single module called Parser, so it's hard to trust
your claim and in any case you should give evidence about it with an
example of the *real* module you're thinking of failing on your HTML
example.

Michele
--
{$_=pack'B8'x25,unpack'A8'x32,$a^=sub{pop^pop}->(map substr
(($a||=join'',map--$|x$_,(unpack'w',unpack'u','G^<R<Y]*YB='
.'KYU;*EVH[.FHF2W+#"\Z*5TI/ER<Z`S(G.DZZ9OX0Z')=~/./g)x2,$_,
256),7,249);s/[^\w,]/ /g;$ \=/^J/?$/:"\r";print,redo}#JAPH,

mirod

unread,

Mar 17, 2008, 12:53:23 PM3/17/08

to

You can use HTML::TreeBuilder, maybe the XML::LibXML html mode or
pre-process the HTML using tidy to generate XHTML.

In my experience, HTML::TreeBuilder is the easiest option.

If you need to navigate through the website to get to the page you need,
then WWW::Mechanize might help.

--
mirod