I want to parse the following HTML page and extract TV listing data
using VC++
http://tvlistings.zap2it.com/tvlistings/ZCGrid.do
is easy for VC++ to call PERL script and do some regular expression?
since the HTML page is not XML well formed, I cannot use a XML parser
right?
any other good ways to extract HTML page data?
As has been mentioned many, many times in this NG: if you want to parse HTML
then use an HTML parser.
For further details please Read The Fine Manual: perldoc -q HTML
" How do I remove HTML from a string?"
While it doesn't address your question directly there is really no big
difference between extracting the text (= 'removing HTML') and extracting
specific other pieces of information.
>and extract TV listing data
>using VC++
Ahemmmm, are you sure you are asking in the right NG? This NG is about Perl.
>is easy for VC++ to call PERL script and do some regular expression?
Are you asking? Well, Perl (Perl is not an acronym but a name proper) is
certainly good at processing text and Perl's regular expression language is
very expressive. However this is rather irrelevant for parsing HTML because
HTML is not a regular language in the first place. Only a masochist would
try to parse HTML using REs.
>since the HTML page is not XML well formed, I cannot use a XML parser
>right?
Why would you want to use an XML parser on HTML source data? Oh, right, you
are also asking VC++ questions in a Perl NG.
>any other good ways to extract HTML page data?
Use one of the existing HTML parsers.
jue
--- example ---
<html><body>
<table border=1>
<tr><td>1<td>first row</td>
<tr><td>2<td>second row</td>
</table>
</body>
</html>
--- example ---
--
Petr Vileta, Czech republic
(My server rejects all messages from Yahoo and Hotmail. Send me your
mail from another non-spammer site please.)
Please reply to <petr AT practisoft DOT cz>
> Jürgen Exner wrote:
>> June Lee <iiu...@yahoo.com> wrote:
>>> any good way to extract the data?
>>> I want to parse the following HTML page
>>
>> As has been mentioned many, many times in this NG: if you want to
>> parse HTML then use an HTML parser.
> You are right when HTML page are valid, but on not valid pages Parser
> fail. The HTML code bellow work for most browsers but Parser fail on
> it.
HTML parsers aren't XML parsers. If they were, life would be a lot
easier for people writing browsers, and 90% of the pages on the web
would not render at all.
> --- example ---
> <html><body>
> <table border=1>
> <tr><td>1<td>first row</td>
> <tr><td>2<td>second row</td>
> </table>
> </body>
> </html>
> --- example ---
Works fine in HTML::Parser:
#!/usr/local/bin/perl
use HTML::Parser;
my $p = HTML::Parser->new( api_version => 3,
start_h => [\&start, "tagname, attr"],
end_h => [\&end, "tagname"],
marked_sections => 1,
);
$p->parse_file(\*DATA);
sub start {
print "START: @_\n";
}
sub end {
print "END: @_\n";
}
__DATA__
<html><body>
<table border=1>
<tr><td>1<td>first row</td>
<tr><td>2<td>second row</td>
</table>
</body>
</html>
--
Joost Diepenmaat | blog: http://joost.zeekat.nl/ | work: http://zeekat.nl/
>> As has been mentioned many, many times in this NG: if you want to
>> parse HTML then use an HTML parser.
>You are right when HTML page are valid, but on not valid pages Parser fail.
>The HTML code bellow work for most browsers but Parser fail on it.
^^^^^^
^^^^^^
I don't know of a single module called Parser, so it's hard to trust
your claim and in any case you should give evidence about it with an
example of the *real* module you're thinking of failing on your HTML
example.
Michele
--
{$_=pack'B8'x25,unpack'A8'x32,$a^=sub{pop^pop}->(map substr
(($a||=join'',map--$|x$_,(unpack'w',unpack'u','G^<R<Y]*YB='
.'KYU;*EVH[.FHF2W+#"\Z*5TI/ER<Z`S(G.DZZ9OX0Z')=~/./g)x2,$_,
256),7,249);s/[^\w,]/ /g;$ \=/^J/?$/:"\r";print,redo}#JAPH,
You can use HTML::TreeBuilder, maybe the XML::LibXML html mode or
pre-process the HTML using tidy to generate XHTML.
In my experience, HTML::TreeBuilder is the easiest option.
If you need to navigate through the website to get to the page you need,
then WWW::Mechanize might help.
--
mirod