Extracting data from HTML

Hazel

unread,

May 31, 2002, 3:52:55 PM5/31/02

to

Hi all,

I'm new in Python. Can someone help me with this problem?

how do i write a program that
will extract info from an HTML and print
of a list of TV programmes, its Time, and Duration
using urllib?

Please help.

Thanx :>

Geoff Gerrietts

unread,

May 31, 2002, 3:54:09 PM5/31/02

to

Quoting Hazel (lail...@hotmail.com):
> how do i write a program that
> will extract info from an HTML and print
> of a list of TV programmes, its Time, and Duration
> using urllib?

You might check into htmllib -- it's got some basic parser structures
in there that can help you parse through the HTML.

You might check out http://www.python9.org/p9-zadka.ppt, which goes
over some of that.

And at the end of this message, I've affixed some (very sloppy, not
very good) Python code that I pounded out the other day to (more or
less) strip markup from a page, so you can see how I went about
prototyping a solution to a (somewhat) similar problem.

--
Geoff Gerrietts <geoff at gerrietts dot net> http://www.gerrietts.net/
"Politics, as a practice, whatever its professions, has always been the
systematic organization of hatreds." --Henry Adams

#!/usr/local/bin/python -i

import htmllib, formatter

class DataStorage:
""" DataStorage
helper class for the parser. effectively implements a string that
changes in-place.
"""
def __init__(self, weight=2):
self.data = ""
self.count = 0
self.weight = weight

def __add__(self, other):
""" __add__
the __add__ routine just appends. clean it later.
"""
self.data = self.data + str(other)
return self

def purge(self):
dat = [self.data] * self.weight
self.data = ""
return string.join(dat)

class HTMLMunger(htmllib.HTMLParser):
TITLE_WT = 5
HEADING_WT = 3
EMPH_WT = 2

def __init__(self):
htmllib.HTMLParser.__init__(self, formatter.NullFormatter())
self.plaindata = DataStorage()
self.storagestack = []

def start_body(self, attrs):
self.savedata = self.plaindata

def push_storage(self,stor):
self.storagestack.append(self.savedata)
self.savedata = stor

def pop_storage(self):
dat = self.savedata.purge()
self.savedata = self.storagestack.pop()
self.handle_data(dat)

def start_h1(self, attrs):
self.push_storage(DataStorage(self.HEADING_WT))
start_h2 = start_h3 = start_h4 = start_h5 = start_h6 = start_h1

def end_h1(self):
self.pop_storage()
end_h2 = end_h3 = end_h4 = end_h5 = end_h6 = end_h1

def start_i(self, attrs):
self.push_storage(DataStorage(self.EMPH_WT))
start_b = start_i

def end_i(self):
self.pop_storage()
end_b = end_i

def anchor_end(self):
# prevent the link number from showing up
self.anchor = None

def extract(self):
dat = string.join(([self.title] * self.TITLE_WT) + [self.plaindata])
return dat

class TextMunger:

def __init__(self):
self.data = ''

def feed(self, data):
self.data = self.data + data

def extract(self):
return self.data

class DocFetcherException(Exception):
pass

def DocFetcher:
handlers = {
'text/html': HTMLMunger,
'text/plain': TextMunger
}

def get_url(self, url):
url_obj = urllib.urlopen(url)
ct = url_obj.info()['Content-Type']
h = self.handlers.get(ct)
if not h:
raise DocFetcherException, "no handler for [%s] type [%s]" %(url,ct)
dp = h()
dp.feed(url_obj.read())
return dp.extract()

if __name__ == '__main__':
pm = HTMLMunger()
import urllib

print "Retrieving"
dat = urllib.urlopen("http://www.yahoo.com/").read()

print "Parsing"
pm.feed(dat)

print "Plain data: ", len(pm.plaindata.data)
print "Emph. data: ", len(pm.emphdata.data)
print "Head. data: ", len(pm.headerdata.data)

print "pm.plaindata.data"

Ian Bicking

unread,

May 31, 2002, 6:25:52 PM5/31/02

to

On Fri, 2002-05-31 at 14:52, Hazel wrote:
> how do i write a program that
> will extract info from an HTML and print
> of a list of TV programmes, its Time, and Duration
> using urllib?

You can get the page with urllib. You can use htmllib to parse it, but
I often find that regular expressions (the re module) are an easier way
-- since you aren't looking for specific markup, but specific
expressions. You'll get lots of false negatives (and positives), but
when you are parsing a page that isn't meant to be parsed (like most web
pages) no technique is perfect.

Ian

Geoff Gerrietts

unread,

Jun 1, 2002, 12:55:08 AM6/1/02

to

Quoting Ian Bicking (ia...@colorstudy.com):
> On Fri, 2002-05-31 at 14:52, Hazel wrote:

> > how do i write a program that
> > will extract info from an HTML and print
> > of a list of TV programmes, its Time, and Duration
> > using urllib?
>

> You can get the page with urllib. You can use htmllib to parse it, but
> I often find that regular expressions (the re module) are an easier way
> -- since you aren't looking for specific markup, but specific
> expressions. You'll get lots of false negatives (and positives), but
> when you are parsing a page that isn't meant to be parsed (like most web
> pages) no technique is perfect.

Definitely agree with this sentiment.

I'll go a step farther, and do a little compare/contrast.

Once upon a time, I wanted to grab data from the
weatherunderground.com website. I know there are lots of better ways
to go about getting this information, these days, but I was not so
well-informed back then.

So I wanted to grab this information, and I tried using regular
expressions to mangle the page. But truthfully, it was just too hard
to do. I could guess about where in the file the table with all the
info would appear, but getting a regular expression that was inclusive
enough to catch all the quirks, yet exclusionary enough to filter out
all the other embedded tables, proved a very large challenge.

That's when the idea of a parser made a lot of sense.

I could push the whole page through a parser, looking for one
particular phrase in a <TH> element, and from that point forward, map
<TH> elements to <TD> elements effectively. It became a very simple
exercise, because I knew how to find that info.

But as Ian rightly points out, htmllib and a real parser can be very
heavy if you're just looking to grab unformatted info -- or if you
can't rely on the formatting to be reliable.

Both techniques are worth knowing -- but better than either would be
finding a way to get the information you're after via XML-RPC or some
other protocol that's designed to carry data rather than rendering
instructions.

Best of luck,
--G.

--
Geoff Gerrietts "If life were measured by accomplishments,
<geoff at gerrietts net> most of us would die in infancy."
http://www.gerrietts.net/ --A.P. Gouthey

Hazel

unread,

Jun 1, 2002, 6:01:48 AM6/1/02

to

Geoff Gerrietts <ge...@gerrietts.net> wrote in message news:<mailman.1022908262...@python.org>...

Dear Geoff, Ian
I'm relying onsgmllib to do the work....
since htmllib requires heavy coding.
Here, an instance of what I want to extract....
the time of the TV programme >> 12:15:00AM

"<TR>
<TD align=right bgColor=#000033><FONT color=#ffffff
face="verdana, arial, helvetica" size=1>12:15:00
AM</FONT></TD>"

So what do u think?

-Thanx
Hazel

Kragen Sitaker

unread,

Jun 2, 2002, 11:41:04 PM6/2/02

to

Geoff Gerrietts <ge...@gerrietts.net> writes:
> Both techniques are worth knowing -- but better than either would be
> finding a way to get the information you're after via XML-RPC or some
> other protocol that's designed to carry data rather than rendering
> instructions.

You seem to imply that XML-RPC is better suited to carrying data
rather than rendering instructions than HTTP is. I disagree with this
implication, and I adduce the following evidence:
- the thousands of RSS feeds (see www.syndic8.com) using HTTP
- people downloading Python via HTTP
- the fact that XML-RPC runs over HTTP

XML-RPC is better suited to expressing RPC than HTTP is, but "getting
some data" is probably better done over HTTP GET, where you can take
advantage of things like caching and URL linkability.

There's a *reason* we added MIME headers in HTTP 1.0 about ten years
ago, boy.

Kragen Sitaker

unread,

Jun 4, 2002, 10:31:51 PM6/4/02

to

Ian Bicking writes:
> I think you are misinterpreting Geoff's response, and you seem to
> have a chip on your shoulder about it.

Sorry about the attitude. I'm frustrated that, twelve years into the
World Wide Web, there are still so many people who fail to learn from
its lessons.

> He did not compare XML-RPC to HTTP, but to HTML (at least, that's
> clearly implicit because this thread was talking about HTML
> parsing). HTML is clearly a poor way to exchange machine-readable
> information, there are too many layout-related tags that are usually
> only appreciated by humans.

I don't really agree that HTML is a poor way to exchange
machine-readable information; that is, after all, what the language is
designed for. But if the author of the HTML doesn't have
machine-readability in mind, and many don't, the HTML usually won't be
very machine-readable.

Nevertheless, Geoff said "XML-RPC or some other protocol that's
designed to carry data". HTTP and XML-RPC are protocols, although
they each define data formats; HTML is a language.

**

With regard to your earlier question about htmllib and HTML::Parse: I
hadn't actually tried to use htmllib, but had only read that it was
unreliable on malformed HTML. I haven't actually been able to feed it
HTML that's malformed enough to break it, though.

Here are link-extraction scripts in Perl using standard CPAN libraries
and in Python using standard Python libraries. The Perl script is
more featureful. The Python script was more painful to write, partly
because htmllib.HTMLParser has a more poorly designed interface than
HTML::Parser (hard as that may be to believe), but mostly because
there's already an HTML::LinkExtor in the library.

Perl version:

#!/usr/bin/perl -w
use strict;
require HTML::LinkExtor;
my $p = HTML::LinkExtor->new(\&cb, "http://www.sn.no/");
sub cb {
my($tag, %links) = @_;
print "$tag @{[%links]}\n";
}
$p->parse_file("pathological.html");
__END__

Python version:

#!/usr/bin/python
import htmllib, formatter
class x(htmllib.HTMLParser):
def dump(self, tag, attrs):
print tag,
for a, v in attrs:
if a in ['action', 'src', 'href']:
print a, v,
print
def do_img(self, attrs):
self.dump('img', attrs)
def start_a(self, attrs):
self.dump('a', attrs)
def start_form(self, attrs):
self.dump('form', attrs)

y = x(formatter.NullFormatter())
y.feed(open('pathological.html').read())
y.close()

Here's pathological.html, which I guess is not very pathological,
because it didn't break either script. I'd be very interested to see
HTML pathological enough to break one or the other, but still accepted
by Netscape >3.0 or MSIE >4.0.

<ul>
Moroon & I are --&gt<b><B><i>GETTING MARRIED</b></i><-- next year.
<p> Look at my <a href=http://example.com/~moron>website.</a>
<body bgcolor=#ff7777>
<script>
This stuff shouldn't get displayed.
x = 1; y = 3;
if (x<y) x = y;
</script>
<IMG SRC='MYPIC.GIF'/>LOOK AT ME!
Sign my guestbook: <table><tr><td><form action=guestbook.cgi method=post>
<input name="yourname"><td><input type="submit"></form></table>

--
<kra...@pobox.com> Kragen Sitaker <http://www.pobox.com/~kragen/>
The Internet stock bubble didn't burst on 1999-11-08. Hurrah!
<URL:http://www.pobox.com/~kragen/bubble.html>
The power didn't go out on 2000-01-01 either. :)

Paul Boddie

unread,

Jun 5, 2002, 7:21:37 AM6/5/02

to

lail...@hotmail.com (Hazel) wrote in message news:<82096df4.02060...@posting.google.com>...

>
> I'm relying onsgmllib to do the work....
> since htmllib requires heavy coding.

First of all, take a look at this document:

http://www.boddie.org.uk/python/HTML.html

The first section describes the use of sgmllib, although you could
also try various XML package classes instead. If you understand the
document, you barely need to read the rest of this message.

> Here, an instance of what I want to extract....
> the time of the TV programme >> 12:15:00AM
>
> "<TR>
> <TD align=right bgColor=#000033><FONT color=#ffffff
> face="verdana, arial, helvetica" size=1>12:15:00
> AM</FONT></TD>"
>
> So what do u think?

The most important structural detail is probably the 'TD' element,
even though the data you need is found within the 'FONT' element. What
you could do, therefore, is to set up a handler method for the 'TD'
element called 'start_td' which sets a flag in your parser object
noting that the information inside the element may be interesting; you
could do more checks on the attributes ('align', 'bgColor') if you
believe that they help to distinguish the cells in the table which
contain times from the other cells. You also need an 'end_td' method
which unsets the flag, and you should always beware of nested tables,
too.

def start_td(self, attributes):
if <some test with the attributes>:
self.inside_cell = 1

def end_td(self):
self.inside_cell = 0

Once "inside" the 'TD' element, you might then want to check for a
'FONT' element. Again, set up a handler method called 'start_font'
which firstly checks to see if that flag was set, indicating that the
parser is currently "inside" the 'TD' element of interest. Then, you
might want to check for some interesting attributes, but only if you
think you can rely on them - I get suspicious about multiple
presentational attributes (especially when they could have used a
stylesheet), and that's partly why I advocate checking for the
presence of more than one element type (in this case, the 'TD' and the
'FONT' elements) before mining away at the data.

The 'start_font' element will also set a flag indicating that a time
should be extracted from the text inside the element (between the
start and end tags), and again, it's important to implement an
'end_font' element which unsets this new flag.

def start_font(self, attributes):
if self.inside_cell and <some test with the attributes>:
self.ready_to_read = 1

def end_font(self):
if self.inside_cell: # Arguably not necessary.
self.ready_to_read = 0

Finally, you should implement the 'handle_data' element which checks
that this new flag is set before reading the textual data and storing
it somewhere (such as another attribute in your parser class).

def handle_data(self, data):
if self.ready_to_read:
self.my_programme_times.append(data)

There are lots of issues with doing the parsing this way, and having
parsed some pretty complicated pages, I can certainly recommend the
XML approach instead, since it provides much better ways of testing
the structure than setting flags here and there. Unfortunately, you
may well need something like mxTidy to deal with severely broken HTML,
of which there seems to be a lot around.

Paul

Giulio Cespuglio

unread,

Jun 6, 2002, 5:39:11 PM6/6/02

to

> http://www.boddie.org.uk/python/HTML.html

Hi Paul,

Thanks a lot for this resource. To be honest, I wonder how you could
work out how to use htmllib, IMHO the documentation is very poor.
Actually, I am happily using the flexibility of regular expressions to
do these things ATM, but I'm willing to give this library a try.

Cheers,
Giulio

Paul Boddie

unread,

Jun 7, 2002, 7:41:35 AM6/7/02

to

Giulio Cespuglio <giulio.agosti...@libero.it> wrote in message news:<4slvfucpesg4ikfs3...@4ax.com>...
> > http://www.boddie.org.uk/python/HTML.html

>
> Thanks a lot for this resource. To be honest, I wonder how you could
> work out how to use htmllib, IMHO the documentation is very poor.
> Actually, I am happily using the flexibility of regular expressions to
> do these things ATM, but I'm willing to give this library a try.

I think that each technique has its advantages and disadvantages:

Regular expressions: good for mining for data without caring
about document structure;
bad for detecting and reasoning about
the document structure
sgmllib
(and SAX-like technologies): good for mining for data whilst
"binding" that data to specific
elements;
bad for dealing with complicated
document structures

DOM-like technologies: good for insisting on particular
document structures and for keeping
these structures intact;
bad for casual mining of data - effort
is required to find data before it can
be extracted

So, if regular expressions aren't giving you the control you require,
you may want to consider one of the other technologies.

Paul