Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

Ask how to use HTMLParser

4 views
Skip to first unread message

Water Lin

unread,
Jan 7, 2010, 10:44:48 PM1/7/10
to

I am a new guy to use Python, but I want to parse a html page now. I
tried to use HTMLParse. Here is my sample code:
----------------------
from HTMLParser import HTMLParser
from urllib2 import urlopen

class MyParser(HTMLParser):
title = ""
is_title = ""
def __init__(self, url):
HTMLParser.__init__(self)
req = urlopen(url)
self.feed(req.read())

def handle_starttag(self, tag, attrs):
if tag == 'div' and attrs[0][1] == 'articleTitle':
print "Found link => %s" % attrs[0][1]
self.is_title = 1

def handle_data(self, data):
if self.is_title:
print "here"
self.title = data
print self.title
self.is_title = 0
-----------------------

For the tag
-------
<div class="articleTitle">open article title</div>
-------

I use my code to parse it. I can locate the div tag but I don't know how
to get the text for the tag which is "open article title" in my example.

How can I get the html content? What's wrong in my handle_data function?

Thanks

Water Lin

--
Water Lin's notes and pencils: http://en.waterlin.org
Email: Wate...@ymail.com

h0uk

unread,
Jan 7, 2010, 11:17:17 PM1/7/10
to
> Email: Water...@ymail.com

Hi.

Have you get errors or anything else??? What is wrong??

Vardan.

h0uk

unread,
Jan 7, 2010, 11:42:58 PM1/7/10
to
On 8 янв, 08:44, Water Lin <Water...@ymail.invalid> wrote:
> Email: Water...@ymail.com

I want to say your code works well

Water Lin

unread,
Jan 8, 2010, 1:44:16 AM1/8/10
to
h0uk <vardan....@gmail.com> writes:

But in handle_data I can't print self.title. I don't why I can't set the
self.title in handle_data.

Thanks

Water Lin

--
Water Lin's notes and pencils: http://en.waterlin.org

Email: Wate...@ymail.com

h0uk

unread,
Jan 8, 2010, 3:24:00 AM1/8/10
to
On 8 янв, 11:44, Water Lin <Water...@ymail.invalid> wrote:
> Email: Water...@ymail.com

I have tested your code as :

#!/usr/bin/env python
# -*- conding: utf-8 -*-

from HTMLParser import HTMLParser

class MyParser(HTMLParser):
title = ""
is_title = ""

def __init__(self, data):
HTMLParser.__init__(self)
self.feed(data)

def handle_starttag(self, tag, attrs):
if tag == 'div' and attrs[0][1] == 'articleTitle':
print "Found link => %s" % attrs[0][1]
self.is_title = 1

def handle_data(self, data):
if self.is_title:
print "here"
self.title = data
print self.title
self.is_title = 0


if __name__ == "__main__":

m = MyParser(""" <div class="secttlbarwrap">
<table cellpadding=0 cellspacing=0 width="100%"><tr><td>
<div style="background: url(/groups/roundedcorners?
c=999999&bc=white&w=4&h=4&a=af) 0px 0px; width: 4px; height: 4px">
<td bgcolor="#999999" width="100%" height="4"><img alt=""
width=1 height=1><td>
<div style="background: url(/groups/roundedcorners?
c=999999&bc=white&w=4&h=4&a=af) -4px 0px; width: 4px; height: 4px">
</div></table></div>


<div class="articleTitle">open article title</div>

<div class="secttlbar">
<div class="lf secttl">
<span id="thread_subject_site">
Ask how to use HTMLParser
</span>
</div>
<div class="rf secmsg frtxt padt2">
<a class="uitl" id="showoptions_lnk2" href="#"
onclick="TH_ToggleOptionsPane(); return false;">Parametrs</a>
</div>
<div class="hght0 clear" style="font-size:0;"></div>
</div>""")

All stuff printed and handled fine. Also, the 'print self.title'
statement works fine.
Try run my code.

Vardan.

Dave Angel

unread,
Jan 8, 2010, 5:34:08 AM1/8/10
to Water Lin, pytho...@python.org
I don't know HTMLParser, but I see a possible confusion point in your
class definition.

You have both class-attributes and instance-attributes of the same names
(title and is_title). So if you have more than one instance of MyParser,
then they won't see each other's changes. Normally, I'd move the
initialization of such attributes into the __init__() method, so the
behavior is clear.

When an instance-attribute has the same name as a class-attribute, the
instance-attribute takes precedence, and "hides" the class-attribute,
for further processing in that same instance. So effectively, the
class-attribute acts as a default value.


Nobody

unread,
Jan 10, 2010, 3:16:28 AM1/10/10
to
On Fri, 08 Jan 2010 11:44:48 +0800, Water Lin wrote:

> I am a new guy to use Python, but I want to parse a html page now. I
> tried to use HTMLParse. Here is my sample code:
> ----------------------
> from HTMLParser import HTMLParser

Note that HTMLParser only tokenises HTML; it doesn't actually *parse* it.
You just get a stream of tag, text, entity, text, tag, ..., not a parse
tree.

In particular, if an element has its start and/or end tags omitted, you
won't get any notification about the start and/or end of the element;
you have to figure that out yourself from the fact that you're getting a
tag which wouldn't be allowed outside or inside the element.

E.g. if the document has omitted </p> tags, if you get a <p> tag when
you are (or *thought* that you were) already within a paragraph, you can
infer the omitted </p> tag.

If you want an actual parser, look at BeautifulSoup. This also does
a good job of handling invalid HTML (which seems to be far more
common than genuine HTML).

0 new messages