Using Beautiful soup with lists

Scott

unread,

Dec 16, 2009, 9:04:11 AM12/16/09

to beautifulsoup

I am new to both Python and Beautifulsoup, so, I am still in the
hacking stage. Here is my issue:

The website I am trying to parse has lists on it, and it is messing up
my python script. What I get for a result is just the first item of
what I thought I should receive. In the following example I get "Page
one" properly, as well as "TextName1" and all its info properly. I do
not get "TextName2" or "TextName3". I need to be able to decide
whether the item is a sectionHeader or a linklist later, so it is
convenient to look at the class attribute for making the soup. Thanks
for any help!

The Python is:

soup.findAll(True, attrs={'class':['sectionHeader', 'linklist']},
recursive=True):

The basic structure of the HTML is:

<h3 class="sectionHeader">Page one</h3>
<ul class=linklist>
<li><a href="......">TextName1</a> <span class="attr">More Text</
span></li>
<li><a href="......">TextName2</a> <span class="attr">More Text</
span></li>
<li><a href="......">TextName3</a> <span class="attr">More Text</
span></li>
</ul>

Aaron DeVore

unread,

Dec 16, 2009, 3:37:52 PM12/16/09

to beauti...@googlegroups.com

Scott,
Am I correct that there are multiple lists in the page and you need to
get the contents of each one? That affects which functions and methods
are usually used with Beautiful Soup.

-Aaron DeVore

> --
>
> You received this message because you are subscribed to the Google Groups "beautifulsoup" group.
> To post to this group, send email to beauti...@googlegroups.com.
> To unsubscribe from this group, send email to beautifulsou...@googlegroups.com.
> For more options, visit this group at http://groups.google.com/group/beautifulsoup?hl=en.
>
>
>

Scott

unread,

Dec 16, 2009, 3:41:47 PM12/16/09

to beautifulsoup

Aaron,
You are correct. There are many such sections within the page. All
sections have the same basic setup, a "sectionheader", then a few
lines later a "linklist".

Thanks,
Scott

Aaron DeVore

unread,

Dec 16, 2009, 4:10:36 PM12/16/09

to beauti...@googlegroups.com

Do you need to get both the <h3> contents and the list contents?

-Aaron DeVore

Scott

unread,

Dec 16, 2009, 4:20:07 PM12/16/09

to beautifulsoup

Yes, I need both. Later in the code I will do something like:

if div['class'] == 'sectionHeader':
..........

if div['class'] == "linklist"

I think that this part of the code will need to change as well as the
soup does not recognize TextName2 (the second list element) to be of
class linklist, correct?

Scott

Aaron DeVore

unread,

Dec 16, 2009, 7:55:03 PM12/16/09

to beauti...@googlegroups.com

Okay, here's an idea. I'll just do code because it's easier to
describe by example. Note that Tag.string gives the inner string (look
it up on the Beautiful Soup documentation). Also, something like
Tag.tagName does a search for the first <tagName> tag, if there is
one.

pages = [] # list of tuples of page information
for header in soup.findAll('h3', 'sectionHeader'):
headerText = header.string
ul = headerText.find('ul') // get the top list node

labels = [] # list of the contents of the <a>'s
for li in ul:
labels.append(li.a.string) # inner string of the first <a> tag
pages.append((headerText, labels))

That gives you a nice list of tuples of the pages. When you're using
it later on, you can just do iteration like this:

for header, labels in pages:
# work with header
for label in labels:
# work with labels

I probably made a mistake in there somewhere, but that should help
explain things (hopefully).

-Aaron DeVore

Scott

unread,

Dec 16, 2009, 9:31:15 PM12/16/09

to beautifulsoup

Aaron,
Thanks for the help. I am workin on it, but not having a great deal
of success. I think that what I really need to do is to get a better
handle on the python language in general. Back to the tutorial and to
get a O'Reilly book.

Scott

Aaron DeVore

unread,

Dec 16, 2009, 9:40:23 PM12/16/09

to beauti...@googlegroups.com

An excellent idea. Good luck!

-Aaron DeVore

Reply all

Reply to author

Forward