Some basic help needed

10 views
Skip to first unread message

Chris_147

unread,
Apr 21, 2008, 6:56:13 AM4/21/08
to beautifulsoup
Hi all,

I recently learned about BeautifulSoup, which seems to be exactly what
I need.
I want to build a program to catalog all my SF books, but these books
are not in English and are older so they cannot be found on Amazon or
other sites.
However, there are some enthousiast sites with all the information I
need.
http://www.deboekenplank.nl/naslag/aut/s/stephenson_n.htm is a
reasonably simple one.
http://www.deboekenplank.nl/naslag/aut/v/vance_j.htm is a more
complicated one.

So the idea is that the user types in the author and title and the
program would come back with all fields like ISBN, # of pages,
publisher, year etc filled in and to be approved by the user.

Now, the basic contents of the pages is this:
<blockquote>
<p><strong><a href="img/stephenson_n_babelvirus_1995_1.jpg">HET
BABELVIRUS</a></strong><br>
1995, Amsterdam: Luitingh-Sijthoff, 447pag., ISBN
90-245-1217-4<br>
vert.van: <a href="img/stephenson_n_eng_snowcrash_1993.jpg">Snow
Crash</a> (1992), vert.door: Alistair Schuchart</p>

<p><strong><a href="img/
stephenson_n_alchemist_1998_1.jpg">ALCHEMIST</a></strong><br>
1998, Amsterdam: Luitingh-Sijthoff, 543pag., ISBN
90-245-1545-9<br>
vert.van: The Diamond Age (1995), vert.door: Irene Ketman</p>

<p><strong><a href="img/
stephenson_n_cryptonomicon_2001_1e_hcdj.jpg">CRYPTONOMICON</a></
strong><br>
2001, Amsterdam: Luitingh-Sijthoff,&nbsp;1085pag., ISBN
90-245-3718-5, hardcover met stofomslag, 1e druk mei 2001<br>
vert.van: Cryptonomicon (1999), vert.door: Irene Ketman</p>
<p>niet vertaald:</p>

<p><b><a href="img/stephenson_n_zodiac.jpg">ZODIAC</a> : An eco-
thriller</b><br>
1995, New York: Bantam, 308pag., ISBN 0-553-57386-1 (first printing:
Atlantic Monthly Press, 1988)</p>
</blockquote>

So the titles are in a blockquote tag, and usually in <p><strong>.
But for example the last one doesn't have <strong> but <b>
My idea was to go deeper and search for <a> tag with the correct
title.
For example: link = soup.findAll(name='a', text='HET BABELVIRUS')
[0].findParent() gives me
<a href="img/stephenson_n_babelvirus_1995_1.jpg">HET BABELVIRUS</a>

But how can I now search further based on this?
I would like to find the rest of the paragraph:
<p><strong><a href="img/stephenson_n_babelvirus_1995_1.jpg">HET
BABELVIRUS</a></strong><br>
1995, Amsterdam: Luitingh-Sijthoff, 447pag., ISBN 90-245-1217-4<br>
vert.van: <a href="img/stephenson_n_eng_snowcrash_1993.jpg">Snow
Crash</a> (1992), vert.door: Alistair Schuchart</p>

And if I find it, can BeautifulSoup help me parse the <br> tag? Or
should I try that with a regex?

thanks for the help,

Chris

Leonard Richardson

unread,
Apr 26, 2008, 4:27:45 PM4/26/08
to beauti...@googlegroups.com
Chris,

> So the titles are in a blockquote tag, and usually in <p><strong>.
> But for example the last one doesn't have <strong> but <b>
> My idea was to go deeper and search for <a> tag with the correct
> title.
> For example: link = soup.findAll(name='a', text='HET BABELVIRUS')
> [0].findParent() gives me
> <a href="img/stephenson_n_babelvirus_1995_1.jpg">HET BABELVIRUS</a>
>
> But how can I now search further based on this?
> I would like to find the rest of the paragraph:
> <p><strong><a href="img/stephenson_n_babelvirus_1995_1.jpg">HET
> BABELVIRUS</a></strong><br>
> 1995, Amsterdam: Luitingh-Sijthoff, 447pag., ISBN 90-245-1217-4<br>
> vert.van: <a href="img/stephenson_n_eng_snowcrash_1993.jpg">Snow
> Crash</a> (1992), vert.door: Alistair Schuchart</p>
>
> And if I find it, can BeautifulSoup help me parse the <br> tag? Or
> should I try that with a regex?

Since you found the <a> tag inside the paragraph, you can find the
paragraph with a_tag.findParent('p')

<p><strong><a href="img/stephenson_n_babelvirus_1995_1.jpg">HET

BABELVIRUS</a></strong><br />


1995, Amsterdam: Luitingh-Sijthoff, 447pag., ISBN

90-245-1217-4<br />


vert.van: <a href="img/stephenson_n_eng_snowcrash_1993.jpg">Snow
Crash</a> (1992), vert.door: Alistair Schuchart</p>

I'm not sure what you mean by parsing the <br> tag. BS has parsed the
<br> tags and knows that there's a <br> tag, then some text, then
another <br> tag, then more text. But it doesn't know that "1995,
Amsterdam: Luitingh-Sijthoff, 447pag., ISBN 90-245-1217-4" is five
pieces of information. To express the structure of that text you
should use a regex.

Hope this helps,
Leonard

Chris_147

unread,
Apr 28, 2008, 7:54:50 AM4/28/08
to beautifulsoup
On Apr 26, 10:27 pm, "Leonard Richardson"
Thanks, that was what I needed!
Sound mighty simple now you've said it, but I could not really figure
it out.

> I'm not sure what you mean by parsing the <br> tag. BS has parsed the
> <br> tags and knows that there's a <br> tag, then some text, then
> another <br> tag, then more text. But it doesn't know that "1995,
> Amsterdam: Luitingh-Sijthoff, 447pag., ISBN 90-245-1217-4" is five
> pieces of information. To express the structure of that text you
> should use a regex.

I thought so, but I wasn't completely sure.

Thanks for the help.
Reply all
Reply to author
Forward
0 new messages