Extracting text in between <title></title>

Zeynel

未讀,

2009年11月17日中午12:03:072009/11/17

收件者：beautifulsoup

Hello,

Can any one explain to this newbie how to use beautifulsoup to extract
text in between <title></title> tags? I have Python 2.6.1 installed.
But reading the documentation I was not clear how I was supposed to
run the program. Say I have an html file test.html and I want to
extract the title and put it in a text file extract.text. How do i do
this?

Thank you for your help.

Aaron DeVore

未讀,

2009年11月17日下午1:37:482009/11/17

收件者：beauti...@googlegroups.com

Here's a basic idea of how to do it

from BeautifulSoup import BeautifulSoup

# parse with the BeautifulSoup class
soup = BeautifulSoup (file("test.html").read())

# find the first (and only) title tag
title = soup.find('title')

# get the first and only text node from the title. It is
# also a unicode object so you can treat it like any
# other text.
titleString = title.string

# Write the title string to 'extract.text'
open('extract.text').write(titleString)

The ultra-extreme-compact-and-confusing version is

open('extract.text').write(BeautifulSoup(file("test.html").read()).title.string)

Cheers!
Aaron DeVore

> --
>
> You received this message because you are subscribed to the Google Groups "beautifulsoup" group.
> To post to this group, send email to beauti...@googlegroups.com.
> For more options, visit this group at http://groups.google.com/group/beautifulsoup?hl=.
>
>
>

Zeynel

未讀,

2009年11月17日下午3:56:182009/11/17

收件者：beautifulsoup

Hi,

Thanks for the help. I made 2 minor changes:

from BeautifulSoup import BeautifulSoup

soup = BeautifulSoup (file("test.html").read())

title = soup.findAll('title')
titleString = title.string
open('extract.text', 'w').write(titleString)

added 'w' because i was getting IOError and I made it soup.findAll.

But for some reason extract.text file remains blank. test.html has
<title> and </title> tags in it. What am I doing wrong?

Aaron DeVore

未讀,

2009年11月17日下午4:59:592009/11/17

收件者：beauti...@googlegroups.com

Ah, yes. 'w' certainly does help. In this case, though, you'll want to
use find (which returns a match or None) instead of findAll (which
returns a list). To be honest, I haven't a clue why you didn't get an
AttributeError with the line 'titleString = title.string'.

-Aaron

Zeynel

未讀,

2009年11月17日下午5:34:062009/11/17

收件者：beautifulsoup

I am sorry I don't understand, why I would get an AttributeError?

What I am doing is pasting the code you supplied into IDLE. It gives
no error but it writes nothing on the file. Does the script work for
you?

Aaron DeVore

未讀,

2009年11月17日晚上7:00:362009/11/17

收件者：beauti...@googlegroups.com

findAll returns a list, which has no string attribute. So let's say
you have this HTML:

<html>
<head><title>A Title</title></head>
<body></body>
</html>

If you parse you should have a tree that looks like this:
html
head
title
"A Title"
body

Then you call findAll('title')

>>> title = soup.findAll('title')
>>> title

[ u"<title>A Title</title>" ]

If you instead use find you get this

>>> title = soup.find('title')

>>> title
u"<title>A Title</title>"
>>> title.string
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
AttributeError: 'list' object has no attribute 'string'

Or, if there wasn't a title tag, you would get this

>>> title = soup.find('title')

>>> type(title)
<type 'NoneType'>

You should see that AttributeError, hence my confusion.

-Aaron

Zeynel

未讀,

2009年11月17日晚上8:30:182009/11/17

收件者：beautifulsoup

I am confused too. But by trial and error I made this script title1.py
work:

from BeautifulSoup import BeautifulSoup
soup = BeautifulSoup(file('test.html').read())

title = soup.find('title')

titleString = title.string
f = open('extract.txt', 'w')
f.write(titleString)
f.close()

This indeed writes the string inside the first title tags into
extract.txt. Can you give me a clue about how to loop through the file
to collect all the tags? Or is there a better way of doing this.

Thank you!

Aaron DeVore

未讀,

2009年11月17日晚上11:58:132009/11/17

收件者：beauti...@googlegroups.com

There should only be a single <title> tag at
<head><title></title></head>. If you're trying to write the contents
of all tags of a particular name (tag-name in the example below) then
you can just use a for loop.

import os

for tag in soup.findAll('tag-name'):
f.write(tag.string + os.linesep)

I might still be confused on what you need, though...

回覆所有人

回覆作者

轉寄