Extracting text in between <title></title>

瀏覽次數:612 次
跳到第一則未讀訊息

Zeynel

未讀,
2009年11月17日 中午12:03:072009/11/17
收件者:beautifulsoup
Hello,

Can any one explain to this newbie how to use beautifulsoup to extract
text in between <title></title> tags? I have Python 2.6.1 installed.
But reading the documentation I was not clear how I was supposed to
run the program. Say I have an html file test.html and I want to
extract the title and put it in a text file extract.text. How do i do
this?

Thank you for your help.

Aaron DeVore

未讀,
2009年11月17日 下午1:37:482009/11/17
收件者:beauti...@googlegroups.com
Here's a basic idea of how to do it

from BeautifulSoup import BeautifulSoup

# parse with the BeautifulSoup class
soup = BeautifulSoup (file("test.html").read())

# find the first (and only) title tag
title = soup.find('title')

# get the first and only text node from the title. It is
# also a unicode object so you can treat it like any
# other text.
titleString = title.string

# Write the title string to 'extract.text'
open('extract.text').write(titleString)

The ultra-extreme-compact-and-confusing version is

open('extract.text').write(BeautifulSoup(file("test.html").read()).title.string)

Cheers!
Aaron DeVore
> --
>
> You received this message because you are subscribed to the Google Groups "beautifulsoup" group.
> To post to this group, send email to beauti...@googlegroups.com.
> For more options, visit this group at http://groups.google.com/group/beautifulsoup?hl=.
>
>
>

Zeynel

未讀,
2009年11月17日 下午3:56:182009/11/17
收件者:beautifulsoup
Hi,

Thanks for the help. I made 2 minor changes:

from BeautifulSoup import BeautifulSoup
soup = BeautifulSoup (file("test.html").read())
title = soup.findAll('title')
titleString = title.string
open('extract.text', 'w').write(titleString)

added 'w' because i was getting IOError and I made it soup.findAll.

But for some reason extract.text file remains blank. test.html has
<title> and </title> tags in it. What am I doing wrong?

Aaron DeVore

未讀,
2009年11月17日 下午4:59:592009/11/17
收件者:beauti...@googlegroups.com
Ah, yes. 'w' certainly does help. In this case, though, you'll want to
use find (which returns a match or None) instead of findAll (which
returns a list). To be honest, I haven't a clue why you didn't get an
AttributeError with the line 'titleString = title.string'.

-Aaron

Zeynel

未讀,
2009年11月17日 下午5:34:062009/11/17
收件者:beautifulsoup
I am sorry I don't understand, why I would get an AttributeError?

What I am doing is pasting the code you supplied into IDLE. It gives
no error but it writes nothing on the file. Does the script work for
you?

Aaron DeVore

未讀,
2009年11月17日 晚上7:00:362009/11/17
收件者:beauti...@googlegroups.com
findAll returns a list, which has no string attribute. So let's say
you have this HTML:

<html>
<head><title>A Title</title></head>
<body></body>
</html>

If you parse you should have a tree that looks like this:
html
head
title
"A Title"
body


Then you call findAll('title')

>>> title = soup.findAll('title')
>>> title
[ u"<title>A Title</title>" ]

If you instead use find you get this

>>> title = soup.find('title')
>>> title
u"<title>A Title</title>"
>>> title.string
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
AttributeError: 'list' object has no attribute 'string'

Or, if there wasn't a title tag, you would get this

>>> title = soup.find('title')
>>> type(title)
<type 'NoneType'>

You should see that AttributeError, hence my confusion.

-Aaron

Zeynel

未讀,
2009年11月17日 晚上8:30:182009/11/17
收件者:beautifulsoup
I am confused too. But by trial and error I made this script title1.py
work:

from BeautifulSoup import BeautifulSoup
soup = BeautifulSoup(file('test.html').read())
title = soup.find('title')
titleString = title.string
f = open('extract.txt', 'w')
f.write(titleString)
f.close()

This indeed writes the string inside the first title tags into
extract.txt. Can you give me a clue about how to loop through the file
to collect all the tags? Or is there a better way of doing this.

Thank you!

Aaron DeVore

未讀,
2009年11月17日 晚上11:58:132009/11/17
收件者:beauti...@googlegroups.com
There should only be a single <title> tag at
<head><title></title></head>. If you're trying to write the contents
of all tags of a particular name (tag-name in the example below) then
you can just use a for loop.

import os

for tag in soup.findAll('tag-name'):
f.write(tag.string + os.linesep)

I might still be confused on what you need, though...
回覆所有人
回覆作者
轉寄
0 則新訊息