extracting text from html-documents

Julian

unread,

Nov 28, 2007, 12:35:31 PM11/28/07

to beautifulsoup

hi there,

i want to extract the text from some html-documents. this is how i do
it:

bsoup is my instance of BeautifulSoup.BeautifulSoup:

#getting all comments
comments = bsoup.findAll(text=lambda text:isinstance(text,
BeautifulSoup.Comment))
#deleting all comments
[comment.extract() for comment in comments]
#get only the text from the <body>
body = bsoup.body(text=True)
#in text is the text of the html-doc
text = ''.join(body)

this works quite good with one problem: something like



is not deleted (due to the "//" i think).

is this a (known) bug/problem? how can i solve/get around this?

thanks for answers!

Yurietc

unread,

Nov 29, 2007, 3:55:30 AM11/29/07

to beauti...@googlegroups.com

Hello, Julian.
I can't reproduce the problem :

>>> s="""<html><body><p>text1</p><p>text2</p></body></html>"""
>>> from BeautifulSoup import BeautifulSoup as BS
>>> from BeautifulSoup import Comment
>>> bsoup=BS(s)
>>> comments = bsoup.findAll(text=lambda text:isinstance(text, Comment))

>>> [comment.extract() for comment in comments]

[None]
>>> body = bsoup.body(text=True)
>>> text = ''.join(body)
>>> print text
text1text2

All comments with // were deleted. May be your problem is because of
some invalid html.
One of the possible ways (but not the best) is to delete comments, using
regular expressions:

>>> s="""<html><body><p>text1</p><p>text2</p></body></html>"""
>>> s2 = re.compile('', re.DOTALL).sub('', s)
>>> print s2
<html><body><p>text1</p><p>text2</p></body></html>

Windows XP; python 2.5.1; BS 3.0.4

Best regards, Yuriy.

Julian

unread,

Nov 29, 2007, 8:10:33 AM11/29/07

to beautifulsoup

On 29 Nov., 09:55, Yurietc <Yuri...@gmail.com> wrote:
> Hello, Julian.
> I can't reproduce the problem :

oh, but try this:

import urllib2, BeautifulSoup

url = "http://www.spiegel.de"
request = urllib2.urlopen(url)
bsoup = BeautifulSoup.BeautifulSoup(request)

comments = bsoup.findAll(text=lambda text:isinstance(text,
BeautifulSoup.Comment))

[comment.extract() for comment in comments]

body = bsoup.body(text=True)
text = ''.join(body)

print text[-100:]

output:

Leserbriefe
Nachdrucke
Impressum

...?

Julian

unread,

Nov 29, 2007, 8:12:41 AM11/29/07

to beautifulsoup

and i have to add that i want to use beautifulsoup right because i
DON'T want to use self made regular expressions. they aren't that good
tested and intelligent, i think.

Yurietc

unread,

Nov 29, 2007, 9:26:50 AM11/29/07

to beauti...@googlegroups.com

I think that it is because BS knows that "" among tags
"<script>" is not a comment:

s='''<html><body><script type="text/javascript">

</script> <script type="text/javascript">

</script><br class="spBreakNoHeight" clear="all" /> 
</body>
</html> '''
>>> bsoup = BeautifulSoup.BeautifulSoup(s)

>>> comments = bsoup.findAll(text=lambda text:isinstance(text,
BeautifulSoup.Comment))
>>> [comment.extract() for comment in comments]

[None]

>>> body = bsoup.body(text=True)
>>> text = ''.join(body)
>>> print text

<!--
spFramebuster();

-->

<!--
OAS_RICH('x70');

-->

and without <script>:

s='''<html><body>

<br class="spBreakNoHeight" clear="all" /> 
</body>
</html> '''
>>> bsoup = BeautifulSoup.BeautifulSoup(s)

>>> comments = bsoup.findAll(text=lambda text:isinstance(text,
BeautifulSoup.Comment))
>>> [comment.extract() for comment in comments]

[None, None, None]

>>> body = bsoup.body(text=True)
>>> text = ''.join(body)

>>> text
u'\n\n\n \n'

and with <script> but without "//":

s='''<html><body><script type="text/javascript">

</script> <script type="text/javascript">

</script><br class="spBreakNoHeight" clear="all" /> 
</body>
</html> '''
>>> bsoup = BeautifulSoup.BeautifulSoup(s)

>>> comments = bsoup.findAll(text=lambda text:isinstance(text,
BeautifulSoup.Comment))
>>> [comment.extract() for comment in comments]

[None]

>>> body = bsoup.body(text=True)
>>> text = ''.join(body)
>>> print text

<!--
spFramebuster();

-->

<!--
OAS_RICH('x70');

-->

So you have to kill all tags <script> with all content just like you
have killed all comments:

>>> s='''<html><body><script type="text/javascript">

</script> <script type="text/javascript">

</script><br class="spBreakNoHeight" clear="all" /> 
</body>
</html> '''
>>> bsoup = BeautifulSoup.BeautifulSoup(s)

>>> comments = bsoup.findAll(text=lambda text:isinstance(text,
BeautifulSoup.Comment))
>>> [comment.extract() for comment in comments]

[None]
>>> c=bsoup.findAll('script')
>>> for i in c:
i.extract()

>>> body = bsoup.body(text=True)
>>> text = ''.join(body)

>>> text
u' \n'

Julian

unread,

Nov 29, 2007, 9:56:59 AM11/29/07

to beautifulsoup

GREAT thank you very much!

Reply all

Reply to author

Forward