extracting text from html-documents

1,329 views
Skip to first unread message

Julian

unread,
Nov 28, 2007, 12:35:31 PM11/28/07
to beautifulsoup
hi there,

i want to extract the text from some html-documents. this is how i do
it:

bsoup is my instance of BeautifulSoup.BeautifulSoup:

#getting all comments
comments = bsoup.findAll(text=lambda text:isinstance(text,
BeautifulSoup.Comment))
#deleting all comments
[comment.extract() for comment in comments]
#get only the text from the <body>
body = bsoup.body(text=True)
#in text is the text of the html-doc
text = ''.join(body)

this works quite good with one problem: something like

<!--
OAS_RICH('x70');
// -->

is not deleted (due to the "//" i think).

is this a (known) bug/problem? how can i solve/get around this?

thanks for answers!

Yurietc

unread,
Nov 29, 2007, 3:55:30 AM11/29/07
to beauti...@googlegroups.com
Hello, Julian.
I can't reproduce the problem :

>>> s="""<html><body><p>text1</p><!--
OAS_RICH('x70');
// --><p>text2</p></body></html>"""
>>> from BeautifulSoup import BeautifulSoup as BS
>>> from BeautifulSoup import Comment
>>> bsoup=BS(s)
>>> comments = bsoup.findAll(text=lambda text:isinstance(text, Comment))


>>> [comment.extract() for comment in comments]

[None]
>>> body = bsoup.body(text=True)
>>> text = ''.join(body)
>>> print text
text1text2

All comments with // were deleted. May be your problem is because of
some invalid html.
One of the possible ways (but not the best) is to delete comments, using
regular expressions:

>>> s="""<html><body><p>text1</p><!--
OAS_RICH('x70');
// --><p>text2<!--other comment // --></p></body></html>"""
>>> s2 = re.compile('<!--.*?-->', re.DOTALL).sub('', s)
>>> print s2
<html><body><p>text1</p><p>text2</p></body></html>

Windows XP; python 2.5.1; BS 3.0.4


Best regards, Yuriy.

Julian

unread,
Nov 29, 2007, 8:10:33 AM11/29/07
to beautifulsoup


On 29 Nov., 09:55, Yurietc <Yuri...@gmail.com> wrote:
> Hello, Julian.
> I can't reproduce the problem :

oh, but try this:

import urllib2, BeautifulSoup

url = "http://www.spiegel.de"
request = urllib2.urlopen(url)
bsoup = BeautifulSoup.BeautifulSoup(request)
comments = bsoup.findAll(text=lambda text:isinstance(text,
BeautifulSoup.Comment))
[comment.extract() for comment in comments]
body = bsoup.body(text=True)
text = ''.join(body)

print text[-100:]

output:

Leserbriefe
Nachdrucke
Impressum



<!--
spFramebuster();
// -->

<!--
OAS_RICH('x70');
// -->


...?

Julian

unread,
Nov 29, 2007, 8:12:41 AM11/29/07
to beautifulsoup
and i have to add that i want to use beautifulsoup right because i
DON'T want to use self made regular expressions. they aren't that good
tested and intelligent, i think.

Yurietc

unread,
Nov 29, 2007, 9:26:50 AM11/29/07
to beauti...@googlegroups.com
I think that it is because BS knows that "<!--something-->" among tags
"<script>" is not a comment:

s='''<html><body><script type="text/javascript">
<!--
spFramebuster();
-->
</script> <script type="text/javascript">
<!--
OAS_RICH('x70');
-->
</script><br class="spBreakNoHeight" clear="all" /> <!-- ##SPONTAG:
LAYER## -->
</body>
</html> '''
>>> bsoup = BeautifulSoup.BeautifulSoup(s)


>>> comments = bsoup.findAll(text=lambda text:isinstance(text,
BeautifulSoup.Comment))
>>> [comment.extract() for comment in comments]

[None]


>>> body = bsoup.body(text=True)
>>> text = ''.join(body)
>>> print text

<!--
spFramebuster();


-->

<!--
OAS_RICH('x70');

-->

and without <script>:

s='''<html><body>
<!--
spFramebuster();
// -->

<!--
OAS_RICH('x70');
// -->

<br class="spBreakNoHeight" clear="all" /> <!-- ##SPONTAG: LAYER## -->
</body>
</html> '''
>>> bsoup = BeautifulSoup.BeautifulSoup(s)


>>> comments = bsoup.findAll(text=lambda text:isinstance(text,
BeautifulSoup.Comment))
>>> [comment.extract() for comment in comments]

[None, None, None]


>>> body = bsoup.body(text=True)
>>> text = ''.join(body)

>>> text
u'\n\n\n \n'

and with <script> but without "//":

s='''<html><body><script type="text/javascript">
<!--
spFramebuster();
-->
</script> <script type="text/javascript">
<!--
OAS_RICH('x70');
-->
</script><br class="spBreakNoHeight" clear="all" /> <!-- ##SPONTAG:
LAYER## -->
</body>
</html> '''
>>> bsoup = BeautifulSoup.BeautifulSoup(s)


>>> comments = bsoup.findAll(text=lambda text:isinstance(text,
BeautifulSoup.Comment))
>>> [comment.extract() for comment in comments]

[None]


>>> body = bsoup.body(text=True)
>>> text = ''.join(body)
>>> print text

<!--
spFramebuster();


-->

<!--
OAS_RICH('x70');

-->

So you have to kill all tags <script> with all content just like you
have killed all comments:

>>> s='''<html><body><script type="text/javascript">
<!--
spFramebuster();
// -->
</script> <script type="text/javascript">


<!--
OAS_RICH('x70');
// -->

</script><br class="spBreakNoHeight" clear="all" /> <!-- ##SPONTAG:
LAYER## -->
</body>
</html> '''
>>> bsoup = BeautifulSoup.BeautifulSoup(s)


>>> comments = bsoup.findAll(text=lambda text:isinstance(text,
BeautifulSoup.Comment))
>>> [comment.extract() for comment in comments]

[None]
>>> c=bsoup.findAll('script')
>>> for i in c:
i.extract()


>>> body = bsoup.body(text=True)
>>> text = ''.join(body)

>>> text
u' \n'

Julian

unread,
Nov 29, 2007, 9:56:59 AM11/29/07
to beautifulsoup
GREAT thank you very much!
Reply all
Reply to author
Forward
0 new messages