Slight Bug report : Soup.text and inner tags: no spacing

364 views
Skip to first unread message

c2_4b

unread,
Apr 19, 2012, 1:23:41 PM4/19/12
to beautifulsoup
Hello,

I've been using BS for a while and I noticed something problematic,
but I think it should be quite easy to fix.

When I try to extract text for a html tag I normally use the .text
function It retrieves me correctly the text I need BUT whith no
spacing. And that's quite problematic
While cleaning the html tag, it should replace it by a simple space to
retrieve the separation


Example:
soup = soup.find("span", { "class" : "OrigineDefinition" })
#<span class="OrigineDefinition">(mot espagnol, du tagal <i>abaka</
i>)</span>
result = soup.text
#(mot espagnol, du tagalabaka)

In this case I would like to have the folowing result with the space
instead:
#(mot espagnol, du tagal abaka)

Anyone knows an easy work around? Or a way to fix this slight detail
inside BeautifulSoup?

Thanks in advance.




Leonard Richardson

unread,
Apr 26, 2012, 10:20:15 AM4/26/12
to beauti...@googlegroups.com
It looks like you're using Beautiful Soup 3. In Beautiful Soup 4, the
default behavior of .text is what you asked for:

>>> from bs4 import BeautifulSoup
>>> markup = '<span class="OrigineDefinition">(mot espagnol, du tagal <i>abaka</i>)</span>'
>>> soup = BeautifulSoup(markup)
>>> soup.find("span", { "class" : "OrigineDefinition" }).text
u'(mot espagnol, du tagal abaka)

The .text attribute is an alias for the get_text() method, which is
fairly customizable:

http://www.crummy.com/software/BeautifulSoup/bs4/doc/#get-text

Leonard
> --
> You received this message because you are subscribed to the Google Groups "beautifulsoup" group.
> To post to this group, send email to beauti...@googlegroups.com.
> To unsubscribe from this group, send email to beautifulsou...@googlegroups.com.
> For more options, visit this group at http://groups.google.com/group/beautifulsoup?hl=en.
>
Reply all
Reply to author
Forward
0 new messages