How to handle �

191 views
Skip to first unread message

Link Swanson

unread,
Apr 10, 2012, 4:33:52 PM4/10/12
to beautifulsoup
Some data that I am scraping contains characters like this:

How do I remove them?


Leonard Richardson

unread,
Apr 10, 2012, 5:56:06 PM4/10/12
to beauti...@googlegroups.com
On Tue, Apr 10, 2012 at 4:33 PM, Link Swanson <li...@mustbuilddigital.com> wrote:
> Some data that I am scraping contains characters like this:
>
> �
>
> How do I remove them?

� is called REPLACEMENT CHARACTER. Beautiful Soup may use REPLACEMENT
CHARACTER to replace characters that can't be converted to Unicode, as
described here:

http://www.crummy.com/software/BeautifulSoup/bs4/doc/#encodings

You can do search-and-replace with code like this:

>>> for s in soup.strings:
... s.replace_with(s.replace(u"\N{REPLACEMENT CHARACTER}", ""))

However, it's possible that your terminal is printing � to represent
*other* characters that it can't display. In that case, your document
does not actually contain REPLACEMENT CHARACTER. You'll have to
identify what those characters actually are, and replace them instead.

Leonard

Link Swanson

unread,
Apr 10, 2012, 6:06:41 PM4/10/12
to beauti...@googlegroups.com
Ok, thanks, I will read that part of the doc more carefully.

That character was from the source HTML as rendered in a browser. My terminal actually prints them like this:



which I believe is what in python would be u"\u0092" as per http://www.fileformat.info/info/unicode/char/92/index.htm or a "curly apostrophe"

Ideally I would like to replace all the curly stuff with their non-curly equivalents using python.

Thanks again for responding to my ignorance




Leonard

--
You received this message because you are subscribed to the Google Groups "beautifulsoup" group.
To post to this group, send email to beauti...@googlegroups.com.
To unsubscribe from this group, send email to beautifulsou...@googlegroups.com.
For more options, visit this group at http://groups.google.com/group/beautifulsoup?hl=en.




--
Link Swanson
Must Build Digital


LunkRat

unread,
Apr 24, 2012, 3:20:14 PM4/24/12
to beautifulsoup
I have discovered that what I am dealing with is "Windows CP1252
Characters" or 'gremlins' - I would like to just map them all with
something like this:

http://effbot.org/zone/unicode-gremlins.htm

Can anyone recommend a good way to implement this so that when these
characters appear in my soup tree, they are replaced with unicode
equivalents?

I am just not sure of the best way to send my parsed data through a
function like this.

Thanks!

On Apr 10, 5:06 pm, Link Swanson <l...@mustbuilddigital.com> wrote:
> Ok, thanks, I will read that part of the doc more carefully.
>
> That character was from the source HTML as rendered in a browser. My
> terminal actually prints them like this:
>
> ’
>
> which I believe is what in python would be u"\u0092" as perhttp://www.fileformat.info/info/unicode/char/92/index.htmor a "curly
> apostrophe"
>
> Ideally I would like to replace all the curly stuff with their non-curly
> equivalents using python.
>
> Thanks again for responding to my ignorance
>
> On Tue, Apr 10, 2012 at 4:56 PM, Leonard Richardson
> <leona...@segfault.org>wrote:
>
>
>
>
>
>
>
>
>
> > On Tue, Apr 10, 2012 at 4:33 PM, Link Swanson <l...@mustbuilddigital.com>

Leonard Richardson

unread,
Apr 25, 2012, 9:17:27 AM4/25/12
to beauti...@googlegroups.com
Beautiful Soup is supposed to turn all input characters into Unicode
characters. It sounds like your document is in Windows-1252, but
Beautiful Soup isn't detecting it as Windows-1252.

Look at `soup.original_encoding` and see what it says.

You can try to force an encoding by specifying the 'from_encoding'
argument to the Beautiful Soup constructor:

>>> BeautifulSoup(markup, from_encoding="windows-1252")

http://www.crummy.com/software/BeautifulSoup/bs4/doc/#encodings

If you show me your code and the markup you're trying to parse, I can
make more specific recommendations.

Leonard
Reply all
Reply to author
Forward
0 new messages