Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

how to remove 'FFFD' character

42 views
Skip to first unread message

webcomm

unread,
Jan 9, 2009, 12:46:44 PM1/9/09
to
Does anyone know a way to remove the 'FFFD' character with python?

You can see the browser output I'm dealing with here:
http://webcomm.webfactional.com/htdocs/fffd.JPG
I deleted a big chunk out of the middle of that JPG to protect
sensitive data.

I don't know what the character encoding of this data is and don't
know what the 'FFFD' represents. I guess it is something that can't
be represented in whatever this particular encoding is, or maybe it is
something corrupt that can't be represented in any encoding. I just
want to scrub it out. I tried this...

clean = txt.encode('ascii','ignore')

...but the 'FFFD' still comes through. Other ideas?

Thanks,
Ryan

Carsten Haese

unread,
Jan 9, 2009, 2:12:44 PM1/9/09
to
webcomm wrote:
> I don't know what the character encoding of this data is and don't
> know what the 'FFFD' represents.

The codepoint 0xFFFD is the so-called 'REPLACEMENT CHARACTER'. It is
used replace an incoming character whose value is unknown or
unrepresentable in Unicode. The browser might display these if for
example a page is encoded in latin-1 but it claims to be utf-8, so the
byte stream will contain byte sequences that can't be decoded into
unicode code points.

> I just
> want to scrub it out. I tried this...
>
> clean = txt.encode('ascii','ignore')
>
> ...but the 'FFFD' still comes through.

You must be doing something wrong, then:

py> u'Hello,\ufffd World'.encode('ascii', 'ignore')
'Hello, World'

HTH,

--
Carsten Haese
http://informixdb.sourceforge.net

0 new messages