how to remove 'FFFD' character

28 views
Skip to first unread message

webcomm

unread,
Jan 9, 2009, 12:46:44 PM1/9/09
to
Does anyone know a way to remove the 'FFFD' character with python?

You can see the browser output I'm dealing with here:
http://webcomm.webfactional.com/htdocs/fffd.JPG
I deleted a big chunk out of the middle of that JPG to protect
sensitive data.

I don't know what the character encoding of this data is and don't
know what the 'FFFD' represents. I guess it is something that can't
be represented in whatever this particular encoding is, or maybe it is
something corrupt that can't be represented in any encoding. I just
want to scrub it out. I tried this...

clean = txt.encode('ascii','ignore')

...but the 'FFFD' still comes through. Other ideas?

Thanks,
Ryan

Carsten Haese

unread,
Jan 9, 2009, 2:12:44 PM1/9/09
to
webcomm wrote:
> I don't know what the character encoding of this data is and don't
> know what the 'FFFD' represents.

The codepoint 0xFFFD is the so-called 'REPLACEMENT CHARACTER'. It is
used replace an incoming character whose value is unknown or
unrepresentable in Unicode. The browser might display these if for
example a page is encoded in latin-1 but it claims to be utf-8, so the
byte stream will contain byte sequences that can't be decoded into
unicode code points.

> I just
> want to scrub it out. I tried this...
>
> clean = txt.encode('ascii','ignore')
>
> ...but the 'FFFD' still comes through.

You must be doing something wrong, then:

py> u'Hello,\ufffd World'.encode('ascii', 'ignore')
'Hello, World'

HTH,

--
Carsten Haese
http://informixdb.sourceforge.net

Reply all
Reply to author
Forward
0 new messages