UnicodeDecodeError: 'utf8' codec can't decode

2,553 views
Skip to first unread message

Yateen

unread,
Jul 1, 2010, 10:55:35 AM7/1/10
to Django users
Hi,

I am using a postgres database and DJango. I have a http url in my
database which contains some special characters, but a table query
returns the result successfully.

select http_url from mytable limit 1;
http_url
----------------------------------------
http://östrogenfrei.de/verhuetung.html


If I use Django model way to get the same data, I get following error
-
>>> from util import *
>>> cursor = connection.cursor()
>>> query="select http_url from cfedr_raw_data_20100526_24860_1277981101 where http_url like '%rogenfrei%'"
>>> cursor.execute(query)
>>> data = []
>>> for item in cursor.fetchall():
... print item
... data.append(item)
...
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/firstpool/yjoshi/permanent/starbi/python2.6.1/lib/python2.6/
encodings/utf_8.py", line 16, in decode
return codecs.utf_8_decode(input, errors, True)
UnicodeDecodeError: 'utf8' codec can't decode bytes in position 7-10:
invalid data

Can anyone please throw some light on this? why is this occurring?
what is the solution.

Thanks in advance,

Yateen..

Bill Freeman

unread,
Jul 1, 2010, 3:30:02 PM7/1/10
to django...@googlegroups.com
What's in the database probably isn't legal UTF-8. It is easily possible
to have a sequence of characters in some other encoding which only
results in the wrong characters if treated as UTF-8, but it is also possible
to violate the UTF-8 structure with such a sequence. PostgreSQL, if
set for UTF-8, may still not care about the contents of character or
text columns, so long as they don't offend the quoting.

Look at the actual byte sequence in the particular entry in, for example,
hexadecimal. If a character in which the 0x80 bit is zero is followed by
a character in which the 0x80 bit is set, that is the beginning of a multi
byte encoding of a unicode code point. In that first character, the number
of contiguous one bits, starting with and including the 0x80 bit, and in
most significant to least significant, before the most significant zero, is
the number of bytes in the character. There must be at least 2 bytes,
so the 0x40 bit must also be set. Any additional bytes required must
have their 0x80 bit a one, and their 0x40 bit a zero (continuation bytes).
Continuation bytes contribute 6 bits each to the construction of an
integer, the first byte contributes 7-n bits. Byte value 0xFE and 0xFF
are never valid. Bytes not part of a multi-byte sequence may not have
a one in their 0x80 bit.

Perhaps some other piece of software has dumped something into
PostgreSQL using, say, Latin-1 or Latin-8, etc.

> --
> You received this message because you are subscribed to the Google Groups "Django users" group.
> To post to this group, send email to django...@googlegroups.com.
> To unsubscribe from this group, send email to django-users...@googlegroups.com.
> For more options, visit this group at http://groups.google.com/group/django-users?hl=en.
>
>

Yateen

unread,
Jul 5, 2010, 8:52:51 AM7/5/10
to Django users
Thanks Bill.

Do you mean even Postgres also should have thrown errors?
My worry is different here. The characters that I am getting are valid
contents of a HTTP URL, and my parser is able to parse them and put in
database. However, the Django interface is not able to read it. If I
am required to accept that it might be some different encoding (and I
will not know why type of encoded URLs I will be getting), how do I
handle the same? Is there any particular settings here that can be
done?

Thanks in advance.

Yateen..

Bill Freeman

unread,
Jul 6, 2010, 10:17:04 AM7/6/10
to django...@googlegroups.com
I doubt that you can fault Postgres. It doesn't need to care about
the encoding of contents, other than it must find the end of the string
(and the conventions used may depend on the interface and
connection settings).

When you say contents of a url, do you mean the url itself, or the
page referred to?

Your parser is written in what? And what encoding does it use
when it talks to Postgres? (Hint: it needs to match the encoding
that Django is using.)

Have you looked at what byte sequence is in the DB? If you have
data in the database which, according to Django's understanding
of the DB settings, is invalid, then it's not Django's job to fix it for
you.

N.B.; There's no guarantee that that web page correctly identified its
encoding either.

Bill

Yateen

unread,
Jul 6, 2010, 11:03:12 AM7/6/10
to Django users
Hi Bill,

Thanks.
You were right. The Postgres encoding and Django encoding are
different.
The parser is Python. Postgres encoding was SQL_ASCII. I changed it to
UTF8, and the parser failed to insert in DB!! I believe I need to fix
it first.


Yateen

unread,
Jul 22, 2010, 8:58:08 AM7/22/10
to Django users
Ok, I did some changes and things look to be working.

My intention was to receive URLs, parse them to get the base URL, put
them in database (Postgres), and then through a http query, through
Django interface through psycopg2, retrieve these URLs and display
those to the user on the browser in a table.

I do not know whether I should be really worrying about encoding/
decoding here as I just wanted to get the chars as they were coming.
Hence, I tried below changes and they were working fine.

I am giving some samples of the URLs that I processed which, with the
help of below changes, could work fine.

http://003-sexo-mulheres-nuas.ck7.net
http://live.žšcr.com/host <<some chars before cr can not be copied
here.
http://östrogenfrei.de/verhuetung.html


I needed below changes -

- keep the Postgres client encoding to sql_ascii.

- Make below changes in Django in following modules -

./python2.6.1/lib/python2.6/site-packages/django//contrib/syndication/
feeds.py
Change for applying Unicode on our URLs and data which is probably
unnecessary. The iri_to_uri is harmless, but works for us.

135,136c135
< url = iri_to_uri(enc_url),
< #url = smart_unicode(enc_url),
---
> url = smart_unicode(enc_url),
138,139c137
< mime_type =
iri_to_uri(self.__get_dynamic_attr('item_enclosure_mime_type', item))
< #mime_type =
smart_unicode(self.__get_dynamic_attr('item_enclosure_mime_type',
item))
---
> mime_type = smart_unicode(self.__get_dynamic_attr('item_enclosure_mime_type', item))


./python2.6.1/lib/python2.6/site-packages/django//db/backends/
postgresql/base.py
Same philosophy as above
Additionally, using sql_ascii as character set wherever possible.

46,47c46
< #result[smart_str(key, charset)] = smart_str(value,
charset)
< result[smart_str(key, charset)] = iri_to_uri(value)
---
> result[smart_str(key, charset)] = smart_str(value, charset)
50,51c49
< return tuple([iri_to_uri(p) for p in params])
< #return tuple([smart_str(p, self.charset, True) for p in
params])
---
> return tuple([smart_str(p, self.charset, True) for p in params])
54,55c52
< return self.cursor.execute(iri_to_uri(sql),
self.format_params(params))
< #return self.cursor.execute(smart_str(sql, self.charset),
self.format_params(params))
---
> return self.cursor.execute(smart_str(sql, self.charset), self.format_params(params))
128c125
< cursor = UnicodeCursorWrapper(cursor, 'sql_ascii')
---
> cursor = UnicodeCursorWrapper(cursor, 'utf-8')
137,138c134
< #return smart_unicode(s)
< return iri_to_uri(s)
---
> return smart_unicode(s)


./python2.6.1/lib/python2.6/site-packages/django//db/backends/
postgresql_psycopg2/base.py
Need to disable psycopg2 extensions as Unicode as this is not needed.
We can safely expect whatever data we get from DJango interface to be
of our use.
25c25
< #psycopg2.extensions.register_type(psycopg2.extensions.UNICODE)
---
> psycopg2.extensions.register_type(psycopg2.extensions.UNICODE)

The below changes looks redundant now, things are working even w/o
this one.
./python2.6.1/lib/python2.6/site-packages/django//db/models/base.py
Setting the encoding to ascii.
277,278c277
< return force_unicode(self).encode('ascii')
< #return force_unicode(self).encode('utf-8')
---
> return force_unicode(self).encode('utf-8')



For the purpose of processing, my views.py needed to process the URLs
in a slightly different before rendering the response back to html -
import urllib
url = urllib.quote_plus(received_url)

Also, in the html file, where I was processing the URL, I needed to
'unescape' my url.


Here is my request/query -

Can you please review these changes and this approach? Do you see any
major issue here?
I am sure there must be some purpose in not having this approach
earlier, but just wondering why?

Thanks,

Bill Freeman

unread,
Jul 23, 2010, 10:51:18 AM7/23/10
to django...@googlegroups.com
I don't really have enough context (or, at the moment, time) to do a
serious review. It may well be that you are safe. iri_to_uri () looks
like the key, since you almost certainly will trip over a value that isn't
eligible as a url (clients cut and paste from MS Word and equivalent
all the time).

But I would like to refute your thought that you shouldn't worry about
encoding and decoding. Everything is encoded. Even UTF-32 is an
encoding. You already have a database, and changing can be a pain,
but I'd like to lobby you, in future projects, to select UTF-8 for your
database encoding. For true ASCII characters (ord(c) < 128), in a
world of 8 bit bytes, the UTF-8 encoding requires the same number of
bytes as the ASCII enocoding for both transmission and storage (they
are the same byte value). In environments where the local accented
characters are common, you might make an argument that latin-N will
save you space, but it also means that there are characters that you
can't represent. The most trouble free situation is when you use
unicode strings in python, as Django tries hard to do, and properly
configure your interfaces (http, database) to do the appropriate
encoding and decoding.

Bill

Yateen

unread,
Aug 4, 2010, 6:40:09 AM8/4/10
to Django users
Hi Bill, thanks for the valuable inputs. I could hit a better solution
and I believe that is simplest one. Better, the solution is on the
application side and not on the DJango side.
What I did was this -
When my parser starts reading data from files (for which I don't know
the encoding), it first converts the URL information from the file in
to appropriate encoded values using iri_to_uri. Also, the encoding in
postgres database is set to utf-8 instead of default sql_ascii. This
setting in postgres creates another problem for psycopg2 connection,
so, whenever I create a connection object for psycopg2, i need to set
the encoding on the object. e.g. conn is my connection object, the
moment it is created, i do conn.set_client_encoding('utf-8'). The
things work fine thereafter.
The only change as compared previous implementation is, for those
special non-ascii characters, I see corresponding encoding in my
display (as %NN), in earlier implementation, I used to see the
characters as they were in the GUI.

One would question which one is a preferred method?
I would go with second one (changes on application side). In former
case, the story does not end there. If there is another component in
your application that needs to fetch data and process it (may be for
exporting to excel etc), you
have to handle that interface too, and keep on doing the same for
every new interface that you introduce. On the other hand, the changes
mentioned in this method ensure that we set proper encoding for the
data in the database so that all the components are comfortable in
processing the same.

I believe I should close this thread having found the appropriate
solution, still, comments/suggestions are welcome.

Thanks.

Reply all
Reply to author
Forward
0 new messages