Look at the actual byte sequence in the particular entry in, for example,
hexadecimal. If a character in which the 0x80 bit is zero is followed by
a character in which the 0x80 bit is set, that is the beginning of a multi
byte encoding of a unicode code point. In that first character, the number
of contiguous one bits, starting with and including the 0x80 bit, and in
most significant to least significant, before the most significant zero, is
the number of bytes in the character. There must be at least 2 bytes,
so the 0x40 bit must also be set. Any additional bytes required must
have their 0x80 bit a one, and their 0x40 bit a zero (continuation bytes).
Continuation bytes contribute 6 bits each to the construction of an
integer, the first byte contributes 7-n bits. Byte value 0xFE and 0xFF
are never valid. Bytes not part of a multi-byte sequence may not have
a one in their 0x80 bit.
Perhaps some other piece of software has dumped something into
PostgreSQL using, say, Latin-1 or Latin-8, etc.
> --
> You received this message because you are subscribed to the Google Groups "Django users" group.
> To post to this group, send email to django...@googlegroups.com.
> To unsubscribe from this group, send email to django-users...@googlegroups.com.
> For more options, visit this group at http://groups.google.com/group/django-users?hl=en.
>
>
When you say contents of a url, do you mean the url itself, or the
page referred to?
Your parser is written in what? And what encoding does it use
when it talks to Postgres? (Hint: it needs to match the encoding
that Django is using.)
Have you looked at what byte sequence is in the DB? If you have
data in the database which, according to Django's understanding
of the DB settings, is invalid, then it's not Django's job to fix it for
you.
N.B.; There's no guarantee that that web page correctly identified its
encoding either.
Bill
But I would like to refute your thought that you shouldn't worry about
encoding and decoding. Everything is encoded. Even UTF-32 is an
encoding. You already have a database, and changing can be a pain,
but I'd like to lobby you, in future projects, to select UTF-8 for your
database encoding. For true ASCII characters (ord(c) < 128), in a
world of 8 bit bytes, the UTF-8 encoding requires the same number of
bytes as the ASCII enocoding for both transmission and storage (they
are the same byte value). In environments where the local accented
characters are common, you might make an argument that latin-N will
save you space, but it also means that there are characters that you
can't represent. The most trouble free situation is when you use
unicode strings in python, as Django tries hard to do, and properly
configure your interfaces (http, database) to do the appropriate
encoding and decoding.
Bill