Re: (Simple?) Unicode Question

Rami Chowdhury

unread,

Aug 27, 2009, 12:44:41 PM8/27/09

to Shashank Singh, pytho...@python.org

> Further, does anything, except a printing device need to know the
> encoding of a piece of "text"?

I may be wrong, but I believe that's part of the idea between separation
of string and bytes types in Python 3.x. I believe, if you are using
Python 3.x, you don't need the character encoding mumbo jumbo at all ;-)

If you're using Python 2.x, though, I believe if you simply set the file
opening mode to binary then data you read() should still be treated as an
array of bytes, although you may encounter issues trying to access the
n'th character.

Please do correct me if I'm wrong, anyone.

On Thu, 27 Aug 2009 09:39:06 -0700, Shashank Singh
<shashank.s...@gmail.com> wrote:

> Hi All!
>
> I have a very simple (and probably stupid) question eluding me.
> When exactly is the char-set information needed?
>
> To make my question clear consider reading a file.
> While reading a file, all I get is basically an array of bytes.
>
> Now suppose a file has 10 bytes in it (all is data, no metadata,
> forget the BOM and stuff for a little while). I read it into an array of
> 10
> bytes, replace, say, 2nd bytes and write all the bytes back to a new
> file.
>
> Do i need the character encoding mumbo jumbo anywhere in this?
>
> Further, does anything, except a printing device need to know the
> encoding of a piece of "text"? I mean, as long as we are not trying
> to get a symbolic representation of a "text" or get "i"th character
> of it, all we need to do is to carry the intended encoding as
> an auxiliary information to the data stored as byte array.
>
> Right?
>
> --shashank

--
Rami Chowdhury
"Never attribute to malice that which can be attributed to stupidity" --
Hanlon's Razor
408-597-7068 (US) / 07875-841-046 (UK) / 0189-245544 (BD)

Albert Hopkins

unread,

Aug 27, 2009, 12:49:36 PM8/27/09

to pytho...@python.org

On Thu, 2009-08-27 at 22:09 +0530, Shashank Singh wrote:
> Hi All!
>
> I have a very simple (and probably stupid) question eluding me.
> When exactly is the char-set information needed?
>
> To make my question clear consider reading a file.
> While reading a file, all I get is basically an array of bytes.
>
> Now suppose a file has 10 bytes in it (all is data, no metadata,
> forget the BOM and stuff for a little while). I read it into an array
> of 10
> bytes, replace, say, 2nd bytes and write all the bytes back to a new
> file.
>
> Do i need the character encoding mumbo jumbo anywhere in this?
>
> Further, does anything, except a printing device need to know the
> encoding of a piece of "text"? I mean, as long as we are not trying
> to get a symbolic representation of a "text" or get "i"th character
> of it, all we need to do is to carry the intended encoding as
> an auxiliary information to the data stored as byte array.

If you are just reading and writing bytes then you are just reading and
writing bytes. Where you need to worry about unicode, etc. is when you
start treating a series of bytes as TEXT (e.g. how many *characters* are
in this byte array).*

This is no different, IMO, than treating a byte stream vs a image file.
You don't, need to worry about resolution, palette, bit-depth, etc. if
you are only treating as a stream of bytes. The only difference between
the two is that in Python "unicode" is a built-in type and "image"
isn't ;)

* Just make sure that if you are manipulating byte streams independent
of it's textual representation that you open files, e.g., in binary
mode.

-a

Thorsten Kampe

unread,

Aug 29, 2009, 3:34:43 AM8/29/09

to

* Rami Chowdhury (Thu, 27 Aug 2009 09:44:41 -0700)

> > Further, does anything, except a printing device need to know the
> > encoding of a piece of "text"?

Python needs to know if you are processing the text.

> I may be wrong, but I believe that's part of the idea between separation
> of string and bytes types in Python 3.x. I believe, if you are using
> Python 3.x, you don't need the character encoding mumbo jumbo at all ;-)

Nothing has changed in that regard. You still need to decode and encode
text and for that you have to know the encoding.

Thorsten

Steven D'Aprano

unread,

Aug 29, 2009, 4:26:54 AM8/29/09

to

On Sat, 29 Aug 2009 09:34:43 +0200, Thorsten Kampe wrote:

> * Rami Chowdhury (Thu, 27 Aug 2009 09:44:41 -0700)
>> > Further, does anything, except a printing device need to know the
>> > encoding of a piece of "text"?
>
> Python needs to know if you are processing the text.

Python only needs to know when you convert the text to or from bytes. I
can do this:

>>> s = "hello"
>>> t = "world"
>>> print(' '.join([s, t]))
hello world

and not need to care anything about encodings.

So long as your terminal has a sensible encoding, and you have a good
quality font, you should be able to print any string you can create.

>> I may be wrong, but I believe that's part of the idea between
>> separation of string and bytes types in Python 3.x. I believe, if you
>> are using Python 3.x, you don't need the character encoding mumbo jumbo
>> at all ;-)
>
> Nothing has changed in that regard. You still need to decode and encode
> text and for that you have to know the encoding.

You only need to worry about encoding when you convert from bytes to
text, and visa versa. Admittedly, the most common time you need to do
that is when reading input from files, but if all your text strings are
generated by Python, and not output anywhere, you shouldn't need to care
about encodings.

If all your text contains nothing but ASCII characters, you should never
need to worry about encodings at all.

--
Steven

Nobody

unread,

Aug 29, 2009, 3:09:12 PM8/29/09

to

On Sat, 29 Aug 2009 08:26:54 +0000, Steven D'Aprano wrote:

> Python only needs to know when you convert the text to or from bytes. I
> can do this:
>
>>>> s = "hello"
>>>> t = "world"
>>>> print(' '.join([s, t]))
> hello world
>
> and not need to care anything about encodings.
>
> So long as your terminal has a sensible encoding, and you have a good
> quality font, you should be able to print any string you can create.

UTF-8 isn't a particularly sensible encoding for terminals.

And "Unicode font" is an oxymoron. You can merge a whole bunch of fonts
together and stuff them into a TTF file; that doesn't make them "a font",
though.

>>> I may be wrong, but I believe that's part of the idea between
>>> separation of string and bytes types in Python 3.x. I believe, if you
>>> are using Python 3.x, you don't need the character encoding mumbo jumbo
>>> at all ;-)
>>
>> Nothing has changed in that regard. You still need to decode and encode
>> text and for that you have to know the encoding.
>
> You only need to worry about encoding when you convert from bytes to
> text, and visa versa. Admittedly, the most common time you need to do
> that is when reading input from files, but if all your text strings are
> generated by Python, and not output anywhere, you shouldn't need to care
> about encodings.

Why would you generate text strings and not output them anywhere?

The main advantage of using Unicode internally is that you can associate
encodings with the specific points where data needs to be converted
to/from bytes, rather than having to carry the encoding details around the
program.

Steven D'Aprano

unread,

Aug 29, 2009, 10:36:49 PM8/29/09

to

On Sat, 29 Aug 2009 20:09:12 +0100, Nobody wrote:

> On Sat, 29 Aug 2009 08:26:54 +0000, Steven D'Aprano wrote:
>
>> Python only needs to know when you convert the text to or from bytes. I
>> can do this:
>>
>>>>> s = "hello"
>>>>> t = "world"
>>>>> print(' '.join([s, t]))
>> hello world
>>
>> and not need to care anything about encodings.
>>
>> So long as your terminal has a sensible encoding, and you have a good
>> quality font, you should be able to print any string you can create.
>
> UTF-8 isn't a particularly sensible encoding for terminals.

Did I mention UTF-8?

Out of curiosity, why do you say that UTF-8 isn't sensible for terminals?

> And "Unicode font" is an oxymoron. You can merge a whole bunch of fonts
> together and stuff them into a TTF file; that doesn't make them "a
> font", though.

I never mentioned "Unicode font" either. In any case, there's no reason
why a skillful designer can't make a single font which covers the entire
Unicode range in a consistent style.

>>>> I may be wrong, but I believe that's part of the idea between
>>>> separation of string and bytes types in Python 3.x. I believe, if you
>>>> are using Python 3.x, you don't need the character encoding mumbo
>>>> jumbo at all ;-)
>>>
>>> Nothing has changed in that regard. You still need to decode and
>>> encode text and for that you have to know the encoding.
>>
>> You only need to worry about encoding when you convert from bytes to
>> text, and visa versa. Admittedly, the most common time you need to do
>> that is when reading input from files, but if all your text strings are
>> generated by Python, and not output anywhere, you shouldn't need to
>> care about encodings.
>
> Why would you generate text strings and not output them anywhere?

Who knows? It doesn't matter -- the point is that you can if you want to.
You only need to worry about encodings at input and output, therefore
logically if you don't do I/O you can process strings all day long and
never worry about encodings at all.

> The main advantage of using Unicode internally is that you can associate
> encodings with the specific points where data needs to be converted
> to/from bytes, rather than having to carry the encoding details around
> the program.

Surely the main advantage of Unicode is that it gives you a full and
consistent range of characters not limited to the 128 characters provided
by ASCII?

--
Steven

Nobody

unread,

Aug 30, 2009, 1:30:48 PM8/30/09

to

On Sun, 30 Aug 2009 02:36:49 +0000, Steven D'Aprano wrote:

>>> So long as your terminal has a sensible encoding, and you have a good
>>> quality font, you should be able to print any string you can create.
>>
>> UTF-8 isn't a particularly sensible encoding for terminals.
>
> Did I mention UTF-8?
>
> Out of curiosity, why do you say that UTF-8 isn't sensible for terminals?

I don't think I've ever seen a terminal (whether an emulator running on a
PC or a hardware terminal) which supports anything like the entire Unicode
repertoire, along with right-to-left writing, complex scripts, etc. Even
support for double-width characters is uncommon.

If your terminal can't handle anything outside of ISO-8859-1, there isn't
any advantage to using UTF-8, and some disadvantages; e.g. a typical Unix
tty driver will delete the last *byte* from the input buffer when you
press backspace (Linux 2.6.* has the IUTF8 flag, but this is non-standard).

Historically, terminal I/O has tended to revolve around unibyte encodings,
with everything except the endpoints being encoding-agnostic. Anything
which falls outside of that is a dog's breakfast; it's no coincidence
that the word for "messed-up text" (arising from an encoding mismatch)
was borrowed from Japanese (mojibake).

Life is simpler if you can use a unibyte encoding. Apart from anything
else, the failure modes tend to be harmless. E.g. you get the wrong glyph
rather than two glyphs where you expected one. On a 7-bit channel, you get
the wrong printable character rather than a control character (this is why
ISO-8859-* reserves \x80-\x9F as control codes rather than using them as
printable characters).

>> And "Unicode font" is an oxymoron. You can merge a whole bunch of fonts
>> together and stuff them into a TTF file; that doesn't make them "a
>> font", though.
>
> I never mentioned "Unicode font" either. In any case, there's no reason
> why a skillful designer can't make a single font which covers the entire
> Unicode range in a consistent style.

Consistency between unrelated scripts is neither realistic nor
desirable.

E.g. Latin fonts tend to use uniform stroke widths unless they're
specifically designed to look like handwriting, whereas Han fonts tend to
prefer variable-width strokes which reflect the direction.

>> The main advantage of using Unicode internally is that you can associate
>> encodings with the specific points where data needs to be converted
>> to/from bytes, rather than having to carry the encoding details around
>> the program.
>
> Surely the main advantage of Unicode is that it gives you a full and
> consistent range of characters not limited to the 128 characters provided
> by ASCII?

Nothing stops you from using other encodings, or from using multiple
encodings. But using multiple encodings means keeping track of the
encodings. This isn't impossible, and it may produce better results (e.g.
no information loss from Han unification), but it can be a lot more work.