[Python-ideas] Add encoding attribute to bytes

Terry Reedy

unread,

Nov 5, 2009, 8:15:36 PM11/5/09

to python...@python.org

A Python interpreter has one encoding for floats, ints, and strings.
sys.float_info and sys.int_info give details about the first two.
although they are mostly invisible to user code. (I presume they are
attached to sys rather than float and int precisely because this.) A
couple of recent posts have discussed making the unicode encoding (UCS2
v 4) both less visible and more discoverable to extensions.

Bytes are nearly always an encoding of *something*, but the particular
encoding used is instance-specific. As Guido has said, the programmer
must keep track. But how? In an OO language, one obvious way is as an
attribute of the instance. That would be carried with the instance and
make it self-identifying.

What I do not know if it is feasible to give an immutable instance of a
builtin class a mutable attribute slot. If it were, I think this could
make 3.x bytes easier and more transparent to use. When a string is
encoded to bytes, the attribute would be set. If it were then pickled,
the attribute would be stored with it and restored with it, and less
easily lost. If it were then decoded, the attribute would be used. If it
were sent to the net, the attribute would be used to set the appropriate
headers. The reverse process would apply from net to bytes to (unicode)
text.

Bytes representing other types of data, such as nedia could also be
tagged, not just those representing text.

This would be a proposal for 3.3 at the earliest. It would involved
revising stdlib modules, as appropriate, to use the new info.

Terry Jan Reedy

_______________________________________________
Python-ideas mailing list
Python...@python.org
http://mail.python.org/mailman/listinfo/python-ideas

MRAB

unread,

Nov 5, 2009, 9:19:35 PM11/5/09

to python...@python.org

You said "give an immutable instance of a builtin class a mutable
attribute slot". Why would the slot be mutable? Surely if the attribute
said that the bytes represented a certain type of data then you
shouldn't be able to change it. ("The attribute says that the bytes are
UTF-8, but I'm going to change it so that it says they are ISO-8859-1.")
I think that the attribute should be immutable.

Stephen J. Turnbull

unread,

Nov 5, 2009, 11:18:11 PM11/5/09

to MRAB, python...@python.org

MRAB writes:

> You said "give an immutable instance of a builtin class a mutable
> attribute slot". Why would the slot be mutable?

I think the idea is that in many cases you won't know what the
encoding is until after you've read the bytes.

But I don't really see this idea as that useful either way. The
obvious use case for me would be in the email module. So you read in
a message and create a bytes object, which you stash away for later
use as necessary. The header and the body, each MIME part, each MIME
part header and payload, and so on recursively are identified as
slices of the BigBytesObject you read in at the beginning, which is
implicitly a binary blob and doesn't need an encoding (strike one).
Each header identifies the encoding (which here would have to refer
ambiguously to Content-Type or Content-Transfer-Encoding, strike two)
of the corresponding payload. And you'll need to deal with cases
where Content-Type and Content-Transfer-Encoding are both relevant,
strike three. You may as well keep the various layers of encoding
explicitly in email-specific objects, so use case: email strikes out.

That's only one use case, of course. But we can see what a use case
would have to look like: you read in a bytes object, just enough to
enable you to accurately parse the rest of the stream in the same way
and tag each bytes part with an appropriate encoding. What are they?

Nick Coghlan

unread,

Nov 6, 2009, 4:13:26 AM11/6/09

to Terry Reedy, python...@python.org

Terry Reedy wrote:
> Bytes are nearly always an encoding of *something*, but the particular
> encoding used is instance-specific. As Guido has said, the programmer
> must keep track. But how? In an OO language, one obvious way is as an
> attribute of the instance. That would be carried with the instance and
> make it self-identifying.

I work in comms and spend a lot of time shuttling bytes from one place
to another without caring in the least about the encoding. Caring about
that kind of detail is application layer stuff and belongs in
application layer objects.

More importantly, such an attribute implies a defined responsibility for
keeping it accurate. For application layer objects, it is possible to
define that. For a low level data structure like bytes, it isn't.

Attaching metadata to something without defining a responsible entity
for keeping that metadata accurate and up to date is a recipe for trouble.

Cheers,
Nick.

--
Nick Coghlan | ncog...@gmail.com | Brisbane, Australia
---------------------------------------------------------------

Georg Brandl

unread,

Nov 6, 2009, 6:17:36 PM11/6/09

to python...@python.org

Terry Reedy schrieb:

> A Python interpreter has one encoding for floats, ints, and strings.
> sys.float_info and sys.int_info give details about the first two.
> although they are mostly invisible to user code. (I presume they are
> attached to sys rather than float and int precisely because this.) A
> couple of recent posts have discussed making the unicode encoding (UCS2
> v 4) both less visible and more discoverable to extensions.
>
> Bytes are nearly always an encoding of *something*, but the particular
> encoding used is instance-specific. As Guido has said, the programmer
> must keep track. But how? In an OO language, one obvious way is as an
> attribute of the instance. That would be carried with the instance and
> make it self-identifying.
>
> What I do not know if it is feasible to give an immutable instance of a
> builtin class a mutable attribute slot.

As soon as you can mutate an instance, it is not an immutable type anymore.
Calling it "immutable" despite will cause trouble. (The same bytes instance
could be used somewhere else transparently, e.g. as a function default
argument, or cached as a constant local.)

As for the usefulness, I often have to work with proprietary communication
protocols between computer and devices, and there the bytes have no encoding
whatsoever (though I agree that most bytes do have a meaningful encoding).
However, a class as fundamental as "bytes" should not be burdened with an
attribute that may not even apply -- it's easy to make a custom class to
represent a (bytes, encoding) pair.

Georg

--
Thus spake the Lord: Thou shalt indent with four spaces. No more, no less.
Four shall be the number of spaces thou shalt indent, and the number of thy
indenting shall be four. Eight shalt thou not indent, nor either indent thou
two, excepting that thou then proceed to four. Tabs are right out.

Jim Jewett

unread,

Nov 7, 2009, 4:07:06 PM11/7/09

to Terry Reedy, python...@python.org

On Thu, Nov 5, 2009 at 8:15 PM, Terry Reedy <tjr...@udel.edu> wrote:
> A Python interpreter has one encoding for floats, ints, and strings.
> sys.float_info and sys.int_info give details about the first two.

(Instead of changing bytes,)

This suggests a sys.string_info that contains information about the
default string representation --including whether the internal
encoding is UCS2 or UCS4 or something else.

That should at least make it possible to give better diagnostic messages.

-jJ

Terry Reedy

unread,

Nov 9, 2009, 9:15:40 PM11/9/09

to python...@python.org

Jim Jewett wrote:
> On Thu, Nov 5, 2009 at 8:15 PM, Terry Reedy <tjr...@udel.edu> wrote:
>> A Python interpreter has one encoding for floats, ints, and strings.
>> sys.float_info and sys.int_info give details about the first two.
>
> (Instead of changing bytes,)
>
> This suggests a sys.string_info that contains information about the
> default string representation --including whether the internal
> encoding is UCS2 or UCS4 or something else.
>
> That should at least make it possible to give better diagnostic messages.

What to do about interpreter-wide unicode string info, if anything, is
related but separate from what to do about instance-specific bytes info.

Terry Reedy

unread,

Nov 9, 2009, 9:22:14 PM11/9/09

to python...@python.org

As Stephen said, in case the info is initially missing or determined to
be erroneous.

> Surely if the attribute
> said that the bytes represented a certain type of data then you
> shouldn't be able to change it. ("The attribute says that the bytes are
> UTF-8, but I'm going to change it so that it says they are ISO-8859-1.")
> I think that the attribute should be immutable.

Encoding set by unicode.encode or a wrapper thereof is definitionally
correct and should not be changed. Encoding inferred by mimetype header
or file extension might be erroneous. I had in mind that the difference
might be indicated somehow: 'utf8' versus 'utf8?', for instance.

Terry Jan Reedy

Terry Reedy

unread,

Nov 9, 2009, 9:44:11 PM11/9/09

to python...@python.org

Georg Brandl wrote:

>> What I do not know if it is feasible to give an immutable instance of a
>> builtin class a mutable attribute slot.
>
> As soon as you can mutate an instance, it is not an immutable type anymore.
> Calling it "immutable" despite will cause trouble. (The same bytes instance
> could be used somewhere else transparently, e.g. as a function default
> argument, or cached as a constant local.)

OK, scratch that implementation of my idea.

>
> As for the usefulness, I often have to work with proprietary communication
> protocols between computer and devices, and there the bytes have no encoding
> whatsoever

Random bits? It seems to me that protocol means some sort of encoding,
formatting, or structuring, some sort of agreed on interpretation, even
if private.

> (though I agree that most bytes do have a meaningful encoding).
> However, a class as fundamental as "bytes" should not be burdened with an
> attribute that may not even apply -- it's easy to make a custom class to
> represent a (bytes, encoding) pair.

The fundamental problem I am interested in is the separation of raw data
from how to use it info. Text encoding of bytes in only one instance,
though the most common that pops up on Python list. I had also thought
of something like (imcomplete):

class Textbytes:
def __init__(self, text, code):
if type(text) is str:
text = text.encode(code)
if type(text) is bytes:
self.text = text
self.code = code
else:
raise ValueError()
def __str__(self):
return self.text.decode(self.code)

b = Textbytes('abc', 'utf8')
print(b)

One problem is that it is a lot bulkier than a raw bytes. Leaving that
aside, a custom class is just that: custom. Stdlib modules will neither
accept nor produce such a wrapper rathar than bytes.

My underlying idea is that maybe the standard Python distribution should
promote encapsulation of encoding info with raw bytes to make bug-free
usage easier. Adding an attribute was one implementation idea. Adding a
standardized wrapper class (at least in a module) would be another.

Terry Jan Reedy

MRAB

unread,

Nov 9, 2009, 9:54:45 PM11/9/09

to python...@python.org

I was thinking more along the lines of saying that the attribute
(default None) is specified when the bytes object is created. You
wouldn't be able to change it, but you could create a new bytes object
with a different attribute:

new_bytes = bytes(old_bytes, "utf8")

The actual bytes themselves wouldn't need to be copied; they could be
safely shared because 'bytes' objects are immutable.

There then comes the question of whether new_bytes == old_bytes.

Stephen J. Turnbull

unread,

Nov 9, 2009, 11:30:22 PM11/9/09

to Terry Reedy, python...@python.org

Terry Reedy writes:

> The fundamental problem I am interested in is the separation of raw data
> from how to use it info.

But this is ambiguous. Take reStructuredText. It *is* text/plain.
But it also *is* application/x-structuredtext. Not to forget
application/octet-stream. An MUA will treat it as the first, docutils
as the second, and gzip as the third.

> My underlying idea is that maybe the standard Python distribution
> should promote encapsulation of encoding info with raw bytes to
> make bug-free usage easier.

I think you will find that every use case makes different demands on
this feature, and that it typically interacts with higher-level needs
of the application. There's a reason that ASN.1 is insanely complex
and only applications that really need it ever use it. This feature
will either be too simple to serve most practical needs, or too
complex to serve most practical programmers.<wink>

And "bug-free" usage is hopeless. Much, perhaps the vast majority, of
the coding information will be automatically derived from sources you
deprecate as "heuristic", like MIME Content-Type headers. It will get
attached to the bytes as an attribute, and after that you can't know
how reliable it is.

If you have a practical example of such a simple class (bytes +
encoding attribute) that serves as a base for more complex
applications, I'd really like to see them. But until there are real
use cases on the table, I have to say I can't see the proposed
facility as being particularly useful to the email package, for
example.

Georg Brandl

unread,

Nov 10, 2009, 3:20:15 AM11/10/09

to python...@python.org

Terry Reedy schrieb:

> Georg Brandl wrote:
>
>>> What I do not know if it is feasible to give an immutable instance of a
>>> builtin class a mutable attribute slot.
>>
>> As soon as you can mutate an instance, it is not an immutable type anymore.
>> Calling it "immutable" despite will cause trouble. (The same bytes instance
>> could be used somewhere else transparently, e.g. as a function default
>> argument, or cached as a constant local.)
>
> OK, scratch that implementation of my idea.
>>
>> As for the usefulness, I often have to work with proprietary communication
>> protocols between computer and devices, and there the bytes have no encoding
>> whatsoever
>
> Random bits? It seems to me that protocol means some sort of encoding,
> formatting, or structuring, some sort of agreed on interpretation, even
> if private.

Sure, but nothing you could map entirely onto a string of Unicode characters.

Georg

Nick Coghlan

unread,

Nov 10, 2009, 5:41:26 AM11/10/09

to Terry Reedy, python...@python.org

Terry Reedy wrote:
>> As for the usefulness, I often have to work with proprietary
>> communication
>> protocols between computer and devices, and there the bytes have no
>> encoding
>> whatsoever
>
> Random bits? It seems to me that protocol means some sort of encoding,
> formatting, or structuring, some sort of agreed on interpretation, even
> if private.

This is true, but the encoding scheme *isn't* a property of the binary
data in and of itself. It's metadata about it that guides the
application as to how the stream should be interpreted.

For a lot of the things I've done in the past, I haven't cared at all
about the encoding of binary data - I've just been schlepping bits from
point A to point B and back without caring what they actually *meant*.
Other times I didn't have to guess or pass any metadata around because
the comms port was hardwired to a particular device that only knew one
way of communicating - the definition of the protocol was implicit in
the implementation of the interface software.

In fact, one of the key features typically desired in a communications
protocol is for it to be content neutral: you push binary data in one
end and get the same binary data out of the other end. Peer applications
using the channel to communicate with each other don't need to care what
the channel is doing with the data, but equally importantly, the
software implementing the comms channel doesn't need to know how to
interpret the bits it is transporting*.

For other applications, the Unicode encoding might be important to know.
Some will care more about the MIME type, or use some other defined
binary encoding (what is the Unicode encoding of an sqlite or bsddb
database file?). Other applications may be interested in a proprietary
binary format that is formally defined solely by the code that knows how
to read and write it.

Can bytes be used to store encoded Unicode data? Sure they can. But they
can be used for a whole host of other things as well, so burdening them
with an attribute that is occasional helpful, but more often dead weight
or even outright misleading would be a mistake.

Cheers,
Nick.

* Sometimes a bit more coupling makes sense when there are engineering
advantages to be had, but this is usually an application specific thing
(e.g. IP has a protocol field that identifies different application
layer protocols such as TCP, UDP and ESP which have different network
performance expectations, This allows IP network routers to apply
different rules without having to peek inside the payload of each IP packet)

--
Nick Coghlan | ncog...@gmail.com | Brisbane, Australia
---------------------------------------------------------------

Terry Reedy

unread,

Nov 10, 2009, 3:10:13 PM11/10/09

to python...@python.org

Georg Brandl wrote:
> Terry Reedy schrieb:

>> Random bits? It seems to me that protocol means some sort of encoding,
>> formatting, or structuring, some sort of agreed on interpretation, even
>> if private.
>
> Sure, but nothing you could map entirely onto a string of Unicode characters.

My idea is not limited to unicode encodings. But I see that one
field/attribute can be either too many or too few, and hence not a
universal solution.

Terry Reedy

unread,

Nov 10, 2009, 3:12:50 PM11/10/09

to python...@python.org

Your experience has been different from mine. Thanks for the exposition.
I can see why you prefer metadata to either be in the stream itself or
as part of a wrapper object.

Terry Jan Reedy

Nick Coghlan

unread,

Nov 10, 2009, 4:26:44 PM11/10/09

to Terry Reedy, python...@python.org

Terry Reedy wrote:
> Your experience has been different from mine. Thanks for the exposition.
> I can see why you prefer metadata to either be in the stream itself or
> as part of a wrapper object.

One of the things I've learned on python-list/-dev/-ideas is that the
*kind* of software one writes regularly makes a big difference to what
seems like a good idea. I tend to write fairly low level hardware
control code, so that's the way I tend to think. Others come from the
financial world or from an academic/scientific background or are
interested in Python for education purposes or in building big
frameworks that try to solve the world (or at least a particular problem
space within it ;).

It says a lot about Python's flexibility as a language that it applies
so well to so many different problem domains, but it can lead to some
interesting discussions when we try to align the interests of all those
different ill-defined groups :)

Cheers,
Nick.

--
Nick Coghlan | ncog...@gmail.com | Brisbane, Australia
---------------------------------------------------------------

Greg Ewing

unread,

Nov 11, 2009, 12:42:41 AM11/11/09

to python...@python.org

Nick Coghlan wrote:

> It says a lot about Python's flexibility as a language that it applies
> so well to so many different problem domains, but it can lead to some
> interesting discussions when we try to align the interests of all those
> different ill-defined groups :)

Yes, and I think that because of this diversity of requirements,
it's very important to keep the basic building blocks of the
language as simple and focused as possible. The fundamental
types should each concentrate on doing just one thing and
doing it well.

Seems to me the bytes type is just right as it is -- basic
raw data that you can use any way you see fit. Anything more
specialised should be built by the user to suit their use
case.

--
Greg

Reply all

Reply to author

Forward