Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

About size of Unicode string

2 views
Skip to first unread message

Frank Abel Cancio Bello

unread,
Jun 6, 2005, 1:21:30 PM6/6/05
to pytho...@python.org
Hi all!

I need know the size of string object independently of its encoding. For
example:

len('123') == len('123'.encode('utf_8'))

while the size of '123' object is different of the size of
'123'.encode('utf_8')

More:
I need send in HTTP request a string. Then I need know the length of the
string to set the header "content-length" independently of its encoding.

Any idea?

Thanks in advance
Frank


Laszlo Zsolt Nagy

unread,
Jun 6, 2005, 1:42:49 PM6/6/05
to Frank Abel Cancio Bello, pytho...@python.org

This is from the RFC:

>
> The Content-Length entity-header field indicates the size of the
> entity-body, in decimal number of OCTETs, sent to the recipient or, in
> the case of the HEAD method, the size of the entity-body that would
> have been sent had the request been a GET.
>
> Content-Length = "Content-Length" ":" 1*DIGIT
>
>
> An example is
>
> Content-Length: 3495
>
>
> Applications SHOULD use this field to indicate the transfer-length of
> the message-body, unless this is prohibited by the rules in section
> 4.4 <http://www.w3.org/Protocols/rfc2616/rfc2616-sec4.html#sec4.4>.
>
> Any Content-Length greater than or equal to zero is a valid value.
> Section 4.4 describes how to determine the length of a message-body if
> a Content-Length is not given.
>
Looks to me that the Content-Length header has nothing to do with the
encoding. It is a very low levet stuff. The content length is given in
OCTETs and it represents the size of the body. Clearly, it has nothing
to do with MIME/encoding etc. It is about the number of bits transferred
in the body. Try to write your unicode strings into a StringIO and take
its length....

Laci

Frank Abel Cancio Bello

unread,
Jun 6, 2005, 2:48:53 PM6/6/05
to pytho...@python.org
Well I will repeat the question:

Can I get how many bytes have a string object independently of its encoding?
Is the "len" function the right way of get it?

Laci look the following code:

import urllib2
request = urllib2.Request(url= 'http://localhost:6000')
data = 'data to send\n'.encode('utf_8')
request.add_data(data)
request.add_header('content-length', str(len(data)))
request.add_header('content-encoding', 'UTF-8')
file = urllib2.urlopen(request)

Is always true that "the size of the entity-body" is "len(data)"
independently of the encoding of "data"?

Andrew Dalke

unread,
Jun 6, 2005, 4:02:40 PM6/6/05
to
Frank Abel Cancio Bello wrote:
> Can I get how many bytes have a string object independently of its encoding?
> Is the "len" function the right way of get it?

No. len(unicode_string) returns the number of characters in the
unicode_string.

Number of bytes depends on how the unicode character are represented.
Different encodings will use different numbers of bytes.

>>> u = u"G\N{Latin small letter A with ring above}"
>>> u
u'G\xe5'
>>> len(u)
2
>>> u.encode("utf-8")
'G\xc3\xa5'
>>> len(u.encode("utf-8"))
3
>>> u.encode("latin1")
'G\xe5'
>>> len(u.encode("latin1"))
2
>>> u.encode("utf16")
'\xfe\xff\x00G\x00\xe5'
>>> len(u.encode("utf16"))
6
>>>

> Laci look the following code:
>
> import urllib2
> request = urllib2.Request(url= 'http://localhost:6000')
> data = 'data to send\n'.encode('utf_8')
> request.add_data(data)
> request.add_header('content-length', str(len(data)))
> request.add_header('content-encoding', 'UTF-8')
> file = urllib2.urlopen(request)
>
> Is always true that "the size of the entity-body" is "len(data)"
> independently of the encoding of "data"?

For this case it is true because the logical length of 'data'
(which is a byte string) is equal to the number of bytes in the
string, and the utf-8 encoding of a byte string with character
values in the range 0-127, inclusive, is unchanged from the
original string.

In general, as if 'data' is a unicode strings, no.

len() returns the logical length of 'data'. That number does
not need to be the number of bytes used to represent 'data'.
To get the bytes you must encode the object.

Andrew
da...@dalkescientific.com

Leif K-Brooks

unread,
Jun 6, 2005, 4:28:52 PM6/6/05
to
Frank Abel Cancio Bello wrote:
> request.add_header('content-encoding', 'UTF-8')

The Content-Encoding header is for things like "gzip", not for
specifying the text encoding. Use the charset parameter to the
Content-Type header for that, as in "Content-Type: text/plain;
charset=utf-8".

Frank Abel Cancio Bello

unread,
Jun 6, 2005, 5:43:40 PM6/6/05
to pytho...@python.org
Thanks to all. Andrew's answer was an excellent explanation. Thanks Leif for
you suggestion.

> -----Original Message-----
> From: python-list-bounces+frankabel=tesla.cuj...@python.org
> [mailto:python-list-bounces+frankabel=tesla.cuj...@python.org] On
> Behalf Of Leif K-Brooks
> Sent: Monday, June 06, 2005 4:29 PM
> To: pytho...@python.org
> Subject: Re: About size of Unicode string
>

> --
> http://mail.python.org/mailman/listinfo/python-list
>

Fredrik Lundh

unread,
Jun 13, 2005, 6:48:07 AM6/13/05
to pytho...@python.org
Frank Abel Cancio Bello wrote:

> Can I get how many bytes have a string object independently of its encoding?

strings hold characters, not bytes. an encoding is used to convert a
stream of characters to a stream of bytes. if you need to know the
number of bytes needed to hold an encoded string, you need to know
the encoding.

(and in some cases, including UTF-8, you need to *do* the encoding
before you can tell how many bytes you get)

> Is the "len" function the right way of get it?

len() on the encoded string, yes.

> Laci look the following code:
>
> import urllib2
> request = urllib2.Request(url= 'http://localhost:6000')
> data = 'data to send\n'.encode('utf_8')
> request.add_data(data)
> request.add_header('content-length', str(len(data)))

> request.add_header('content-encoding', 'UTF-8')

> file = urllib2.urlopen(request)
>
> Is always true that "the size of the entity-body" is "len(data)"
> independently of the encoding of "data"?

your data variable contains bytes, not characters, so the answer is "yes".

on the other hand, that add_header line isn't really needed -- if you leave
it out, urllib2 will add the content-length header all by itself.

</F>

0 new messages