International characters counted multiple times in 140 char limit

benjackson

unread,

Jan 7, 2009, 8:39:05 PM1/7/09

to Twitter Development Talk

Just sent out the following tweet through the API:

@gabrielemcrise acho que é um misto de pioneirismo +hype+base de
usuários. E também o API, que cercou o serviço de ferramentas
interessantes

The international characters are being counted more than once and the
tweet shows up as:

@gabrielemcrise acho que é um misto de pioneirismo +hype+base de
usuários. E também o API, que cercou o serviço de ferramentas inter

Alex Payne

unread,

Jan 7, 2009, 8:50:53 PM1/7/09

to twitter-deve...@googlegroups.com

Yes, we count by byte.

2009/1/7 benjackson <bhja...@gmail.com>:

--
Alex Payne - API Lead, Twitter, Inc.
http://twitter.com/al3x

zbowling

unread,

Jan 7, 2009, 9:04:46 PM1/7/09

to Twitter Development Talk

Welcome to UTF-8.

This is something I consult on all the time. The days that encoding
length equaled character size length and even equaled representation
length are long gone. It's something you have to break your mind of
(and it doesn't help that languages like C and C++ call a byte a
"char".

1 character can count anywhere from 1 to 5 bytes in some cases.

Basicly:
U+000000 to U+00007F (basic Latin) = 1 byte - the graceful part of
UTF-8 is that it is directly equivalent to ASCII in that range.
U+000080 to U+0007FF - 2 bytes
U+000800 to U+00FFFF - 3 bytes
U+010000 to U+10FFFF - 4 bytes
etc...

See: http://en.wikipedia.org/wiki/UTF-8

Zac Bowling
http://zbowling.com/

Alex Payne

unread,

Jan 8, 2009, 2:44:41 AM1/8/09

to twitter-deve...@googlegroups.com

Right now, we go by how Ruby 1.8.x handles String.size. It'll be
Unicode-safe in the future.

--

Reply all

Reply to author

Forward