What is 140 characters?

414 views
Skip to first unread message

Craig Hockenberry

unread,
Mar 6, 2009, 12:43:45 PM3/6/09
to Twitter Development Talk
Some discussion about this thread popped up on Twitter yesterday:

<http://groups.google.com/group/twitter-development-talk/browse_thread/
thread/44be91d5ec5850fa>

Alex states that it's 140 bytes per tweet. So, of course, Loren
Brichter and I tried to prove that. With the following results:

1) 140 characters that including ones that include HTML entities:
<http://twitter.com/gnitset/status/1286202252>

At the time of posting, this tweet showed up on the site and in feeds
with all 140 characters. After a few hours, the "<" was converted to
"&lt;", increasing the count per character from one to four bytes and
decreasing the tweet length from 140 characters to 69. (You can see
this truncation at the end of the tweet: the "&" is from "&lt;")

Presumably, this happens as tweets in the memcache are written though
to the backing store.

I also see a lot of Twitter clients that don't realize how special the
&lt; and &gt; entities are. It took me a LONG time to figure out what
was going on here.

2) 140 Unicode _multi-byte_ characters: <http://twitter.com/atebits/
status/1286199010>

What's curious is that Loren's example with 140 characters uses the
Unicode 27A1 glyph. It uses 3 bytes in UTF-8. Why didn't it get
truncated? This seems to contradict Alex's statement in the thread
mentioned above.

As people start to use things like Emoji, tinyarro.ws and generally
figure out that Unicode (UTF-8) is a valid type of data on Twitter,
our clients should adapt and display more accurate "characters
remaining" counts. I can count bytes instead of characters, but I'm
not sure if I should or not.

No one likes a truncated tweet: we need an explicit statement on how
to count and submit multi-byte characters and entities.

-ch

Alex Payne

unread,
Mar 6, 2009, 2:18:57 PM3/6/09
to twitter-deve...@googlegroups.com
I'm taking this email to our Service Team, the folks who work on the
back-end of the service. The whole "message body changing as it moves
from cache to backing store" thing is totally unacceptable. Answers
soon.

--
Alex Payne - API Lead, Twitter, Inc.
http://twitter.com/al3x

Craig Hockenberry

unread,
Mar 6, 2009, 2:38:05 PM3/6/09
to Twitter Development Talk
This truncation as data moves throughout your system occurs in other
places. I've seen the same behavior when setting a user's location and
bio, for example.

-ch

Matt Sanford

unread,
Mar 6, 2009, 2:43:37 PM3/6/09
to twitter-deve...@googlegroups.com
I deployed a batch of explicit length checks this week to try and stop that madness. I didn't do the same for status text because it has another validation routine altogether. The Service Team should be able to help out with in making that more sane.

— Matt

Cameron Kaiser

unread,
Mar 6, 2009, 2:47:56 PM3/6/09
to twitter-deve...@googlegroups.com
> 2) 140 Unicode _multi-byte_ characters: <http://twitter.com/atebits/
> status/1286199010>
>
> What's curious is that Loren's example with 140 characters uses the
> Unicode 27A1 glyph. It uses 3 bytes in UTF-8. Why didn't it get
> truncated? This seems to contradict Alex's statement in the thread
> mentioned above.
>
> As people start to use things like Emoji, tinyarro.ws and generally
> figure out that Unicode (UTF-8) is a valid type of data on Twitter,
> our clients should adapt and display more accurate "characters
> remaining" counts. I can count bytes instead of characters, but I'm
> not sure if I should or not.

FWIW, I had a number of users complain about truncated UTF-8 sequences a
while back, which may be a symptom of this same problem. TTYtter now counts
bytes explicitly, and this seems to have dealt with the issue.

--
------------------------------------ personal: http://www.cameronkaiser.com/ --
Cameron Kaiser * Floodgap Systems * www.floodgap.com * cka...@floodgap.com
-- Make welfare as hard to get as building permits. ---------------------------

Alex Payne

unread,
Mar 9, 2009, 6:17:47 PM3/9/09
to twitter-deve...@googlegroups.com
Just to keep the group updated: one of our engineers has claimed this
issue. It will be dealt with with EXTREME prejudice.

--

atebits

unread,
Mar 9, 2009, 7:27:36 PM3/9/09
to Twitter Development Talk
Just to confirm: "EXTREME prejudice" as in "140 *bytes* as defined by
UTF-8 with HTML entity encoding only for special (< > &) characters?

So my tweet should *NOT* have worked?
http://twitter.com/atebits/status/1286199010

Would love to see some kind of official *this is how we determine how
long some hunk of unicode is* blurb on the API docs.

Loren

On Mar 9, 3:17 pm, Alex Payne <a...@twitter.com> wrote:
> Just to keep the group updated: one of our engineers has claimed this
> issue. It will be dealt with with EXTREME prejudice.
>
>
>
>
>
> On Fri, Mar 6, 2009 at 12:47, Cameron Kaiser <spec...@floodgap.com> wrote:
>
> >> 2) 140 Unicode _multi-byte_ characters: <http://twitter.com/atebits/
> >> status/1286199010>
>
> >> What's curious is that Loren's example with 140 characters uses the
> >> Unicode 27A1 glyph. It uses 3 bytes in UTF-8. Why didn't it get
> >> truncated? This seems to contradict Alex's statement in the thread
> >> mentioned above.
>
> >> As people start to use things like Emoji, tinyarro.ws and generally
> >> figure out that Unicode (UTF-8) is a valid type of data on Twitter,
> >> our clients should adapt and display more accurate "characters
> >> remaining" counts. I can count bytes instead of characters, but I'm
> >> not sure if I should or not.
>
> > FWIW, I had a number of users complain about truncated UTF-8 sequences a
> > while back, which may be a symptom of this same problem. TTYtter now counts
> > bytes explicitly, and this seems to have dealt with the issue.
>
> > --
> > ------------------------------------ personal:http://www.cameronkaiser.com/--
> >  Cameron Kaiser * Floodgap Systems *www.floodgap.com* ckai...@floodgap.com

Alex Payne

unread,
Mar 9, 2009, 8:03:02 PM3/9/09
to twitter-deve...@googlegroups.com
Once the guys on the backend team get back to me, I'll provide as much.

TjL

unread,
Mar 10, 2009, 8:11:20 PM3/10/09
to twitter-deve...@googlegroups.com
On Mon, Mar 9, 2009 at 7:27 PM, atebits <loren.b...@gmail.com> wrote:
>
> Just to confirm: "EXTREME prejudice" as in "140 *bytes* as defined by
> UTF-8 with HTML entity encoding only for special (< > &) characters?

Just to interject: & has not been specially encoded except for during
a brief time when " was also converted to &quot; and counted as 5
characters and &amp; equalled 4. This was un-done in a matter of days,
if not less.

I'd reiterate that there's no need to encode > as &rt;

If you are encoding < as &lt; there's no risk of someone getting an
<img> tag or <a href tag to work, so maybe there is an argument for a
left tag, but there's really no need to encode a right tag.

Figured I'd throw that out there FWIW

TjL

Nicolas Steenhout

unread,
Mar 10, 2009, 8:21:30 PM3/10/09
to twitter-deve...@googlegroups.com
> I'd reiterate that there's no need to encode > as &rt;

What about &eacute; or &agrav; ?? :)

Nic

Alex Payne

unread,
Mar 10, 2009, 8:26:08 PM3/10/09
to twitter-deve...@googlegroups.com
We consider the issue neither acute nor grave.

--

Cameron Kaiser

unread,
Mar 10, 2009, 8:32:10 PM3/10/09
to twitter-deve...@googlegroups.com
> > > I'd reiterate that there's no need to encode > as &rt;
> >
> > What about &eacute; or &agrav; ?? :)
>
> We consider the issue neither acute nor grave.

You were waiting all day for that, weren't you?

--
------------------------------------ personal: http://www.cameronkaiser.com/ --
Cameron Kaiser * Floodgap Systems * www.floodgap.com * cka...@floodgap.com

-- I use my C128 because I am an ornery, stubborn, retro grouch. -- Bob Masse -

TjL

unread,
Mar 10, 2009, 9:07:00 PM3/10/09
to twitter-deve...@googlegroups.com
On Tue, Mar 10, 2009 at 8:26 PM, Alex Payne <al...@twitter.com> wrote:
> We consider the issue neither acute nor grave.
>

UNFOLLOW.

Oh, wait, crap.

Andrew Badera

unread,
Mar 10, 2009, 9:24:56 PM3/10/09
to twitter-deve...@googlegroups.com
if you listen real hard, you can hear *groan*'s from the East Coast.


On Tue, Mar 10, 2009 at 8:26 PM, Alex Payne <al...@twitter.com> wrote:
>

Nicolas Steenhout

unread,
Mar 10, 2009, 11:46:59 PM3/10/09
to twitter-deve...@googlegroups.com
You're all a bunch of degenera.... errr, geeks, right, that's right, GEEKS! :)

<shaking head>

Nic

Dossy Shiobara

unread,
Mar 11, 2009, 12:09:08 AM3/11/09
to twitter-deve...@googlegroups.com
On 3/10/09 11:46 PM, Nicolas Steenhout wrote:
> You're all a bunch of degenera.... errr, geeks, right, that's right, GEEKS! :)
>
> <shaking head>

"All" is such an inclusive term, isn't it? :-)

ONE OF US! ONE OF US!

--
Dossy Shiobara | do...@panoptic.com | http://dossy.org/
Panoptic Computer Network | http://panoptic.com/
"He realized the fastest way to change is to laugh at your own
folly -- then you can let go and quickly move on." (p. 70)

Nicolas Steenhout

unread,
Mar 11, 2009, 1:13:03 AM3/11/09
to twitter-deve...@googlegroups.com
> "All" is such an inclusive term, isn't it?  :-)
>
> ONE OF US!  ONE OF US!

Hey! I resemble that :)

But I fear we're somewhat West of the OT's question :)

And we still don't know what to do about encoding html entities used
for accents in languages such as French, Spanish, etc ;)

Nic

Dossy Shiobara

unread,
Mar 11, 2009, 1:49:57 AM3/11/09
to twitter-deve...@googlegroups.com
On 3/11/09 1:13 AM, Nicolas Steenhout wrote:
> And we still don't know what to do about encoding html entities used
> for accents in languages such as French, Spanish, etc ;)

Transcode to their UTF-8 codepoints and punt. Let the Twitter
developers figure out how to handle the data. :-)

Craig Hockenberry

unread,
Mar 11, 2009, 11:39:54 AM3/11/09
to Twitter Development Talk
DO NOT ENCODE WITH HTML ENTITIES.

The only reason that < and > are encoded as &lt; and &gt; is because
these values are represented within an XML <text> element. This is
invalid XML:

<text>This <-- is a test</text>

And this is valid XML:

<text>This &lt;-- is a test</text>

If you use HTML entities, they will only show up correctly in a web
browser. SMS and other media will display &crap;.

-ch

Craig Hockenberry

unread,
Mar 24, 2009, 10:56:37 AM3/24/09
to Twitter Development Talk
Any news from the Service Team? I'd really like to get the counters
right in an upcoming release...

-ch

Alex Payne

unread,
Mar 24, 2009, 9:36:50 PM3/24/09
to twitter-deve...@googlegroups.com
Unfortunately, nothing definitive. We're still looking into this.

Bill Robertson

unread,
Mar 24, 2009, 10:08:22 PM3/24/09
to Twitter Development Talk
I have been wondering too. If its a character, it should be a
character, weather it's an 'A', 'À' or '的'

TjL

unread,
Sep 8, 2009, 5:05:09 PM9/8/09
to twitter-deve...@googlegroups.com
It's been nearly 6 months. Has this question been answered? If so I missed it.

Matt Sanford

unread,
Sep 9, 2009, 1:07:14 AM9/9/09
to Twitter Development Talk
Hi There,

I'm sorry this never got updated. Some changes have been made and
are waiting to go out now. When I switched from working on the
Platform (formerly API) team to my focus on international I took over
this issue.
Once this current fix is deployed (probably in a week or so since
I'm traveling at the moment) the definition of a character will be
consistent throughout our API. The new change will always compute
length based on the Unicode NFC [1] version of the string. Using the
NFC form makes the 140 character limit based on the length as
displayed rather than some under-the-cover byte arithmetic.
I more than agree with the above statement that a character is a
character and Twitter shouldn't care. Data should be data. The main
issue with that is that some clients compose characters and some
don't. My common example of this is é. Depending on your client
Twitter could get:

é - 1 byte
- URL Encoded UTF-8: %C3%A9
- http://www.fileformat.info/info/unicode/char/00e9/index.htm

-- or --

é - 2 bytes
- URL Encoded UTF-8: %65%CC%81
- http://www.fileformat.info/info/unicode/char/0065/index.htm
+ plus: http://www.fileformat.info/info/unicode/char/0301/index.htm

So, my fix will make it so that no matter the client if the user
sees é it counts as a single character. I'll announce something in the
change log once my fix is deployed.

Thanks;
— Matt Sanford / @mzsanford

[1] - http://www.unicode.org/reports/tr15/

Charles A. Lopez

unread,
Sep 9, 2009, 5:49:58 AM9/9/09
to twitter-deve...@googlegroups.com


2009/9/9 Matt Sanford <ma...@twitter.com>


Hi There,

   I'm sorry this never got updated. Some changes have been made and
are waiting to go out now. When I switched from working on the
Platform (formerly API) team to my focus on international I took over
this issue.
   Once this current fix is deployed (probably in a week or so since
I'm traveling at the moment) the definition of a character will be
consistent throughout our API. The new change will always compute
length based on the Unicode NFC [1] version of the string. Using the
NFC form makes the 140 character limit based on the length as
displayed rather than some under-the-cover byte arithmetic.
   I more than agree with the above statement that a character is a
character and Twitter shouldn't care. Data should be data. The main
issue with that is that some clients compose characters and some
don't. My common example of this is é. Depending on your client
Twitter could get:

é - 1 byte
  - URL Encoded UTF-8: %C3%A9
  - http://www.fileformat.info/info/unicode/char/00e9/index.htm

 
isn't that 2 bytes?
 
 
-- or --

é - 2 bytes
  - URL Encoded UTF-8: %65%CC%81
  - http://www.fileformat.info/info/unicode/char/0065/index.htm
    + plus: http://www.fileformat.info/info/unicode/char/0301/index.htm

 
and this three bytes?

TjL

unread,
Sep 10, 2009, 2:14:12 PM9/10/09
to twitter-deve...@googlegroups.com
On Wed, Sep 9, 2009 at 1:07 AM, Matt Sanford <ma...@twitter.com> wrote:
>    I more than agree with the above statement that a character is a
> character and Twitter shouldn't care. Data should be data. The main
> issue with that is that some clients compose characters and some
> don't. My common example of this is é. Depending on your client
> Twitter could get:
>
> é - 1 byte
>   - URL Encoded UTF-8: %C3%A9
>   - http://www.fileformat.info/info/unicode/char/00e9/index.htm
>
> -- or --
>
> é - 2 bytes
>   - URL Encoded UTF-8: %65%CC%81
>   - http://www.fileformat.info/info/unicode/char/0065/index.htm
>     + plus: http://www.fileformat.info/info/unicode/char/0301/index.htm
>
>    So, my fix will make it so that no matter the client if the user
> sees é it counts as a single character. I'll announce something in the
> change log once my fix is deployed.

Sorry for being picky about this, I'm just trying to make sure that
I'm understanding the terms correctly as you are using them.

I tend to think of Twitter as 140 "characters" (rather than bytes). I
realize that "character" may not have a precise definition, but to me,
each of these is "one character":

e é < & >

Am I understanding you correctly that Twitter is moving to standardize
where you can send a message with 140 "characters" regardless of
whether that's 140 e or 140 é or 140 < or 140 & or 140 > ?

I think that's what is being said, I just want to make sure I'm
understanding properly.

Thanks!

TjL

Cameron Kaiser

unread,
Oct 16, 2009, 9:00:45 AM10/16/09
to twitter-deve...@googlegroups.com
Just to follow up on Matt's note,

> I'm sorry this never got updated. Some changes have been made and
> are waiting to go out now. When I switched from working on the
> Platform (formerly API) team to my focus on international I took over
> this issue.
> Once this current fix is deployed (probably in a week or so since
> I'm traveling at the moment) the definition of a character will be
> consistent throughout our API. The new change will always compute
> length based on the Unicode NFC [1] version of the string. Using the
> NFC form makes the 140 character limit based on the length as
> displayed rather than some under-the-cover byte arithmetic.

Has this change occurred yet? I have a fix in TTYtter ready to go to enable
140 *character* posts instead of bytes, but I don't want to deploy it until
I know the path is clear.

--
------------------------------------ personal: http://www.cameronkaiser.com/ --
Cameron Kaiser * Floodgap Systems * www.floodgap.com * cka...@floodgap.com

-- armadillo, n. the act of providing weapons to a Spanish pickle. ------------

Reply all
Reply to author
Forward
0 new messages