At the time of posting, this tweet showed up on the site and in feeds
with all 140 characters. After a few hours, the "<" was converted to
"<", increasing the count per character from one to four bytes and
decreasing the tweet length from 140 characters to 69. (You can see
this truncation at the end of the tweet: the "&" is from "<")
Presumably, this happens as tweets in the memcache are written though
to the backing store.
I also see a lot of Twitter clients that don't realize how special the
< and > entities are. It took me a LONG time to figure out what
was going on here.
What's curious is that Loren's example with 140 characters uses the
Unicode 27A1 glyph. It uses 3 bytes in UTF-8. Why didn't it get
truncated? This seems to contradict Alex's statement in the thread
mentioned above.
As people start to use things like Emoji, tinyarro.ws and generally
figure out that Unicode (UTF-8) is a valid type of data on Twitter,
our clients should adapt and display more accurate "characters
remaining" counts. I can count bytes instead of characters, but I'm
not sure if I should or not.
No one likes a truncated tweet: we need an explicit statement on how
to count and submit multi-byte characters and entities.
I'm taking this email to our Service Team, the folks who work on the back-end of the service. The whole "message body changing as it moves from cache to backing store" thing is totally unacceptable. Answers soon.
> At the time of posting, this tweet showed up on the site and in feeds > with all 140 characters. After a few hours, the "<" was converted to > "<", increasing the count per character from one to four bytes and > decreasing the tweet length from 140 characters to 69. (You can see > this truncation at the end of the tweet: the "&" is from "<")
> Presumably, this happens as tweets in the memcache are written though > to the backing store.
> I also see a lot of Twitter clients that don't realize how special the > < and > entities are. It took me a LONG time to figure out what > was going on here.
> What's curious is that Loren's example with 140 characters uses the > Unicode 27A1 glyph. It uses 3 bytes in UTF-8. Why didn't it get > truncated? This seems to contradict Alex's statement in the thread > mentioned above.
> As people start to use things like Emoji, tinyarro.ws and generally > figure out that Unicode (UTF-8) is a valid type of data on Twitter, > our clients should adapt and display more accurate "characters > remaining" counts. I can count bytes instead of characters, but I'm > not sure if I should or not.
> No one likes a truncated tweet: we need an explicit statement on how > to count and submit multi-byte characters and entities.
This truncation as data moves throughout your system occurs in other
places. I've seen the same behavior when setting a user's location and
bio, for example.
-ch
On Mar 6, 11:18 am, Alex Payne <a...@twitter.com> wrote:
> I'm taking this email to our Service Team, the folks who work on the
> back-end of the service. The whole "message body changing as it moves
> from cache to backing store" thing is totally unacceptable. Answers
> soon.
> > At the time of posting, this tweet showed up on the site and in feeds
> > with all 140 characters. After a few hours, the "<" was converted to
> > "<", increasing the count per character from one to four bytes and
> > decreasing the tweet length from 140 characters to 69. (You can see
> > this truncation at the end of the tweet: the "&" is from "<")
> > Presumably, this happens as tweets in the memcache are written though
> > to the backing store.
> > I also see a lot of Twitter clients that don't realize how special the
> > < and > entities are. It took me a LONG time to figure out what
> > was going on here.
> > What's curious is that Loren's example with 140 characters uses the
> > Unicode 27A1 glyph. It uses 3 bytes in UTF-8. Why didn't it get
> > truncated? This seems to contradict Alex's statement in the thread
> > mentioned above.
> > As people start to use things like Emoji, tinyarro.ws and generally
> > figure out that Unicode (UTF-8) is a valid type of data on Twitter,
> > our clients should adapt and display more accurate "characters
> > remaining" counts. I can count bytes instead of characters, but I'm
> > not sure if I should or not.
> > No one likes a truncated tweet: we need an explicit statement on how
> > to count and submit multi-byte characters and entities.
I deployed a batch of explicit length checks this week to try and stop that madness. I didn't do the same for status text because it has another validation routine altogether. The Service Team should be able to help out with in making that more sane.
— Matt
On Mar 6, 2009, at 11:38 AM, Craig Hockenberry wrote:
> This truncation as data moves throughout your system occurs in other > places. I've seen the same behavior when setting a user's location and > bio, for example.
> -ch
> On Mar 6, 11:18 am, Alex Payne <a...@twitter.com> wrote: >> I'm taking this email to our Service Team, the folks who work on the >> back-end of the service. The whole "message body changing as it moves >> from cache to backing store" thing is totally unacceptable. Answers >> soon.
>> On Fri, Mar 6, 2009 at 09:43, Craig Hockenberry
>>> At the time of posting, this tweet showed up on the site and in >>> feeds >>> with all 140 characters. After a few hours, the "<" was converted to >>> "<", increasing the count per character from one to four bytes >>> and >>> decreasing the tweet length from 140 characters to 69. (You can see >>> this truncation at the end of the tweet: the "&" is from "<")
>>> Presumably, this happens as tweets in the memcache are written >>> though >>> to the backing store.
>>> I also see a lot of Twitter clients that don't realize how special >>> the >>> < and > entities are. It took me a LONG time to figure out >>> what >>> was going on here.
>>> What's curious is that Loren's example with 140 characters uses the >>> Unicode 27A1 glyph. It uses 3 bytes in UTF-8. Why didn't it get >>> truncated? This seems to contradict Alex's statement in the thread >>> mentioned above.
>>> As people start to use things like Emoji, tinyarro.ws and generally >>> figure out that Unicode (UTF-8) is a valid type of data on Twitter, >>> our clients should adapt and display more accurate "characters >>> remaining" counts. I can count bytes instead of characters, but I'm >>> not sure if I should or not.
>>> No one likes a truncated tweet: we need an explicit statement on how >>> to count and submit multi-byte characters and entities.
> What's curious is that Loren's example with 140 characters uses the > Unicode 27A1 glyph. It uses 3 bytes in UTF-8. Why didn't it get > truncated? This seems to contradict Alex's statement in the thread > mentioned above.
> As people start to use things like Emoji, tinyarro.ws and generally > figure out that Unicode (UTF-8) is a valid type of data on Twitter, > our clients should adapt and display more accurate "characters > remaining" counts. I can count bytes instead of characters, but I'm > not sure if I should or not.
FWIW, I had a number of users complain about truncated UTF-8 sequences a while back, which may be a symptom of this same problem. TTYtter now counts bytes explicitly, and this seems to have dealt with the issue.
-- ------------------------------------ personal: http://www.cameronkaiser.com/ -- Cameron Kaiser * Floodgap Systems * www.floodgap.com * ckai...@floodgap.com -- Make welfare as hard to get as building permits. ---------------------------
>> What's curious is that Loren's example with 140 characters uses the >> Unicode 27A1 glyph. It uses 3 bytes in UTF-8. Why didn't it get >> truncated? This seems to contradict Alex's statement in the thread >> mentioned above.
>> As people start to use things like Emoji, tinyarro.ws and generally >> figure out that Unicode (UTF-8) is a valid type of data on Twitter, >> our clients should adapt and display more accurate "characters >> remaining" counts. I can count bytes instead of characters, but I'm >> not sure if I should or not.
> FWIW, I had a number of users complain about truncated UTF-8 sequences a > while back, which may be a symptom of this same problem. TTYtter now counts > bytes explicitly, and this seems to have dealt with the issue.
> -- > ------------------------------------ personal: http://www.cameronkaiser.com/ -- > Cameron Kaiser * Floodgap Systems * www.floodgap.com * ckai...@floodgap.com > -- Make welfare as hard to get as building permits. ---------------------------
> >> What's curious is that Loren's example with 140 characters uses the
> >> Unicode 27A1 glyph. It uses 3 bytes in UTF-8. Why didn't it get
> >> truncated? This seems to contradict Alex's statement in the thread
> >> mentioned above.
> >> As people start to use things like Emoji, tinyarro.ws and generally
> >> figure out that Unicode (UTF-8) is a valid type of data on Twitter,
> >> our clients should adapt and display more accurate "characters
> >> remaining" counts. I can count bytes instead of characters, but I'm
> >> not sure if I should or not.
> > FWIW, I had a number of users complain about truncated UTF-8 sequences a
> > while back, which may be a symptom of this same problem. TTYtter now counts
> > bytes explicitly, and this seems to have dealt with the issue.
> > --
> > ------------------------------------ personal:http://www.cameronkaiser.com/-- > > Cameron Kaiser * Floodgap Systems *www.floodgap.com* ckai...@floodgap.com
> > -- Make welfare as hard to get as building permits. ---------------------------
> Would love to see some kind of official *this is how we determine how > long some hunk of unicode is* blurb on the API docs.
> Loren
> On Mar 9, 3:17 pm, Alex Payne <a...@twitter.com> wrote: >> Just to keep the group updated: one of our engineers has claimed this >> issue. It will be dealt with with EXTREME prejudice.
>> On Fri, Mar 6, 2009 at 12:47, Cameron Kaiser <spec...@floodgap.com> wrote:
>> >> What's curious is that Loren's example with 140 characters uses the >> >> Unicode 27A1 glyph. It uses 3 bytes in UTF-8. Why didn't it get >> >> truncated? This seems to contradict Alex's statement in the thread >> >> mentioned above.
>> >> As people start to use things like Emoji, tinyarro.ws and generally >> >> figure out that Unicode (UTF-8) is a valid type of data on Twitter, >> >> our clients should adapt and display more accurate "characters >> >> remaining" counts. I can count bytes instead of characters, but I'm >> >> not sure if I should or not.
>> > FWIW, I had a number of users complain about truncated UTF-8 sequences a >> > while back, which may be a symptom of this same problem. TTYtter now counts >> > bytes explicitly, and this seems to have dealt with the issue.
>> > -- >> > ------------------------------------ personal:http://www.cameronkaiser.com/-- >> > Cameron Kaiser * Floodgap Systems *www.floodgap.com* ckai...@floodgap.com >> > -- Make welfare as hard to get as building permits. ---------------------------
On Mon, Mar 9, 2009 at 7:27 PM, atebits <loren.brich...@gmail.com> wrote:
> Just to confirm: "EXTREME prejudice" as in "140 *bytes* as defined by > UTF-8 with HTML entity encoding only for special (< > &) characters?
Just to interject: & has not been specially encoded except for during a brief time when " was also converted to " and counted as 5 characters and & equalled 4. This was un-done in a matter of days, if not less.
I'd reiterate that there's no need to encode > as &rt;
If you are encoding < as < there's no risk of someone getting an <img> tag or <a href tag to work, so maybe there is an argument for a left tag, but there's really no need to encode a right tag.
> > > I'd reiterate that there's no need to encode > as &rt;
> > What about é or &agrav; ?? :)
> We consider the issue neither acute nor grave.
You were waiting all day for that, weren't you?
-- ------------------------------------ personal: http://www.cameronkaiser.com/ -- Cameron Kaiser * Floodgap Systems * www.floodgap.com * ckai...@floodgap.com -- I use my C128 because I am an ornery, stubborn, retro grouch. -- Bob Masse -
> You're all a bunch of degenera.... errr, geeks, right, that's right, GEEKS! :)
> <shaking head>
"All" is such an inclusive term, isn't it? :-)
ONE OF US! ONE OF US!
-- Dossy Shiobara | do...@panoptic.com | http://dossy.org/ Panoptic Computer Network | http://panoptic.com/ "He realized the fastest way to change is to laugh at your own folly -- then you can let go and quickly move on." (p. 70)
> And we still don't know what to do about encoding html entities used > for accents in languages such as French, Spanish, etc ;)
Transcode to their UTF-8 codepoints and punt. Let the Twitter developers figure out how to handle the data. :-)
-- Dossy Shiobara | do...@panoptic.com | http://dossy.org/ Panoptic Computer Network | http://panoptic.com/ "He realized the fastest way to change is to laugh at your own folly -- then you can let go and quickly move on." (p. 70)
> I'm taking this email to our Service Team, the folks who work on the
> back-end of the service. The whole "message body changing as it moves
> from cache to backing store" thing is totally unacceptable. Answers
> soon.
> > At the time of posting, this tweet showed up on the site and in feeds
> > with all 140 characters. After a few hours, the "<" was converted to
> > "<", increasing the count per character from one to four bytes and
> > decreasing the tweet length from 140 characters to 69. (You can see
> > this truncation at the end of the tweet: the "&" is from "<")
> > Presumably, this happens as tweets in the memcache are written though
> > to the backing store.
> > I also see a lot of Twitter clients that don't realize how special the
> > < and > entities are. It took me a LONG time to figure out what
> > was going on here.
> > What's curious is that Loren's example with 140 characters uses the
> > Unicode 27A1 glyph. It uses 3 bytes in UTF-8. Why didn't it get
> > truncated? This seems to contradict Alex's statement in the thread
> > mentioned above.
> > As people start to use things like Emoji, tinyarro.ws and generally
> > figure out that Unicode (UTF-8) is a valid type of data on Twitter,
> > our clients should adapt and display more accurate "characters
> > remaining" counts. I can count bytes instead of characters, but I'm
> > not sure if I should or not.
> > No one likes a truncated tweet: we need an explicit statement on how
> > to count and submit multi-byte characters and entities.
> Any news from the Service Team? I'd really like to get the counters > right in an upcoming release...
> -ch
> On Mar 6, 12:18 pm, Alex Payne <a...@twitter.com> wrote: >> I'm taking this email to our Service Team, the folks who work on the >> back-end of the service. The whole "message body changing as it moves >> from cache to backing store" thing is totally unacceptable. Answers >> soon.
>> On Fri, Mar 6, 2009 at 09:43, Craig Hockenberry
>> > At the time of posting, this tweet showed up on the site and in feeds >> > with all 140 characters. After a few hours, the "<" was converted to >> > "<", increasing the count per character from one to four bytes and >> > decreasing the tweet length from 140 characters to 69. (You can see >> > this truncation at the end of the tweet: the "&" is from "<")
>> > Presumably, this happens as tweets in the memcache are written though >> > to the backing store.
>> > I also see a lot of Twitter clients that don't realize how special the >> > < and > entities are. It took me a LONG time to figure out what >> > was going on here.
>> > What's curious is that Loren's example with 140 characters uses the >> > Unicode 27A1 glyph. It uses 3 bytes in UTF-8. Why didn't it get >> > truncated? This seems to contradict Alex's statement in the thread >> > mentioned above.
>> > As people start to use things like Emoji, tinyarro.ws and generally >> > figure out that Unicode (UTF-8) is a valid type of data on Twitter, >> > our clients should adapt and display more accurate "characters >> > remaining" counts. I can count bytes instead of characters, but I'm >> > not sure if I should or not.
>> > No one likes a truncated tweet: we need an explicit statement on how >> > to count and submit multi-byte characters and entities.
> > Any news from the Service Team? I'd really like to get the counters
> > right in an upcoming release...
> > -ch
> > On Mar 6, 12:18 pm, Alex Payne <a...@twitter.com> wrote:
> >> I'm taking this email to our Service Team, the folks who work on the
> >> back-end of the service. The whole "message body changing as it moves
> >> from cache to backing store" thing is totally unacceptable. Answers
> >> soon.
> >> On Fri, Mar 6, 2009 at 09:43, Craig Hockenberry
> >> > At the time of posting, this tweet showed up on the site and in feeds
> >> > with all 140 characters. After a few hours, the "<" was converted to
> >> > "<", increasing the count per character from one to four bytes and
> >> > decreasing the tweet length from 140 characters to 69. (You can see
> >> > this truncation at the end of the tweet: the "&" is from "<")
> >> > Presumably, this happens as tweets in the memcache are written though
> >> > to the backing store.
> >> > I also see a lot of Twitter clients that don't realize how special the
> >> > < and > entities are. It took me a LONG time to figure out what
> >> > was going on here.
> >> > What's curious is that Loren's example with 140 characters uses the
> >> > Unicode 27A1 glyph. It uses 3 bytes in UTF-8. Why didn't it get
> >> > truncated? This seems to contradict Alex's statement in the thread
> >> > mentioned above.
> >> > As people start to use things like Emoji, tinyarro.ws and generally
> >> > figure out that Unicode (UTF-8) is a valid type of data on Twitter,
> >> > our clients should adapt and display more accurate "characters
> >> > remaining" counts. I can count bytes instead of characters, but I'm
> >> > not sure if I should or not.
> >> > No one likes a truncated tweet: we need an explicit statement on how
> >> > to count and submit multi-byte characters and entities.
On Tue, Mar 24, 2009 at 9:36 PM, Alex Payne<a...@twitter.com> wrote:
> Unfortunately, nothing definitive. We're still looking into this.
> On Tue, Mar 24, 2009 at 07:56, Craig Hockenberry > <craig.hockenbe...@gmail.com> wrote:
>> Any news from the Service Team? I'd really like to get the counters >> right in an upcoming release...
>> -ch
>> On Mar 6, 12:18 pm, Alex Payne <a...@twitter.com> wrote: >>> I'm taking this email to our Service Team, the folks who work on the >>> back-end of the service. The whole "message body changing as it moves >>> from cache to backing store" thing is totally unacceptable. Answers >>> soon.
>>> On Fri, Mar 6, 2009 at 09:43, Craig Hockenberry
>>> > At the time of posting, this tweet showed up on the site and in feeds >>> > with all 140 characters. After a few hours, the "<" was converted to >>> > "<", increasing the count per character from one to four bytes and >>> > decreasing the tweet length from 140 characters to 69. (You can see >>> > this truncation at the end of the tweet: the "&" is from "<")
>>> > Presumably, this happens as tweets in the memcache are written though >>> > to the backing store.
>>> > I also see a lot of Twitter clients that don't realize how special the >>> > < and > entities are. It took me a LONG time to figure out what >>> > was going on here.
>>> > What's curious is that Loren's example with 140 characters uses the >>> > Unicode 27A1 glyph. It uses 3 bytes in UTF-8. Why didn't it get >>> > truncated? This seems to contradict Alex's statement in the thread >>> > mentioned above.
>>> > As people start to use things like Emoji, tinyarro.ws and generally >>> > figure out that Unicode (UTF-8) is a valid type of data on Twitter, >>> > our clients should adapt and display more accurate "characters >>> > remaining" counts. I can count bytes instead of characters, but I'm >>> > not sure if I should or not.
>>> > No one likes a truncated tweet: we need an explicit statement on how >>> > to count and submit multi-byte characters and entities.
I'm sorry this never got updated. Some changes have been made and
are waiting to go out now. When I switched from working on the
Platform (formerly API) team to my focus on international I took over
this issue.
Once this current fix is deployed (probably in a week or so since
I'm traveling at the moment) the definition of a character will be
consistent throughout our API. The new change will always compute
length based on the Unicode NFC [1] version of the string. Using the
NFC form makes the 140 character limit based on the length as
displayed rather than some under-the-cover byte arithmetic.
I more than agree with the above statement that a character is a
character and Twitter shouldn't care. Data should be data. The main
issue with that is that some clients compose characters and some
don't. My common example of this is é. Depending on your client
Twitter could get:
So, my fix will make it so that no matter the client if the user
sees é it counts as a single character. I'll announce something in the
change log once my fix is deployed.
> It's been nearly 6 months. Has this question been answered? If so I missed it.
> On Tue, Mar 24, 2009 at 9:36 PM, Alex Payne<a...@twitter.com> wrote:
> > Unfortunately, nothing definitive. We're still looking into this.
> > On Tue, Mar 24, 2009 at 07:56, Craig Hockenberry
> > <craig.hockenbe...@gmail.com> wrote:
> >> Any news from the Service Team? I'd really like to get the counters
> >> right in an upcoming release...
> >> -ch
> >> On Mar 6, 12:18 pm, Alex Payne <a...@twitter.com> wrote:
> >>> I'm taking this email to our Service Team, the folks who work on the
> >>> back-end of the service. The whole "message body changing as it moves
> >>> from cache to backing store" thing is totally unacceptable. Answers
> >>> soon.
> >>> On Fri, Mar 6, 2009 at 09:43, Craig Hockenberry
> >>> > At the time of posting, this tweet showed up on the site and in feeds
> >>> > with all 140 characters. After a few hours, the "<" was converted to
> >>> > "<", increasing the count per character from one to four bytes and
> >>> > decreasing the tweet length from 140 characters to 69. (You can see
> >>> > this truncation at the end of the tweet: the "&" is from "<")
> >>> > Presumably, this happens as tweets in the memcache are written though
> >>> > to the backing store.
> >>> > I also see a lot of Twitter clients that don't realize how special the
> >>> > < and > entities are. It took me a LONG time to figure out what
> >>> > was going on here.
> >>> > What's curious is that Loren's example with 140 characters uses the
> >>> > Unicode 27A1 glyph. It uses 3 bytes in UTF-8. Why didn't it get
> >>> > truncated? This seems to contradict Alex's statement in the thread
> >>> > mentioned above.
> >>> > As people start to use things like Emoji, tinyarro.ws and generally
> >>> > figure out that Unicode (UTF-8) is a valid type of data on Twitter,
> >>> > our clients should adapt and display more accurate "characters
> >>> > remaining" counts. I can count bytes instead of characters, but I'm
> >>> > not sure if I should or not.
> >>> > No one likes a truncated tweet: we need an explicit statement on how
> >>> > to count and submit multi-byte characters and entities.
> I'm sorry this never got updated. Some changes have been made and > are waiting to go out now. When I switched from working on the > Platform (formerly API) team to my focus on international I took over > this issue. > Once this current fix is deployed (probably in a week or so since > I'm traveling at the moment) the definition of a character will be > consistent throughout our API. The new change will always compute > length based on the Unicode NFC [1] version of the string. Using the > NFC form makes the 140 character limit based on the length as > displayed rather than some under-the-cover byte arithmetic. > I more than agree with the above statement that a character is a > character and Twitter shouldn't care. Data should be data. The main > issue with that is that some clients compose characters and some > don't. My common example of this is é. Depending on your client > Twitter could get:
> So, my fix will make it so that no matter the client if the user > sees é it counts as a single character. I'll announce something in the > change log once my fix is deployed.
> On Sep 9, 6:05 am, TjL <luo...@gmail.com> wrote: > > It's been nearly 6 months. Has this question been answered? If so I > missed it.
> > On Tue, Mar 24, 2009 at 9:36 PM, Alex Payne<a...@twitter.com> wrote:
> > > Unfortunately, nothing definitive. We're still looking into this.
> > > On Tue, Mar 24, 2009 at 07:56, Craig Hockenberry > > > <craig.hockenbe...@gmail.com> wrote:
> > >> Any news from the Service Team? I'd really like to get the counters > > >> right in an upcoming release...
> > >> -ch
> > >> On Mar 6, 12:18 pm, Alex Payne <a...@twitter.com> wrote: > > >>> I'm taking this email to our Service Team, the folks who work on the > > >>> back-end of the service. The whole "message body changing as it moves > > >>> from cache to backing store" thing is totally unacceptable. Answers > > >>> soon.
> > >>> On Fri, Mar 6, 2009 at 09:43, Craig Hockenberry
> > >>> > At the time of posting, this tweet showed up on the site and in > feeds > > >>> > with all 140 characters. After a few hours, the "<" was converted > to > > >>> > "<", increasing the count per character from one to four bytes > and > > >>> > decreasing the tweet length from 140 characters to 69. (You can see > > >>> > this truncation at the end of the tweet: the "&" is from "<")
> > >>> > Presumably, this happens as tweets in the memcache are written > though > > >>> > to the backing store.
> > >>> > I also see a lot of Twitter clients that don't realize how special > the > > >>> > < and > entities are. It took me a LONG time to figure out > what > > >>> > was going on here.
> > >>> > What's curious is that Loren's example with 140 characters uses the > > >>> > Unicode 27A1 glyph. It uses 3 bytes in UTF-8. Why didn't it get > > >>> > truncated? This seems to contradict Alex's statement in the thread > > >>> > mentioned above.
> > >>> > As people start to use things like Emoji, tinyarro.ws and > generally > > >>> > figure out that Unicode (UTF-8) is a valid type of data on Twitter, > > >>> > our clients should adapt and display more accurate "characters > > >>> > remaining" counts. I can count bytes instead of characters, but I'm > > >>> > not sure if I should or not.
> > >>> > No one likes a truncated tweet: we need an explicit statement on > how > > >>> > to count and submit multi-byte characters and entities.