Alex Payne - API Lead, Twitter, Inc.
FWIW, I had a number of users complain about truncated UTF-8 sequences a
while back, which may be a symptom of this same problem. TTYtter now counts
bytes explicitly, and this seems to have dealt with the issue.
------------------------------------ personal: http://www.cameronkaiser.com/ --
Cameron Kaiser * Floodgap Systems * www.floodgap.com * cka...@floodgap.com
-- Make welfare as hard to get as building permits. ---------------------------
Just to interject: & has not been specially encoded except for during
a brief time when " was also converted to " and counted as 5
characters and & equalled 4. This was un-done in a matter of days,
if not less.
I'd reiterate that there's no need to encode > as &rt;
If you are encoding < as < there's no risk of someone getting an
<img> tag or <a href tag to work, so maybe there is an argument for a
left tag, but there's really no need to encode a right tag.
Figured I'd throw that out there FWIW
What about é or &agrav; ?? :)
You were waiting all day for that, weren't you?
-- I use my C128 because I am an ornery, stubborn, retro grouch. -- Bob Masse -
Oh, wait, crap.
On Tue, Mar 10, 2009 at 8:26 PM, Alex Payne <al...@twitter.com> wrote:
"All" is such an inclusive term, isn't it? :-)
ONE OF US! ONE OF US!
Dossy Shiobara | do...@panoptic.com | http://dossy.org/
Panoptic Computer Network | http://panoptic.com/
"He realized the fastest way to change is to laugh at your own
folly -- then you can let go and quickly move on." (p. 70)
Hey! I resemble that :)
But I fear we're somewhat West of the OT's question :)
And we still don't know what to do about encoding html entities used
for accents in languages such as French, Spanish, etc ;)
Transcode to their UTF-8 codepoints and punt. Let the Twitter
developers figure out how to handle the data. :-)
I'm sorry this never got updated. Some changes have been made and
are waiting to go out now. When I switched from working on the
Platform (formerly API) team to my focus on international I took over
Once this current fix is deployed (probably in a week or so since
I'm traveling at the moment) the definition of a character will be
consistent throughout our API. The new change will always compute
length based on the Unicode NFC  version of the string. Using the
NFC form makes the 140 character limit based on the length as
displayed rather than some under-the-cover byte arithmetic.
I more than agree with the above statement that a character is a
character and Twitter shouldn't care. Data should be data. The main
issue with that is that some clients compose characters and some
don't. My common example of this is é. Depending on your client
Twitter could get:
é - 1 byte
- URL Encoded UTF-8: %C3%A9
-- or --
é - 2 bytes
- URL Encoded UTF-8: %65%CC%81
+ plus: http://www.fileformat.info/info/unicode/char/0301/index.htm
Sorry for being picky about this, I'm just trying to make sure that
I'm understanding the terms correctly as you are using them.
I tend to think of Twitter as 140 "characters" (rather than bytes). I
realize that "character" may not have a precise definition, but to me,
each of these is "one character":
e é < & >
Am I understanding you correctly that Twitter is moving to standardize
where you can send a message with 140 "characters" regardless of
whether that's 140 e or 140 é or 140 < or 140 & or 140 > ?
I think that's what is being said, I just want to make sure I'm
> I'm sorry this never got updated. Some changes have been made and
> are waiting to go out now. When I switched from working on the
> Platform (formerly API) team to my focus on international I took over
> this issue.
> Once this current fix is deployed (probably in a week or so since
> I'm traveling at the moment) the definition of a character will be
> consistent throughout our API. The new change will always compute
> length based on the Unicode NFC  version of the string. Using the
> NFC form makes the 140 character limit based on the length as
> displayed rather than some under-the-cover byte arithmetic.
Has this change occurred yet? I have a fix in TTYtter ready to go to enable
140 *character* posts instead of bytes, but I don't want to deploy it until
I know the path is clear.
-- armadillo, n. the act of providing weapons to a Spanish pickle. ------------