counting messages: characters vs. bytes, HTML entities

58 views
Skip to first unread message

leoboiko

unread,
Oct 5, 2007, 11:48:49 AM10/5/07
to Twitter Development Talk
Hi. I'm adding stuff to twit.el and came up with two issues when
counting text size:

1. The API docs talk about a maximum of 160 (recommended 140)
*characters*, but as far as my tests go, twit seems to count 140
*bytes* - which of course are not the same thing in UTF-8; when I
use accented, typografical or Japanese characters the "recent" page
truncate updates at far less than 140 characters

If this is intended, I think the documentation should be clearer.

2. A pair of the characters are converted to HTML entities: '<' and
'>'. What's weird is that the counting algorithm counts the entity
size, not the character size, so that '<>' is an update with length
8 ("&lt;&gt;"). Curiously, '&' is not converted to "&amp;".

So to count characters in twit.el I'm doing this: 1) expand '<' and
'>' and 2) count the byte size of the resulting string, encoded as
UTF-8. Is that correct? Sounds hackish to me, but as far as I tested,
it gives me exact results in Twitter.com.

leoboiko

unread,
Oct 8, 2007, 9:40:55 AM10/8/07
to Twitter Development Talk
Er... I'm sorry for insisting, but could someone confirm whether these
observations are true? I want to post my modifications of twit.el to
emacswiki, and I'm afraid that the Twitter behaviour I described
change suddenly.

Piers Karsenbarg

unread,
Oct 8, 2007, 9:43:08 AM10/8/07
to twitter-deve...@googlegroups.com
Should be 140 characters as thats the optimum (probably not the correct word, but hey) length for an SMS message.

leoboiko

unread,
Oct 8, 2007, 10:35:53 AM10/8/07
to Twitter Development Talk
Piers Karsenbarg wrote:
> Should be 140 characters as thats the optimum (probably not the correct
> word, but hey) length for an SMS message.

I'm not familiar with the SMS standard, but it's *very* important to
clarify whether that's 140 characters, or 140 bytes in a given
encoding. Characters in UTF-8 may take up to 4 bytes each. 140
characters of UTF-8 Japanese takes much more than 140 bytes.

There's also the issue of HTML entities - '<' and '>' taking 4 bytes
each - which seems like a bug to me.

Compare e.g. http://twitter.com/eru/statuses/312496202 to
http://twitter.com/eru/statuses/312496292 ; both have exactly 140
characters, but only the later was truncated at the Twitter "recent"
page (and only the later gave a "message too long" warning in the IM
interface). For entities, see
http://twitter.com/eru/statuses/320385382 ; only 47 characters long,
but was truncated.

Piers Karsenbarg

unread,
Oct 8, 2007, 10:46:30 AM10/8/07
to twitter-deve...@googlegroups.com
Well, when you try to send more than 140 characters both on the web and via text, it tells you that you've gone over. Not rocket science.

On 10/8/07, leoboiko <leob...@gmail.com> wrote:

Cameron Kaiser

unread,
Oct 8, 2007, 10:51:59 AM10/8/07
to twitter-deve...@googlegroups.com
> > Should be 140 characters as thats the optimum (probably not the correct
> > word, but hey) length for an SMS message.
>
> I'm not familiar with the SMS standard, but it's *very* important to
> clarify whether that's 140 characters, or 140 bytes in a given
> encoding. Characters in UTF-8 may take up to 4 bytes each. 140
> characters of UTF-8 Japanese takes much more than 140 bytes.

This is an excellent point to clarify. GSM SMS is 7-bit only, so as far
as I'm aware, it should be 140 bytes, not characters.

I agree that HTML escaped entities should not count as any more than 1 byte.

--
------------------------------------ personal: http://www.cameronkaiser.com/ --
Cameron Kaiser * Floodgap Systems * www.floodgap.com * cka...@floodgap.com
-- there's a dance or two in the old dame yet. -- mehitabel -------------------

leoboiko

unread,
Oct 8, 2007, 10:52:49 AM10/8/07
to Twitter Development Talk
Piers Karsenbarg wrote:
> Well, when you try to send more than 140 characters both on the web and via
> text, it tells you that you've gone over. Not rocket science.

No, it doesn't. It tells that to me when I try to send less than 140
characters, and over 140 *bytes* - contrariwise to what's plainly
stated in the API documentation. Which means that either the
documentation is wrong, or the implementation is defective. I just
want to know which one.

And also there's the issue with HTML entities. Is it a bug, or
undocumented behavior?

Alex Payne

unread,
Oct 8, 2007, 1:59:18 PM10/8/07
to twitter-deve...@googlegroups.com
It's 140 bytes, as that's how Ruby counts string length unless you use
a bunch of experimental UTF-8 support that messes with our stack.

We have to encode HTML entities to prevent XSS attacks. Sorry about
the lost characters.

On 10/8/07, leoboiko <leob...@gmail.com> wrote:
>


--
Alex Payne
http://twitter.com/al3x

Manuel González Noriega

unread,
Oct 8, 2007, 3:21:29 PM10/8/07
to twitter-deve...@googlegroups.com

You're right, anyone twittering in any language other than English
(spanish in my case) bumps into this behaviour fast.

--
Manuel, que
piensa que eres una excelente persona y medra en torno a
http://simplelogica.net y/o http://simplelogica.net/logicola/
Recuerda comer mucha fruta y verdura.

Reply all
Reply to author
Forward
0 new messages