HTML escaping by Twitter is really a bug

11 views
Skip to first unread message

Bjoern

unread,
Jul 17, 2009, 7:15:52 AM7/17/09
to Twitter Development Talk
Hi,

probably it is too late to change it now, but someone has to say it: I
think it is the wrong approach to do HTML escaping in the API on the
Twitter side. For starters, not every consumer is a Website. Secondly,
even if I am a website, now I have to rely on Twitter getting the
escaping right.

I'd much rather rely on my own HTML escaping algorithm, and get the
data in pure form without assumptions about it's use. It should be a
natural reflex for web developers to escape everything, so to put the
Twitter data on my website without escaping it leaves with a very
uneasy feeling (my nervous systems wants to escape the strings).

The workaround is to first unescape the HTML and then escape it again,
I suppose? I haven't thought it through 100% to see if that would be a
fail-save approach.

I just noticed that for example bit.ly is bitten by this, they seem to
escape the data from Twitter, so that the text comes out ugly on their
webseite.

Björn

Bjoern

unread,
Jul 17, 2009, 7:38:55 AM7/17/09
to Twitter Development Talk
Just had an idea: maybe Twitter could add an optional parameter to
switch off HTML escaping (&escapeHTHML=false or something like that).
That way developers who are unaware of the issue would get the escaped
HTML, and the developers who are aware could get the proper data.

Jeff Dairiki

unread,
Jul 17, 2009, 10:23:23 AM7/17/09
to twitter-deve...@googlegroups.com, Bjoern
On Fri, Jul 17, 2009 at 04:15:52AM -0700, Bjoern wrote:
>
> probably it is too late to change it now, but someone has to say it: I
> think it is the wrong approach to do HTML escaping in the API on the
> Twitter side.

What data are you referring to that is being HTML-escaped?

From what I can tell, the text of status messages, at least, are not escaped
by the API. For example, look at:

http://twitter.com/statuses/show/2688630329.json

or

http://twitter.com/statuses/show/2688630329.xml

In the JSON format, non-ascii characters are properly escaped unicode
in the javascript strings; in the XML format, non-asciis are encoded
as XML numeric character entities. Either way, once you've (properly)
decoded the message, you should have plain old unicode.

If one (incorrectly) posts (already encoded) HTML entities in a status
update, the twitter.com web page is lenient about not double-encoding
them. In other words if you post a status update of "A & B", the
twitter.com web interface will display this as "A & B", even though the
API (correctly) will report the status text to be "A & B".

E.g. compare status 2688630329 (links above) to:

http://twitter.com/statuses/show/2688620445.json
http://twitter.com/statuses/show/2688620445.xml


... Or were you talking about something else altogether?

Jeff

Bjoern

unread,
Jul 17, 2009, 10:53:27 AM7/17/09
to Twitter Development Talk
(somehow got the response above as email, too - sorry for replying
twice...)

Hi,

look for example at this: http://twitter.com/statuses/show/2689100482.json

My status update was "test html escaping by twitter <b>bold</b>" but
Twitter sends me "test html escaping by twitter &lt;b&gt;bold&lt;\/
b&gt;"

So it has transformed the "<" and "<" into HTML entities &lt; and &gt;
-
that's another thing than URL escaping.

Hope that clarifies it?

Best wishes,

Björn

Bjoern

unread,
Jul 17, 2009, 11:03:30 AM7/17/09
to Twitter Development Talk
By now I have also create a a ticket for this:
http://code.google.com/p/twitter-api/issues/detail?id=845

My apologies for writing both at the issue tracker and in the forum. I
did not plan to create an issue at first, because I thought it
unlikely that it would be fixed. When I thought about adding a
parameter to switch escaping on and off I changed my mind.

Björn

Matt Sanford

unread,
Jul 17, 2009, 11:07:20 AM7/17/09
to twitter-deve...@googlegroups.com
Hi Bjoern,

Short Answer: It's working as designed for security reasons. We
don't like it any more than you do.

Long Answer: This has come up on the list quite a bit in the
past. Like a great many things spammers, scammers and unkind people
are the reason we can't have nice things. When we discussed allowing
non-escaped data the main argument against it was that the majority of
tweets are displayed via HTML and that failing to do that correctly
poses a security risk to everyone. We erred on the side of security
and caution, returning the data in a way suitable for display on a web
page rather than trusting each and every developer to handle it
correctly. That would make each developer a single point of failure
for security … and that's a whole lot of possible failure. As it
stands now a web developer has to go out of their way to enable XSS
attacks in tweets. The feeling was that security should be the
default, and disabling should be an exercise left to the reader. We're
well aware that this is not ideal, and that it's a bit of a pain for
non-web applications. We wish we didn't have to do this sort of thing
but sometime you have to find a balance between standards, data
purity, and protection.

Thanks;
– Matt Sanford / @mzsanford
Twitter Dev

Jeff Dairiki

unread,
Jul 17, 2009, 11:27:05 AM7/17/09
to twitter-deve...@googlegroups.com
On Fri, Jul 17, 2009 at 07:53:27AM -0700, Bjoern wrote:
>
> look for example at this: http://twitter.com/statuses/show/2689100482.json
>
> My status update was "test html escaping by twitter <b>bold</b>" but
> Twitter sends me "test html escaping by twitter &lt;b&gt;bold&lt;\/
> b&gt;"
>
> So it has transformed the "<" and "<" into HTML entities &lt; and &gt;
> [...]
> Hope that clarifies it?

Yes it does. It seems the API encodes &lt;, &gt;, &amp;, and &quot;.
(I should have realized that was what you meant in the first place ---
haven't had enough coffee yet this morning.)

And I see your point.

Though I can see the reason for the encoding. Imagine the havoc which
could ensue if some unknowing app developer forgets to encode texts,
allowing nefarious parties to post raw HTML to their site via twitter.

As you stated at the top of the thread --- it's easy enough to decode
the entities yourself, if you want the raw text.

Sorry for the interruption... carry on!

Jeff


Joe Bowman

unread,
Jul 17, 2009, 11:36:49 AM7/17/09
to Twitter Development Talk
From a security standpoint, I'd hope the information is stored pre-
escaped, and that's why the API returns it that way. I'd like to offer
a +1 to liking the idea that the data I get from the API is escaped
for me.

Bjoern

unread,
Jul 17, 2009, 6:30:27 PM7/17/09
to Twitter Development Talk
On Jul 17, 5:07 pm, Matt Sanford <m...@twitter.com> wrote:

>      Short Answer: It's working as designed for security reasons. We  
> don't like it any more than you do.

Thank you for your answer. There are pros and cons for both
approaches, and you had to make a decision.

Björn
Reply all
Reply to author
Forward
0 new messages