What data are you referring to that is being HTML-escaped?
From what I can tell, the text of status messages, at least, are not escaped
by the API. For example, look at:
http://twitter.com/statuses/show/2688630329.json
or
http://twitter.com/statuses/show/2688630329.xml
In the JSON format, non-ascii characters are properly escaped unicode
in the javascript strings; in the XML format, non-asciis are encoded
as XML numeric character entities. Either way, once you've (properly)
decoded the message, you should have plain old unicode.
If one (incorrectly) posts (already encoded) HTML entities in a status
update, the twitter.com web page is lenient about not double-encoding
them. In other words if you post a status update of "A & B", the
twitter.com web interface will display this as "A & B", even though the
API (correctly) will report the status text to be "A & B".
E.g. compare status 2688630329 (links above) to:
http://twitter.com/statuses/show/2688620445.json
http://twitter.com/statuses/show/2688620445.xml
... Or were you talking about something else altogether?
Jeff
Short Answer: It's working as designed for security reasons. We
don't like it any more than you do.
Long Answer: This has come up on the list quite a bit in the
past. Like a great many things spammers, scammers and unkind people
are the reason we can't have nice things. When we discussed allowing
non-escaped data the main argument against it was that the majority of
tweets are displayed via HTML and that failing to do that correctly
poses a security risk to everyone. We erred on the side of security
and caution, returning the data in a way suitable for display on a web
page rather than trusting each and every developer to handle it
correctly. That would make each developer a single point of failure
for security … and that's a whole lot of possible failure. As it
stands now a web developer has to go out of their way to enable XSS
attacks in tweets. The feeling was that security should be the
default, and disabling should be an exercise left to the reader. We're
well aware that this is not ideal, and that it's a bit of a pain for
non-web applications. We wish we didn't have to do this sort of thing
but sometime you have to find a balance between standards, data
purity, and protection.
Thanks;
– Matt Sanford / @mzsanford
Twitter Dev
Yes it does. It seems the API encodes <, >, &, and ".
(I should have realized that was what you meant in the first place ---
haven't had enough coffee yet this morning.)
And I see your point.
Though I can see the reason for the encoding. Imagine the havoc which
could ensue if some unknowing app developer forgets to encode texts,
allowing nefarious parties to post raw HTML to their site via twitter.
As you stated at the top of the thread --- it's easy enough to decode
the entities yourself, if you want the raw text.
Sorry for the interruption... carry on!
Jeff