illegal unicode character \uffff

230 views
Skip to first unread message

braver

unread,
Nov 21, 2009, 5:10:04 PM11/21/09
to Twitter Development Talk
I've tried loading the gardenhose via Perl's JSON, and it fails on
quite a few Asian ones with \uffff in them, e.g. the tweet if
5277460813:

{"text":"RT @RealLamarOdom \uffffIf you haven't heard it, go to
www.richsoilclothing.com and look under \"updates\". Tell me what you
think. It's hot!",...}

Is it the artifact of downloading, or Twitter serves illegal UTF8?
Here's an example of what Perl says about it, for another tweet:

*** json ENCODING error: malformed or illegal unicode character in
string [�Artest l], cannot convert to JSON at /home/alexyk/twitter/
loader/jwilter.pl line 30, <> line 44817003.

{"in_reply_to_screen_name":null,"text":"RT @TheLakersNation
\uffffArtest looked great. Lamar dominated the boards. Kobe is Kobe.
And most importantly, the Lakers take the WIN!","source":"<a href=
\"http://mobileways.de/gravity\" rel=\"nofollow\">Gravity</
a>","in_reply_to_user_id":null,"in_reply_to_status_id":null,"truncated":false,"geo":null,"created_at":"Mon
Nov 02 05:55:49 +0000 2009","user":
{"profile_background_tile":false,"profile_sidebar_border_color":"BDDCAD","following":null,"statuses_count":
243,"followers_count":33,"profile_image_url":"http://a3.twimg.com/
profile_images/406146987/Real_Force_normal.jpg","friends_count":
93,"description":"My Love:Kobe Bryant,Los Angeles
Lakers,NBA,Twitter,Music,Movie.I Love This Game.Determination:Let's
again!","location":"CN","geo_enabled":false,"profile_background_color":"9AE4E8","screen_name":"Real_Force","favourites_count":
4,"verified":false,"notifications":null,"profile_text_color":"333333","time_zone":"Beijing","protected":false,"url":"http://
hi.baidu.com/real_force/","created_at":"Wed Sep 09 12:41:22 +0000
2009","profile_link_color":"0084B4","name":"Zhang
Yuhao","profile_background_image_url":"http://a1.twimg.com/
profile_background_images/36003404/
photo_manipulation_photo_art_the_mansion.jpg","id":
72842359,"utc_offset":
28800,"profile_sidebar_fill_color":"DDFFCC"},"favorited":false,"id":
5357163705}

PostgreSQL shows similar annoyance on its text field in UTF8. Pls
clarify what do you do to unicode here!
Cheers,
Alexy

braver

unread,
Dec 1, 2009, 10:22:22 PM12/1/09
to Twitter Development Talk
Gardenhose apparently returns illegal Unicode, as confirmed by
PostgreSQL and Perl's Encode, a very trusted, high-mileage code. We
surely can trap illegal Unicode errors but need to know whether you're
aware of it, the rationale, and plan of action, if any. -- Alexy

On Nov 21, 5:10 pm, braver <delivera...@gmail.com> wrote:
> I've tried loading the gardenhose via Perl's JSON, and it fails on
> quite a few Asian ones with \uffff in them, e.g. the tweet if
> 5277460813:
>
> {"text":"RT @RealLamarOdom \uffffIf you haven't heard it, go towww.richsoilclothing.comand look under \"updates\". Tell me what you
> think. It's hot!",...}
>
> Is it the artifact of downloading, or Twitter serves illegal UTF8?
> Here's an example of what Perl says about it, for another tweet:
>
> *** json ENCODING error: malformed or illegal unicode character in
> string [ Artest l], cannot convert to JSON at /home/alexyk/twitter/
> loader/jwilter.pl line 30, <> line 44817003.
>
>  {"in_reply_to_screen_name":null,"text":"RT @TheLakersNation
> \uffffArtest looked great. Lamar dominated the boards. Kobe is Kobe.
> And most importantly, the Lakers take the WIN!","source":"<a href=
> \"http://mobileways.de/gravity\" rel=\"nofollow\">Gravity</
> a>","in_reply_to_user_id":null,"in_reply_to_status_id":null,"truncated":fal se,"geo":null,"created_at":"Mon
> Nov 02 05:55:49 +0000 2009","user":
> {"profile_background_tile":false,"profile_sidebar_border_color":"BDDCAD","f ollowing":null,"statuses_count":
> 243,"followers_count":33,"profile_image_url":"http://a3.twimg.com/
> profile_images/406146987/Real_Force_normal.jpg","friends_count":
> 93,"description":"My Love:Kobe Bryant,Los Angeles
> Lakers,NBA,Twitter,Music,Movie.I Love This Game.Determination:Let's
> again!","location":"CN","geo_enabled":false,"profile_background_color":"9AE 4E8","screen_name":"Real_Force","favourites_count":
> 4,"verified":false,"notifications":null,"profile_text_color":"333333","time _zone":"Beijing","protected":false,"url":"http://

John Kalucki

unread,
Dec 1, 2009, 10:35:10 PM12/1/09
to Twitter Development Talk
In this case, this isn't the Streaming API. That encoding is almost
certainly what was presented to Twitter, probably exactly as encoded
by the client. In this case, I'd complain to: http://mobileways.de/products/gravity/gravity/

If you request the Tweet via the REST API, you'll see the same data
and the same encoding error.

-John Kalucki
http://twitter.com/jkalucki
Services, Twitter Inc.


On Dec 1, 7:22 pm, braver <delivera...@gmail.com> wrote:
> Gardenhose apparently returns illegal Unicode, as confirmed by
> PostgreSQL and Perl's Encode, a very trusted, high-mileage code.  We
> surely can trap illegal Unicode errors but need to know whether you're
> aware of it, the rationale, and plan of action, if any. -- Alexy
>
> On Nov 21, 5:10 pm, braver <delivera...@gmail.com> wrote:
>
> > I've tried loading the gardenhose via Perl's JSON, and it fails on
> > quite a few Asian ones with \uffff in them, e.g. the tweet if
> > 5277460813:
>
> > {"text":"RT @RealLamarOdom \uffffIf you haven't heard it, go towww.richsoilclothing.comandlook under \"updates\". Tell me what you

braver

unread,
Dec 1, 2009, 10:41:09 PM12/1/09
to Twitter Development Talk
John -- thanks for clarification! Certainly it's the data in
Twitter's database as a whole, not just the Streaming API. One
question is whether you should accept illegal Unicode? Probably it's
a safer thing to do to avoid scaring the clients, but maybe you'd want
to apply some filter before sticking it into the database? I.e., is
it reasonable to have a policy of accepting or storing only legal
Unicode? I know some folks use Twitter for machine/sensor data, but
perhaps it's not intended? I can envision Twitter allowing non-
Unicode data if marked as such, perhaps on a closed stream, for
machines talking to each other, -- but not humans.

Cheers,
Alexy

John Kalucki

unread,
Dec 1, 2009, 10:49:57 PM12/1/09
to Twitter Development Talk
Perhaps someone from Platform could weigh in on this?

-John

braver

unread,
Dec 3, 2009, 11:21:00 PM12/3/09
to Twitter Development Talk
On Dec 1, 10:49 pm, John Kalucki <jkalu...@gmail.com> wrote:
> Perhaps someone from Platform could weigh in on this?

In [vulgar] Russian, I'd say it seems Platform retracted its tongue
into a [bodily cavity]. :) Platform, hey! :)

Cheers,
Alexy

Mark McBride

unread,
Dec 3, 2009, 11:24:18 PM12/3/09
to twitter-deve...@googlegroups.com
We are taking a look... hope to have an update soon
--
---Mark

http://twitter.com/mccv

braver

unread,
Dec 4, 2009, 8:29:44 PM12/4/09
to Twitter Development Talk
Mark, great to see you here! Now I trust the platform is in the right
hands. :)

Cheers,
Alexy
Reply all
Reply to author
Forward
0 new messages