I've tried loading the gardenhose via Perl's JSON, and it fails on
quite a few Asian ones with \uffff in them, e.g. the tweet if
5277460813:
{"text":"RT @RealLamarOdom \uffffIf you haven't heard it, go to
www.richsoilclothing.com and look under \"updates\". Tell me what you
think. It's hot!",...}
Is it the artifact of downloading, or Twitter serves illegal UTF8?
Here's an example of what Perl says about it, for another tweet:
*** json ENCODING error: malformed or illegal unicode character in
string [�Artest l], cannot convert to JSON at /home/alexyk/twitter/
loader/
jwilter.pl line 30, <> line 44817003.
{"in_reply_to_screen_name":null,"text":"RT @TheLakersNation
\uffffArtest looked great. Lamar dominated the boards. Kobe is Kobe.
And most importantly, the Lakers take the WIN!","source":"<a href=
\"
http://mobileways.de/gravity\" rel=\"nofollow\">Gravity</
a>","in_reply_to_user_id":null,"in_reply_to_status_id":null,"truncated":false,"geo":null,"created_at":"Mon
Nov 02 05:55:49 +0000 2009","user":
{"profile_background_tile":false,"profile_sidebar_border_color":"BDDCAD","following":null,"statuses_count":
243,"followers_count":33,"profile_image_url":"
http://a3.twimg.com/
profile_images/406146987/Real_Force_normal.jpg","friends_count":
93,"description":"My Love:Kobe Bryant,Los Angeles
Lakers,NBA,Twitter,Music,Movie.I Love This Game.Determination:Let's
again!","location":"CN","geo_enabled":false,"profile_background_color":"9AE4E8","screen_name":"Real_Force","favourites_count":
4,"verified":false,"notifications":null,"profile_text_color":"333333","time_zone":"Beijing","protected":false,"url":"http://
hi.baidu.com/real_force/","created_at":"Wed Sep 09 12:41:22 +0000
2009","profile_link_color":"0084B4","name":"Zhang
Yuhao","profile_background_image_url":"
http://a1.twimg.com/
profile_background_images/36003404/
photo_manipulation_photo_art_the_mansion.jpg","id":
72842359,"utc_offset":
28800,"profile_sidebar_fill_color":"DDFFCC"},"favorited":false,"id":
5357163705}
PostgreSQL shows similar annoyance on its text field in UTF8. Pls
clarify what do you do to unicode here!
Cheers,
Alexy