Twitter Scrape (rough draft)

1526 views
Skip to first unread message

Philip (flip) Kromer

unread,
Dec 22, 2008, 11:40:52 AM12/22/08
to Get.TheInfo Group, Jennifer Golbeck, Hayes Davis, Tom White
Hey y'all,

I've gathered a massive scrape of the Twitter friend graph: about 2.7M users (and slowing, meaning I'm starting to find the edge), 10M tweets, 58M edges, with pretty-near complete edge data for users with more than a dozen followers.

Big huge thanks to twitter.com who have given permission to share this freely.  Please go build tools with this data that make both twitter.com and yourself rich and famous. This will convince more corporations to free their data.

I'm just pushing what I have up there and have _not_ done all the double-checking I'd like. I'm getting out of town for a week, though, and I thought people might like to play with it even rough as it is. I'll do a better release after new year's.

The scrape has full metadata on users and releationships, and I've calculated pagerank for the users. I'll post data on the 2-cliques and local density when I get back.

Happy festivus,
flip

PS If anyone can point me towards map-reduce algorithms to efficiently 
* calculate all-pairs distances (or ballpark it)
* assign clusters
at this scale please advise.

=========================================================
NOTES
=========================================================

I've posted it at
  username: 'theinfo.org'
... the password is the ramanujan taxicab number followed by the word 'kennedy', all one word.  If that doesn't work or doesn't make sense email me.

Approx. cardinality:
   8 068 820    twitter_user_partial.tsv # partial users as found in other users' tweets
   2 675 458    ...         ... giving info on 2675458 unique users
   2 173 417    twitter_user.tsv         # full user records
   2 168 569    twitter_user_profile.tsv 
   2 168 739    twitter_user_style.tsv   
     219 388    hashtag.tsv              # hashtags collected from tweets
  58 010 471    a_follows_b.tsv          # following relationships
  10 168 919    tweet.tsv                # unique tweets
   2 071 290    tweet_url.tsv            # urls collected from tweets
   2 997 735    a_atsigns_b.tsv          # all @atsigns collected from tweets (anywhere in tweet, but screen_name only and not threaded)
   2 494 807    a_replied_b.tsv          # @replies collected from tweets (only those that appear first, but threaded)
  90 542 155    total     

(the user_partial thing: when you ask for a user's following / friends list, or in the public timeline tweets, you get a partial listing of each user.  I've kept, for these partial users, each unique state observed.  So if <mrflip> was seen on the 10th, the 15th, and the 16th and had (everything else the same) 80, 80 and 82 followers resp. you'll get the user_partial records of the 10th and the 16th.)
  
===========================================================

Layout of each file (all tab-delimited):
   
# class_name       [key_field] [scraped_at] ... attributes ...
TwitterUserPartial [:id],  :id,  :screen_name, :followers_count, :protected, :name, :url, :location, :description, :profile_image_url )
TwitterUser        [:id],  :id,  :screen_name, :created_at, :statuses_count, :followers_count, :friends_count, :favourites_count, :protected )
TwitterUserProfile [:id],  :id,  :name, :url, :location, :description, :time_zone, :utc_offset )
TwitterUserStyle   [:id],  :id,  :profile_background_color, :profile_text_color, :profile_link_color, :profile_sidebar_border_color, :profile_sidebar_fill_color, :profile_background_image_url, :profile_image_url, :profile_background_tile )
Tweet              [:id],  :id,  :created_at, :twitter_user_id, :text, :favorited, :truncated, :tweet_len, :in_reply_to_user_id, :in_reply_to_status_id, :fromsource, :fromsource_url )
AFollowsB          [:user_a_id, :user_b_id],               :user_a_id, :user_b_id  )
ARepliedB          [:user_a_id, :user_b_id,   :status_id], :user_a_id, :user_b_id,   :status_id, :in_reply_to_status_id )
AAtsignsB          [:user_a_id, :user_b_name, :status_id], :user_a_id, :user_b_name, :status_id ) # note we have no user_b_id for @foo
Hashtag            [:user_a_id, :hashtag],                 :user_a_id, :hashtag,     :status_id )
TweetUrl           [:user_a_id, :tweet_url],               :user_a_id, :tweet_url,   :status_id )

Pagerank files are
  user_id pagerank ids_that_user_follows
(and 'dummy' if we haven't gotten their follower list, or if that list is empty.)

===========================================================

Notes:

* If you use the .tsv form make sure your language doesn't interpret the
  zero-padded twitter_user.id of '000000000072' as octal 58.
  
* I think that there may be inconsistent user data for all-numeric
  screen_names: see
  That is, I think that the data in this scrape may commingle information about
  the user having screen_name '415' with that of the user having id #415. Not much I can do bout it, but I plan to scrub that data later.

* Watch out for some ill-formed screen_names: see
    
* For the parsed data: act as if I've double-checked none of this. If you
  have questions please ask, though.
  
* The scraped files (ripd-_xxxxxxxx.tar.bz2) *are* in their final form, and are
  exactly as they came off the server.
  
* pagerank is non-normalized -- divide by N and take the log.

The files are huge, and the ripd-_xxxx directories will make your filesystem cry, I recommend hadoop.

Philip (flip) Kromer

unread,
Dec 22, 2008, 12:07:45 PM12/22/08
to Get.TheInfo Group, Jennifer Golbeck, Hayes Davis, Tom White, Dhruv Bansal, Mark McGranaghan
...sorry for the double post.  Two more notes:

* It will take a while for the files to all migrate over, so check later tonight. Each of the files is in the 300M-1.6GB range.

* Also, I should mention what the files are:
  ripd-_xxxxx* -- raw .json files as pulled from twitter.com
  20081220-sorted-uff/ -- parsed data from the scrape as explained below
  20081220-sorted-tweets/ -- parsed data from tweets off the public datamining firehose
  20081220-replied_pagerank, 20081220-follows_pagerank -- pagerank calculated for replied and follows relationships (10 iterations).

flip
--
http://www.infochimps.org
Connected Open Free Data

Paul Böhm

unread,
Dec 22, 2008, 5:17:25 PM12/22/08
to get.theinfo
can you put this on http://aws.amazon.com/publicdatasets/ ?

On Dec 22, 6:07 pm, "Philip (flip) Kromer" <f...@infochimps.org>
wrote:
> ...sorry for the double post. Two more notes:
> * It will take a while for the files to all migrate over, so check later
> tonight. Each of the files is in the 300M-1.6GB range.
> * Also, I should mention what the files are:
> ripd-_xxxxx* -- raw .json files as pulled from twitter.com
> 20081220-sorted-uff/ -- parsed data from the scrape as explained below
> 20081220-sorted-tweets/ -- parsed data from tweets off the public
> datamining firehose
> 20081220-replied_pagerank, 20081220-follows_pagerank -- pagerank
> calculated for replied and follows relationships (10 iterations).
>
> flip
>
> On Mon, Dec 22, 2008 at 10:40 AM, Philip (flip) Kromer
> <f...@infochimps.org>wrote:
>
>
>
> > Hey y'all,
> > I've gathered a massive scrape of the Twitter friend graph: about 2.7M
> > users (and slowing, meaning I'm starting to find the edge), 10M tweets, 58M
> > edges, with pretty-near complete edge data for users with more than a dozen
> > followers.
>
> > Big huge thanks to twitter.com who have given permission to share this
> > freely. Please go build tools with this data that make both twitter.comand yourself rich and famous. This will convince more corporations to free
> --http://www.infochimps.org
> Connected Open Free Data

Philip (flip) Kromer

unread,
Dec 22, 2008, 7:56:22 PM12/22/08
to Get. TheInfo
Sure thing. I'm planning to post it to archive.org too: I only put it
on that rickety old server because the data's still so half-baked.
Neither will probably happen til next week: my only Internet right now
is iPhone.

---

To folks arriving from hacker news &c: hi! A proper release is coming.
Either follow @infochimps or check blog.infochimps.org, or look for a
post to the Twitter developers google list.

flip
----
http://infochimps.org
Connected Open Free Data

LightStpsHere59

unread,
Dec 23, 2008, 12:38:28 AM12/23/08
to get.theinfo
Great work! Now to make real sense of it.

DS

Philip (flip) Kromer wrote:
> Hey y'all,
> I've gathered a massive scrape of the Twitter friend graph: about 2.7M users
> (and slowing, meaning I'm starting to find the edge), 10M tweets, 58M edges,
> with pretty-near complete edge data for users with more than a dozen
> followers.
>
> Big huge thanks to twitter.com who have given permission to share this
> freely. Please go build tools with this data that make both
> twitter.comand yourself rich and famous. This will convince more

Erich Morisse

unread,
Dec 29, 2008, 6:24:32 PM12/29/08
to get.theinfo
flip, wow.

Philip (flip) Kromer wrote:
> Hey y'all,
> I've gathered a massive scrape of the Twitter friend graph: about 2.7M users
> (and slowing, meaning I'm starting to find the edge), 10M tweets, 58M edges,
> with pretty-near complete edge data for users with more than a dozen
> followers.
>
> Big huge thanks to twitter.com who have given permission to share this
> freely. Please go build tools with this data that make both
> twitter.comand yourself rich and famous. This will convince more

jhofman

unread,
Jan 3, 2009, 9:43:51 AM1/3/09
to get.theinfo
hi all,

any news of when this might be re-posted?

thanks.

-j

Philip (flip) Kromer

unread,
Jan 3, 2009, 1:50:52 PM1/3/09
to Get. TheInfo
I'll keep you up to date. I'm confident Twitter gets it: there are so,
so many interesting apps that can only be built with local access to a
bulk slice (apps that will enhance the ecosystem and/or pay for
firehose access).

flip
----
http://infochimps.org
Connected Open Free Data

nitinborwankar

unread,
Jan 17, 2009, 6:56:23 PM1/17/09
to get.theinfo
Hi Philip,

Any updates on this? Has it been pulled indefinitey?

Nitin

On Jan 3, 10:50 am, "Philip (flip) Kromer" <f...@infochimps.org>
wrote:
> I'll keep you up to date. I'm confident Twitter gets it: there are so,  
> so many interesting apps that can only be built with local access to a  
> bulk slice (apps that will enhance the ecosystem and/or pay for  
> firehose access).
>
> flip
> ----http://infochimps.org
> Connected Open Free Data

mozTom

unread,
Jan 24, 2009, 3:33:18 PM1/24/09
to get.theinfo
Your confidence is misplaced, Philip.

On Jan 3, 1:50 pm, "Philip (flip) Kromer" <f...@infochimps.org> wrote:
> I'll keep you up to date. I'm confident Twitter gets it: there are so,  
> so many interesting apps that can only be built with local access to a  
> bulk slice (apps that will enhance the ecosystem and/or pay for  
> firehose access).
>
> flip
> ----http://infochimps.org
> Connected Open Free Data

Philip (flip) Kromer

unread,
Jan 24, 2009, 10:52:35 PM1/24/09
to get-t...@googlegroups.com, mozTom
I think, and I hope, that my confidence is very well-placed.

The fact is that this bulk data existed and exists quite apart from my scrape... Their API limits are set to just under half-million requests per day, not to mention whatever HTML scraping people get away with. Every piece of this is data available free to the public from theirs, and any dozen of other independent APIs.

The questions at hand is: do they want random people collecting and distributing sub-rosa copies of the data, outside of their view and with no formal agreement, at great cost in servers and monitoring? /We/ asked and got permission, having offered to coordinate any necessary restrictions, and immediately took it down when they reconsidered.  The next person might not do any of that. 

Our offer is to host, share and index the data; to put it in view of the smartest researchers in the world (I've been contacted by a number of prominent researchers eager to get at this); to ensure that users understand the associated terms of service; to place it in an environment where others will link it with the rest of the world's free data; and to make it easily available to people building tools that need SOME now and will gladly pay for ALL later.

You pointed to the Waldman brouhaha to show something something about their user-facing Terms of Service. I actually side with how Twitter handled it; regardless, they've grown and thrived by courageously adopting very open policies.  Their dev team are engineers and understand the value of opening this data.  The hunger for this data is enormous, and the larger community is one that values openness and is not shy about applying pressure in that direction.

If there's further interest in debating the politics of what Twitter might/should/not do, I think it would be better restricted it to the relevant blog post (http://blog.infochimps.org/2008/12/29/massive-scrape-of-twitters-friend-graph/).

flip
--
Reply all
Reply to author
Forward
0 new messages