Collecting 40GB of government data, publishing it on the Internet Archive

14 views
Skip to first unread message

Eric Mill

unread,
Nov 7, 2014, 12:58:32 PM11/7/14
to dc-cfa-...@googlegroups.com
Hi CFDC (and sorry if you're also on the Sunlight Labs list),

I wrote for Sunlight today on a project that's collecting just about every US federal inspector general report:


One of the things I think is worth pointing out, that I'd love to see more people take advantage of, is that the Internet Archive lets you upload anything

So for example, we have our 40GB bulk dataset in a 34GB zip file, for anyone to download at no cost to us:

They have an S3-compatible API that they don't talk about (or document!) nearly enough. I spent a Saturday with their Python API wrapper and wrote a script to back up our entire collection into the Internet Archive. This gives each report its own landing page on the Archive and gets it entered into the Archive's search engine.

To get a formal "collection", you have to upload 50+ items and then write to them requesting it. We did this, and so far have ~100 items in it: https://archive.org/details/usinspectorsgeneral

We'll get the other 18,000 reports their own landing pages soon, and get cronjobs that automatically add new reports, and keep the bulk download up to date. 

More public data projects should do this! The Internet Archive is an incredible resource. I'm very glad we have a permalink to our bulk data, and don't have to stress over the download costs. 

-- Eric

--
Reply all
Reply to author
Forward
0 new messages