Bulk Data to supplement API?

31 views
Skip to first unread message

Shawn Johnson

unread,
May 27, 2015, 2:30:57 PM5/27/15
to us-govern...@googlegroups.com
I'm looking for some resources on Bulk Data to supplement an API.

Our primary use case: New users of the existing API are often looking to 'download' a comprehensive data set for offline analysis.  This creates a lot of work for them, and a lot of resource usage for us.

A few things I'm contemplating:
- How often to update?  New data is continuously added to the data set.  Currently ~5-million 'documents'; ~3k/week, 12k/month being added.
- Offer one large set, or break it down by some increment?
- I'm preferring JSON because of it's self describing nature.  Also we will have fields with potentially very long text, so I'm thinking CSV isn't really a good fit.
- There is a lot of binary content - PDF, Word, etc.  We have this text indexed in a full-text search engine, so we're considering including the contents as text - as our users are primarily looking to get at this text more so than the metadata.
- Should we instead export the metadata with API links to the binary downloads?
- Should we also offer the binary files as bulk zipped downloads - say in a tree structure?

Any feedback is greatly appreciated!

Greg Gershman

unread,
May 27, 2015, 3:01:07 PM5/27/15
to Shawn Johnson, us-govern...@googlegroups.com
I'm personally in favor of bulk data first, API second. More often than not, bulk data is going to be more useful to developers than an API.

Answers to your other questions depend on your data, how frequently it updates. Always good to zip things.

Greg

--
You received this message because you are subscribed to the Google Groups "US Government APIs" group.
To unsubscribe from this group and stop receiving emails from it, send an email to us-government-a...@googlegroups.com.
To post to this group, send email to us-govern...@googlegroups.com.
Visit this group at http://groups.google.com/group/us-government-apis.



--
Greg Gershman
Principal, Ad Hoc LLC

Burns, Martin

unread,
May 27, 2015, 3:08:18 PM5/27/15
to Greg Gershman, Shawn Johnson, us-govern...@googlegroups.com

FWIW, in our Green Button architecture we have data identified in packages called “resources”. The resources are linked via metadata. They are available individually and in collections through REST API. In addition, they can be accumulated into “bulk” data sets also available by API and SFTP through the common interface.

 

In Green Button, the data provider is typically an electric utility. They publish electric/gas meter data daily. For third party service providers that can have many thousands of customers with a single utility, they fetch the daily data via the bulk interface.

 

Green Button data is described in an XSD and usually transferred as XML so that data can be validated against the schema. JSON is a future export planned. We also provide an XSLT that can transform the XML to CSV for those who want what is essentially time-stamped measurement data in that form.


HTH,

Marty

 

Dr. Martin J. Burns,

National Institute of Standards and Technology

Smart Grid & Cyber Physical Systems Program Office

martin...@nist.gov

Tel: 301-975-6283

Cel: 202-379-8021

Shawn Johnson

unread,
May 28, 2015, 11:06:19 AM5/28/15
to us-govern...@googlegroups.com, shawnjo...@gmail.com
Thanks Greg, I agree that bulk data is ideally first - but the API is already there at this point.

Eric Mill

unread,
Jun 8, 2015, 12:44:14 AM6/8/15
to Shawn Johnson, us-government-apis
On Wed, May 27, 2015 at 2:30 PM, Shawn Johnson <shawnjo...@gmail.com> wrote:
I'm looking for some resources on Bulk Data to supplement an API.

Our primary use case: New users of the existing API are often looking to 'download' a comprehensive data set for offline analysis.  This creates a lot of work for them, and a lot of resource usage for us.

A few things I'm contemplating:
- How often to update?  New data is continuously added to the data set.  Currently ~5-million 'documents'; ~3k/week, 12k/month being added.

How often are your users currently updating? And, could you make it simple enough to use the bulk data to get up to the last week, and then use the API to "catch up" to wherever in the weekly cycle the data happens to be?

I imagine the nightmare scenario for people staying "in sync" with your dataset would be to download the entire thing via API. Once most of the data is collected locally, using the API to keep up to date with some small amount of time probably isn't too bad.
 
- Offer one large set, or break it down by some increment?

When I've published bulk data before, it's been by year/session or some other appropriate milestone, with the current milestone updating in place. That way the update process for the main bulk file isn't overwhelmingly huge each time, and you can name/order previous milestone sets by their milestone (e.g. year) so the URLs are predictable.
 
- I'm preferring JSON because of it's self describing nature.  Also we will have fields with potentially very long text, so I'm thinking CSV isn't really a good fit.

Good call.
 
- There is a lot of binary content - PDF, Word, etc.  We have this text indexed in a full-text search engine, so we're considering including the contents as text - as our users are primarily looking to get at this text more so than the metadata.

Totally, as long as you have the metadata you have stored already that wasn't format-specific, like filename.
 
- Should we instead export the metadata with API links to the binary downloads?
- Should we also offer the binary files as bulk zipped downloads - say in a tree structure?

I'd say both of the above make sense, as separate and separately useful things. The metadata alone will go a long way. If you didn't offer the binary files at all, you'd be asking your API to absorb a ton of hits as people who do need them go fetch them one by one.

As two separate caches, you can modify them each independently -- if the metadata format changes or you have fixes, you can regenerate only those. The binary files can be linked over by unique ID, and use their own structure.
 

Any feedback is greatly appreciated!

I think this is a really fantastic initiative, that could set great precedent for other government APIs, in the US and around the world. If it's hard, that's because it's worth it!

-- Eric
 

--
You received this message because you are subscribed to the Google Groups "US Government APIs" group.
To unsubscribe from this group and stop receiving emails from it, send an email to us-government-a...@googlegroups.com.
To post to this group, send email to us-govern...@googlegroups.com.
Visit this group at http://groups.google.com/group/us-government-apis.

Andrew Pendleton

unread,
Jun 8, 2015, 11:30:54 AM6/8/15
to us-govern...@googlegroups.com
This is exciting! We (Sunlight Labs/Docket Wrench) are probably in something of a unique position as both a likely consumer of this bulk offering once it becomes available, and a current provider of one-off dumps of our copy of your data to journalists, academics, etc., so I can probably answer with a couple of different hats on (and might have different answers for each, for better or for worse).


- How often to update?  New data is continuously added to the data set.  Currently ~5-million 'documents'; ~3k/week, 12k/month being added.

For users that also consume the API (like us), any frequency is fine if, as Eric suggested, it's possible to get caught up with the API. For non-API users, probably as frequently as possible (people ask us for data as of right now on a pretty regular basis), though I know full well that that's a really hard problem, having tried to solve it myself.

- Offer one large set, or break it down by some increment?

As a consumer, we'd probably want either everything at once or something time-delimited. That's virtually never what people asking *us* for data want, though; we most frequently get requests for a single docket in bulk, or a set of dockets; more rarely, people want a whole agency. But it's pretty much always in some sort of logical unit like that, rather than a time-delimited chunk. People want it to write a story, or support a dissertation, or whatever, so they're usually focused on one particular rulemaking, policy area, etc.
 
- I'm preferring JSON because of it's self describing nature.  Also we will have fields with potentially very long text, so I'm thinking CSV isn't really a good fit.

Yep, CSV doesn't fit well. The data is insufficiently flat. We usually supply YAML to requesters because I've found it's a little easier to read for non-programmers, but JSON is fine, and more universal.
 
- There is a lot of binary content - PDF, Word, etc.  We have this text indexed in a full-text search engine, so we're considering including the contents as text - as our users are primarily looking to get at this text more so than the metadata.
- Should we instead export the metadata with API links to the binary downloads?

The text would be fine as long as it's possible to get the binary files if we need them (especially if you didn't get any text from the document, because it's a bad scan or something). The dumps we provide to requesters usually have text if we have it, but we include download URLs for the original binary regardless so that people can retrieve the source files for themselves if necessary.
 
- Should we also offer the binary files as bulk zipped downloads - say in a tree structure?

The disadvantage with putting everything in a big zip file is that it's hard to do anything incrementally (get all the new stuff since yesterday, say). You end up having to re-download the whole thing. So compression is good, as is not having to make 5 million requests to get everything, but that's the thing against which you'd have to counterbalance.

Anyhow, having tried to figure out how to do this for ourselves, I'm well aware that it's super-challenging, and don't blame you for going API-first (we did, too). Kudos for giving it a go.

Andrew Pendleton
Docket Wrench Project Lead
Sunlight Foundation
Reply all
Reply to author
Forward
0 new messages