What is "government data"?

4 views
Skip to first unread message

Jason May

unread,
Dec 18, 2007, 6:15:36 PM12/18/07
to open-go...@googlegroups.com
Hi all-

I'm wondering if this group has clarified the meaning of the term
"government data" which I've seen used here a number of times.

There has been a lot of attention in the last couple of years to the
question of transparency in regards to political campaign finance,
issue positions, votes by elected officials and so on; excellent work
by Project Vote Smart, MAPLight and others.

But there is a lot of other digital material produced by government
agencies. I'm particularly interested in the huge quantities of
numeric data generated by the US Census, Bureau of Labor Statistics,
Energy Information Administration and other entities. Is this stuff
within the scope of the open government initiatives that are being
discussed on this list?

The recent comments here by Ethan Zuckerman and others regarding cost
recovery, licenses, ransom and so forth are very relevant to data
that is not in the form of text documents. There are different
technical challenges related to publication of numeric data from
those that exist for publishing documents.

Cheers,
-Jason May
Numbrary.com

Carl Malamud

unread,
Dec 18, 2007, 6:37:59 PM12/18/07
to Open Government
Hi Jason -

>
> I'm wondering if this group has clarified the meaning of the term
> "government data" which I've seen used here a number of times.

That was left somewhat open. The principles only apply to public
data,
and I think "government" is any entity is in the public governance
business. So, these principles could just as easily be applied to,
e.g., the American National Standards Institute (ANSI) or ICANN.

>
> But there is a lot of other digital material produced by government
> agencies. I'm particularly interested in the huge quantities of
> numeric data generated by the US Census, Bureau of Labor Statistics,
> Energy Information Administration and other entities. Is this stuff
> within the scope of the open government initiatives that are being
> discussed on this list?

Absolutely! As you say, different challenges for different entities,
but all the information you cite are good examples of government
data that should be more broadly available.

Carl

Steven Mandzik

unread,
Dec 18, 2007, 11:25:26 PM12/18/07
to open-go...@googlegroups.com
As a citizen and taxpayer I would like to consider all data, "government data" and that it should all be made available. Even the most secretive of data, that of the Intelligence Community, is still required to be released after 25 years. I would hope to consider the data in terms of a spectrum. Starting with the data that is already available online. Then the data that is available for us, but not online. And ending with data that is somewhat restricted acccess (sensitive information, intel information).

One of the first things that I would like to see happen is focusing on that data that is already available. A recent Senate Hearing talked about how that data is restricted from search engine crawling, for practically no reason. This is definitely a piece of low hanging fruit that could make thousands of pieces of information available (via search engines). After that I would hope to push for more digitization of information. There is tons of publicly available information sitting on shelves collecting dust.

Anyway, I hope everyone heard about this Senate Hearing it was really cool. If not feel free to check out a review I wrote about it, it summarizes the hearing pretty well:

http://www.swordplay.tv/2007/12/12/senate-hearing-discusses-web-20-to-improve-our-democracy/

Steve

Ethan Zuckerman

unread,
Dec 20, 2007, 1:27:27 PM12/20/07
to Open Government
As Carl mentioned, the decision was to keep the definition quite open.
Part of the reason is that if you use a standard like "any information
taxpayers have helped pay for", you end up incredibly broad. I raised
the issue of whether we wanted to include reseach funded by entities
like the NIH, NSF, etc., which would have us making common cause with
the Open Access publishing folks. It was suggested that this might be
overbroad and would make our efforts more politically difficult. (That
said, Open Access is about to get a major win in the omnibus
appropriations bill, which should require OA publishing for NIH-funded
research.)

-Ethan

Joseph Lorenzo Hall

unread,
Dec 24, 2007, 11:31:25 AM12/24/07
to open-go...@googlegroups.com
On Dec 18, 2007 9:25 PM, Steven Mandzik <steven....@gmail.com> wrote:
>
> One of the first things that I would like to see happen is focusing on that
> data that is already available. A recent Senate Hearing talked about how
> that data is restricted from search engine crawling, for practically no
> reason. This is definitely a piece of low hanging fruit that could make
> thousands of pieces of information available (via search engines). After
> that I would hope to push for more digitization of information. There is
> tons of publicly available information sitting on shelves collecting dust.

It seems like a bill that targeted bogos robots.txt for gov sites
might be a good thing and easy to pass into law, no? Can anyone think
of legitimate reasons for gov sites to have robots.txt (when they
should probably be using a more sophisticated sense of access control
instead of robots.txt). The only thing I can think of would be
government-hosted stuff for kids. best, Joe

--
Joseph Lorenzo Hall
UC Berkeley School of Information
http://josephhall.org/

Greg Palmer

unread,
Dec 24, 2007, 11:40:38 AM12/24/07
to open-go...@googlegroups.com
I think that legislation may already exist; personally I read the E-Government Reauthorization Act (S. 2321) to say that bogus and unduly restrictive robots.txt files are discouraged:

"Not later than 1 year after the date of enactment of the E -Government Reauthorization Act of 2007, the Director shall promulgate guidance and best practices to ensure that publicly available online Federal Government information and services are made more accessible to external search capabilities, including commercial and governmental search capabilities. The guidance and best practices shall include guidelines for each agency to test the accessibility of the websites of that agency to external search capabilities."

I know that Google and some others have taken to calling this the "Sitemaps" provision, but I'm not sure it goes quite that far. The language certainly isn't particularly strong, but it's a step in the right direction and gives a mandate to agencies to test the accessibility of their sites against external search products.

--Greg

Carl Malamud

unread,
Dec 24, 2007, 11:47:34 AM12/24/07
to Open Government
On Dec 24, 8:40 am, "Greg Palmer" <jgpal...@gmail.com> wrote:
> I think that legislation may already exist; personally I read the
> E-Government Reauthorization Act (S. 2321) to say that bogus and unduly
> restrictive robots.txt files are discouraged:

The laws are pretty clear that government information should be widely
distributed unless there is a compelling reason not to do so. That
was
originally codified in the OMB A-130 circular, but many/most
agencies have not taken that guidance to heart.

It is worth remembering that a robots.txt file is advisory and has
no force in law. Search engines are free to ignore the robots.txt
files, being careful of course not to have their crawl turn into a
denial of service.

Joseph Lorenzo Hall

unread,
Dec 24, 2007, 11:51:38 AM12/24/07
to open-go...@googlegroups.com
On Dec 24, 2007 9:47 AM, Carl Malamud <carl+...@resource.org> wrote:
>
> It is worth remembering that a robots.txt file is advisory and has
> no force in law. Search engines are free to ignore the robots.txt
> files, being careful of course not to have their crawl turn into a
> denial of service.

Carl I remember you saying something about an LoC letter/decision on
robots.txt ... can you point us to that? best, Joe

Steven.mandzik

unread,
Dec 24, 2007, 12:03:05 PM12/24/07
to open-go...@googlegroups.com
I'm glad to see that the letter of the law is correct. However, it
will take years for the gov to comply.

The fact of the matter is that gov IT departments are very limited in
knowledge and capabilty (I work for the gov, helping to push these
initiatives).

I think that what is needed to make this happen quicker is a strong
voice from us but backed by many thousand citizens...

Sent from my iPhone!!

On Dec 24, 2007, at 8:51 AM, "Joseph Lorenzo Hall" <joe...@gmail.com>
wrote:

>

Steven.mandzik

unread,
Dec 24, 2007, 12:06:12 PM12/24/07
to open-go...@googlegroups.com
If one wanted to start or sign a virtual petition regarding this
robots.txt file issue, the recent EPA ruling, and other initiatives,
where would I go to do so?

Does anyone know of a site, blog, or wiki for this purpose? - Steve

Sent from my iPhone!!

On Dec 24, 2007, at 8:51 AM, "Joseph Lorenzo Hall" <joe...@gmail.com>
wrote:

>

Carl Malamud

unread,
Dec 24, 2007, 12:15:57 PM12/24/07
to Open Government
Here's the letter:
http://public.resource.org/letter_from_marybeth_peters.pdf

http://bulk.resource.org/copyright has the initial crawl.
http://rss.resource.org/ contains updates.

And, here is where we mirrored our data from:
http://cocatalog.loc.gov/robots.txt

Same procedure applies to gpo.gov, fwiw.

Joseph Lorenzo Hall

unread,
Dec 24, 2007, 12:40:59 PM12/24/07
to open-go...@googlegroups.com
On Dec 24, 2007 10:03 AM, Steven.mandzik <steven....@gmail.com> wrote:
>
> I think that what is needed to make this happen quicker is a strong
> voice from us but backed by many thousand citizens...

Yeah, I think this is wise... it would be neat to canvass search
engines and see who obeys robots.txt and, if so or if not, why or why
not. I have a feeling that, despite the "encouraging, but not
mandatory" law and the LoC opinion that Carl mentioned in Sebastopol,
there are political considerations to which some aggregators might
have to bow. If it were a mandatory regulation -- "thou shalt not use
robots.txt, yo" -- then the Google's, etc. wouldn't have to worry
about the advisoriness of robots.txt.

And feel free to slap me and say, "this isn't a problem, Joe!". best, Joe

Carl Malamud

unread,
Dec 24, 2007, 4:43:04 PM12/24/07
to Open Government
I'm not sure it is a huge problem ...

Here are a few robots.txt files in the government:

http://www.google.com/search?q=inurl%3Arobots.txt+site%3A.gov

Here is a particularly amazing example:

http://www.epa.gov/robots.txt

If somebody mirrored epa.gov, ignoring and excluding the
robots.txt file, the mirror will be quickly crawled by search
engines. As a result, when people search for EPA-related
material, the mirror sites will have much higher search engine
ranking.

At some point, the EPA Administrator will search for something
and notice that they've been p3wned. As a result, the
EPA staff will grow concerned, many meetings will ensue,
and finally the robots.txt file will be changed to increase the
"pagerank" of the government site vis a vis the mirror.

Your scenario may vary ...

Joseph Lorenzo Hall

unread,
Dec 27, 2007, 11:57:25 AM12/27/07
to open-go...@googlegroups.com
On Dec 24, 2007 2:43 PM, Carl Malamud <carl+...@resource.org> wrote:
>
> I'm not sure it is a huge problem ...

Well, I'd like to hear from for-profit search before I decide that
it's not a problem. :) best, Joe

Reply all
Reply to author
Forward
0 new messages