Last week the House Committee on House Administration (here in the U.S.) held a conference on legislative data and transparency. Reynold Schweickhardt, the committee’s director of technology policy, made an interesting observation at the start of the day that policy for public information often is framed in terms of 3 A's:
accessibility, authenticity, and accuracy.
I thought about that over the next few hours. They are good principles. And yet us data geeks so often find ourselves having to start from scratch explaining why clean data is so important. It seems contradictory: if accuracy is a concept practitioners in government get, and if 'clean' is a type of accuracy, then there must be some communications failure here if we're having a hard time explaining open data to government agencies. (To be clear, Reynold totally gets it.)
So I was thinking that morning, what other word do we need to add to those 3 As to work open data in there? At first I thought about adding "precision". Precision is one thing we're usually asking for when we ask for open data. Precision is basically granularity. Compared to say a PDF, XHTML is more granular because it is explicit about section boundaries, paragraphs, identifying where in the document the important things are like names and dollar amounts, etc. (It is more granular with respect to the meaning of the document, though not its pagination.)
But precision is too narrow. When Congress releases its institutional spending records, it does so in a PDF. That PDF has high precision --- it gets down practically to line items. The problem with the PDF is that it has low accuracy because getting it into a spreadsheet format and de-duping names introduces errors.
But accuracy is already one of the three As. So what's missing here?
The Association of Computing Machinery’s Recommendation on Open Government (February 2009) figured this out:
> "Data published by the government should be in formats and approaches > that promote analysis and reuse of that data."
Not only is it right, but "analysis" starts with the letter A. Plus, in order to do any useful analysis on large amounts of information, we need automation --- another A word. That is fate if I ever saw it.
Proposing a whole 17 distinct principles of open government data (read the chapter!) might be, let's say, overwhelming in any practical situation. If we had to do with just four words, maybe these will do:
accessible, authentic, accurate, and analyzable (using automation, because data is big these days).
Analyzable gives deeper meaning to the other three words. Accuracy is too vague alone. You can't measure accuracy in the absence of some process. In the computer science world, accuracy is how often something comes out right. I think government documents people have considered that 'something' to be if a Xerox machine copies enough pixels correctly. That's not sufficient for analysis anymore. We can't go hiring thousands of interns to read all of the documents governments produce. We didn't build computers for nothing.
With analyzable added, the meaning of accuracy is that an *automated computer process* will get it right. If someone says a document is accurate because it is a scan, I'll say that's what accurate meant in the 1960s. If the fourth "A" of government information is analyzable, we can redefine accuracy for 2012.
But if you want the full 17 principles, read the rest of the chapter, which tackles data quality (accuracy & precision), machine processability, and other concepts in more detail. There's also a case study on the House disbursements documents, looking at whether and how it met the 17 principles:
This is a great point -- and I think there's a perfect A word for it:
Adaptability.
That captures the spirit of innovation that infuses so much of this work. And if data is adaptable, it is also capable of being analyzed -- or so I would think?
--
David Robinson Knight Law and Media Scholar Information Society Project Yale Law School
JD Class of 2012 David.Robin...@Yale.edu (202) 657-9892
On Sat, Feb 11, 2012 at 8:43 PM, Josh Tauberer <taube...@govtrack.us> wrote: > Last week the House Committee on House Administration (here in the U.S.) > held a conference on legislative data and transparency. Reynold > Schweickhardt, the committee’s director of technology policy, made an > interesting observation at the start of the day that policy for public > information often is framed in terms of 3 A's:
> accessibility, > authenticity, and > accuracy.
> I thought about that over the next few hours. They are good principles. > And yet us data geeks so often find ourselves having to start from > scratch explaining why clean data is so important. It seems > contradictory: if accuracy is a concept practitioners in government get, > and if 'clean' is a type of accuracy, then there must be some > communications failure here if we're having a hard time explaining open > data to government agencies. (To be clear, Reynold totally gets it.)
> So I was thinking that morning, what other word do we need to add to > those 3 As to work open data in there? At first I thought about adding > "precision". Precision is one thing we're usually asking for when we ask > for open data. Precision is basically granularity. Compared to say a > PDF, XHTML is more granular because it is explicit about section > boundaries, paragraphs, identifying where in the document the important > things are like names and dollar amounts, etc. (It is more granular with > respect to the meaning of the document, though not its pagination.)
> But precision is too narrow. When Congress releases its institutional > spending records, it does so in a PDF. That PDF has high precision --- > it gets down practically to line items. The problem with the PDF is that > it has low accuracy because getting it into a spreadsheet format and > de-duping names introduces errors.
> But accuracy is already one of the three As. So what's missing here?
> The Association of Computing Machinery’s Recommendation on Open > Government (February 2009) figured this out:
> "Data published by the government should be in formats and approaches >> that promote analysis and reuse of that data."
> Not only is it right, but "analysis" starts with the letter A. Plus, in > order to do any useful analysis on large amounts of information, we need > automation --- another A word. That is fate if I ever saw it.
> Proposing a whole 17 distinct principles of open government data (read the > chapter!) might be, let's say, overwhelming in any practical situation. If > we had to do with just four words, maybe these will do:
> accessible, > authentic, > accurate, and > analyzable (using automation, because data is big these days).
> Analyzable gives deeper meaning to the other three words. Accuracy is too > vague alone. You can't measure accuracy in the absence of some process. In > the computer science world, accuracy is how often something comes out > right. I think government documents people have considered that 'something' > to be if a Xerox machine copies enough pixels correctly. That's not > sufficient for analysis anymore. We can't go hiring thousands of interns to > read all of the documents governments produce. We didn't build computers > for nothing.
> With analyzable added, the meaning of accuracy is that an *automated > computer process* will get it right. If someone says a document is accurate > because it is a scan, I'll say that's what accurate meant in the 1960s. If > the fourth "A" of government information is analyzable, we can redefine > accuracy for 2012.
> But if you want the full 17 principles, read the rest of the chapter, > which tackles data quality (accuracy & precision), machine processability, > and other concepts in more detail. There's also a case study on the House > disbursements documents, looking at whether and how it met the 17 > principles:
> -- > You received this message because you are subscribed to the Google Groups > "Open House Project" group. > To post to this group, send email to openhouseproject@googlegroups.**com<openhouseproject@googlegroups.com> > . > To unsubscribe from this group, send email to > openhouseproject+unsubscribe@**googlegroups.com<openhouseproject%2Bunsubscr ibe@googlegroups.com> > . > For more options, visit this group at http://groups.google.com/** > group/openhouseproject?hl=en<http://groups.google.com/group/openhouseproject?hl=en> > .
> -- > You received this message because you are subscribed to the Google Groups "Open House Project" group. > To post to this group, send email to openhouseproject@googlegroups.com. > To unsubscribe from this group, send email to openhouseproject+unsubscribe@googlegroups.com. > For more options, visit this group at http://groups.google.com/group/openhouseproject?hl=en.
I'm on the fence about the 4 A’s....actually I’m starting to consider
the idea that the community is collectively attempting to impress too
many things upon the "open data" concept.
Strip it down to the very basic. What is open data in essence? What
does it mean to be "open"? I try to think about in comparison to open
source, open access, open science. What makes something “open”? I
would argue often what makes things “open” in these areas is a
question of intellectual property/licensing/etc. Now think about the
principles of open data....are the principles of open data qualities
of open data? requirements? characteristics? or are they merely things
we desire from data? optimal conditions for data? desiderata?
In comparison to open source, we only ask that code be licensed to be
open source. We don’t ask that code compiles? is well documented?
works well or as intended? etc. Those are things that might be
expected or desired but certainly not required of it to be ”open”. As
for data we have been adding other issues to the mix in addition to
issues of intellectual property/licensing. We talk about issues of
content, data quality, accuracy, timeliness, completeness, primary,
machine processable, etc. These are all important issues to the
dissemination and access of data but are they things that make data
open or are they data desiderata?
@justgrimes
On Jan 29, 8:37 pm, Josh Tauberer <taube...@govtrack.us> wrote:
On the other hand, one might argue that ease of programmatic readablity is another facet of 'Accessibility', since in the age of 'big data', data is not really accessible if it isn't formatted for programmatic access. In fact, one way of thwarting transparency is to overwhelm the user in enormous volumes of documents that effectively cannot be parsed, summarized and searched efficiently. Think of the last scene of 'Raiders of the Lost Ark'…
Anyway, I totally agree that programmatic machine readability is absolutely key for big data Thanks for thoughts,
> Last week the House Committee on House Administration (here in the U.S.) > held a conference on legislative data and transparency. Reynold > Schweickhardt, the committee’s director of technology policy, made an > interesting observation at the start of the day that policy for public > information often is framed in terms of 3 A's:
> accessibility, > authenticity, and > accuracy.
> I thought about that over the next few hours. They are good principles. > And yet us data geeks so often find ourselves having to start from > scratch explaining why clean data is so important. It seems > contradictory: if accuracy is a concept practitioners in government get, > and if 'clean' is a type of accuracy, then there must be some > communications failure here if we're having a hard time explaining open > data to government agencies. (To be clear, Reynold totally gets it.)
> So I was thinking that morning, what other word do we need to add to > those 3 As to work open data in there? At first I thought about adding > "precision". Precision is one thing we're usually asking for when we ask > for open data. Precision is basically granularity. Compared to say a > PDF, XHTML is more granular because it is explicit about section > boundaries, paragraphs, identifying where in the document the important > things are like names and dollar amounts, etc. (It is more granular with > respect to the meaning of the document, though not its pagination.)
> But precision is too narrow. When Congress releases its institutional > spending records, it does so in a PDF. That PDF has high precision --- > it gets down practically to line items. The problem with the PDF is that > it has low accuracy because getting it into a spreadsheet format and > de-duping names introduces errors.
> But accuracy is already one of the three As. So what's missing here?
> The Association of Computing Machinery’s Recommendation on Open > Government (February 2009) figured this out:
> Not only is it right, but "analysis" starts with the letter A. Plus, in order to do any useful analysis on large amounts of information, we need automation --- another A word. That is fate if I ever saw it.
> Proposing a whole 17 distinct principles of open government data (read the chapter!) might be, let's say, overwhelming in any practical situation. If we had to do with just four words, maybe these will do:
> accessible, > authentic, > accurate, and > analyzable (using automation, because data is big these days).
> Analyzable gives deeper meaning to the other three words. Accuracy is too vague alone. You can't measure accuracy in the absence of some process. In the computer science world, accuracy is how often something comes out right. I think government documents people have considered that 'something' to be if a Xerox machine copies enough pixels correctly. That's not sufficient for analysis anymore. We can't go hiring thousands of interns to read all of the documents governments produce. We didn't build computers for nothing.
> With analyzable added, the meaning of accuracy is that an *automated computer process* will get it right. If someone says a document is accurate because it is a scan, I'll say that's what accurate meant in the 1960s. If the fourth "A" of government information is analyzable, we can redefine accuracy for 2012.
> But if you want the full 17 principles, read the rest of the chapter, which tackles data quality (accuracy & precision), machine processability, and other concepts in more detail. There's also a case study on the House disbursements documents, looking at whether and how it met the 17 principles:
> -- > You received this message because you are subscribed to the Google Groups "Open House Project" group. > To post to this group, send email to openhouseproject@googlegroups.com. > To unsubscribe from this group, send email to openhouseproject+unsubscribe@googlegroups.com. > For more options, visit this group at http://groups.google.com/group/openhouseproject?hl=en.
I don't know that searching for words that start with "A" will necessarily lead to the best outcome... ;-)
For those that missed it, my paper "Publication Practices for Transparent Government" cited two "A" words, though only by happenstance: authoritative sourcing, availability, machine-discoverability, and machine-readability.
On Sun, Feb 12, 2012 at 1:21 PM, Gregory Slater <tenk...@pacbell.net> wrote:
> What about 'API' for the fourth 'A' ?
> On the other hand, one might argue that ease of programmatic readablity is > another facet of 'Accessibility', since in the age of 'big data', data is > not really accessible if it isn't formatted for programmatic access. In > fact, one way of thwarting transparency is to overwhelm the user in > enormous volumes of documents that effectively cannot be parsed, summarized > and searched efficiently. Think of the last scene of 'Raiders of the Lost > Ark'…
> Anyway, I totally agree that programmatic machine readability is > absolutely key for big data > Thanks for thoughts,
> - Greg Slater
> On Feb 11, 2012, at 5:43 PM, Josh Tauberer wrote:
> > Last week the House Committee on House Administration (here in the U.S.) > > held a conference on legislative data and transparency. Reynold > > Schweickhardt, the committee’s director of technology policy, made an > > interesting observation at the start of the day that policy for public > > information often is framed in terms of 3 A's:
> > accessibility, > > authenticity, and > > accuracy.
> > I thought about that over the next few hours. They are good principles. > > And yet us data geeks so often find ourselves having to start from > > scratch explaining why clean data is so important. It seems > > contradictory: if accuracy is a concept practitioners in government get, > > and if 'clean' is a type of accuracy, then there must be some > > communications failure here if we're having a hard time explaining open > > data to government agencies. (To be clear, Reynold totally gets it.)
> > So I was thinking that morning, what other word do we need to add to > > those 3 As to work open data in there? At first I thought about adding > > "precision". Precision is one thing we're usually asking for when we ask > > for open data. Precision is basically granularity. Compared to say a > > PDF, XHTML is more granular because it is explicit about section > > boundaries, paragraphs, identifying where in the document the important > > things are like names and dollar amounts, etc. (It is more granular with > > respect to the meaning of the document, though not its pagination.)
> > But precision is too narrow. When Congress releases its institutional > > spending records, it does so in a PDF. That PDF has high precision --- > > it gets down practically to line items. The problem with the PDF is that > > it has low accuracy because getting it into a spreadsheet format and > > de-duping names introduces errors.
> > But accuracy is already one of the three As. So what's missing here?
> > The Association of Computing Machinery’s Recommendation on Open > > Government (February 2009) figured this out:
> > Not only is it right, but "analysis" starts with the letter A. Plus, in > order to do any useful analysis on large amounts of information, we need > automation --- another A word. That is fate if I ever saw it.
> > Proposing a whole 17 distinct principles of open government data (read > the chapter!) might be, let's say, overwhelming in any practical situation. > If we had to do with just four words, maybe these will do:
> > accessible, > > authentic, > > accurate, and > > analyzable (using automation, because data is big these days).
> > Analyzable gives deeper meaning to the other three words. Accuracy is > too vague alone. You can't measure accuracy in the absence of some process. > In the computer science world, accuracy is how often something comes out > right. I think government documents people have considered that 'something' > to be if a Xerox machine copies enough pixels correctly. That's not > sufficient for analysis anymore. We can't go hiring thousands of interns to > read all of the documents governments produce. We didn't build computers > for nothing.
> > With analyzable added, the meaning of accuracy is that an *automated > computer process* will get it right. If someone says a document is accurate > because it is a scan, I'll say that's what accurate meant in the 1960s. If > the fourth "A" of government information is analyzable, we can redefine > accuracy for 2012.
> > But if you want the full 17 principles, read the rest of the chapter, > which tackles data quality (accuracy & precision), machine processability, > and other concepts in more detail. There's also a case study on the House > disbursements documents, looking at whether and how it met the 17 > principles:
> > -- > > You received this message because you are subscribed to the Google > Groups "Open House Project" group. > > To post to this group, send email to openhouseproject@googlegroups.com. > > To unsubscribe from this group, send email to > openhouseproject+unsubscribe@googlegroups.com. > > For more options, visit this group at > http://groups.google.com/group/openhouseproject?hl=en.
> -- > You received this message because you are subscribed to the Google Groups > "Open House Project" group. > To post to this group, send email to openhouseproject@googlegroups.com. > To unsubscribe from this group, send email to > openhouseproject+unsubscribe@googlegroups.com. > For more options, visit this group at > http://groups.google.com/group/openhouseproject?hl=en.
> On the other hand, one might argue that ease of programmatic readablity is > another facet of 'Accessibility', since in the age of 'big data', data is > not really accessible if it isn't formatted for programmatic access. In > fact, one way of thwarting transparency is to overwhelm the user in > enormous volumes of documents that effectively cannot be parsed, summarized > and searched efficiently. Think of the last scene of 'Raiders of the Lost > Ark'…
> Anyway, I totally agree that programmatic machine readability is > absolutely key for big data > Thanks for thoughts,
> - Greg Slater
> On Feb 11, 2012, at 5:43 PM, Josh Tauberer wrote:
> > Last week the House Committee on House Administration (here in the U.S.) > > held a conference on legislative data and transparency. Reynold > > Schweickhardt, the committee’s director of technology policy, made an > > interesting observation at the start of the day that policy for public > > information often is framed in terms of 3 A's:
> > accessibility, > > authenticity, and > > accuracy.
> > I thought about that over the next few hours. They are good principles. > > And yet us data geeks so often find ourselves having to start from > > scratch explaining why clean data is so important. It seems > > contradictory: if accuracy is a concept practitioners in government get, > > and if 'clean' is a type of accuracy, then there must be some > > communications failure here if we're having a hard time explaining open > > data to government agencies. (To be clear, Reynold totally gets it.)
> > So I was thinking that morning, what other word do we need to add to > > those 3 As to work open data in there? At first I thought about adding > > "precision". Precision is one thing we're usually asking for when we ask > > for open data. Precision is basically granularity. Compared to say a > > PDF, XHTML is more granular because it is explicit about section > > boundaries, paragraphs, identifying where in the document the important > > things are like names and dollar amounts, etc. (It is more granular with > > respect to the meaning of the document, though not its pagination.)
> > But precision is too narrow. When Congress releases its institutional > > spending records, it does so in a PDF. That PDF has high precision --- > > it gets down practically to line items. The problem with the PDF is that > > it has low accuracy because getting it into a spreadsheet format and > > de-duping names introduces errors.
> > But accuracy is already one of the three As. So what's missing here?
> > The Association of Computing Machinery’s Recommendation on Open > > Government (February 2009) figured this out:
> > Not only is it right, but "analysis" starts with the letter A. Plus, in > order to do any useful analysis on large amounts of information, we need > automation --- another A word. That is fate if I ever saw it.
> > Proposing a whole 17 distinct principles of open government data (read > the chapter!) might be, let's say, overwhelming in any practical situation. > If we had to do with just four words, maybe these will do:
> > accessible, > > authentic, > > accurate, and > > analyzable (using automation, because data is big these days).
> > Analyzable gives deeper meaning to the other three words. Accuracy is > too vague alone. You can't measure accuracy in the absence of some process. > In the computer science world, accuracy is how often something comes out > right. I think government documents people have considered that 'something' > to be if a Xerox machine copies enough pixels correctly. That's not > sufficient for analysis anymore. We can't go hiring thousands of interns to > read all of the documents governments produce. We didn't build computers > for nothing.
> > With analyzable added, the meaning of accuracy is that an *automated > computer process* will get it right. If someone says a document is accurate > because it is a scan, I'll say that's what accurate meant in the 1960s. If > the fourth "A" of government information is analyzable, we can redefine > accuracy for 2012.
> > But if you want the full 17 principles, read the rest of the chapter, > which tackles data quality (accuracy & precision), machine processability, > and other concepts in more detail. There's also a case study on the House > disbursements documents, looking at whether and how it met the 17 > principles:
> > -- > > You received this message because you are subscribed to the Google > Groups "Open House Project" group. > > To post to this group, send email to openhouseproject@googlegroups.com. > > To unsubscribe from this group, send email to > openhouseproject+unsubscribe@googlegroups.com. > > For more options, visit this group at > http://groups.google.com/group/openhouseproject?hl=en.
> -- > You received this message because you are subscribed to the Google Groups > "Open House Project" group. > To post to this group, send email to openhouseproject@googlegroups.com. > To unsubscribe from this group, send email to > openhouseproject+unsubscribe@googlegroups.com. > For more options, visit this group at > http://groups.google.com/group/openhouseproject?hl=en.
> I don't know that searching for words that start with "A" will necessarily lead to the best outcome... ;-)
> For those that missed it, my paper "Publication Practices for Transparent Government" cited two "A" words, though only by happenstance: authoritative sourcing, availability, machine-discoverability, and machine-readability.
> On Sun, Feb 12, 2012 at 1:21 PM, Gregory Slater <tenk...@pacbell.net> wrote:
> What about 'API' for the fourth 'A' ?
> On the other hand, one might argue that ease of programmatic readablity is another facet of 'Accessibility', since in the age of 'big data', data is not really accessible if it isn't formatted for programmatic access. In fact, one way of thwarting transparency is to overwhelm the user in enormous volumes of documents that effectively cannot be parsed, summarized and searched efficiently. Think of the last scene of 'Raiders of the Lost Ark'…
> Anyway, I totally agree that programmatic machine readability is absolutely key for big data > Thanks for thoughts,
> - Greg Slater
> On Feb 11, 2012, at 5:43 PM, Josh Tauberer wrote:
> > Last week the House Committee on House Administration (here in the U.S.) > > held a conference on legislative data and transparency. Reynold > > Schweickhardt, the committee’s director of technology policy, made an > > interesting observation at the start of the day that policy for public > > information often is framed in terms of 3 A's:
> > accessibility, > > authenticity, and > > accuracy.
> > I thought about that over the next few hours. They are good principles. > > And yet us data geeks so often find ourselves having to start from > > scratch explaining why clean data is so important. It seems > > contradictory: if accuracy is a concept practitioners in government get, > > and if 'clean' is a type of accuracy, then there must be some > > communications failure here if we're having a hard time explaining open > > data to government agencies. (To be clear, Reynold totally gets it.)
> > So I was thinking that morning, what other word do we need to add to > > those 3 As to work open data in there? At first I thought about adding > > "precision". Precision is one thing we're usually asking for when we ask > > for open data. Precision is basically granularity. Compared to say a > > PDF, XHTML is more granular because it is explicit about section > > boundaries, paragraphs, identifying where in the document the important > > things are like names and dollar amounts, etc. (It is more granular with > > respect to the meaning of the document, though not its pagination.)
> > But precision is too narrow. When Congress releases its institutional > > spending records, it does so in a PDF. That PDF has high precision --- > > it gets down practically to line items. The problem with the PDF is that > > it has low accuracy because getting it into a spreadsheet format and > > de-duping names introduces errors.
> > But accuracy is already one of the three As. So what's missing here?
> > The Association of Computing Machinery’s Recommendation on Open > > Government (February 2009) figured this out:
> > Not only is it right, but "analysis" starts with the letter A. Plus, in order to do any useful analysis on large amounts of information, we need automation --- another A word. That is fate if I ever saw it.
> > Proposing a whole 17 distinct principles of open government data (read the chapter!) might be, let's say, overwhelming in any practical situation. If we had to do with just four words, maybe these will do:
> > accessible, > > authentic, > > accurate, and > > analyzable (using automation, because data is big these days).
> > Analyzable gives deeper meaning to the other three words. Accuracy is too vague alone. You can't measure accuracy in the absence of some process. In the computer science world, accuracy is how often something comes out right. I think government documents people have considered that 'something' to be if a Xerox machine copies enough pixels correctly. That's not sufficient for analysis anymore. We can't go hiring thousands of interns to read all of the documents governments produce. We didn't build computers for nothing.
> > With analyzable added, the meaning of accuracy is that an *automated computer process* will get it right. If someone says a document is accurate because it is a scan, I'll say that's what accurate meant in the 1960s. If the fourth "A" of government information is analyzable, we can redefine accuracy for 2012.
> > But if you want the full 17 principles, read the rest of the chapter, which tackles data quality (accuracy & precision), machine processability, and other concepts in more detail. There's also a case study on the House disbursements documents, looking at whether and how it met the 17 principles:
> > -- > > You received this message because you are subscribed to the Google Groups "Open House Project" group. > > To post to this group, send email to openhouseproject@googlegroups.com. > > To unsubscribe from this group, send email to openhouseproject+unsubscribe@googlegroups.com. > > For more options, visit this group at http://groups.google.com/group/openhouseproject?hl=en.
> -- > You received this message because you are subscribed to the Google Groups "Open House Project" group. > To post to this group, send email to openhouseproject@googlegroups.com. > To unsubscribe from this group, send email to openhouseproject+unsubscribe@googlegroups.com. > For more options, visit this group at http://groups.google.com/group/openhouseproject?hl=en.
> -- > You received this message because you are subscribed to the Google Groups "Open House Project" group. > To post to this group, send email to openhouseproject@googlegroups.com. > To unsubscribe from this group, send email to openhouseproject+unsubscribe@googlegroups.com. > For more options, visit this group at http://groups.google.com/group/openhouseproject?hl=en.
I'm not sure about alliteration either. I came up with eight criteria for open data in general <http://on.fb.me/xuLADW> W. David Stephenson "Data Dynamite: how liberating information will transform our world"
On Sun, Feb 12, 2012 at 2:24 PM, WashingtonWatch.com <
webmas...@washingtonwatch.com> wrote: > I don't know that searching for words that start with "A" will necessarily > lead to the best outcome... ;-)
> For those that missed it, my paper "Publication Practices for Transparent > Government" cited two "A" words, though only by happenstance: authoritative > sourcing, availability, machine-discoverability, and machine-readability.
> On Sun, Feb 12, 2012 at 1:21 PM, Gregory Slater <tenk...@pacbell.net>wrote:
>> What about 'API' for the fourth 'A' ?
>> On the other hand, one might argue that ease of programmatic readablity >> is another facet of 'Accessibility', since in the age of 'big data', data >> is not really accessible if it isn't formatted for programmatic access. In >> fact, one way of thwarting transparency is to overwhelm the user in >> enormous volumes of documents that effectively cannot be parsed, summarized >> and searched efficiently. Think of the last scene of 'Raiders of the Lost >> Ark'…
>> Anyway, I totally agree that programmatic machine readability is >> absolutely key for big data >> Thanks for thoughts,
>> - Greg Slater
>> On Feb 11, 2012, at 5:43 PM, Josh Tauberer wrote:
>> > Last week the House Committee on House Administration (here in the U.S.) >> > held a conference on legislative data and transparency. Reynold >> > Schweickhardt, the committee’s director of technology policy, made an >> > interesting observation at the start of the day that policy for public >> > information often is framed in terms of 3 A's:
>> > accessibility, >> > authenticity, and >> > accuracy.
>> > I thought about that over the next few hours. They are good principles. >> > And yet us data geeks so often find ourselves having to start from >> > scratch explaining why clean data is so important. It seems >> > contradictory: if accuracy is a concept practitioners in government get, >> > and if 'clean' is a type of accuracy, then there must be some >> > communications failure here if we're having a hard time explaining open >> > data to government agencies. (To be clear, Reynold totally gets it.)
>> > So I was thinking that morning, what other word do we need to add to >> > those 3 As to work open data in there? At first I thought about adding >> > "precision". Precision is one thing we're usually asking for when we ask >> > for open data. Precision is basically granularity. Compared to say a >> > PDF, XHTML is more granular because it is explicit about section >> > boundaries, paragraphs, identifying where in the document the important >> > things are like names and dollar amounts, etc. (It is more granular with >> > respect to the meaning of the document, though not its pagination.)
>> > But precision is too narrow. When Congress releases its institutional >> > spending records, it does so in a PDF. That PDF has high precision --- >> > it gets down practically to line items. The problem with the PDF is that >> > it has low accuracy because getting it into a spreadsheet format and >> > de-duping names introduces errors.
>> > But accuracy is already one of the three As. So what's missing here?
>> > The Association of Computing Machinery’s Recommendation on Open >> > Government (February 2009) figured this out:
>> > Not only is it right, but "analysis" starts with the letter A. Plus, in >> order to do any useful analysis on large amounts of information, we need >> automation --- another A word. That is fate if I ever saw it.
>> > Proposing a whole 17 distinct principles of open government data (read >> the chapter!) might be, let's say, overwhelming in any practical situation. >> If we had to do with just four words, maybe these will do:
>> > accessible, >> > authentic, >> > accurate, and >> > analyzable (using automation, because data is big these days).
>> > Analyzable gives deeper meaning to the other three words. Accuracy is >> too vague alone. You can't measure accuracy in the absence of some process. >> In the computer science world, accuracy is how often something comes out >> right. I think government documents people have considered that 'something' >> to be if a Xerox machine copies enough pixels correctly. That's not >> sufficient for analysis anymore. We can't go hiring thousands of interns to >> read all of the documents governments produce. We didn't build computers >> for nothing.
>> > With analyzable added, the meaning of accuracy is that an *automated >> computer process* will get it right. If someone says a document is accurate >> because it is a scan, I'll say that's what accurate meant in the 1960s. If >> the fourth "A" of government information is analyzable, we can redefine >> accuracy for 2012.
>> > But if you want the full 17 principles, read the rest of the chapter, >> which tackles data quality (accuracy & precision), machine processability, >> and other concepts in more detail. There's also a case study on the House >> disbursements documents, looking at whether and how it met the 17 >> principles:
>> > -- >> > You received this message because you are subscribed to the Google >> Groups "Open House Project" group. >> > To post to this group, send email to openhouseproject@googlegroups.com. >> > To unsubscribe from this group, send email to >> openhouseproject+unsubscribe@googlegroups.com. >> > For more options, visit this group at >> http://groups.google.com/group/openhouseproject?hl=en.
>> -- >> You received this message because you are subscribed to the Google Groups >> "Open House Project" group. >> To post to this group, send email to openhouseproject@googlegroups.com. >> To unsubscribe from this group, send email to >> openhouseproject+unsubscribe@googlegroups.com. >> For more options, visit this group at >> http://groups.google.com/group/openhouseproject?hl=en.
> -- > You received this message because you are subscribed to the Google Groups > "Open House Project" group. > To post to this group, send email to openhouseproject@googlegroups.com. > To unsubscribe from this group, send email to > openhouseproject+unsubscribe@googlegroups.com. > For more options, visit this group at > http://groups.google.com/group/openhouseproject?hl=en.
-- W. David Stephenson | Principal | Stephenson Strategies D.Stephen...@stephensonstrategies.com 335 Main St., Medfield, MA 02052 | (508 ) 740-8918
My suggestions wouldn't fit with the whole "A" paradigm, but the first thing I think when I think of machine readability is "queryable" ... Another thing when I think about deduplication problems is "normalized"
The nice thing about these definitions is that they have real (already defined) meaning, and can be tested or measured. Datasets could be tagged with their level of normalization, for example "1NF"
I'm thinking queryability is probably in the accessibility category and would be a baseline requirement since big data isn't accessible unless it can be queried. Normalization would be part of the definition for accuracy, and could be objectively assessed as one of the normal forms.
Just my $0.02.
On Feb 12, 2012, at 11:25 AM, "WashingtonWatch.com<http://WashingtonWatch.com>" <webmas...@washingtonwatch.com<mailto:webmas...@washingtonwatch.com>> wrote:
I don't know that searching for words that start with "A" will necessarily lead to the best outcome... ;-)
For those that missed it, my paper "Publication Practices for Transparent Government" cited two "A" words, though only by happenstance: authoritative sourcing, availability, machine-discoverability, and machine-readability.
On Sun, Feb 12, 2012 at 1:21 PM, Gregory Slater <tenk...@pacbell.net<mailto:tenk...@pacbell.net>> wrote:
What about 'API' for the fourth 'A' ?
On the other hand, one might argue that ease of programmatic readablity is another facet of 'Accessibility', since in the age of 'big data', data is not really accessible if it isn't formatted for programmatic access. In fact, one way of thwarting transparency is to overwhelm the user in enormous volumes of documents that effectively cannot be parsed, summarized and searched efficiently. Think of the last scene of 'Raiders of the Lost Ark'…
Anyway, I totally agree that programmatic machine readability is absolutely key for big data Thanks for thoughts,
> Last week the House Committee on House Administration (here in the U.S.) > held a conference on legislative data and transparency. Reynold > Schweickhardt, the committee’s director of technology policy, made an > interesting observation at the start of the day that policy for public > information often is framed in terms of 3 A's:
> accessibility, > authenticity, and > accuracy.
> I thought about that over the next few hours. They are good principles. > And yet us data geeks so often find ourselves having to start from > scratch explaining why clean data is so important. It seems > contradictory: if accuracy is a concept practitioners in government get, > and if 'clean' is a type of accuracy, then there must be some > communications failure here if we're having a hard time explaining open > data to government agencies. (To be clear, Reynold totally gets it.)
> So I was thinking that morning, what other word do we need to add to > those 3 As to work open data in there? At first I thought about adding > "precision". Precision is one thing we're usually asking for when we ask > for open data. Precision is basically granularity. Compared to say a > PDF, XHTML is more granular because it is explicit about section > boundaries, paragraphs, identifying where in the document the important > things are like names and dollar amounts, etc. (It is more granular with > respect to the meaning of the document, though not its pagination.)
> But precision is too narrow. When Congress releases its institutional > spending records, it does so in a PDF. That PDF has high precision --- > it gets down practically to line items. The problem with the PDF is that > it has low accuracy because getting it into a spreadsheet format and > de-duping names introduces errors.
> But accuracy is already one of the three As. So what's missing here?
> The Association of Computing Machinery’s Recommendation on Open > Government (February 2009) figured this out:
> Not only is it right, but "analysis" starts with the letter A. Plus, in order to do any useful analysis on large amounts of information, we need automation --- another A word. That is fate if I ever saw it.
> Proposing a whole 17 distinct principles of open government data (read the chapter!) might be, let's say, overwhelming in any practical situation. If we had to do with just four words, maybe these will do:
> accessible, > authentic, > accurate, and > analyzable (using automation, because data is big these days).
> Analyzable gives deeper meaning to the other three words. Accuracy is too vague alone. You can't measure accuracy in the absence of some process. In the computer science world, accuracy is how often something comes out right. I think government documents people have considered that 'something' to be if a Xerox machine copies enough pixels correctly. That's not sufficient for analysis anymore. We can't go hiring thousands of interns to read all of the documents governments produce. We didn't build computers for nothing.
> With analyzable added, the meaning of accuracy is that an *automated computer process* will get it right. If someone says a document is accurate because it is a scan, I'll say that's what accurate meant in the 1960s. If the fourth "A" of government information is analyzable, we can redefine accuracy for 2012.
> But if you want the full 17 principles, read the rest of the chapter, which tackles data quality (accuracy & precision), machine processability, and other concepts in more detail. There's also a case study on the House disbursements documents, looking at whether and how it met the 17 principles:
> -- > You received this message because you are subscribed to the Google Groups "Open House Project" group. > To post to this group, send email to openhouseproject@googlegroups.com<mailto:openhouseproject@googlegroups.com> . > To unsubscribe from this group, send email to openhouseproject+unsubscribe@googlegroups.com<mailto:openhouseproject%2Buns ubscribe@googlegroups.com>. > For more options, visit this group at http://groups.google.com/group/openhouseproject?hl=en.
-- You received this message because you are subscribed to the Google Groups "Open House Project" group. To post to this group, send email to openhouseproject@googlegroups.com<mailto:openhouseproject@googlegroups.com> . To unsubscribe from this group, send email to openhouseproject+unsubscribe@googlegroups.com<mailto:openhouseproject%2Buns ubscribe@googlegroups.com>. For more options, visit this group at http://groups.google.com/group/openhouseproject?hl=en.
-- You received this message because you are subscribed to the Google Groups "Open House Project" group. To post to this group, send email to openhouseproject@googlegroups.com<mailto:openhouseproject@googlegroups.com> . To unsubscribe from this group, send email to openhouseproject+unsubscribe@googlegroups.com<mailto:openhouseproject+unsub scribe@googlegroups.com>. For more options, visit this group at http://groups.google.com/group/openhouseproject?hl=en.
(Some replies of course only went to one list or the other --- apologies if I'm replying to something you didn't see.)
On 02/11/2012 08:58 PM, David Robinson wrote:
> Adaptability.
> That captures the spirit of innovation that infuses so much of this > work. And if data is adaptable, it is also capable of being analyzed > -- or so I would think?
I like that this makes the focus broader than just analysis, closer to the meaning of transformation.
On 02/12/2012 12:57 AM, Justin Grimes wrote:
> In comparison to open source, we only ask that code be licensed to > be open source. We don’t ask that code compiles? is well documented? > works well or as intended? etc. Those are things that might be > expected or desired but certainly not required of it to be ”open”.
Even in the open source world, there are dozens of popular licenses. The minimal requirements for 'open source' aren't necessarily natural --- they no doubt came out of balancing different views and the pragmatic need for interoperability of licenses.
The pragmatic needs for data, and especially government data, are different. If data is meant to serve transparency, then it is important to be able to know what the bits mean, more so than interoperability (for instance).
On 02/12/2012 04:12 AM, innovation institute wrote:
> There is no accuracy in absolute terms.
That's exactly what I was saying. But in my experience, many agencies who are or want to produce data do not have a well defined sense of accuracy, or their definition is out of date with respect to data.
On 02/12/2012 01:21 PM, Gregory Slater wrote:
> What about 'API' for the fourth 'A' ?
On 02/12/2012 04:52 PM, Javier Muniz wrote:
> "queryable"
The fear that some of us have with those sorts of recommendations is that agencies will then skip the bulk data part, and then we'll all have to start getting API keys and bending over backwards to get large slices of the underlying data for a large scale analysis.
On 02/12/2012 04:52 PM, Javier Muniz wrote:
> The nice thing about these definitions is that they have real > (already defined) meaning, and can be tested or measured. Datasets > could be tagged with their level of normalization, for example "1NF"
"1NF" (or even 3NF) can be a useful definition and recommendation, but it is very narrow in the types of data it would make sense for (e.g. not documents).
When I said queryable I didn't necessarily mean an API, just a data format that is queryable (sets of csv files, xml, etc). If you can provide a schema, then you know your dataset is queryable.
Also, while I agree normalization isn't applicable to, say PDF documents, it is important when you begin to look at documents as a dataset. We do this a lot at Granicus. Documents for us are simply a way of representing the results of a query from one or more datasets. When we produce, for instance, a minutes document, we usually generate them on the fly from all of the data we are able to query about a particular meeting.
This allows us to produce the documents that our customers expect as part of their process, but it also allows us to keep the data both queryable and normalized under the hood.
This type of structure is important to us because we actually want the ability to add new features on top of the data in the future, and having a ton of normal minutes documents would not be useful for that.
The same can be done for legislative documents and workflow. It would require a ton of work to make the shift throughout the entire process, but starting to educate on it now could at least get the ball rolling and maybe address some of the lower hanging fruit. ________________________________________ From: openhouseproject@googlegroups.com [openhouseproject@googlegroups.com] on behalf of Josh Tauberer [taube...@govtrack.us] Sent: Sunday, February 12, 2012 3:01 PM To: openhouseproject@googlegroups.com; open-governm...@lists.okfn.org Subject: Re: [openhouseproject] The Four "A"s of Open Government Data
(Some replies of course only went to one list or the other --- apologies if I'm replying to something you didn't see.)
On 02/11/2012 08:58 PM, David Robinson wrote:
> Adaptability.
> That captures the spirit of innovation that infuses so much of this > work. And if data is adaptable, it is also capable of being analyzed > -- or so I would think?
I like that this makes the focus broader than just analysis, closer to the meaning of transformation.
On 02/12/2012 12:57 AM, Justin Grimes wrote:
> In comparison to open source, we only ask that code be licensed to > be open source. We don’t ask that code compiles? is well documented? > works well or as intended? etc. Those are things that might be > expected or desired but certainly not required of it to be ”open”.
Even in the open source world, there are dozens of popular licenses. The minimal requirements for 'open source' aren't necessarily natural --- they no doubt came out of balancing different views and the pragmatic need for interoperability of licenses.
The pragmatic needs for data, and especially government data, are different. If data is meant to serve transparency, then it is important to be able to know what the bits mean, more so than interoperability (for instance).
On 02/12/2012 04:12 AM, innovation institute wrote:
> There is no accuracy in absolute terms.
That's exactly what I was saying. But in my experience, many agencies who are or want to produce data do not have a well defined sense of accuracy, or their definition is out of date with respect to data.
On 02/12/2012 01:21 PM, Gregory Slater wrote:
> What about 'API' for the fourth 'A' ?
On 02/12/2012 04:52 PM, Javier Muniz wrote:
> "queryable"
The fear that some of us have with those sorts of recommendations is that agencies will then skip the bulk data part, and then we'll all have to start getting API keys and bending over backwards to get large slices of the underlying data for a large scale analysis.
On 02/12/2012 04:52 PM, Javier Muniz wrote:
> The nice thing about these definitions is that they have real > (already defined) meaning, and can be tested or measured. Datasets > could be tagged with their level of normalization, for example "1NF"
"1NF" (or even 3NF) can be a useful definition and recommendation, but it is very narrow in the types of data it would make sense for (e.g. not documents).
-- You received this message because you are subscribed to the Google Groups "Open House Project" group. To post to this group, send email to openhouseproject@googlegroups.com. To unsubscribe from this group, send email to openhouseproject+unsubscribe@googlegroups.com. For more options, visit this group at http://groups.google.com/group/openhouseproject?hl=en.