To be sure, XML is a significant improvement over proprietary and
closed data formats. But it can be a pain to work with, especially
when compared to YAML, JSON, SQLite, or CSV (sometimes).
What do you think? In the face of other formats, should XML be
something to oppose? Are we, the open government community, at the
point where we can be picky about open data formats?
> To be sure, XML is a significant improvement over proprietary and
> closed data formats. But it can be a pain to work with, especially
> when compared to YAML, JSON, SQLite, or CSV (sometimes).
> What do you think? In the face of other formats, should XML be
> something to oppose? Are we, the open government community, at the
> point where we can be picky about open data formats?
> - Luigi
Moral certainty is always a sign of cultural inferiority. - H.L.Mencken
Big XML can be a problem, but it is simple enough (in most cases) to split
the data into smaller documents. All document formats would have similar
issues, in fact it could be argued that SAX and StAX, (as mentioned by one
of the comments) and others allows processes to work well with large XML
documents; the author could have used Saxon to ease with the XSLT (a bit
limited, yes). His example was more a result of poor engineering than any
issue intrinsic to XML itself. All of that being said, for RESTful
interfaces I believe that JSON often works better, and frameworks like Axis2
allow for serving different flavors of data. I would least prefer CSV and
SQLite.
Just my two cents.
On Sun, Aug 23, 2009 at 8:18 PM, Luigi Montanez <luigi.monta...@gmail.com>wrote:
> To be sure, XML is a significant improvement over proprietary and
> closed data formats. But it can be a pain to work with, especially
> when compared to YAML, JSON, SQLite, or CSV (sometimes).
> What do you think? In the face of other formats, should XML be
> something to oppose? Are we, the open government community, at the
> point where we can be picky about open data formats?
> - Luigi
-- Derek Williams
Cell: 970.214.8928
Home Office: 970.416.8996
XML is by far the most widely supported data format. We shouldn't be *too* picky about data formats when we're still trying to convince folks that data is a good thing, but IMO XML is the format to push. To take the formats you mentioned, all besides JSON has I think a serious problem-
YAML- I don't know what it is (= not widely adopted)
SQLite - Binary, proprietary, only one implementation, and subject to obsoletion
CSV - Not well standardized. No character encoding. Often not generated properly.
XML and JSON are entirely equivalent as far as I can tell, except XML tools are more prevalent and XML has far deeper industry adoption. I haven't run across any advantage of JSON over XML.
I agree with the article that XML can get annoying for large data, but the alternatives make me think twice about recommending another format.
Not that I would complain if anyone used CSV for a large data set --- so long as it was done correctly and documented right. It's just that I wouldn't recommend CSV without being reasonably confident it wouldn't make things worse.
What would be nice would be an actual complete CSV standard (i.e. fully interpretable without anything besides the file). Here's one:
RFC 4180
*plus* the header line is mandatory
*plus* it is UTF-8 encoded
(Can we call this CCSV for "complete CSV"?)
Actually for the international community that uses commas as decimal separators, I think a generic character delimited values (ha, "CSV") standard might be a good idea to have.
> To be sure, XML is a significant improvement over proprietary and
> closed data formats. But it can be a pain to work with, especially
> when compared to YAML, JSON, SQLite, or CSV (sometimes).
> What do you think? In the face of other formats, should XML be
> something to oppose? Are we, the open government community, at the
> point where we can be picky about open data formats?
json is flexible and easy to output in most languages. xml is tedious
because it's very verbose, but that also means it's largely
self-documenting.
if we documented our json as well as XML documents itself, how much
time/convenience would we actually save? :)
i think the reason most of us like working with json are similarly the
reasons why in 20 years if you looked at the dataset you might have no idea
what it was or how to use it. for better and worse, as a community we're
(arguably) supposed to think about those aspects of the problem, as well.
as michael points out in his blog post, XML's structure is also very
repetitive, and extremely verbose. would something as simple as json
standardized with a comments section, and header section including name and
data types (for example), suffice for most of us? where does it fail?
perhaps, as a simple epirical study, there could be a running wiki page of
example data sets, and notes about where existing formats have failed them.
after 6 months, we can look it over, and come up with a list of suggested
characteristics and example formats which would address as many of them as
possible. has someone done that already?
jessy
(FWIW, as a bit of an afterthought, [CT]sv doesnt strike me as a good format
for arbitrary large data sets, in particular when there's large blobs of
text (or encded binary, i suppose) involved, since in that case you're often
dealing with tens or hundreds (or more) lines of text (which of course may
or may not have commas, tabs, or even \n newlines in them), and it becomes a
more subtle problem to store properly, and parse.)
On Sun, Aug 23, 2009 at 8:09 PM, Josh Tauberer <taube...@govtrack.us> wrote:
> XML is by far the most widely supported data format. We shouldn't be
> *too* picky about data formats when we're still trying to convince folks
> that data is a good thing, but IMO XML is the format to push. To take
> the formats you mentioned, all besides JSON has I think a serious problem-
> YAML- I don't know what it is (= not widely adopted)
> SQLite - Binary, proprietary, only one implementation, and subject to
> obsoletion
> CSV - Not well standardized. No character encoding. Often not generated
> properly.
> XML and JSON are entirely equivalent as far as I can tell, except XML
> tools are more prevalent and XML has far deeper industry adoption. I
> haven't run across any advantage of JSON over XML.
> I agree with the article that XML can get annoying for large data, but
> the alternatives make me think twice about recommending another format.
> Not that I would complain if anyone used CSV for a large data set --- so
> long as it was done correctly and documented right. It's just that I
> wouldn't recommend CSV without being reasonably confident it wouldn't
> make things worse.
> What would be nice would be an actual complete CSV standard (i.e. fully
> interpretable without anything besides the file). Here's one:
> RFC 4180
> *plus* the header line is mandatory
> *plus* it is UTF-8 encoded
> (Can we call this CCSV for "complete CSV"?)
> Actually for the international community that uses commas as decimal
> separators, I think a generic character delimited values (ha, "CSV")
> standard might be a good idea to have.
> Josh
> On 8/23/2009 10:18 PM, Luigi Montanez wrote:
> > I found these arguments to be rather though-provoking:
> > To be sure, XML is a significant improvement over proprietary and
> > closed data formats. But it can be a pain to work with, especially
> > when compared to YAML, JSON, SQLite, or CSV (sometimes).
> > What do you think? In the face of other formats, should XML be
> > something to oppose? Are we, the open government community, at the
> > point where we can be picky about open data formats?
On Sun, Aug 23, 2009 at 8:09 PM, Josh Tauberer<taube...@govtrack.us> wrote:
> XML is by far the most widely supported data format.
The problem is that XML is by itself very little of a standard -- it is the specific schemas that are anything like a "data format". So when we "adopt XML" it means very little.
> We shouldn't be > *too* picky about data formats when we're still trying to convince folks > that data is a good thing, but IMO XML is the format to push.
I disagree, rather I think a "full CSV" like you describe below is appropriate. CSV or JSON are FAR easier to program to (dare I say something like an order of magnitude) than even the most well described XML. We are trying to aggregate indicators for the Portland OR Metro Region, and programming to an XML format takes days or weeks versus hours for CSV. This is really important when programmer time is limited, and would make a difference in whether we chose a data stream or not for reporting.
I also think the supposed self documenting aspect of XML is completely overrated, bordering on the ridiculous. There is nothing that forces you to include the right metadata in a schema just because it is XML
> To take > the formats you mentioned, all besides JSON has I think a serious problem-
> YAML- I don't know what it is (= not widely adopted)
I agree not a good first choice, but not because you don't know what it is ;)
> SQLite - Binary, proprietary, only one implementation, and subject to > obsoletion
Um... incorrect. SQLite public domain (not even BSD licensed), so if it became important to maintain an old binary format, the community could fork the code. This is the beauty of open source. I think if there is a large multi table database, SQLite or ASCII SQL code is an excellent way of transmitting it.
> CSV - Not well standardized. No character encoding. Often not generated > properly.
> XML and JSON are entirely equivalent as far as I can tell, except XML > tools are more prevalent and XML has far deeper industry adoption. I > haven't run across any advantage of JSON over XML.
Entirely equivalent ... accept in terms of lines of code and complexity. And to be honest, "industry adoption" is not necessarily indicative of its engineering quality. At all.
> Not that I would complain if anyone used CSV for a large data set --- so > long as it was done correctly and documented right. It's just that I > wouldn't recommend CSV without being reasonably confident it wouldn't > make things worse.
> What would be nice would be an actual complete CSV standard (i.e. fully > interpretable without anything besides the file). Here's one: > RFC 4180 > *plus* the header line is mandatory > *plus* it is UTF-8 encoded > (Can we call this CCSV for "complete CSV"?)
Good idea. I actually think a standard like HTTP might be a good approach in which there is a header section with abritrary key value information, which include column names (this is what they mean by "header" in the RFC), a little bit of metadata (name of the table, etc), the separator value, etc. Then two newlines, then the data.
(Encoding this header information in XML tags just makes them harder to parse, without any payoff. )
> Actually for the international community that uses commas as decimal > separators, I think a generic character delimited values (ha, "CSV") > standard might be a good idea to have.
See above paragraph
In summary, I would argue that XML should NOT be encouraged, but rather an industrial strength CSV format would be best. As a programmer, getting things into and out of XML adds a huge amount of time (even with libraries), would make it harder for gov't agencies to serve data, harder for users to use data, add a lot of extra bandwidth (due to the tags and whitespace), and not give additional payoff in terms of metadata (since there is nothing intrinsic to a schema per se to make it self documenting).
There is a danger that poor technologies get used because they feel more technical -- XML, in my mind, is popular not for its intrinsic merits (which are slight), but for its emotional connotations as an "industry standard".
One more thing in favor of CSV -- a huge amount of modernity runs on spreadsheets, so getting a government employee to think in terms of exporting to CSV and copying to a directory would be fairly straightforward, but if there were an intermediate fancy data format in between it would be harder to get buy in.
On Sun, Aug 23, 2009 at 10:45 PM, Webb Sprague<webb.spra...@gmail.com> wrote: > A few notes below from an interested party.
> On Sun, Aug 23, 2009 at 8:09 PM, Josh Tauberer<taube...@govtrack.us> wrote:
>> XML is by far the most widely supported data format.
> The problem is that XML is by itself very little of a standard -- it > is the specific schemas that are anything like a "data format". So > when we "adopt XML" it means very little.
>> We shouldn't be >> *too* picky about data formats when we're still trying to convince folks >> that data is a good thing, but IMO XML is the format to push.
> I disagree, rather I think a "full CSV" like you describe below is > appropriate. CSV or JSON are FAR easier to program to (dare I say > something like an order of magnitude) than even the most well > described XML. We are trying to aggregate indicators for the Portland > OR Metro Region, and programming to an XML format takes days or weeks > versus hours for CSV. This is really important when programmer time > is limited, and would make a difference in whether we chose a data > stream or not for reporting.
> I also think the supposed self documenting aspect of XML is completely > overrated, bordering on the ridiculous. There is nothing that forces > you to include the right metadata in a schema just because it is XML
>> To take >> the formats you mentioned, all besides JSON has I think a serious problem-
>> YAML- I don't know what it is (= not widely adopted)
> I agree not a good first choice, but not because you don't know what it is ;)
>> SQLite - Binary, proprietary, only one implementation, and subject to >> obsoletion
> Um... incorrect. SQLite public domain (not even BSD licensed), so if > it became important to maintain an old binary format, the community > could fork the code. This is the beauty of open source. I think if > there is a large multi table database, SQLite or ASCII SQL code is an > excellent way of transmitting it.
>> CSV - Not well standardized. No character encoding. Often not generated >> properly.
>> XML and JSON are entirely equivalent as far as I can tell, except XML >> tools are more prevalent and XML has far deeper industry adoption. I >> haven't run across any advantage of JSON over XML.
> Entirely equivalent ... accept in terms of lines of code and > complexity. And to be honest, "industry adoption" is not necessarily > indicative of its engineering quality. At all.
>> Not that I would complain if anyone used CSV for a large data set --- so >> long as it was done correctly and documented right. It's just that I >> wouldn't recommend CSV without being reasonably confident it wouldn't >> make things worse.
>> What would be nice would be an actual complete CSV standard (i.e. fully >> interpretable without anything besides the file). Here's one: >> RFC 4180 >> *plus* the header line is mandatory >> *plus* it is UTF-8 encoded >> (Can we call this CCSV for "complete CSV"?)
> Good idea. I actually think a standard like HTTP might be a good > approach in which there is a header section with abritrary key value > information, which include column names (this is what they mean by > "header" in the RFC), a little bit of metadata (name of the table, > etc), the separator value, etc. Then two newlines, then the data.
> (Encoding this header information in XML tags just makes them harder > to parse, without any payoff. )
>> Actually for the international community that uses commas as decimal >> separators, I think a generic character delimited values (ha, "CSV") >> standard might be a good idea to have.
> See above paragraph
> In summary, I would argue that XML should NOT be encouraged, but > rather an industrial strength CSV format would be best. As a > programmer, getting things into and out of XML adds a huge amount of > time (even with libraries), would make it harder for gov't agencies to > serve data, harder for users to use data, add a lot of extra bandwidth > (due to the tags and whitespace), and not give additional payoff in > terms of metadata (since there is nothing intrinsic to a schema per se > to make it self documenting).
> There is a danger that poor technologies get used because they feel > more technical -- XML, in my mind, is popular not for its intrinsic > merits (which are slight), but for its emotional connotations as an > "industry standard".
The non-technical government employee probably doesn't know what CSV is
either, and is going to think in terms of the cost & length of the contract
required to modify their database to be exportable. The technical
government employee is quite capable of thinking in terms of of the fancy
formats.
On Mon, Aug 24, 2009 at 1:49 AM, Webb Sprague <webb.spra...@gmail.com>wrote:
> One more thing in favor of CSV -- a huge amount of modernity runs on
> spreadsheets, so getting a government employee to think in terms of
> exporting to CSV and copying to a directory would be fairly
> straightforward, but if there were an intermediate fancy data format
> in between it would be harder to get buy in.
> W
> On Sun, Aug 23, 2009 at 10:45 PM, Webb Sprague<webb.spra...@gmail.com>
> wrote:
> > A few notes below from an interested party.
> > On Sun, Aug 23, 2009 at 8:09 PM, Josh Tauberer<taube...@govtrack.us>
> wrote:
> >> XML is by far the most widely supported data format.
> > The problem is that XML is by itself very little of a standard -- it
> > is the specific schemas that are anything like a "data format". So
> > when we "adopt XML" it means very little.
> >> We shouldn't be
> >> *too* picky about data formats when we're still trying to convince folks
> >> that data is a good thing, but IMO XML is the format to push.
> > I disagree, rather I think a "full CSV" like you describe below is
> > appropriate. CSV or JSON are FAR easier to program to (dare I say
> > something like an order of magnitude) than even the most well
> > described XML. We are trying to aggregate indicators for the Portland
> > OR Metro Region, and programming to an XML format takes days or weeks
> > versus hours for CSV. This is really important when programmer time
> > is limited, and would make a difference in whether we chose a data
> > stream or not for reporting.
> > I also think the supposed self documenting aspect of XML is completely
> > overrated, bordering on the ridiculous. There is nothing that forces
> > you to include the right metadata in a schema just because it is XML
> >> To take
> >> the formats you mentioned, all besides JSON has I think a serious
> problem-
> >> YAML- I don't know what it is (= not widely adopted)
> > I agree not a good first choice, but not because you don't know what it
> is ;)
> >> SQLite - Binary, proprietary, only one implementation, and subject to
> >> obsoletion
> > Um... incorrect. SQLite public domain (not even BSD licensed), so if
> > it became important to maintain an old binary format, the community
> > could fork the code. This is the beauty of open source. I think if
> > there is a large multi table database, SQLite or ASCII SQL code is an
> > excellent way of transmitting it.
> >> CSV - Not well standardized. No character encoding. Often not generated
> >> properly.
> >> XML and JSON are entirely equivalent as far as I can tell, except XML
> >> tools are more prevalent and XML has far deeper industry adoption. I
> >> haven't run across any advantage of JSON over XML.
> > Entirely equivalent ... accept in terms of lines of code and
> > complexity. And to be honest, "industry adoption" is not necessarily
> > indicative of its engineering quality. At all.
> >> Not that I would complain if anyone used CSV for a large data set --- so
> >> long as it was done correctly and documented right. It's just that I
> >> wouldn't recommend CSV without being reasonably confident it wouldn't
> >> make things worse.
> >> What would be nice would be an actual complete CSV standard (i.e. fully
> >> interpretable without anything besides the file). Here's one:
> >> RFC 4180
> >> *plus* the header line is mandatory
> >> *plus* it is UTF-8 encoded
> >> (Can we call this CCSV for "complete CSV"?)
> > Good idea. I actually think a standard like HTTP might be a good
> > approach in which there is a header section with abritrary key value
> > information, which include column names (this is what they mean by
> > "header" in the RFC), a little bit of metadata (name of the table,
> > etc), the separator value, etc. Then two newlines, then the data.
> > (Encoding this header information in XML tags just makes them harder
> > to parse, without any payoff. )
> >> Actually for the international community that uses commas as decimal
> >> separators, I think a generic character delimited values (ha, "CSV")
> >> standard might be a good idea to have.
> > See above paragraph
> > In summary, I would argue that XML should NOT be encouraged, but
> > rather an industrial strength CSV format would be best. As a
> > programmer, getting things into and out of XML adds a huge amount of
> > time (even with libraries), would make it harder for gov't agencies to
> > serve data, harder for users to use data, add a lot of extra bandwidth
> > (due to the tags and whitespace), and not give additional payoff in
> > terms of metadata (since there is nothing intrinsic to a schema per se
> > to make it self documenting).
> > There is a danger that poor technologies get used because they feel
> > more technical -- XML, in my mind, is popular not for its intrinsic
> > merits (which are slight), but for its emotional connotations as an
> > "industry standard".
Mostly I just wanted to add in this quote: "XML is like violence: if
it doesn't solve your problem, you're not using enough of it"
But also, I think Driscoll's best point is his first one: XML does
seem to breed bureaucracy in a strange way. It's unfortunately common
to find yourself on a list witnessing a discussion of the relative
merits of various XML variants between people who've never written a
line of code. And there seems to be an assumption in much of the XML-
using world that publishing a DTD is just as good as -- probably
better than! -- publishing a sample document.
But I agree with Josh: XML is what we've got, it's a lingua franca,
and we shouldn't be too picky. I think part of the appeal of other
formats is the simplicity they enforce. If you're going to publish
complex data in CSV, you're going to have to make the format
understandable, and to think about unique identifiers. If you're
using JSON, the library author probably used XML first and realized
that the event-based mechanics of stream parsing are godawful and
should be hidden from the user (though in really extreme cases there's
no getting around it). XML is powerful enough to enable bad design
decisions, and old enough to have tools that suffer from many such
decisions themselves.
Tom
On Aug 24, 1:49 am, Webb Sprague <webb.spra...@gmail.com> wrote:
> One more thing in favor of CSV -- a huge amount of modernity runs on
> spreadsheets, so getting a government employee to think in terms of
> exporting to CSV and copying to a directory would be fairly
> straightforward, but if there were an intermediate fancy data format
> in between it would be harder to get buy in.
> W
> On Sun, Aug 23, 2009 at 10:45 PM, Webb Sprague<webb.spra...@gmail.com> wrote:
> > A few notes below from an interested party.
> > On Sun, Aug 23, 2009 at 8:09 PM, Josh Tauberer<taube...@govtrack.us> wrote:
> >> XML is by far the most widely supported data format.
> > The problem is that XML is by itself very little of a standard -- it
> > is the specific schemas that are anything like a "data format". So
> > when we "adopt XML" it means very little.
> >> We shouldn't be
> >> *too* picky about data formats when we're still trying to convince folks
> >> that data is a good thing, but IMO XML is the format to push.
> > I disagree, rather I think a "full CSV" like you describe below is
> > appropriate. CSV or JSON are FAR easier to program to (dare I say
> > something like an order of magnitude) than even the most well
> > described XML. We are trying to aggregate indicators for the Portland
> > OR Metro Region, and programming to an XML format takes days or weeks
> > versus hours for CSV. This is really important when programmer time
> > is limited, and would make a difference in whether we chose a data
> > stream or not for reporting.
> > I also think the supposed self documenting aspect of XML is completely
> > overrated, bordering on the ridiculous. There is nothing that forces
> > you to include the right metadata in a schema just because it is XML
> >> To take
> >> the formats you mentioned, all besides JSON has I think a serious problem-
> >> YAML- I don't know what it is (= not widely adopted)
> > I agree not a good first choice, but not because you don't know what it is ;)
> >> SQLite - Binary, proprietary, only one implementation, and subject to
> >> obsoletion
> > Um... incorrect. SQLite public domain (not even BSD licensed), so if
> > it became important to maintain an old binary format, the community
> > could fork the code. This is the beauty of open source. I think if
> > there is a large multi table database, SQLite or ASCII SQL code is an
> > excellent way of transmitting it.
> >> CSV - Not well standardized. No character encoding. Often not generated
> >> properly.
> >> XML and JSON are entirely equivalent as far as I can tell, except XML
> >> tools are more prevalent and XML has far deeper industry adoption. I
> >> haven't run across any advantage of JSON over XML.
> > Entirely equivalent ... accept in terms of lines of code and
> > complexity. And to be honest, "industry adoption" is not necessarily
> > indicative of its engineering quality. At all.
> >> Not that I would complain if anyone used CSV for a large data set --- so
> >> long as it was done correctly and documented right. It's just that I
> >> wouldn't recommend CSV without being reasonably confident it wouldn't
> >> make things worse.
> >> What would be nice would be an actual complete CSV standard (i.e. fully
> >> interpretable without anything besides the file). Here's one:
> >> RFC 4180
> >> *plus* the header line is mandatory
> >> *plus* it is UTF-8 encoded
> >> (Can we call this CCSV for "complete CSV"?)
> > Good idea. I actually think a standard like HTTP might be a good
> > approach in which there is a header section with abritrary key value
> > information, which include column names (this is what they mean by
> > "header" in the RFC), a little bit of metadata (name of the table,
> > etc), the separator value, etc. Then two newlines, then the data.
> > (Encoding this header information in XML tags just makes them harder
> > to parse, without any payoff. )
> >> Actually for the international community that uses commas as decimal
> >> separators, I think a generic character delimited values (ha, "CSV")
> >> standard might be a good idea to have.
> > See above paragraph
> > In summary, I would argue that XML should NOT be encouraged, but
> > rather an industrial strength CSV format would be best. As a
> > programmer, getting things into and out of XML adds a huge amount of
> > time (even with libraries), would make it harder for gov't agencies to
> > serve data, harder for users to use data, add a lot of extra bandwidth
> > (due to the tags and whitespace), and not give additional payoff in
> > terms of metadata (since there is nothing intrinsic to a schema per se
> > to make it self documenting).
> > There is a danger that poor technologies get used because they feel
> > more technical -- XML, in my mind, is popular not for its intrinsic
> > merits (which are slight), but for its emotional connotations as an
> > "industry standard".
I'd like to point out that formats such as XML and JSON can communicate
parent-child relationships and multiple data types/objects within one
document while CSV cannot.
On a related note, the OpenLeg effort by the NY Senate CIO team (which I am
a part of), has recognized the XML issue from the get-go, and offers a
variety of view renderings for bills. For instance:
Our system is modular enough to add in any custom requested or format
variant necessary. In short, we aren't betting the farm on any one format or
schema, but instead building in flexibility and iterating. We'd love to
support a CSV++ format if it was defined, and will also be adding easy to
use/parse formats like RSS and KML as approriate/useful.
+Nathan
On Mon, Aug 24, 2009 at 9:56 AM, Matt Brennan <matty.bren...@gmail.com>wrote:
> The non-technical government employee probably doesn't know what CSV is
> either, and is going to think in terms of the cost & length of the contract
> required to modify their database to be exportable. The technical
> government employee is quite capable of thinking in terms of of the fancy
> formats.
> On Mon, Aug 24, 2009 at 1:49 AM, Webb Sprague <webb.spra...@gmail.com>wrote:
>> One more thing in favor of CSV -- a huge amount of modernity runs on
>> spreadsheets, so getting a government employee to think in terms of
>> exporting to CSV and copying to a directory would be fairly
>> straightforward, but if there were an intermediate fancy data format
>> in between it would be harder to get buy in.
>> W
>> On Sun, Aug 23, 2009 at 10:45 PM, Webb Sprague<webb.spra...@gmail.com>
>> wrote:
>> > A few notes below from an interested party.
>> > On Sun, Aug 23, 2009 at 8:09 PM, Josh Tauberer<taube...@govtrack.us>
>> wrote:
>> >> XML is by far the most widely supported data format.
>> > The problem is that XML is by itself very little of a standard -- it
>> > is the specific schemas that are anything like a "data format". So
>> > when we "adopt XML" it means very little.
>> >> We shouldn't be
>> >> *too* picky about data formats when we're still trying to convince
>> folks
>> >> that data is a good thing, but IMO XML is the format to push.
>> > I disagree, rather I think a "full CSV" like you describe below is
>> > appropriate. CSV or JSON are FAR easier to program to (dare I say
>> > something like an order of magnitude) than even the most well
>> > described XML. We are trying to aggregate indicators for the Portland
>> > OR Metro Region, and programming to an XML format takes days or weeks
>> > versus hours for CSV. This is really important when programmer time
>> > is limited, and would make a difference in whether we chose a data
>> > stream or not for reporting.
>> > I also think the supposed self documenting aspect of XML is completely
>> > overrated, bordering on the ridiculous. There is nothing that forces
>> > you to include the right metadata in a schema just because it is XML
>> >> To take
>> >> the formats you mentioned, all besides JSON has I think a serious
>> problem-
>> >> YAML- I don't know what it is (= not widely adopted)
>> > I agree not a good first choice, but not because you don't know what it
>> is ;)
>> >> SQLite - Binary, proprietary, only one implementation, and subject to
>> >> obsoletion
>> > Um... incorrect. SQLite public domain (not even BSD licensed), so if
>> > it became important to maintain an old binary format, the community
>> > could fork the code. This is the beauty of open source. I think if
>> > there is a large multi table database, SQLite or ASCII SQL code is an
>> > excellent way of transmitting it.
>> >> CSV - Not well standardized. No character encoding. Often not generated
>> >> properly.
>> >> XML and JSON are entirely equivalent as far as I can tell, except XML
>> >> tools are more prevalent and XML has far deeper industry adoption. I
>> >> haven't run across any advantage of JSON over XML.
>> > Entirely equivalent ... accept in terms of lines of code and
>> > complexity. And to be honest, "industry adoption" is not necessarily
>> > indicative of its engineering quality. At all.
>> >> Not that I would complain if anyone used CSV for a large data set ---
>> so
>> >> long as it was done correctly and documented right. It's just that I
>> >> wouldn't recommend CSV without being reasonably confident it wouldn't
>> >> make things worse.
>> >> What would be nice would be an actual complete CSV standard (i.e. fully
>> >> interpretable without anything besides the file). Here's one:
>> >> RFC 4180
>> >> *plus* the header line is mandatory
>> >> *plus* it is UTF-8 encoded
>> >> (Can we call this CCSV for "complete CSV"?)
>> > Good idea. I actually think a standard like HTTP might be a good
>> > approach in which there is a header section with abritrary key value
>> > information, which include column names (this is what they mean by
>> > "header" in the RFC), a little bit of metadata (name of the table,
>> > etc), the separator value, etc. Then two newlines, then the data.
>> > (Encoding this header information in XML tags just makes them harder
>> > to parse, without any payoff. )
>> >> Actually for the international community that uses commas as decimal
>> >> separators, I think a generic character delimited values (ha, "CSV")
>> >> standard might be a good idea to have.
>> > See above paragraph
>> > In summary, I would argue that XML should NOT be encouraged, but
>> > rather an industrial strength CSV format would be best. As a
>> > programmer, getting things into and out of XML adds a huge amount of
>> > time (even with libraries), would make it harder for gov't agencies to
>> > serve data, harder for users to use data, add a lot of extra bandwidth
>> > (due to the tags and whitespace), and not give additional payoff in
>> > terms of metadata (since there is nothing intrinsic to a schema per se
>> > to make it self documenting).
>> > There is a danger that poor technologies get used because they feel
>> > more technical -- XML, in my mind, is popular not for its intrinsic
>> > merits (which are slight), but for its emotional connotations as an
>> > "industry standard".
Maybe some kind of "industrial strength CSV", or CSV++, format, would
be ideal, but the problem is that it doesn't exist. And you can't get
all these government agencies to coordinate in the way you'd have to
for them to create something new that meets all their needs. Nobody'd
do anything til it passed ISO standardization!
No, we have to go with what's out there, and the choice is between
JSON and XML. YAML is beautiful and terse (and, in fact, completely
compatible with JSON), but not so much so that it's worth picking over
JSON when there are so many more JSON parsing tools available. SQLite
is awesome, but it's binary, not easily "browsable" in a text editor
or browser, and you're going to have to form queries to take the data
out. They're not good candidates for universalizing government data
output.
We should be pushing as many agencies as possible to output in both
XML and JSON. But if there's not enough political capital or
technical comprehension, or whatever, to get an agency to output
both...well, then give XML. And we'll deal with it being large.
Just, nobody bother with DTDs. They are a waste of everyone's time and energy.
On Mon, Aug 24, 2009 at 10:25 AM, Nathan Freitas<nathanfrei...@gmail.com> wrote:
> I'd like to point out that formats such as XML and JSON can communicate
> parent-child relationships and multiple data types/objects within one
> document while CSV cannot.
> On a related note, the OpenLeg effort by the NY Senate CIO team (which I am
> a part of), has recognized the XML issue from the get-go, and offers a
> variety of view renderings for bills. For instance:
> Our system is modular enough to add in any custom requested or format
> variant necessary. In short, we aren't betting the farm on any one format or
> schema, but instead building in flexibility and iterating. We'd love to
> support a CSV++ format if it was defined, and will also be adding easy to
> use/parse formats like RSS and KML as approriate/useful.
> +Nathan
> On Mon, Aug 24, 2009 at 9:56 AM, Matt Brennan <matty.bren...@gmail.com>
> wrote:
>> The non-technical government employee probably doesn't know what CSV is
>> either, and is going to think in terms of the cost & length of the contract
>> required to modify their database to be exportable. The technical
>> government employee is quite capable of thinking in terms of of the fancy
>> formats.
>> On Mon, Aug 24, 2009 at 1:49 AM, Webb Sprague <webb.spra...@gmail.com>
>> wrote:
>>> One more thing in favor of CSV -- a huge amount of modernity runs on
>>> spreadsheets, so getting a government employee to think in terms of
>>> exporting to CSV and copying to a directory would be fairly
>>> straightforward, but if there were an intermediate fancy data format
>>> in between it would be harder to get buy in.
>>> W
>>> On Sun, Aug 23, 2009 at 10:45 PM, Webb Sprague<webb.spra...@gmail.com>
>>> wrote:
>>> > A few notes below from an interested party.
>>> > On Sun, Aug 23, 2009 at 8:09 PM, Josh Tauberer<taube...@govtrack.us>
>>> > wrote:
>>> >> XML is by far the most widely supported data format.
>>> > The problem is that XML is by itself very little of a standard -- it
>>> > is the specific schemas that are anything like a "data format". So
>>> > when we "adopt XML" it means very little.
>>> >> We shouldn't be
>>> >> *too* picky about data formats when we're still trying to convince
>>> >> folks
>>> >> that data is a good thing, but IMO XML is the format to push.
>>> > I disagree, rather I think a "full CSV" like you describe below is
>>> > appropriate. CSV or JSON are FAR easier to program to (dare I say
>>> > something like an order of magnitude) than even the most well
>>> > described XML. We are trying to aggregate indicators for the Portland
>>> > OR Metro Region, and programming to an XML format takes days or weeks
>>> > versus hours for CSV. This is really important when programmer time
>>> > is limited, and would make a difference in whether we chose a data
>>> > stream or not for reporting.
>>> > I also think the supposed self documenting aspect of XML is completely
>>> > overrated, bordering on the ridiculous. There is nothing that forces
>>> > you to include the right metadata in a schema just because it is XML
>>> >> To take
>>> >> the formats you mentioned, all besides JSON has I think a serious
>>> >> problem-
>>> >> YAML- I don't know what it is (= not widely adopted)
>>> > I agree not a good first choice, but not because you don't know what it
>>> > is ;)
>>> >> SQLite - Binary, proprietary, only one implementation, and subject to
>>> >> obsoletion
>>> > Um... incorrect. SQLite public domain (not even BSD licensed), so if
>>> > it became important to maintain an old binary format, the community
>>> > could fork the code. This is the beauty of open source. I think if
>>> > there is a large multi table database, SQLite or ASCII SQL code is an
>>> > excellent way of transmitting it.
>>> >> CSV - Not well standardized. No character encoding. Often not
>>> >> generated
>>> >> properly.
>>> >> XML and JSON are entirely equivalent as far as I can tell, except XML
>>> >> tools are more prevalent and XML has far deeper industry adoption. I
>>> >> haven't run across any advantage of JSON over XML.
>>> > Entirely equivalent ... accept in terms of lines of code and
>>> > complexity. And to be honest, "industry adoption" is not necessarily
>>> > indicative of its engineering quality. At all.
>>> >> Not that I would complain if anyone used CSV for a large data set ---
>>> >> so
>>> >> long as it was done correctly and documented right. It's just that I
>>> >> wouldn't recommend CSV without being reasonably confident it wouldn't
>>> >> make things worse.
>>> >> What would be nice would be an actual complete CSV standard (i.e.
>>> >> fully
>>> >> interpretable without anything besides the file). Here's one:
>>> >> RFC 4180
>>> >> *plus* the header line is mandatory
>>> >> *plus* it is UTF-8 encoded
>>> >> (Can we call this CCSV for "complete CSV"?)
>>> > Good idea. I actually think a standard like HTTP might be a good
>>> > approach in which there is a header section with abritrary key value
>>> > information, which include column names (this is what they mean by
>>> > "header" in the RFC), a little bit of metadata (name of the table,
>>> > etc), the separator value, etc. Then two newlines, then the data.
>>> > (Encoding this header information in XML tags just makes them harder
>>> > to parse, without any payoff. )
>>> >> Actually for the international community that uses commas as decimal
>>> >> separators, I think a generic character delimited values (ha, "CSV")
>>> >> standard might be a good idea to have.
>>> > See above paragraph
>>> > In summary, I would argue that XML should NOT be encouraged, but
>>> > rather an industrial strength CSV format would be best. As a
>>> > programmer, getting things into and out of XML adds a huge amount of
>>> > time (even with libraries), would make it harder for gov't agencies to
>>> > serve data, harder for users to use data, add a lot of extra bandwidth
>>> > (due to the tags and whitespace), and not give additional payoff in
>>> > terms of metadata (since there is nothing intrinsic to a schema per se
>>> > to make it self documenting).
>>> > There is a danger that poor technologies get used because they feel
>>> > more technical -- XML, in my mind, is popular not for its intrinsic
>>> > merits (which are slight), but for its emotional connotations as an
>>> > "industry standard".
"The mission of the RDB2RDF Working Group, part of the Semantic Web Activity, is to standardize a language for mapping relational data and relational database schemas into RDF and OWL, tentatively called the RDB2RDF Mapping Language, R2RML."
There are also tools around already which can be configured (using their own custom languages) to expose an RDF view of non-RDF relational/tabular data. For example, see http://www4.wiwiss.fu-berlin.de/bizer/d2rq/
Somewhat similarly, the GRDDL standard explains how various non-RDF markups can be mapped to RDF using XSLT - http://en.wikipedia.org/wiki/GRDDL
So the story here is that different data providers can choose the formats that make sense to them, but increasingly can document their concrete formats using shared schemas/ontologies. Other parties can publish SQL, tabular dumps, XML, JSON or whatever, and have different mappings to the same basic terminology - eg. http://www.oegov.us/blog/?p=234 http://www.fao.org/countryProfiles/geoinfo.asp?lang=en etc).
This doesn't magically solve all interop and documentation practices, but it does suggest some ways of avoiding excessive fragmentation of the data without forcing a "one size fits all" solution on everyone. Anything that is mapped to RDF by one of these techniques can benefit from the SQL-ish SPARQL query language (http://www.w3.org/TR/rdf-sparql-query/), and can be mixed and merged with other mapped data, regardless of the concrete notation. So in theory this gives a way for data from RDFa/microformats, SQL, CSV and plain XML to be integrated...
I think Carrie is right that we can't be picky. I asked for Python in my project and now 75% of its written in PHP. Open data is like open code: take what you can get and be happy anyone cares enough to do it at all. (Of course, the corollary is: if you don't like it you can fix it.) However, that is no reason not to express a strong preference.
As to what that preference should be: XML is wonderful for interoperability, but its verboseness has a number of number of unfortunate side-effects:
1) The sure amount of metadata (tags) required to define a simple data format means it needs to be translated to be skimmable.
2) There are a million and ones way to iterate over the data, thus being able to understand the _data_ doesn't mean you can understand any code that _uses_ the data.
3) Webapp developers realized long ago that raw XML is too heavy for responsive AJAX calls--thats why JSON took off in popularity.
What this means is that if we "get" XML and we want to use it in certain ways its a very taxing process to translate it into a more appropriate format--a process which could cause the loss of data if its not done well and might be slow even if it is done well.
For all these reasons, I think XML is clearly not an ideal data format.
Sqlite is binary--I think binary is a bad way to go for a transport file format. CSV is barely a format at all and offers none of the advantages of any of the other options.
JSON, on the other hand, has many of the same advantages as XML--nesting, self-naming, etc--without any of the bloat. It is easy to validate, ideal for the most common target platform (the web), easy to work with in all modern languages, and supports the subset of data types which are common to most use-cases.
All that being the case, I would say my particular strong preferences are:
1) If possible, prefer to get multiple formats. (At least XML and JSON.)
2) If that's not possible, prefer JSON.
3) If that's not possible, prefer XML.
4) If that's not possible, prefer that they give you anything rather than nothing.
> To be sure, XML is a significant improvement over proprietary and
> closed data formats. But it can be a pain to work with, especially
> when compared to YAML, JSON, SQLite, or CSV (sometimes).
> What do you think? In the face of other formats, should XML be
> something to oppose? Are we, the open government community, at the
> point where we can be picky about open data formats?
From a getting-the-government-to-publish-more-stuff perspective, if the
choice is between:
A) ask them to publish relevant data as-is with whatever documentation is
available immediately and trust the community of developers to perform
deeper research and transformation as necessary and maintain public
documentation
or
B) decide on a specific data format or set of formats and boycott government
data unless it's in those formats
Path B seems to be the one most likely to lead to a web of bureaucracy and
buck-passing.
I've seen and worked with data published in probably a thousand stupid
antiquated formats. The most recent boneheaded data experience I've had was
with the USPS Zip+4 file, a fixed-width file with a 182-character-long
schema that's published without carriage returns or line feeds which is 8GB
of text on a single row. (Luckily there's a couple GNU tools which will
help -- both "fold" and "fmt" can be used to insert a CRLF after every 182
characters.)
I haven't received a physical tape in a while, but it was a little more than
a year ago when I last received data on microfiche.
Point is, XML, JSON, YAML, CSV, TSV, a SQL statement, a DBF, fixed-width,
whatever ... as long as it's got a persistent URI, I can write up a couple
paragraphs on what it is and how to make it usable to the community, and I
bet many of the other folks here can do the same. We can even clean it up
and publish it somewhere else in whatever format (or collection of formats)
we like. If it's hand-scrawled information in PDF or even TIFF, we can
mechanical-turk the data and put it in whatever format we want, or whatever
format the developer prefers. In that case, if you're the one busting your
butt to make the conversion and you think that XML is a bloated and
error-prone format, then publish it in JSON or TSV or SQLite or whatever you
think is superior. If somebody else wants it in XML, that person can
convert it and publish it themselves. Storage is cheap.
The situation we want to avoid is one where an agency only comfortable with
publishing things in DBF publishes nothing since they feel if they can't
publish to a modern format they can't publish at all. Instead, go ahead and
publish it in DBF and I'll convert to JSON and and upload it somewhere, or
even better, write a shell script and instructions that allows other folks
to run the data transformation themselves.
On Mon, Aug 24, 2009 at 10:25 AM, Tom Lee <thomas.j....@gmail.com> wrote:
> Mostly I just wanted to add in this quote: "XML is like violence: if
> it doesn't solve your problem, you're not using enough of it"
> But also, I think Driscoll's best point is his first one: XML does
> seem to breed bureaucracy in a strange way. It's unfortunately common
> to find yourself on a list witnessing a discussion of the relative
> merits of various XML variants between people who've never written a
> line of code. And there seems to be an assumption in much of the XML-
> using world that publishing a DTD is just as good as -- probably
> better than! -- publishing a sample document.
> But I agree with Josh: XML is what we've got, it's a lingua franca,
> and we shouldn't be too picky. I think part of the appeal of other
> formats is the simplicity they enforce. If you're going to publish
> complex data in CSV, you're going to have to make the format
> understandable, and to think about unique identifiers. If you're
> using JSON, the library author probably used XML first and realized
> that the event-based mechanics of stream parsing are godawful and
> should be hidden from the user (though in really extreme cases there's
> no getting around it). XML is powerful enough to enable bad design
> decisions, and old enough to have tools that suffer from many such
> decisions themselves.
> Tom
> On Aug 24, 1:49 am, Webb Sprague <webb.spra...@gmail.com> wrote:
> > One more thing in favor of CSV -- a huge amount of modernity runs on
> > spreadsheets, so getting a government employee to think in terms of
> > exporting to CSV and copying to a directory would be fairly
> > straightforward, but if there were an intermediate fancy data format
> > in between it would be harder to get buy in.
> > W
> > On Sun, Aug 23, 2009 at 10:45 PM, Webb Sprague<webb.spra...@gmail.com>
> wrote:
> > > A few notes below from an interested party.
> > > On Sun, Aug 23, 2009 at 8:09 PM, Josh Tauberer<taube...@govtrack.us>
> wrote:
> > >> XML is by far the most widely supported data format.
> > > The problem is that XML is by itself very little of a standard -- it
> > > is the specific schemas that are anything like a "data format". So
> > > when we "adopt XML" it means very little.
> > >> We shouldn't be
> > >> *too* picky about data formats when we're still trying to convince
> folks
> > >> that data is a good thing, but IMO XML is the format to push.
> > > I disagree, rather I think a "full CSV" like you describe below is
> > > appropriate. CSV or JSON are FAR easier to program to (dare I say
> > > something like an order of magnitude) than even the most well
> > > described XML. We are trying to aggregate indicators for the Portland
> > > OR Metro Region, and programming to an XML format takes days or weeks
> > > versus hours for CSV. This is really important when programmer time
> > > is limited, and would make a difference in whether we chose a data
> > > stream or not for reporting.
> > > I also think the supposed self documenting aspect of XML is completely
> > > overrated, bordering on the ridiculous. There is nothing that forces
> > > you to include the right metadata in a schema just because it is XML
> > >> To take
> > >> the formats you mentioned, all besides JSON has I think a serious
> problem-
> > >> YAML- I don't know what it is (= not widely adopted)
> > > I agree not a good first choice, but not because you don't know what it
> is ;)
> > >> SQLite - Binary, proprietary, only one implementation, and subject to
> > >> obsoletion
> > > Um... incorrect. SQLite public domain (not even BSD licensed), so if
> > > it became important to maintain an old binary format, the community
> > > could fork the code. This is the beauty of open source. I think if
> > > there is a large multi table database, SQLite or ASCII SQL code is an
> > > excellent way of transmitting it.
> > >> CSV - Not well standardized. No character encoding. Often not
> generated
> > >> properly.
> > >> XML and JSON are entirely equivalent as far as I can tell, except XML
> > >> tools are more prevalent and XML has far deeper industry adoption. I
> > >> haven't run across any advantage of JSON over XML.
> > > Entirely equivalent ... accept in terms of lines of code and
> > > complexity. And to be honest, "industry adoption" is not necessarily
> > > indicative of its engineering quality. At all.
> > >> Not that I would complain if anyone used CSV for a large data set ---
> so
> > >> long as it was done correctly and documented right. It's just that I
> > >> wouldn't recommend CSV without being reasonably confident it wouldn't
> > >> make things worse.
> > >> What would be nice would be an actual complete CSV standard (i.e.
> fully
> > >> interpretable without anything besides the file). Here's one:
> > >> RFC 4180
> > >> *plus* the header line is mandatory
> > >> *plus* it is UTF-8 encoded
> > >> (Can we call this CCSV for "complete CSV"?)
> > > Good idea. I actually think a standard like HTTP might be a good
> > > approach in which there is a header section with abritrary key value
> > > information, which include column names (this is what they mean by
> > > "header" in the RFC), a little bit of metadata (name of the table,
> > > etc), the separator value, etc. Then two newlines, then the data.
> > > (Encoding this header information in XML tags just makes them harder
> > > to parse, without any payoff. )
> > >> Actually for the international community that uses commas as decimal
> > >> separators, I think a generic character delimited values (ha, "CSV")
> > >> standard might be a good idea to have.
> > > See above paragraph
> > > In summary, I would argue that XML should NOT be encouraged, but
> > > rather an industrial strength CSV format would be best. As a
> > > programmer, getting things into and out of XML adds a huge amount of
> > > time (even with libraries), would make it harder for gov't agencies to
> > > serve data, harder for users to use data, add a lot of extra bandwidth
> > > (due to the tags and whitespace), and not give additional payoff in
> > > terms of metadata (since there is nothing intrinsic to a schema per se
> > > to make it self documenting).
> > > There is a danger that poor technologies get used because they feel
> > > more technical -- XML, in my mind, is popular not for its intrinsic
> > > merits (which are slight), but for its emotional connotations as an
> > > "industry standard".
> I think Carrie is right that we can't be picky. I asked for Python in > my project and now 75% of its written in PHP. Open data is like open > code: take what you can get and be happy anyone cares enough to do it at > all. (Of course, the corollary is: if you don't like it you can fix > it.) However, that is no reason not to express a strong preference.
> As to what that preference should be: XML is wonderful for > interoperability, but its verboseness has a number of number of > unfortunate side-effects:
> 1) The sure amount of metadata (tags) required to define a simple > data format means it needs to be translated to be skimmable.
How about if every XML format used around here came with a default XSLT that converted it into human-friendly HTML?
(I'd be happy if it was HTML+RDFa, but the HTML part is more important...)
> 2) There are a million and ones way to iterate over the data, thus > being able to understand the _data_ doesn't mean you can understand any > code that _uses_ the data.
That's a good point
> 3) Webapp developers realized long ago that raw XML is too heavy for > responsive AJAX calls--thats why JSON took off in popularity.
> What this means is that if we "get" XML and we want to use it in certain > ways its a very taxing process to translate it into a more appropriate > format--a process which could cause the loss of data if its not done > well and might be slow even if it is done well.
> For all these reasons, I think XML is clearly not an ideal data format. > Sqlite is binary--I think binary is a bad way to go for a transport file > format. CSV is barely a format at all and offers none of the advantages > of any of the other options.
> JSON, on the other hand, has many of the same advantages as > XML--nesting, self-naming, etc--without any of the bloat. It is easy to > validate, ideal for the most common target platform (the web), easy to > work with in all modern languages, and supports the subset of data types > which are common to most use-cases.
> All that being the case, I would say my particular strong preferences are:
> 1) If possible, prefer to get multiple formats. (At least XML and JSON.)
Yup
> 2) If that's not possible, prefer JSON. > 3) If that's not possible, prefer XML. > 4) If that's not possible, prefer that they give you anything rather than nothing.
I guess a lot depends on what the point is. When we're looking at egov / transparency, a lot of the point is about various government or govt-related parties putting otherwise-hidden information more clearly "on the record". In which case the definition of the fields is almost as important as the actual data --- and matters such as what a NULL or empty field means, what an empty row means, who created the data, etc. Very subtle matters of interpretation can have rather large political and practical consequences. Without clear documentation about what the data means, we can still plot places on maps and generate pie charts, but translating that to policy / trend analysis or citizen activism is trickier...
The BNP (British National Party) is a far-right UK political party. The database contained (apparently) some membership records, but also various people who may have merely been contacts. It is widely reported in blogs etc as being their "membership database", but on wikileaks it is more responsibly reported as being "membership and contacts". Without clear metadata about what these records mean (not in some formal ontology language, but in simple human language!) the data risks being used poorly. Same with open data releases on topics from health, through house prices, to crime.
If all we see is list of people records in "contacts.csv", we have no idea whether it's "parties that we've contacted" or "parties that've contacted us", or something else entirely. You can make mashups and maps without such metadata, but you can't make *decisions*.
So yep, ask for data in xml, csv, plain text, vcard, ... but don't let that flexibility mean the requirement for clear documentation is waived. I suggest that RDF, simple ontologies and HTML+RDFa might be part of the documentation story, but the principle here is more important than the tool.
I heartily agree with Christopher. I have been working with trying to
use XML effectively since 1995. At that time we were trying to
provide input into what would become the HL-7 standard for health
care. What year is it now?
There is a place for everything, but after over 25 years in this field
I have learned over and over again the HARD way that K.I.S.S. does
apply to software 90% of the time. The other 10% i would prefer to
leave to others in academia.
Cheers,
Owen
On Aug 24, 11:26 am, Christopher Groskopf <staringmon...@gmail.com>
wrote:
> I think Carrie is right that we can't be picky. I asked for Python in
> my project and now 75% of its written in PHP. Open data is like open
> code: take what you can get and be happy anyone cares enough to do it at
> all. (Of course, the corollary is: if you don't like it you can fix
> it.) However, that is no reason not to express a strong preference.
> As to what that preference should be: XML is wonderful for
> interoperability, but its verboseness has a number of number of
> unfortunate side-effects:
> 1) The sure amount of metadata (tags) required to define a simple
> data format means it needs to be translated to be skimmable.
> 2) There are a million and ones way to iterate over the data, thus
> being able to understand the _data_ doesn't mean you can understand any
> code that _uses_ the data.
> 3) Webapp developers realized long ago that raw XML is too heavy for
> responsive AJAX calls--thats why JSON took off in popularity.
> What this means is that if we "get" XML and we want to use it in certain
> ways its a very taxing process to translate it into a more appropriate
> format--a process which could cause the loss of data if its not done
> well and might be slow even if it is done well.
> For all these reasons, I think XML is clearly not an ideal data format.
> Sqlite is binary--I think binary is a bad way to go for a transport file
> format. CSV is barely a format at all and offers none of the advantages
> of any of the other options.
> JSON, on the other hand, has many of the same advantages as
> XML--nesting, self-naming, etc--without any of the bloat. It is easy to
> validate, ideal for the most common target platform (the web), easy to
> work with in all modern languages, and supports the subset of data types
> which are common to most use-cases.
> All that being the case, I would say my particular strong preferences are:
> 1) If possible, prefer to get multiple formats. (At least XML and JSON.)
> 2) If that's not possible, prefer JSON.
> 3) If that's not possible, prefer XML.
> 4) If that's not possible, prefer that they give you anything rather
> than nothing.
> That's my two pennies on the issue,
> Chris
> Luigi Montanez wrote:
> > I found these arguments to be rather though-provoking:
> > To be sure, XML is a significant improvement over proprietary and
> > closed data formats. But it can be a pain to work with, especially
> > when compared to YAML, JSON, SQLite, or CSV (sometimes).
> > What do you think? In the face of other formats, should XML be
> > something to oppose? Are we, the open government community, at the
> > point where we can be picky about open data formats?
On Mon, Aug 24, 2009 at 4:59 PM, Eric Mill<e...@sunlightfoundation.com> wrote:
> Maybe some kind of "industrial strength CSV", or CSV++, format, would > be ideal, but the problem is that it doesn't exist. And you can't get > all these government agencies to coordinate in the way you'd have to > for them to create something new that meets all their needs. Nobody'd > do anything til it passed ISO standardization!
..ooOO(Would something with a W3C stamp on it help there?)
> No, we have to go with what's out there, and the choice is between > JSON and XML. YAML is beautiful and terse (and, in fact, completely > compatible with JSON), but not so much so that it's worth picking over > JSON when there are so many more JSON parsing tools available.
Yep - one gotcha I've heard w.r.t. YAML (vs both JSON and XML) is that it isn't syntactically evident if a YAML file is truncated, so data can go missing silently - eg network or server trouble during download of a huge file - and downstream tools mightn't notice. So "YAML? thanks but JSON" seems right choice there.
> Just, nobody bother with DTDs. They are a waste of everyone's time and energy.
What conventions do you recommend for documentation of such data?
Thanks for the info., Dan. This is very interesting. These efforts
can only help the current situation, right?
In recent years I have encountered projects, in two separate
industries (visual special effects and business accounting), whose
only major stumbling block was the realization that "my XML won't play
nicely with your XML". Both projects came to a screeching halt until
specialists in each respective industry could create proprietary
adapters.
The love of XML was initially a move toward simplicity, but it fell
down that slippery slope to complexity...and XML "specialization"
creates yet another barrier to entry for open access to public data.
Carrie
On Aug 24, 2009, at 8:05 AM, Dan Brickley wrote:
> "The mission of the RDB2RDF Working Group, part of the Semantic Web
> Activity, is to standardize a language for mapping relational data and
> relational database schemas into RDF and OWL, tentatively called the
> RDB2RDF Mapping Language, R2RML."
> There are also tools around already which can be configured (using
> their own custom languages) to expose an RDF view of non-RDF
> relational/tabular data. For example, see
> http://www4.wiwiss.fu-berlin.de/bizer/d2rq/
> Somewhat similarly, the GRDDL standard explains how various non-RDF
> markups can be mapped to RDF using XSLT -
> http://en.wikipedia.org/wiki/GRDDL
> So the story here is that different data providers can choose the
> formats that make sense to them, but increasingly can document their
> concrete formats using shared schemas/ontologies. Other parties can
> publish SQL, tabular dumps, XML, JSON or whatever, and have different
> mappings to the same basic terminology - eg.
> http://www.oegov.us/blog/?p=234 > http://www.fao.org/countryProfiles/geoinfo.asp?lang=en etc).
> This doesn't magically solve all interop and documentation practices,
> but it does suggest some ways of avoiding excessive fragmentation of
> the data without forcing a "one size fits all" solution on everyone.
> Anything that is mapped to RDF by one of these techniques can benefit
> from the SQL-ish SPARQL query language
> (http://www.w3.org/TR/rdf-sparql-query/), and can be mixed and merged
> with other mapped data, regardless of the concrete notation. So in
> theory this gives a way for data from RDFa/microformats, SQL, CSV and
> plain XML to be integrated...
> cheers,
> Dan
"I retain all my vitamins because I am always steamed." -- Stephen
Colbert
>> Just, nobody bother with DTDs. They are a waste of everyone's time and energy.
> What conventions do you recommend for documentation of such data?
Have DTDs ever been sufficient documentation for someone to learn how
to use an XML document? I would always prefer a human-written web
page or document that describes what the dataset is and what fields to
expect. I don't think you can get around that requirement, and I'd
much rather see resources go into documentation meant for humans than
computers.
DTDs as API documentation do not make sense, but their use in object
serialization is quite useful. It's basically static typing for serialized
objects. DTDs allow development tools to automatically generate object and
client code that can consume the XML documents and services. In the
Java/.NET world this is extremely valuable.
I tend to think of XML and DTDs of being in the same cultural family as
statically typed languages. Even though I would rarely choose them over the
loose, dynamic nature of Python/Ruby and JSON, I appreciate what they are
trying to accomplish.
Jeremy
On Mon, Aug 24, 2009 at 12:35 PM, Eric Mill <e...@sunlightfoundation.com>wrote:
> >> Just, nobody bother with DTDs. They are a waste of everyone's time and
> energy.
> > What conventions do you recommend for documentation of such data?
> Have DTDs ever been sufficient documentation for someone to learn how
> to use an XML document? I would always prefer a human-written web
> page or document that describes what the dataset is and what fields to
> expect. I don't think you can get around that requirement, and I'd
> much rather see resources go into documentation meant for humans than
> computers.
I agree with everything you just wrote. Having a default, human-readable translation readily available for an XML document would go a long way toward reducing the onus of working with it. And the necessity of good documentation is not mitigated by a choice of format.
Also, your point about understanding what the data represents is very well taken and something everyone needs to keep close to their heart when working with datasets they did not generate. That said, I think that any of these formats being discussed /can/ be properly documented.
And I don't think the need for documentation should drive choice of data format, assuming there is a choice to be made. The tenor of the discussion almost makes me wonder if the whole reason that XML is in such wide-use has nothing to do with it being a standard _data_ format and everything to do with it (theoretically) having a standard _documentation_ format. Does that mean that if we all just agreed on a standard way of documenting JSON that XML could go away tomorrow?
Dan Brickley wrote:
> On Mon, Aug 24, 2009 at 5:26 PM, Christopher
> Groskopf<staringmon...@gmail.com> wrote:
>> Interesting topic.
>> I think Carrie is right that we can't be picky. I asked for Python in
>> my project and now 75% of its written in PHP. Open data is like open
>> code: take what you can get and be happy anyone cares enough to do it at
>> all. (Of course, the corollary is: if you don't like it you can fix
>> it.) However, that is no reason not to express a strong preference.
>> As to what that preference should be: XML is wonderful for
>> interoperability, but its verboseness has a number of number of
>> unfortunate side-effects:
>> 1) The sure amount of metadata (tags) required to define a simple
>> data format means it needs to be translated to be skimmable.
> How about if every XML format used around here came with a default
> XSLT that converted it into human-friendly HTML?
> (I'd be happy if it was HTML+RDFa, but the HTML part is more important...)
>> 2) There are a million and ones way to iterate over the data, thus
>> being able to understand the _data_ doesn't mean you can understand any
>> code that _uses_ the data.
> That's a good point
>> 3) Webapp developers realized long ago that raw XML is too heavy for
>> responsive AJAX calls--thats why JSON took off in popularity.
>> What this means is that if we "get" XML and we want to use it in certain
>> ways its a very taxing process to translate it into a more appropriate
>> format--a process which could cause the loss of data if its not done
>> well and might be slow even if it is done well.
>> For all these reasons, I think XML is clearly not an ideal data format.
>> Sqlite is binary--I think binary is a bad way to go for a transport file
>> format. CSV is barely a format at all and offers none of the advantages
>> of any of the other options.
>> JSON, on the other hand, has many of the same advantages as
>> XML--nesting, self-naming, etc--without any of the bloat. It is easy to
>> validate, ideal for the most common target platform (the web), easy to
>> work with in all modern languages, and supports the subset of data types
>> which are common to most use-cases.
>> All that being the case, I would say my particular strong preferences are:
>> 1) If possible, prefer to get multiple formats. (At least XML and JSON.)
> Yup
>> 2) If that's not possible, prefer JSON.
>> 3) If that's not possible, prefer XML.
>> 4) If that's not possible, prefer that they give you anything rather than nothing.
> I guess a lot depends on what the point is. When we're looking at egov
> / transparency, a lot of the point is about various government or
> govt-related parties putting otherwise-hidden information more clearly
> "on the record". In which case the definition of the fields is almost
> as important as the actual data --- and matters such as what a NULL or
> empty field means, what an empty row means, who created the data, etc.
> Very subtle matters of interpretation can have rather large political
> and practical consequences. Without clear documentation about what
> the data means, we can still plot places on maps and generate pie
> charts, but translating that to policy / trend analysis or citizen
> activism is trickier...
> The BNP (British National Party) is a far-right UK political party.
> The database contained (apparently) some membership records, but also
> various people who may have merely been contacts. It is widely
> reported in blogs etc as being their "membership database", but on
> wikileaks it is more responsibly reported as being "membership and
> contacts". Without clear metadata about what these records mean (not
> in some formal ontology language, but in simple human language!) the
> data risks being used poorly. Same with open data releases on topics
> from health, through house prices, to crime.
> If all we see is list of people records in "contacts.csv", we have no
> idea whether it's "parties that we've contacted" or "parties that've
> contacted us", or something else entirely. You can make mashups and
> maps without such metadata, but you can't make *decisions*.
> So yep, ask for data in xml, csv, plain text, vcard, ... but don't let
> that flexibility mean the requirement for clear documentation is
> waived. I suggest that RDF, simple ontologies and HTML+RDFa might be
> part of the documentation story, but the principle here is more
> important than the tool.
On Mon, Aug 24, 2009 at 6:35 PM, Eric Mill<e...@sunlightfoundation.com> wrote:
>>> Just, nobody bother with DTDs. They are a waste of everyone's time and energy.
>> What conventions do you recommend for documentation of such data?
> Have DTDs ever been sufficient documentation for someone to learn how > to use an XML document?
Not in my experience. I wasn't suggesting DTDs did the job (let alone well), just that any documentation is better than no documentation, so I was curious what you'd rather see instead.
> I would always prefer a human-written web > page or document that describes what the dataset is and what fields to > expect. I don't think you can get around that requirement, and I'd > much rather see resources go into documentation meant for humans than > computers.
Yup (although having a common data model like RDF's reduces the cost of missing per-schema machine docs).
Personally I find DTDs and schemas hard to read, and am always happen when I stumble across example instances.
The experiment at http://examplotron.org/ is somewhat interesting in that direction - they start with instances and try to turn them into schemas, with a few addtional decorations. I'm not advocating for it here, but it is a cute example of a very minimalistic XML schema language...
Like everyone else, I'm finding this an interesting discussion. Michael
Driscoll's article is excellent.
If we substituted "SGML" for "XML" we'd probably get a good sense that
eventually that which is too cumbersome is eventually replaced--or
overlayed--by something more streamlined. SO XML is likely to be in the
stack for a long time (just like C or even Fortran), but is likely not to be
used where agility and is required.
CSV has its place, too, but CSV has a real problem as Nathan stated with any
data that is not simple and strictly tabular. the world is becoming more
name value pair oriented, which is why I think we see a rise in JSON
use. Also, JSON and XML are already used for configuration information,
even passing functionality. I haven't seen that done in CSV.
Driscoll article specifically addresses how XML threatens BIG data. To me,
that means data that exceeds the 64K rows of spreadsheet. Where data is a
few hundred or a just a few thousands rows, CSV often makes very good sense.
I do want to take one point up with Driscoll regarding XML and big data. The
community has long since learned how to handle a scale of information bigger
than our containers, e.g., RAM paging, TCP packets, database shards, etc.
XML, as we handle it today, a built-in type of automated "shard" to break up
big XML into smaller pieces. The ability to produce and consume XML in
shard-like chunks would dramatically reduce the size-related problems.
And while we are discussing ponies, I sure wish the lazy web would build a
specialized, easy to use data-browser already...
On Mon, Aug 24, 2009 at 1:20 PM, Dan Brickley <dan...@danbri.org> wrote:
> On Mon, Aug 24, 2009 at 6:35 PM, Eric Mill<e...@sunlightfoundation.com>
> wrote:
> >>> Just, nobody bother with DTDs. They are a waste of everyone's time and
> energy.
> >> What conventions do you recommend for documentation of such data?
> > Have DTDs ever been sufficient documentation for someone to learn how
> > to use an XML document?
> Not in my experience. I wasn't suggesting DTDs did the job (let alone
> well), just that any documentation is better than no documentation, so
> I was curious what you'd rather see instead.
> > I would always prefer a human-written web
> > page or document that describes what the dataset is and what fields to
> > expect. I don't think you can get around that requirement, and I'd
> > much rather see resources go into documentation meant for humans than
> > computers.
> Yup (although having a common data model like RDF's reduces the cost
> of missing per-schema machine docs).
> Personally I find DTDs and schemas hard to read, and am always happen
> when I stumble across example instances.
> The experiment at http://examplotron.org/ is somewhat interesting in
> that direction - they start with instances and try to
> turn them into schemas, with a few addtional decorations. I'm not
> advocating for it here, but it is a cute example of a very
> minimalistic XML schema language...
This is an interesting discussion! I have a couple of points.
1. We should remember that a hugely important downstream consumers of this type of govt data will be people who write scripts to aggregate, visualize, and screen the data. These will probably be written by a planner who knows some PHP, and probably not a comp sci graduate. To enable this, simple data formats are important -- hence my earlier point about hours to parse simple text like CSV versus days/ weeks to work with XML. Also a converter to human readable HTML won't help with this pipeline at all.
2. Someone mentioned that we don't need to worry about the technical ability of office workers, since IT specialists are familiar with XML. I think that is wrong -- if we can build open data into day to day workflows by non IT people, it stands a better chance of actually happening. If sharing data requires a special budget and outside labor and is a pain in the A**, it will get dropped at the first excuse (ie a budget).
3. Standards discussions in the abstract which address data in the abstract tend to ungrounded complexity. They attempt to encompass all possible data, yet usually don't work very well for any particular data (witness, ahem, XML; remember that HTML and HTTP were developed ad hoc by programmers in the beginning). Perhaps we should split this discussion into the various types of government data (tabular, image, full database, etc) and try to solve those particular problems. (And I'll bet that CSV + metadata would take care of 50% of govt data quite well, and get us rolling in a big way.)
4. Part of our challenge is to create an audience -- once this has happened, they will start demanding data. I think the most important thing to do is to create that data pipe and feedback loop. If it takes three years to develop the perfect govt schema of everything and to get a few agencies to start using it, versus a few months to get my local police dept to put their tables on line and have it become part of the civic discussion because of some scripter/ planner making live graphs, then I would WAY prefer the latter. ( I don't want to set up a false dichotomy here and claim that it is one or the other, I just want to explicate the poles.)
5. Someone stated in this discussion something to the effect that "because XML is verbose, it is self documenting" -- I think that is a fallacy. I have wasted hours futzing with XML full "column begin" and "column end" tags -- quite verbose, quite useless.
Anyway, hope my brain dump is useful to someone. W
Data from everyday spreadsheets is never going to conform to a standard; the
structured data we're interested in is going to come out of a database.
And that database is probably proprietary, and it's probably going to
involve a contractor changing the system to get it out. Especially if we're
going to be picky about what format.
In most cases, sharing government data is going to require some budget,
outside labor, and be a pain in the a**. If it didn't, it would have
happened already. It's our job to convince the government that it's worth
it.
~M
On Mon, Aug 24, 2009 at 2:22 PM, Webb Sprague <webb.spra...@gmail.com>wrote:
> This is an interesting discussion! I have a couple of points.
> 1. We should remember that a hugely important downstream consumers of
> this type of govt data will be people who write scripts to aggregate,
> visualize, and screen the data. These will probably be written by a
> planner who knows some PHP, and probably not a comp sci graduate. To
> enable this, simple data formats are important -- hence my earlier
> point about hours to parse simple text like CSV versus days/ weeks to
> work with XML. Also a converter to human readable HTML won't help
> with this pipeline at all.
> 2. Someone mentioned that we don't need to worry about the technical
> ability of office workers, since IT specialists are familiar with XML.
> I think that is wrong -- if we can build open data into day to day
> workflows by non IT people, it stands a better chance of actually
> happening. If sharing data requires a special budget and outside
> labor and is a pain in the A**, it will get dropped at the first
> excuse (ie a budget).
> 3. Standards discussions in the abstract which address data in the
> abstract tend to ungrounded complexity. They attempt to encompass all
> possible data, yet usually don't work very well for any particular
> data (witness, ahem, XML; remember that HTML and HTTP were developed
> ad hoc by programmers in the beginning). Perhaps we should split this
> discussion into the various types of government data (tabular, image,
> full database, etc) and try to solve those particular problems. (And
> I'll bet that CSV + metadata would take care of 50% of govt data quite
> well, and get us rolling in a big way.)
> 4. Part of our challenge is to create an audience -- once this has
> happened, they will start demanding data. I think the most important
> thing to do is to create that data pipe and feedback loop. If it
> takes three years to develop the perfect govt schema of everything and
> to get a few agencies to start using it, versus a few months to get my
> local police dept to put their tables on line and have it become part
> of the civic discussion because of some scripter/ planner making live
> graphs, then I would WAY prefer the latter. ( I don't want to set up
> a false dichotomy here and claim that it is one or the other, I just
> want to explicate the poles.)
> 5. Someone stated in this discussion something to the effect that
> "because XML is verbose, it is self documenting" -- I think that is a
> fallacy. I have wasted hours futzing with XML full "column begin" and
> "column end" tags -- quite verbose, quite useless.
> Anyway, hope my brain dump is useful to someone.
> W