Gmail Calendar Documents Reader Web more »
Recently Visited Groups | Help | Sign in
Google Groups Home
How XML Threatens Big Data
There are currently too many topics in this group that display first. To make this topic appear first, remove this option from another topic.
There was an error processing your request. Please try again.
flag
  Messages 1 - 25 of 42 - Collapse all  -  Translate all to Translated (View all originals)   Newer >
The group you are posting to is a Usenet group. Messages posted to this group will make your email address visible to anyone on the Internet.
Your reply message has not been sent.
Your post was successful
 
From:
To:
Cc:
Followup To:
Add Cc | Add Followup-to | Edit Subject
Subject:
Validation:
For verification purposes please type the characters you see in the picture below or the numbers you hear by clicking the accessibility icon. Listen and type the numbers you hear
 
Luigi Montanez  
View profile  
 More options Aug 23, 10:18 pm
From: Luigi Montanez <luigi.monta...@gmail.com>
Date: Sun, 23 Aug 2009 19:18:50 -0700 (PDT)
Local: Sun, Aug 23 2009 10:18 pm
Subject: How XML Threatens Big Data
I found these arguments to be rather though-provoking:

http://dataspora.com/blog/xml-and-big-data/

To be sure, XML is a significant improvement over proprietary and
closed data formats. But it can be a pain to work with, especially
when compared to YAML, JSON, SQLite, or CSV (sometimes).

What do you think? In the face of other formats, should XML be
something to oppose? Are we, the open government community, at the
point where we can be picky about open data formats?

- Luigi


    Reply to author    Forward  
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Carrie Oviatt  
View profile  
 More options Aug 23, 10:31 pm
From: Carrie Oviatt <carrie.ovi...@gmail.com>
Date: Sun, 23 Aug 2009 19:31:47 -0700
Local: Sun, Aug 23 2009 10:31 pm
Subject: Re: [sunlightlabs] How XML Threatens Big Data

Picky no.  Pushing for more accessible, workable solutions, yes!!!!!

Carrie Oviatt

On Aug 23, 2009, at 7:18 PM, Luigi Montanez wrote:

> I found these arguments to be rather though-provoking:

> http://dataspora.com/blog/xml-and-big-data/

> To be sure, XML is a significant improvement over proprietary and
> closed data formats. But it can be a pain to work with, especially
> when compared to YAML, JSON, SQLite, or CSV (sometimes).

> What do you think? In the face of other formats, should XML be
> something to oppose? Are we, the open government community, at the
> point where we can be picky about open data formats?

> - Luigi

Moral certainty is always a sign of cultural inferiority. - H.L.Mencken

    Reply to author    Forward  
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Derek Williams  
View profile  
 More options Aug 23, 10:38 pm
From: Derek Williams <dere...@gmail.com>
Date: Sun, 23 Aug 2009 20:38:41 -0600
Local: Sun, Aug 23 2009 10:38 pm
Subject: Re: [sunlightlabs] How XML Threatens Big Data

Big XML can be a problem, but it is simple enough (in most cases) to split
the data into smaller documents.  All document formats would have similar
issues, in fact it could be argued that SAX and StAX, (as mentioned by one
of the comments) and others allows processes to work well with large XML
documents; the author could have used Saxon to ease with the XSLT (a bit
limited, yes).  His example was more a result of poor engineering than any
issue intrinsic to XML itself.   All of that being said, for RESTful
interfaces I believe that JSON often works better, and frameworks like Axis2
allow for serving different flavors of data.  I would least prefer CSV and
SQLite.

Just my two cents.

On Sun, Aug 23, 2009 at 8:18 PM, Luigi Montanez <luigi.monta...@gmail.com>wrote:

--
Derek Williams
Cell: 970.214.8928
Home Office: 970.416.8996

    Reply to author    Forward  
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Josh Tauberer  
View profile  
 More options Aug 23, 11:09 pm
From: Josh Tauberer <taube...@govtrack.us>
Date: Sun, 23 Aug 2009 23:09:44 -0400
Local: Sun, Aug 23 2009 11:09 pm
Subject: Re: [sunlightlabs] How XML Threatens Big Data
XML is by far the most widely supported data format. We shouldn't be
*too* picky about data formats when we're still trying to convince folks
that data is a good thing, but  IMO XML is the format to push. To take
the formats you mentioned, all besides JSON has I think a serious problem-

YAML- I don't know what it is (= not widely adopted)
SQLite - Binary, proprietary, only one implementation, and subject to
obsoletion
CSV - Not well standardized. No character encoding. Often not generated
properly.

XML and JSON are entirely equivalent as far as I can tell, except XML
tools are more prevalent and XML has far deeper industry adoption. I
haven't run across any advantage of JSON over XML.

I agree with the article that XML can get annoying for large data, but
the alternatives make me think twice about recommending another format.

Not that I would complain if anyone used CSV for a large data set --- so
long as it was done correctly and documented right. It's just that I
wouldn't recommend CSV without being reasonably confident it wouldn't
make things worse.

What would be nice would be an actual complete CSV standard (i.e. fully
interpretable without anything besides the file). Here's one:
    RFC 4180
    *plus* the header line is mandatory
    *plus* it is UTF-8 encoded
(Can we call this CCSV for "complete CSV"?)

Actually for the international community that uses commas as decimal
separators, I think a   generic character delimited values (ha, "CSV")
standard might be a good idea to have.

Josh

On 8/23/2009 10:18 PM, Luigi Montanez wrote:


    Reply to author    Forward  
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Jessy Cowan-sharp  
View profile  
 More options Aug 24, 2:30 am
From: Jessy Cowan-sharp <jessy.cowansh...@gmail.com>
Date: Sun, 23 Aug 2009 23:30:54 -0700
Local: Mon, Aug 24 2009 2:30 am
Subject: Re: [sunlightlabs] Re: How XML Threatens Big Data

json is flexible and easy to output in most languages. xml is tedious
because it's very verbose, but that also means it's largely
self-documenting.

if we documented our json as well as XML documents itself, how much
time/convenience would we actually save? :)

i think the reason most of us like working with json are similarly the
reasons why in 20 years if you looked at the dataset you might have no idea
what it was or how to use it. for better and worse, as a community we're
(arguably) supposed to think about those aspects of the problem, as well.

as michael points out in his blog post, XML's structure is also very
repetitive, and extremely verbose. would something as simple as json
standardized with a comments section, and header section including name and
data types (for example), suffice for most of us? where does it fail?

perhaps, as a simple epirical study, there could be a running wiki page of
example data sets, and notes about where existing formats have failed them.
after 6 months, we can look it over, and come up with a list of suggested
characteristics and example formats which would address as many of them as
possible. has someone done that already?

jessy

(FWIW, as a bit of an afterthought, [CT]sv doesnt strike me as a good format
for arbitrary large data sets, in particular when there's large blobs of
text (or encded binary, i suppose) involved, since in that case you're often
dealing with tens or hundreds (or more) lines of text (which of course may
or may not have commas, tabs, or even \n newlines in them), and it becomes a
more subtle problem to store properly, and parse.)

--
Jessy Cowan-Sharp
w: http://nebula.nasa.gov
p: http://blog.quaternio.net


    Reply to author    Forward  
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Webb Sprague  
View profile  
 More options Aug 24, 1:45 am
From: Webb Sprague <webb.spra...@gmail.com>
Date: Sun, 23 Aug 2009 22:45:57 -0700
Local: Mon, Aug 24 2009 1:45 am
Subject: Re: [sunlightlabs] Re: How XML Threatens Big Data
A few notes below from an interested party.

On Sun, Aug 23, 2009 at 8:09 PM, Josh Tauberer<taube...@govtrack.us> wrote:

> XML is by far the most widely supported data format.

The problem is that XML is by itself very little of a standard -- it
is the specific schemas that are anything like a "data format".  So
when we "adopt XML" it means very little.

> We shouldn't be
> *too* picky about data formats when we're still trying to convince folks
> that data is a good thing, but  IMO XML is the format to push.

I disagree, rather I think a "full CSV" like you describe below is
appropriate.  CSV or JSON are FAR easier to program to (dare I say
something like an order of magnitude) than even the most well
described XML.  We are trying to aggregate indicators for the Portland
OR Metro  Region, and programming to an XML format takes days or weeks
versus hours for CSV.  This is really important when programmer time
is limited, and would make a difference in whether we chose a data
stream or not for reporting.

I also think the supposed self documenting aspect of XML is completely
overrated, bordering on the ridiculous.  There is nothing that forces
you to include the right metadata in a schema just because it is XML

>  To take
> the formats you mentioned, all besides JSON has I think a serious problem-

> YAML- I don't know what it is (= not widely adopted)

I agree not a good first choice, but not because you don't know what it is ;)

> SQLite - Binary, proprietary, only one implementation, and subject to
> obsoletion

Um... incorrect.  SQLite public domain (not even BSD licensed), so if
it became important to maintain an old binary format, the community
could fork the code.  This is the beauty of open source.  I think if
there is a large multi table database, SQLite or ASCII SQL code is an
excellent way of transmitting it.

> CSV - Not well standardized. No character encoding. Often not generated
> properly.

> XML and JSON are entirely equivalent as far as I can tell, except XML
> tools are more prevalent and XML has far deeper industry adoption. I
> haven't run across any advantage of JSON over XML.

Entirely equivalent ... accept in terms of lines of code and
complexity.  And to be honest, "industry adoption" is not necessarily
indicative of its engineering quality.  At all.

> Not that I would complain if anyone used CSV for a large data set --- so
> long as it was done correctly and documented right. It's just that I
> wouldn't recommend CSV without being reasonably confident it wouldn't
> make things worse.

> What would be nice would be an actual complete CSV standard (i.e. fully
> interpretable without anything besides the file). Here's one:
>    RFC 4180
>    *plus* the header line is mandatory
>    *plus* it is UTF-8 encoded
> (Can we call this CCSV for "complete CSV"?)

Good idea.  I actually think a standard like HTTP might be a good
approach in which there is a header section with abritrary key value
information, which include column names (this is what they mean by
"header" in the RFC), a little bit of metadata (name of the table,
etc), the separator value, etc.  Then two newlines, then the data.

(Encoding this header information in XML tags just makes them harder
to parse, without any payoff.  )

> Actually for the international community that uses commas as decimal
> separators, I think a   generic character delimited values (ha, "CSV")
> standard might be a good idea to have.

See above paragraph

In summary, I would argue that XML should NOT be encouraged, but
rather an industrial strength CSV format would be best.  As a
programmer, getting things into and out of XML adds a huge amount of
time (even with libraries), would make it harder for gov't agencies to
serve data, harder for users to use data, add a lot of extra bandwidth
(due to the tags and whitespace), and not give additional payoff in
terms of metadata (since there is nothing intrinsic to a schema per se
to make it self documenting).

There is a danger that  poor technologies get used because they feel
more technical -- XML, in my mind, is popular not for its intrinsic
merits (which are slight), but for its emotional connotations as an
"industry standard".

W


    Reply to author    Forward  
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Webb Sprague  
View profile  
 More options Aug 24, 1:49 am
From: Webb Sprague <webb.spra...@gmail.com>
Date: Sun, 23 Aug 2009 22:49:20 -0700
Local: Mon, Aug 24 2009 1:49 am
Subject: Re: [sunlightlabs] Re: How XML Threatens Big Data
One more thing in favor of CSV -- a huge amount of modernity runs on
spreadsheets, so getting a government employee to think in terms of
exporting to CSV and copying to a directory would be fairly
straightforward, but if there were an intermediate fancy data format
in between it would be harder to get buy in.

W


    Reply to author    Forward  
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Matt Brennan  
View profile  
 More options Aug 24, 9:56 am
From: Matt Brennan <matty.bren...@gmail.com>
Date: Mon, 24 Aug 2009 09:56:19 -0400
Local: Mon, Aug 24 2009 9:56 am
Subject: Re: [sunlightlabs] Re: How XML Threatens Big Data

The non-technical government employee probably doesn't know what CSV is
either, and is going to think in terms of the cost & length of the contract
required to modify their database to be exportable.  The technical
government employee is quite capable of thinking in terms of of the fancy
formats.

On Mon, Aug 24, 2009 at 1:49 AM, Webb Sprague <webb.spra...@gmail.com>wrote:

--
~
Matt Brennan

    Reply to author    Forward  
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Tom Lee  
View profile  
 More options Aug 24, 10:25 am
From: Tom Lee <thomas.j....@gmail.com>
Date: Mon, 24 Aug 2009 07:25:07 -0700 (PDT)
Local: Mon, Aug 24 2009 10:25 am
Subject: Re: How XML Threatens Big Data
Mostly I just wanted to add in this quote: "XML is like violence: if
it doesn't solve your problem, you're not using enough of it"

But also, I think Driscoll's best point is his first one: XML does
seem to breed bureaucracy in a strange way.  It's unfortunately common
to find yourself on a list witnessing a discussion of the relative
merits of various XML variants between people who've never written a
line of code.  And there seems to be an assumption in much of the XML-
using world that publishing a DTD is just as good as -- probably
better than! -- publishing a sample document.

But I agree with Josh: XML is what we've got, it's a lingua franca,
and we shouldn't be too picky.  I think part of the appeal of other
formats is the simplicity they enforce.  If you're going to publish
complex data in CSV, you're going to have to make the format
understandable, and to think about unique identifiers.  If you're
using JSON, the library author probably used XML first and realized
that the event-based mechanics of stream parsing are godawful and
should be hidden from the user (though in really extreme cases there's
no getting around it).  XML is powerful enough to enable bad design
decisions, and old enough to have tools that suffer from many such
decisions themselves.

Tom

On Aug 24, 1:49 am, Webb Sprague <webb.spra...@gmail.com> wrote:


    Reply to author    Forward  
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Nathan Freitas  
View profile  
 More options Aug 24, 10:25 am
From: Nathan Freitas <nathanfrei...@gmail.com>
Date: Mon, 24 Aug 2009 10:25:33 -0400
Local: Mon, Aug 24 2009 10:25 am
Subject: Re: [sunlightlabs] Re: How XML Threatens Big Data

I'd like to point out that formats such as XML and JSON can communicate
parent-child relationships and multiple data types/objects within one
document while CSV cannot.

On a related note, the OpenLeg effort by the NY Senate CIO team (which I am
a part of), has recognized the XML issue from the get-go, and offers a
variety of view renderings for bills. For instance:

for a collection of bills by committee:
http://open.nysenate.gov/openleg/api/xml/committee/MENTAL+HEALTH+AND+...
http://open.nysenate.gov/openleg/api/json/committee/MENTAL+HEALTH+AND...
http://open.nysenate.gov/openleg/api/csv/committee/MENTAL+HEALTH+AND+...

for a bill:
http://open.nysenate.gov/openleg/api/xml/bill/S1646
http://open.nysenate.gov/openleg/api/json/bill/S1646
http://open.nysenate.gov/openleg/api/csv/bill/S1646

Our system is modular enough to add in any custom requested or format
variant necessary. In short, we aren't betting the farm on any one format or
schema, but instead building in flexibility and iterating. We'd love to
support a CSV++ format if it was defined, and will also be adding easy to
use/parse formats like RSS and KML as approriate/useful.

+Nathan

On Mon, Aug 24, 2009 at 9:56 AM, Matt Brennan <matty.bren...@gmail.com>wrote:


    Reply to author    Forward  
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Eric Mill  
View profile  
 More options Aug 24, 10:59 am
From: Eric Mill <e...@sunlightfoundation.com>
Date: Mon, 24 Aug 2009 10:59:33 -0400
Local: Mon, Aug 24 2009 10:59 am
Subject: Re: [sunlightlabs] Re: How XML Threatens Big Data
Maybe some kind of "industrial strength CSV", or CSV++, format, would
be ideal, but the problem is that it doesn't exist. And you can't get
all these government agencies to coordinate in the way you'd have to
for them to create something new that meets all their needs.  Nobody'd
do anything til it passed ISO standardization!

No, we have to go with what's out there, and the choice is between
JSON and XML. YAML is beautiful and terse (and, in fact, completely
compatible with JSON), but not so much so that it's worth picking over
JSON when there are so many more JSON parsing tools available.  SQLite
is awesome, but it's binary, not easily "browsable" in a text editor
or browser, and you're going to have to form queries to take the data
out.  They're not good candidates for universalizing government data
output.

We should be pushing as many agencies as possible to output in both
XML and JSON.  But if there's not enough political capital or
technical comprehension, or whatever, to get an agency to output
both...well, then give XML.  And we'll deal with it being large.

Just, nobody bother with DTDs.  They are a waste of everyone's time and energy.

-- Eric


    Reply to author    Forward  
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Dan Brickley  
View profile  
 More options Aug 24, 11:05 am
From: Dan Brickley <dan...@danbri.org>
Date: Mon, 24 Aug 2009 17:05:18 +0200
Local: Mon, Aug 24 2009 11:05 am
Subject: Re: [sunlightlabs] Re: How XML Threatens Big Data
Cc:'ing some W3C folk, who might not have seen the original thread at
http://groups.google.com/group/sunlightlabs/browse_thread/thread/da91...

On the SQlLite, CSV (and Tab-SV, etc) front, folk here might be
interested to look at recent and looming work at W3C:

http://www.w3.org/2009/03/rdb2rdf-charter
 - draft of a proposed charter for a Working Group working on mappings
between relational and RDF data:

"The mission of the RDB2RDF Working Group, part of the Semantic Web
Activity, is to standardize a language for mapping relational data and
relational database schemas into RDF and OWL, tentatively called the
RDB2RDF Mapping Language, R2RML."

The legwork for this was done in an earlier incubator group, whose
findings are online -
http://www.w3.org/2005/Incubator/rdb2rdf/XGR-rdb2rdf-20090126/

There are also tools around already which can be configured (using
their own custom languages) to expose an RDF view of non-RDF
relational/tabular data. For example, see
http://www4.wiwiss.fu-berlin.de/bizer/d2rq/

Somewhat similarly, the GRDDL standard explains how various non-RDF
markups can be mapped to RDF using XSLT -
http://en.wikipedia.org/wiki/GRDDL

There's also a json-grddl proposal at http://buzzword.org.uk/2008/jsonGRDDL/spec

So the story here is that different data providers can choose the
formats that make sense to them, but increasingly can document their
concrete formats using shared schemas/ontologies. Other parties can
publish SQL, tabular dumps, XML, JSON or whatever, and have different
mappings to the same basic terminology - eg.
http://www.oegov.us/blog/?p=234
http://www.fao.org/countryProfiles/geoinfo.asp?lang=en etc).

This doesn't magically solve all interop and documentation practices,
but it does suggest some ways of avoiding excessive fragmentation of
the data without forcing a "one size fits all" solution on everyone.
Anything that is mapped to RDF by one of these techniques can benefit
from the SQL-ish SPARQL query language
(http://www.w3.org/TR/rdf-sparql-query/), and can be mixed and merged
with other mapped data, regardless of the concrete notation. So in
theory this gives a way for data from RDFa/microformats, SQL, CSV and
plain XML to be integrated...

cheers,

Dan


    Reply to author    Forward  
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Christopher Groskopf  
View profile  
 More options Aug 24, 11:26 am
From: Christopher Groskopf <staringmon...@gmail.com>
Date: Mon, 24 Aug 2009 08:26:36 -0700
Local: Mon, Aug 24 2009 11:26 am
Subject: Re: [sunlightlabs] How XML Threatens Big Data
Interesting topic.

I think Carrie is right that we can't be picky.  I asked for Python in
my project and now 75% of its written in PHP.  Open data is like open
code: take what you can get and be happy anyone cares enough to do it at
all.  (Of course, the corollary is: if you don't like it you can fix
it.)  However, that is no reason not to express a strong preference.

As to what that preference should be: XML is wonderful for
interoperability, but its verboseness has a number of number of
unfortunate side-effects:

    1) The sure amount of metadata (tags) required to define a simple
data format means it needs to be translated to be skimmable.
    2) There are a million and ones way to iterate over the data, thus
being able to understand the _data_ doesn't mean you can understand any
code that _uses_ the data.
    3) Webapp developers realized long ago that raw XML is too heavy for
responsive AJAX calls--thats why JSON took off in popularity.

What this means is that if we "get" XML and we want to use it in certain
ways its a very taxing process to translate it into a more appropriate
format--a process which could cause the loss of data if its not done
well and might be slow even if it is done well.

For all these reasons, I think XML is clearly not an ideal data format.  
Sqlite is binary--I think binary is a bad way to go for a transport file
format.  CSV is barely a format at all and offers none of the advantages
of any of the other options.

JSON, on the other hand, has many of the same advantages as
XML--nesting, self-naming, etc--without any of the bloat.  It is easy to
validate, ideal for the most common target platform (the web), easy to
work with in all modern languages, and supports the subset of data types
which are common to most use-cases.

All that being the case, I would say my particular strong preferences are:

    1) If possible, prefer to get multiple formats. (At least XML and JSON.)
    2) If that's not possible, prefer JSON.
    3) If that's not possible, prefer XML.
    4) If that's not possible, prefer that they give you anything rather
than nothing.

That's my two pennies on the issue,
Chris


    Reply to author    Forward  
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Luke Peterson  
View profile  
 More options Aug 24, 12:13 pm
From: Luke Peterson <luke.peter...@gmail.com>
Date: Mon, 24 Aug 2009 12:13:50 -0400
Local: Mon, Aug 24 2009 12:13 pm
Subject: Re: [sunlightlabs] Re: How XML Threatens Big Data

From a getting-the-government-to-publish-more-stuff perspective, if the
choice is between:

A) ask them to publish relevant data as-is with whatever documentation is
available immediately and trust the community of developers to perform
deeper research and transformation as necessary and maintain public
documentation

or

B) decide on a specific data format or set of formats and boycott government
data unless it's in those formats

Path B seems to be the one most likely to lead to a web of bureaucracy and
buck-passing.

I've seen and worked with data published in probably a thousand stupid
antiquated formats.  The most recent boneheaded data experience I've had was
with the USPS Zip+4 file, a fixed-width file with a 182-character-long
schema that's published without carriage returns or line feeds which is 8GB
of text on a single row.  (Luckily there's a couple GNU tools which will
help --  both "fold" and "fmt" can be used to insert a CRLF after every 182
characters.)

I haven't received a physical tape in a while, but it was a little more than
a year ago when I last received data on microfiche.

Point is, XML, JSON, YAML, CSV, TSV, a SQL statement, a DBF, fixed-width,
whatever ... as long as it's got a persistent URI, I can write up a couple
paragraphs on what it is and how to make it usable to the community, and I
bet many of the other folks here can do the same.  We can even clean it up
and publish it somewhere else in whatever format (or collection of formats)
we like.  If it's hand-scrawled information in PDF or even TIFF, we can
mechanical-turk the data and put it in whatever format we want, or whatever
format the developer prefers.  In that case, if you're the one busting your
butt to make the conversion and you think that XML is a bloated and
error-prone format, then publish it in JSON or TSV or SQLite or whatever you
think is superior.  If somebody else wants it in XML, that person can
convert it and publish it themselves.  Storage is cheap.

The situation we want to avoid is one where an agency only comfortable with
publishing things in DBF publishes nothing since they feel if they can't
publish to a modern format they can't publish at all.  Instead, go ahead and
publish it in DBF and I'll convert to JSON and and upload it somewhere, or
even better, write a shell script and instructions that allows other folks
to run the data transformation themselves.

Right?

-----
Luke Peterson


    Reply to author    Forward  
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Dan Brickley  
View profile  
 More options Aug 24, 12:21 pm
From: Dan Brickley <dan...@danbri.org>
Date: Mon, 24 Aug 2009 18:21:09 +0200
Local: Mon, Aug 24 2009 12:21 pm
Subject: Re: [sunlightlabs] Re: How XML Threatens Big Data
On Mon, Aug 24, 2009 at 5:26 PM, Christopher

How about if every XML format used around here came with a default
XSLT that converted it into human-friendly HTML?

(I'd be happy if it was HTML+RDFa, but the HTML part is more important...)

>    2) There are a million and ones way to iterate over the data, thus
> being able to understand the _data_ doesn't mean you can understand any
> code that _uses_ the data.

That's a good point

>    3) Webapp developers realized long ago that raw XML is too heavy for
> responsive AJAX calls--thats why JSON took off in popularity.

Browser security models had some impact here too - the fact that they
could load "Javascript" (typically a thin callback wrapper around
JSON) from other domains - eg see
http://www.25hoursaday.com/weblog/PermaLink.aspx?guid=060ca7c3-b03f-4...
http://simonwillison.net/2005/Dec/16/json/ )

Yup

>    2) If that's not possible, prefer JSON.
>    3) If that's not possible, prefer XML.
>    4) If that's not possible, prefer that they give you anything rather than nothing.

I guess a lot depends on what the point is. When we're looking at egov
/ transparency, a lot of the point is about various government or
govt-related parties putting otherwise-hidden information more clearly
"on the record". In which case the definition of the fields is almost
as important as the actual data --- and matters such as what a NULL or
empty field means, what an empty row means, who created the data, etc.
Very subtle matters of interpretation can have rather large political
and practical consequences.  Without clear documentation about what
the data means, we can still plot places on maps and generate pie
charts, but translating that to policy / trend analysis or citizen
activism is trickier...

Quick example - a few months ago, this data started floating around -
http://wikileaks.org/wiki/British_National_Party_membership_and_conta...

The BNP (British National Party) is a far-right UK political party.
The database contained (apparently) some membership records, but also
various people who may have merely been contacts. It is widely
reported in blogs etc as being their "membership database", but on
wikileaks it is more responsibly reported as being "membership and
contacts". Without clear metadata about what these records mean (not
in some formal ontology language, but in simple human language!) the
data risks being used poorly. Same with open data releases on topics
from health, through house prices, to crime.

If all we see is list of people records in "contacts.csv", we have no
idea whether it's "parties that we've contacted" or "parties that've
contacted us", or something else entirely. You can make mashups and
maps without such metadata, but you can't make *decisions*.

So yep, ask for data in xml, csv, plain text, vcard, ... but don't let
that flexibility mean the requirement for clear documentation is
waived. I suggest that RDF, simple ontologies and HTML+RDFa might be
part of the documentation story, but the principle here is more
important than the tool.

cheers,

Dan


    Reply to author    Forward  
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Owen  
View profile  
 More options Aug 24, 12:22 pm
From: Owen <od...@csrees.usda.gov>
Date: Mon, 24 Aug 2009 09:22:07 -0700 (PDT)
Local: Mon, Aug 24 2009 12:22 pm
Subject: Re: How XML Threatens Big Data
I heartily agree with Christopher. I have been working with trying to
use XML effectively since 1995.  At that time we were trying to
provide input into what would become the HL-7 standard for health
care.  What year is it now?

There is a place for everything, but after over 25 years in this field
I have learned over and over again the HARD way that K.I.S.S. does
apply to software 90% of the time.  The other 10% i would prefer to
leave to others in academia.

Cheers,

Owen

On Aug 24, 11:26 am, Christopher Groskopf <staringmon...@gmail.com>
wrote:


    Reply to author    Forward  
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Dan Brickley  
View profile  
 More options Aug 24, 12:28 pm
From: Dan Brickley <dan...@danbri.org>
Date: Mon, 24 Aug 2009 18:28:23 +0200
Local: Mon, Aug 24 2009 12:28 pm
Subject: Re: [sunlightlabs] Re: How XML Threatens Big Data

On Mon, Aug 24, 2009 at 4:59 PM, Eric Mill<e...@sunlightfoundation.com> wrote:

> Maybe some kind of "industrial strength CSV", or CSV++, format, would
> be ideal, but the problem is that it doesn't exist. And you can't get
> all these government agencies to coordinate in the way you'd have to
> for them to create something new that meets all their needs.  Nobody'd
> do anything til it passed ISO standardization!

..ooOO(Would something with a W3C stamp on it help there?)

> No, we have to go with what's out there, and the choice is between
> JSON and XML. YAML is beautiful and terse (and, in fact, completely
> compatible with JSON), but not so much so that it's worth picking over
> JSON when there are so many more JSON parsing tools available.

Yep - one gotcha I've heard w.r.t. YAML (vs both JSON and XML) is that
it isn't syntactically evident if a YAML file is truncated, so data
can go missing silently - eg network or server trouble during download
of a huge file -  and downstream tools mightn't notice. So "YAML?
thanks but JSON" seems right choice there.

> Just, nobody bother with DTDs.  They are a waste of everyone's time and energy.

What conventions do you recommend for documentation of such data?

cheers,

Dan


    Reply to author    Forward  
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Carrie Oviatt  
View profile  
 More options Aug 24, 12:35 pm
From: Carrie Oviatt <carrie.ovi...@gmail.com>
Date: Mon, 24 Aug 2009 09:35:09 -0700
Local: Mon, Aug 24 2009 12:35 pm
Subject: Re: [sunlightlabs] Re: How XML Threatens Big Data

Thanks for the info., Dan.  This is very interesting.  These efforts  
can only help the current situation, right?

In recent years I have encountered projects, in two separate  
industries (visual special effects and business accounting), whose  
only major stumbling block was the realization that "my XML won't play  
nicely with your XML".  Both projects came to a screeching halt until  
specialists in each respective industry could create proprietary  
adapters.

The love of XML was initially a move toward simplicity, but it fell  
down that slippery slope to complexity...and XML "specialization"  
creates yet another barrier to entry for open access to  public data.

Carrie
On Aug 24, 2009, at 8:05 AM, Dan Brickley wrote:

"I retain all my vitamins because I am always steamed." -- Stephen  
Colbert

    Reply to author    Forward  
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Eric Mill  
View profile  
 More options Aug 24, 12:35 pm
From: Eric Mill <e...@sunlightfoundation.com>
Date: Mon, 24 Aug 2009 12:35:55 -0400
Local: Mon, Aug 24 2009 12:35 pm
Subject: Re: [sunlightlabs] Re: How XML Threatens Big Data

>> Just, nobody bother with DTDs.  They are a waste of everyone's time and energy.

> What conventions do you recommend for documentation of such data?

Have DTDs ever been sufficient documentation for someone to learn how
to use an XML document?  I would always prefer a human-written web
page or document that describes what the dataset is and what fields to
expect.  I don't think you can get around that requirement, and I'd
much rather see resources go into documentation meant for humans than
computers.

-- Eric


    Reply to author    Forward  
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Jeremy Carbaugh  
View profile  
 More options Aug 24, 1:10 pm
From: Jeremy Carbaugh <jcarba...@sunlightfoundation.com>
Date: Mon, 24 Aug 2009 13:10:38 -0400
Local: Mon, Aug 24 2009 1:10 pm
Subject: Re: [sunlightlabs] Re: How XML Threatens Big Data

DTDs as API documentation do not make sense, but their use in object
serialization is quite useful. It's basically static typing for serialized
objects. DTDs allow development tools to automatically generate object and
client code that can consume the XML documents and services. In the
Java/.NET world this is extremely valuable.

I tend to think of XML and DTDs of being in the same cultural family as
statically typed languages. Even though I would rarely choose them over the
loose, dynamic nature of Python/Ruby and JSON, I appreciate what they are
trying to accomplish.

Jeremy

On Mon, Aug 24, 2009 at 12:35 PM, Eric Mill <e...@sunlightfoundation.com>wrote:


    Reply to author    Forward  
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Christopher Groskopf  
View profile  
 More options Aug 24, 1:12 pm
From: Christopher Groskopf <staringmon...@gmail.com>
Date: Mon, 24 Aug 2009 10:12:25 -0700
Local: Mon, Aug 24 2009 1:12 pm
Subject: Re: [sunlightlabs] Re: How XML Threatens Big Data
Dan,

    I agree with everything you just wrote.  Having a default,
human-readable translation readily available for an XML document would
go a long way toward reducing the onus of working with it.  And the
necessity of good documentation is not mitigated by a choice of format.  
Also, your point about understanding what the data represents is very
well taken and something everyone needs to keep close to their heart
when working with datasets they did not generate.  That said, I think
that any of these formats being discussed /can/ be properly documented.  
And I don't think the need for documentation should drive choice of data
format, assuming there is a choice to be made.  The tenor of the
discussion almost makes me wonder if the whole reason that XML is in
such wide-use has nothing to do with it being a standard _data_ format
and everything to do with it (theoretically) having a standard
_documentation_ format.  Does that mean that if we all just agreed on a
standard way of documenting JSON that XML could go away tomorrow?

    If so, I'm in.

Speak soon,
Chris


    Reply to author    Forward  
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Dan Brickley  
View profile  
 More options Aug 24, 1:20 pm
From: Dan Brickley <dan...@danbri.org>
Date: Mon, 24 Aug 2009 19:20:28 +0200
Local: Mon, Aug 24 2009 1:20 pm
Subject: Re: [sunlightlabs] Re: How XML Threatens Big Data

On Mon, Aug 24, 2009 at 6:35 PM, Eric Mill<e...@sunlightfoundation.com> wrote:

>>> Just, nobody bother with DTDs.  They are a waste of everyone's time and energy.

>> What conventions do you recommend for documentation of such data?

> Have DTDs ever been sufficient documentation for someone to learn how
> to use an XML document?

Not in my experience. I wasn't suggesting DTDs did the job (let alone
well), just that any documentation is better than no documentation, so
I was curious what you'd rather see instead.

>           I would always prefer a human-written web
> page or document that describes what the dataset is and what fields to
> expect.  I don't think you can get around that requirement, and I'd
> much rather see resources go into documentation meant for humans than
> computers.

Yup (although having a common data model like RDF's reduces the cost
of missing per-schema machine docs).

Personally I find DTDs and schemas hard to read, and am always happen
when I stumble across example instances.

The experiment at http://examplotron.org/ is somewhat interesting in
that direction - they start with instances and try to
turn them into schemas, with a few addtional decorations. I'm not
advocating for it here, but it is a cute example of a very
minimalistic XML schema language...

cheers,

Dan


    Reply to author    Forward  
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Greg Elin  
View profile  
 More options Aug 24, 2:00 pm
From: Greg Elin <wiredb...@gmail.com>
Date: Mon, 24 Aug 2009 14:00:23 -0400
Local: Mon, Aug 24 2009 2:00 pm
Subject: Re: [sunlightlabs] Re: How XML Threatens Big Data

Like everyone else, I'm finding this an interesting discussion. Michael
Driscoll's article is excellent.

If we substituted "SGML" for "XML" we'd probably get a good sense that
eventually that which is too cumbersome is eventually replaced--or
overlayed--by something more streamlined. SO XML is likely to be in the
stack for a long time (just like C or even Fortran), but is likely not to be
used where agility and is required.

CSV has its place, too, but CSV has a real problem as Nathan stated with any
data that is not simple and strictly tabular. the world is becoming more
name value pair oriented, which is why I think we see a rise in JSON
use.  Also, JSON and XML are already used for configuration information,
even passing functionality. I haven't seen that done in CSV.

Driscoll article specifically addresses how XML threatens BIG data. To me,
that means data that exceeds the 64K rows of spreadsheet. Where data is a
few hundred or a just a few thousands rows, CSV often makes very good sense.

I do want to take one point up with Driscoll regarding XML and big data. The
community has long since learned how to handle a scale of information bigger
than our containers, e.g., RAM paging, TCP packets, database shards, etc.
XML, as we handle it today, a built-in type of automated "shard" to break up
big XML into smaller pieces. The ability to produce and consume XML in
shard-like chunks would dramatically reduce the size-related problems.

And while we are discussing ponies, I sure wish the lazy web would build a
specialized, easy to use data-browser already...

Greg Elin
http://gregelin.com
g...@fotonotes.net
http://twitter.com/gregelin
skype: fotonotes
aim: wiredbike
DC: 202-713-WOOT
cell: 917-304-3488

For the immediate future, it will be with us an important as common tool

XML, in many instances, is simply too complex. When it is used with a deft
touch, it is a powerful lingua franca. And for the


    Reply to author    Forward  
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Webb Sprague  
View profile  
 More options Aug 24, 2:22 pm
From: Webb Sprague <webb.spra...@gmail.com>
Date: Mon, 24 Aug 2009 11:22:20 -0700
Local: Mon, Aug 24 2009 2:22 pm
Subject: Re: [sunlightlabs] Re: How XML Threatens Big Data
This is an interesting discussion!  I have a couple of points.

1.  We should remember that a hugely important downstream consumers of
this type of govt data will be people who write scripts to aggregate,
visualize, and screen the data.  These will probably be written by a
planner who knows some PHP, and probably not a comp sci graduate.  To
enable this, simple data formats are important -- hence my earlier
point about hours to parse simple text like CSV versus days/ weeks to
work with XML.  Also a converter to human readable HTML won't help
with this pipeline at all.

2.  Someone mentioned that we don't need to worry about the technical
ability of office workers, since IT specialists are familiar with XML.
I think that is wrong -- if we can build open data into day to day
workflows by non IT people, it stands a better chance of actually
happening.  If sharing data requires a special budget and outside
labor and is a pain in the A**, it will get dropped at the first
excuse (ie a budget).

3.  Standards discussions in the abstract which address data in the
abstract tend to ungrounded complexity. They attempt to encompass all
possible data, yet usually don't work very well for any particular
data (witness, ahem, XML; remember that HTML and HTTP were developed
ad hoc by programmers in the beginning).  Perhaps we should split this
discussion into the various types of government data (tabular, image,
full database, etc) and try to solve those particular problems.  (And
I'll bet that CSV + metadata would take care of 50% of govt data quite
well, and get us rolling in a big way.)

4.  Part of our challenge is to create an audience -- once this has
happened, they will start demanding data.  I think the most important
thing to do is to create that data pipe and feedback loop.  If it
takes three years to develop the perfect govt schema of everything and
to get a few agencies to start using it, versus a few months to get my
local police dept to put their tables on line and have it become part
of the civic discussion because of some scripter/ planner making live
graphs, then I would WAY prefer the latter.  ( I don't want to set up
a false dichotomy here and claim that it is one or the other, I just
want to explicate the poles.)

5.  Someone stated in this discussion something to the effect that
"because XML is verbose, it is self documenting" -- I think that is a
fallacy.  I have wasted hours futzing with XML full "column begin" and
"column end" tags -- quite verbose, quite useless.

Anyway, hope my brain dump is useful to someone.
W


    Reply to author    Forward  
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Matt Brennan  
View profile  
 More options Aug 24, 3:49 pm
From: Matt Brennan <matty.bren...@gmail.com>
Date: Mon, 24 Aug 2009 15:49:05 -0400
Local: Mon, Aug 24 2009 3:49 pm
Subject: Re: [sunlightlabs] Re: How XML Threatens Big Data

re #2

Data from everyday spreadsheets is never going to conform to a standard; the
structured data we're interested in is going to come out of a database.

And that database is probably proprietary, and it's probably going to
involve a contractor changing the system to get it out.  Especially if we're
going to be picky about what format.
In most cases, sharing government data is going to require some budget,
outside labor, and be a pain in the a**. If it didn't, it would have
happened already. It's our job to convince the government that it's worth
it.

~M

On Mon, Aug 24, 2009 at 2:22 PM, Webb Sprague <webb.spra...@gmail.com>wrote:

--
~
Matt Brennan

    Reply to author    Forward  
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Messages 1 - 25 of 42   Newer >
« Back to Discussions « Newer topic     Older topic »

Create a group - Google Groups - Google Home - Terms of Service - Privacy Policy
©2009 Google