How to create a CSV dataset

Kevin McArthur

unread,

Jul 11, 2014, 12:24:49 PM7/11/14

to opend...@googlegroups.com

Hi Open Data folks,

I'm wondering if anyone would be willing to put together a short guide
on how to create CSV datasets for consumption by data publishing agencies.

I know many of us take the CSV format totally for granted, but I'm
increasingly finding CSV datasets that don't understand the
field/record/header/table relationship. I'll lightly pick on Victoria
here, but only because they're doing an awesome job in actually
publishing data. Take the Councillor expense datasets they recently
published, ex
http://www.victoria.ca/assets/City~Hall/Open~Data/C%20Thorton%20Joe%202013.csv
While well-meaning, they clearly don't understand the formatting
expected of a CSV file. Each Councillor is in a separate datafile, none
of which are machine readable due to offset multi-line headers. We're
seeing this type of thing all over the place in different data catalogs
and from a ton of different issuing agencies -- so its not just
Victoria. I think its a lack of data format education, and that perhaps
as tech-focused group we've taken the formats for granted.

In these situations, I'd like a guide I could send out that explains how
to format CSV data.

Explaining what a "field", "row", "table", "header", etc are in plain
language, how to quote fields, standards for dates, money values and
explaining how to handle "null" values. Data normalization and nth
normal form would also be a good addition for referenced data sets.

I'd write the guide myself, but I think I'm too close to the programming
for the language to translate well. Is anyone else equipped to translate
these concepts into everyday language?

--

Kevin

James McKinney

unread,

Jul 11, 2014, 12:29:27 PM7/11/14

to opend...@googlegroups.com

The following is very technical, but an eventual guide can at least use it as a reference to define what a column, row, table, etc. are: http://w3c.github.io/csvw/syntax/index.html#core-tabular-data-model Governments may be more convinced to reformat their data to meet tabular CSV formatting standards when they see a W3C logo attached to them.

James

> --
> You received this message because you are subscribed to the Google Groups "OpenDataBC" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to opendatabc+...@googlegroups.com.
> For more options, visit https://groups.google.com/d/optout.

Gerry Tychon

unread,

Jul 11, 2014, 12:49:36 PM7/11/14

to opend...@googlegroups.com

Kevin ...

I have seen this forever. I think part of the issue is the CSV files are
exported from a spreadsheet (mostly, of course, Excel) and the
information is put into the spreadsheet from the viewpoint of creating a
document -- that is instead of using a word processor they use a
spreadsheet.

It really boils down to data management. When folks are either creating
or acquiring the data they need to think about how the data will be
shared as "data" and not in a presentation form. And even when the data
is in pretty good CSV format, metadata is missing.

I know everyone reading this is aware of the issues. I have thought for
some time that open data could be a catalyst to much better data
management within organizations. Still a long ways to go.

... gerry

Greg Lawrance

unread,

Jul 11, 2014, 2:58:07 PM7/11/14

to opend...@googlegroups.com

I thought something like this ( a plain English best practices guide) might exist in the http://opendatahandbook.org/pdf/OpenDataHandbook.pdf - unfortunately this is not true. Would make a great appendix.

Perhaps this could be a deliverable out of http://csvconf.com/ which launches in only 4 days!

Some other exciting things in the CSV space ...

This is a cool Python toolkit that I have just started to use to find issues with CSV files and to efficiently data out of them.

CSVKit http://github.com/onyxfish/csvkit

also once we actually have well structured CSV files - there are efforts underway at W3C to make them more useful

W3 CSV Web Working Group - http://www.w3.org/2013/05/lcsv-charter.html

--
You received this message because you are subscribed to the Google Groups "OpenDataBC" group.

To unsubscribe from this group and stop receiving emails from it, send an email to opendatabc+unsubscribe@googlegroups.com.

Paul Ramsey

unread,

Jul 11, 2014, 3:09:58 PM7/11/14

to opend...@googlegroups.com

I wish one of the rules of CSV was “don’t use commas” :)

I was reduced to using Text::CSV [1] to read a file yesterday, because of multiple levels of escaping (“oops, it has a comma, better wrap it in “””, “ooops, there’s a “ in there, better escape that too!”, “oops, have to escape my escape character!”)

P.

[1] Yes, I use perl, don’t read anything into that, haters.

--
Paul Ramsey
http://cleverelephant.ca

http://postgis.net

To unsubscribe from this group and stop receiving emails from it, send an email to opendatabc+...@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "OpenDataBC" group.

To unsubscribe from this group and stop receiving emails from it, send an email to opendatabc+...@googlegroups.com.

Reply all

Reply to author

Forward