A simple csv parsing library

148 views
Skip to first unread message

Jonas Enlund

unread,
May 25, 2010, 11:40:36 PM5/25/10
to Clojure
Hi there

I built a simple csv parsing library[1] last weekend which I want to
show you guys. It follows the RFC 4180[2] pretty closely but it allows
for any character as separator and quote mark. It would be great if
someone would take time and read the code. I would like to know:

a) Can performance still be improved?
b) Is it idiomatically written?
c) What should an idiomatic csv parsing API look like in Clojure?
Currently there is only one public function, 'parse' (like
clojure.xml).

The end of the file contains a few usage examples.

happy hacking!
/Jonas

[1] http://gist.github.com/414023
[2] http://tools.ietf.org/html/rfc4180

Jonas Enlund

unread,
Jun 8, 2010, 3:54:20 PM6/8/10
to Clojure
I've added my work on the csv reader/writer library to github
(http://github.com/jonase/cljcsv). Please let me know If anyone finds
it useful.

Thanks,
Jonas

Kyle R. Burton

unread,
Jun 8, 2010, 11:54:13 PM6/8/10
to clo...@googlegroups.com
> I've added my work on the csv reader/writer library to github
> (http://github.com/jonase/cljcsv). Please let me know If anyone finds
> it useful.

Thanks for the implementation, I'm very encouraged that you followed
the RFC (I've seen lots of implementations that haven't).

I took a quick look at both yours and clojure-csv [1]. I'm not using
the 1.2 snapshots so I wasn't able to try out your implementation, but
I did notice clojure-csv is lax about invalidly formatted files - if a
quoted field ends the file but is not terminated before eof, it does
not signal an error. I think I recognize the same behavior in cljcsv
as well (though as I said I could not try it). It might be nice to at
least have an option which allows an unterminated field to be
recognized.

Best Regards,

Kyle

[1] http://github.com/davidsantiago/clojure-csv

> On Wed, May 26, 2010 at 6:40 AM, Jonas Enlund <jonas....@gmail.com> wrote:
>> Hi there
>>
>> I built a simple csv parsing library[1] last weekend which I want to
>> show you guys. It follows the RFC 4180[2] pretty closely but it allows
>> for any character as separator and quote mark. It would be great if
>> someone would take time and read the code. I would like to know:
>>
>> a) Can performance still be improved?
>> b) Is it idiomatically written?
>> c) What should an idiomatic csv parsing API look like in Clojure?
>> Currently there is only one public function, 'parse' (like
>> clojure.xml).
>>
>> The end of the file contains a few usage examples.
>>
>> happy hacking!
>> /Jonas
>>
>> [1] http://gist.github.com/414023
>> [2] http://tools.ietf.org/html/rfc4180
>>
>

> --
> You received this message because you are subscribed to the Google
> Groups "Clojure" group.
> To post to this group, send email to clo...@googlegroups.com
> Note that posts from new members are moderated - please be patient with your first post.
> To unsubscribe from this group, send email to
> clojure+u...@googlegroups.com
> For more options, visit this group at
> http://groups.google.com/group/clojure?hl=en

--
------------------------------------------------------------------------------
kyle....@gmail.com http://asymmetrical-view.com/
------------------------------------------------------------------------------

Jonas Enlund

unread,
Jun 9, 2010, 3:03:00 AM6/9/10
to clo...@googlegroups.com
When I say that it supports the RFC I mean that the library should be
able to read any file that follows that standard. It might (and it
does) read files that do not follow the standard. Here are some
examples:

1) quotes can appear anywhere, not only as the first and last
character in a cell.
2) records can be of different length.
3) any character can be used as quote or separator.
4) The problem you raised, i.e., that the file might end in a "quoted"
state and still be read without exceptions.

I'll consider adding a :strict flag which would throw an exception for
points 2 and 4. However, I don't want to sacrifice performance since I
consider that to be the most important feature in any csv reading
library (people often have Gb+ sized csv files).

Jonas

Kyle R. Burton

unread,
Jun 9, 2010, 8:47:47 AM6/9/10
to clo...@googlegroups.com
> When I say that it supports the RFC I mean that the library should be
> able to read any file that follows that standard. It might (and it
> does) read files that do not follow the standard. Here are some
> examples:
>
> 1) quotes can appear anywhere, not only as the first and last
> character in a cell.
> 2) records can be of different length.
> 3) any character can be used as quote or separator.
> 4) The problem you raised, i.e., that the file might end in a "quoted"
> state and still be read without exceptions.
>
> I'll consider adding a :strict flag which would throw an exception for
> points 2 and 4. However, I don't want to sacrifice performance since I
> consider that to be the most important feature in any csv reading
> library (people often have Gb+ sized csv files).

I appreciate your needs, mine differ a bit, in some circumstances I
want strictness for the format. Ideally for my own use cases I'd like
to be able to know about or reject the entire file if #4 occurs, but
allow #2. I'd imagine adding in the additional logic will have some
impact on performance.

Regarding performance, have you considered adding a benchmarking
harness to the project?

Thank you for publishing the library.


Regards,

Kyle

Jonas Enlund

unread,
Jun 9, 2010, 9:08:54 AM6/9/10
to clo...@googlegroups.com
On Wed, Jun 9, 2010 at 3:47 PM, Kyle R. Burton <kyle....@gmail.com> wrote:
>> When I say that it supports the RFC I mean that the library should be
>> able to read any file that follows that standard. It might (and it
>> does) read files that do not follow the standard. Here are some
>> examples:
>>
>> 1) quotes can appear anywhere, not only as the first and last
>> character in a cell.
>> 2) records can be of different length.
>> 3) any character can be used as quote or separator.
>> 4) The problem you raised, i.e., that the file might end in a "quoted"
>> state and still be read without exceptions.
>>
>> I'll consider adding a :strict flag which would throw an exception for
>> points 2 and 4. However, I don't want to sacrifice performance since I
>> consider that to be the most important feature in any csv reading
>> library (people often have Gb+ sized csv files).
>
> I appreciate your needs, mine differ a bit, in some circumstances I
> want strictness for the format.  Ideally for my own use cases I'd like
> to be able to know about or reject the entire file if #4 occurs, but
> allow #2.  I'd imagine adding in the additional logic will have some
> impact on performance.

Checking #4 wouldn't have a performance hit, so I'll probably add
that. I have to think about the other points.

>
> Regarding performance, have you considered adding a benchmarking
> harness to the project?

Yes, I have thought about it. It would be fun to compare cljcsv,
clojure-csv, OpenCSV and SuperCSV (are there others?). I have looked
into it a bit, and if my measurements are correct it's faster than
clojure-csv and within 10% of OpenCSV.


>
> Thank you for publishing the library.
>

Thanks for showing interest!

Jonas

>
> Regards,
>
> Kyle

Andrzej

unread,
Jun 9, 2010, 12:27:55 PM6/9/10
to clo...@googlegroups.com
On Wed, Jun 9, 2010 at 4:54 AM, Jonas Enlund <jonas....@gmail.com> wrote:
> I've added my work on the csv reader/writer library to github
> (http://github.com/jonase/cljcsv). Please let me know If anyone finds
> it useful.

Would it be possible to expose the output data as a "rel" (a set of
maps)? Or, possibly better, a "relation" type like the one provided by
Erik: http://gist.github.com/415538 ?

My .csv files are usually big "unions" of all the data I could gather,
and reading them in is just a first step in a data processing flow.
What follows later is a bunch of grouping/filtering operations that
extract data of interest based on some custom criteria. This requires
a fairly good support for indexing (fast selects of rows with some
particular values, or a range of values, or values matching a pattern,
etc) and relational operations. I think it formally falls under OLAP
but I've never really bothered learning about it.

Thanks for your work,

-Andrzej

Daniel Werner

unread,
Jun 9, 2010, 6:01:17 PM6/9/10
to Clojure
Jonas,

Thanks for stepping forward and publishing your work. From the short
glance I had at it already, your code seems very low-level (probably
for performance), but sound. The only thing that, compared to other
CSV libraries I've used, I miss somewhat is explicit support for
"dialects". While the developer can set the separator, quote and
newline chars herself, it would be nice if there were a number of pre-
made maps for commonly used CSV dialects, e.g. Excel, whose values can
just be passed to the read and write functions.

To give an example, Python's approach is quite simple yet effective.
Their use of class inheritance to define new dialects that are similar
to old ones would translate well to Clojure maps, by assoc'ing or
merge'ing the changes in.

http://docs.python.org/library/csv.html#csv-fmt-params

--Daniel

Jonas Enlund

unread,
Jun 11, 2010, 8:53:07 AM6/11/10
to clo...@googlegroups.com

I'll take a look at how python does things. Dialects sounds like a good idea.

I have now added the lib to clojars so it should be ready for use with
either leiningen or maven.

Reply all
Reply to author
Forward
0 new messages