I built a simple csv parsing library[1] last weekend which I want to
show you guys. It follows the RFC 4180[2] pretty closely but it allows
for any character as separator and quote mark. It would be great if
someone would take time and read the code. I would like to know:
a) Can performance still be improved?
b) Is it idiomatically written?
c) What should an idiomatic csv parsing API look like in Clojure?
Currently there is only one public function, 'parse' (like
clojure.xml).
The end of the file contains a few usage examples.
happy hacking!
/Jonas
[1] http://gist.github.com/414023
[2] http://tools.ietf.org/html/rfc4180
Thanks,
Jonas
Thanks for the implementation, I'm very encouraged that you followed
the RFC (I've seen lots of implementations that haven't).
I took a quick look at both yours and clojure-csv [1]. I'm not using
the 1.2 snapshots so I wasn't able to try out your implementation, but
I did notice clojure-csv is lax about invalidly formatted files - if a
quoted field ends the file but is not terminated before eof, it does
not signal an error. I think I recognize the same behavior in cljcsv
as well (though as I said I could not try it). It might be nice to at
least have an option which allows an unterminated field to be
recognized.
Best Regards,
Kyle
[1] http://github.com/davidsantiago/clojure-csv
> On Wed, May 26, 2010 at 6:40 AM, Jonas Enlund <jonas....@gmail.com> wrote:
>> Hi there
>>
>> I built a simple csv parsing library[1] last weekend which I want to
>> show you guys. It follows the RFC 4180[2] pretty closely but it allows
>> for any character as separator and quote mark. It would be great if
>> someone would take time and read the code. I would like to know:
>>
>> a) Can performance still be improved?
>> b) Is it idiomatically written?
>> c) What should an idiomatic csv parsing API look like in Clojure?
>> Currently there is only one public function, 'parse' (like
>> clojure.xml).
>>
>> The end of the file contains a few usage examples.
>>
>> happy hacking!
>> /Jonas
>>
>> [1] http://gist.github.com/414023
>> [2] http://tools.ietf.org/html/rfc4180
>>
>
> --
> You received this message because you are subscribed to the Google
> Groups "Clojure" group.
> To post to this group, send email to clo...@googlegroups.com
> Note that posts from new members are moderated - please be patient with your first post.
> To unsubscribe from this group, send email to
> clojure+u...@googlegroups.com
> For more options, visit this group at
> http://groups.google.com/group/clojure?hl=en
--
------------------------------------------------------------------------------
kyle....@gmail.com http://asymmetrical-view.com/
------------------------------------------------------------------------------
1) quotes can appear anywhere, not only as the first and last
character in a cell.
2) records can be of different length.
3) any character can be used as quote or separator.
4) The problem you raised, i.e., that the file might end in a "quoted"
state and still be read without exceptions.
I'll consider adding a :strict flag which would throw an exception for
points 2 and 4. However, I don't want to sacrifice performance since I
consider that to be the most important feature in any csv reading
library (people often have Gb+ sized csv files).
Jonas
I appreciate your needs, mine differ a bit, in some circumstances I
want strictness for the format. Ideally for my own use cases I'd like
to be able to know about or reject the entire file if #4 occurs, but
allow #2. I'd imagine adding in the additional logic will have some
impact on performance.
Regarding performance, have you considered adding a benchmarking
harness to the project?
Thank you for publishing the library.
Regards,
Kyle
Checking #4 wouldn't have a performance hit, so I'll probably add
that. I have to think about the other points.
>
> Regarding performance, have you considered adding a benchmarking
> harness to the project?
Yes, I have thought about it. It would be fun to compare cljcsv,
clojure-csv, OpenCSV and SuperCSV (are there others?). I have looked
into it a bit, and if my measurements are correct it's faster than
clojure-csv and within 10% of OpenCSV.
>
> Thank you for publishing the library.
>
Thanks for showing interest!
Jonas
>
> Regards,
>
> Kyle
Would it be possible to expose the output data as a "rel" (a set of
maps)? Or, possibly better, a "relation" type like the one provided by
Erik: http://gist.github.com/415538 ?
My .csv files are usually big "unions" of all the data I could gather,
and reading them in is just a first step in a data processing flow.
What follows later is a bunch of grouping/filtering operations that
extract data of interest based on some custom criteria. This requires
a fairly good support for indexing (fast selects of rows with some
particular values, or a range of values, or values matching a pattern,
etc) and relational operations. I think it formally falls under OLAP
but I've never really bothered learning about it.
Thanks for your work,
-Andrzej
I'll take a look at how python does things. Dialects sounds like a good idea.
I have now added the lib to clojars so it should be ready for use with
either leiningen or maven.