Parsing correctly a CSV

530 views
Skip to first unread message

Edmondo Porcu

unread,
May 11, 2012, 2:50:28 AM5/11/12
to scala-user
Dear all,
I am trying to parse a CSV comma separated file in Scala, and I need
a way to handle commas inside quotation marks : they have to be
ignored

What strategy would you suggest?

Thank you for your help
Edmondo

Guillaume Yziquel

unread,
May 11, 2012, 3:09:06 AM5/11/12
to Edmondo Porcu, scala-user
Le Friday 11 May 2012 � 08:50:28 (+0200), Edmondo Porcu a �crit :
> Dear all,

Hi.

> I am trying to parse a CSV comma separated file in Scala, and I need
> a way to handle commas inside quotation marks : they have to be
> ignored
>
> What strategy would you suggest?

import org.apache.commons.csv._

> Thank you for your help
> Edmondo

--
Guillaume Yziquel
Crossing-Tech
Parc Scientifique EPFL

Luke Vilnis

unread,
May 11, 2012, 3:48:46 AM5/11/12
to Guillaume Yziquel, Edmondo Porcu, scala-user
Parser combinators! Hooray! Extend RegexParsers, add an implicit conversion from Char to accept(_), and an implicit conversion from String to regex(_.r) and brace yourself for some quotation marks...

def quoteDelimitedString = '"' ~> ("""[^"]""" | """\\.""").* <~ '"'
def csv = quoteDelimitedString | comma | newline


On Fri, May 11, 2012 at 3:09 AM, Guillaume Yziquel <guillaum...@crossing-tech.com> wrote:

Daniel Sobral

unread,
May 11, 2012, 4:55:37 PM5/11/12
to Luke Vilnis, Guillaume Yziquel, Edmondo Porcu, scala-user
I suggest apache commons as well. Next question is handling of escaped
quotes inside quotes.
--
Daniel C. Sobral

I travel to the future all the time.

Luke Vilnis

unread,
May 11, 2012, 4:58:52 PM5/11/12
to Daniel Sobral, Guillaume Yziquel, Edmondo Porcu, scala-user
My code sample does handle escaped quotes inside quotes

Daniel Sobral

unread,
May 11, 2012, 5:03:48 PM5/11/12
to Luke Vilnis, Guillaume Yziquel, Edmondo Porcu, scala-user
On Fri, May 11, 2012 at 5:58 PM, Luke Vilnis <lvi...@gmail.com> wrote:
> My code sample does handle escaped quotes inside quotes

It does? I saw anything-but-double-quotes or double-backslash-and-dot.
Escaped quotes in CSV are double double quotes.

Luke Vilnis

unread,
May 11, 2012, 5:07:42 PM5/11/12
to Daniel Sobral, Guillaume Yziquel, Edmondo Porcu, scala-user
Ooooh didn't realize that. I thought escaped quotes were \". (The snippet above was not used for CSV, it was from an unrelated parser project). I think just adding a  | "\"\"" to the parser should be enough though? But I definitely take your point that it's usually better to use a real CSV parsing library - we had these exact same issues at my job a few years back and it was glorious when we finally switched to a real library.

Alex Cruise

unread,
May 11, 2012, 5:27:31 PM5/11/12
to Edmondo Porcu, scala-user
On Thu, May 10, 2012 at 11:50 PM, Edmondo Porcu <edmond...@gmail.com> wrote:
I am trying to parse a CSV comma separated file in Scala,  and I need
a way to handle commas inside quotation marks : they have to be
ignored

What strategy would you suggest?

I've gotten good results from this one, which supports both delimited (including CSV) and fixed-width files. http://jsapar.tigris.org/Introduction.html

Parsing CSV yourself is a terrible, terrible idea.  It seems very simple, but there are so many standards to choose from! :)

-0xe1a

Eduardo Pareja Tobes

unread,
May 11, 2012, 5:37:28 PM5/11/12
to Edmondo Porcu, scala-user
I've used http://opencsv.sourceforge.net for this

Eduardo Pareja Tobes
oh no sequences! <- I work there

Matthew Pocock

unread,
May 11, 2012, 6:45:14 PM5/11/12
to Edmondo Porcu, scala-user
You may get some mileage out of a library I wrote for this:


It's highly configurable, and you can either use it to build a data-model, hook into your own model, or interact directly with the event stream.

Matthew
--
Dr Matthew Pocock
Integrative Bioinformatics Group, School of Computing Science, Newcastle University
skype: matthew.pocock
tel: (0191) 2566550

Tom Switzer

unread,
May 11, 2012, 6:47:39 PM5/11/12
to Alex Cruise, Edmondo Porcu, scala-user
I find the exact opposite. I must parse the CSV myself, because every single one I get is slightly different. I've tried several libraries and none were flexible enough to cover all my use cases, though I haven't tried the one you suggested.

Currently, I have a GenericSeparatedValues trait that extends RegexParsers. I then usually specialize for the various "dialects" of CSV I need to support. However, the combinator parsers have a tendency to want to load everything up in ram and do a lot of subSequences and getting it to parse CSV that are hundreds of MB required some finagling.

Edmondo Porcu

unread,
May 14, 2012, 4:01:21 AM5/14/12
to Tom Switzer, Alex Cruise, scala-user
Commons.csv is not available on Maven :( What alternatives do you propose?

2012/5/12 Tom Switzer <thomas....@gmail.com>:

Guillaume Yziquel

unread,
May 14, 2012, 4:18:49 AM5/14/12
to Edmondo Porcu, Tom Switzer, Alex Cruise, scala-user
Le Monday 14 May 2012 � 10:01:21 (+0200), Edmondo Porcu a �crit :
> Commons.csv is not available on Maven :( What alternatives do you propose?

True. But you do have an OSGi bundle, which should be just as useful.

http://mvnrepository.com/artifact/org.apache.servicemix.bundles/org.apache.servicemix.bundles.commons-csv/1.0-r706900_3
Reply all
Reply to author
Forward
0 new messages