Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

Looking for a rock-solid CSV file parser

9 views
Skip to first unread message

Ramon F Herrera

unread,
Apr 20, 2007, 12:51:20 AM4/20/07
to

I am looking for a strong parser to be used with CSV (Comma Separated
Values) files.

I would like to have operations such as counting the number of fields
in each line. No, I can't simply count commas in every line. I need to
make sure that every line is syntactically correct.

What I need is some sort of 'awk' for CSV files. The "strong"
requirement is important as I don't want some wise guy trying to sneak
executable code into my application.

Is there such a thing?

TIA,

-Ramon

ps: is there a formal spec for CSV files?

Ramon F Herrera

unread,
Apr 20, 2007, 1:15:29 AM4/20/07
to


Now that I think about it, all I really need is a (set of) function
which parses a single CSV line.

I can provide the enclosing loop. :-)

-Ramon


Chris F.A. Johnson

unread,
Apr 20, 2007, 1:50:00 AM4/20/07
to
On 2007-04-20, Ramon F Herrera wrote:
>
> I am looking for a strong parser to be used with CSV (Comma Separated
> Values) files.
>
> I would like to have operations such as counting the number of fields
> in each line. No, I can't simply count commas in every line. I need to
> make sure that every line is syntactically correct.
>
> What I need is some sort of 'awk' for CSV files. The "strong"
> requirement is important as I don't want some wise guy trying to sneak
> executable code into my application.
>
> Is there such a thing?

There are many. I have some shell functions that do the job. The
question is, what job? See below.

> ps: is there a formal spec for CSV files?

No. That's one of the problems. There are many variants (e.g., how
are commas (or other separator) in a field handled?).

I've taken to using CTV (Character-Terminated Values) format, with
escaped terminator characters when embedded in a field. Among other
advantages, the last character of the record can be checked to
determine what separator is used.

--
Chris F.A. Johnson, author | <http://cfaj.freeshell.org>
Shell Scripting Recipes: | My code in this post, if any,
A Problem-Solution Approach | is released under the
2005, Apress | GNU General Public Licence

James Antill

unread,
Apr 20, 2007, 1:04:41 PM4/20/07
to
On Thu, 19 Apr 2007 21:51:20 -0700, Ramon F Herrera wrote:

> I am looking for a strong parser to be used with CSV (Comma Separated
> Values) files.
>
> I would like to have operations such as counting the number of fields in
> each line. No, I can't simply count commas in every line. I need to make
> sure that every line is syntactically correct.
>
> What I need is some sort of 'awk' for CSV files. The "strong"
> requirement is important as I don't want some wise guy trying to sneak
> executable code into my application.
>
> Is there such a thing?

A google groups search for csv parser gives useful information in the
second link:

http://groups.google.com/groups/search?q=csv+parser

--
James Antill -- ja...@and.org
http://www.and.org/and-httpd/ -- $2,000 security guarantee
http://www.and.org/vstr/

John-Paul Stewart

unread,
Apr 20, 2007, 3:17:53 PM4/20/07
to
Ramon F Herrera wrote:
> On Apr 19, 11:51 pm, Ramon F Herrera <r...@conexus.net> wrote:
>> I am looking for a strong parser to be used with CSV (Comma Separated
>> Values) files.
[snip]

> Now that I think about it, all I really need is a (set of) function
> which parses a single CSV line.

Depending on what you're doing with the data, you might find the PHP
scripting language useful. In particular, I use its fgetcsv() function
(http://www.php.net/manual/en/function.fgetcsv.php) to fetch a row of
CSV values into an array, from which I can process the data in any way I
like. If you're already familiar with PHP scripting, it's quite handy.

(Note that while PHP is often used for website scripting, it is also
quite useful from the command line. On Debian this functionality
requires installing one of the php?-cli packages. YMMV.)

The Natural Philosopher

unread,
Apr 21, 2007, 5:57:40 AM4/21/07
to

Or simply write your first C program. Jobs like this are really good
projects to teach yourself 'C' on.

Ramon F Herrera

unread,
Apr 21, 2007, 2:24:25 PM4/21/07
to

I have been writing code (mainly C, but also Java) for a quarter of a
century, Nat the Philo :-). I have decided to take myself out of the
programming chair and hire some fresh kid to do the leg work while I
do the part that requires wisdom and gray hair.

In fact, this is part of a much bigger generalized program that
converts lots of files with all kinds of formats (fixed-width, CSV,
EBCDIC, etc.) into a normalized canonical format which is fed into
Oracle. I even designed a formal description language (with lex+yacc,
but I'll move it to Antlr as soon as I learn it) to describe the file
contents.

I figure that I don't have all the test cases for CSV, I might miss
something, so why reinvent the wheel? Ergo: use somebody else's fine-
tuned code for CSV and only CSV.

As you philosophers say: QED.

-Ramon


Gordon Burditt

unread,
Apr 21, 2007, 7:19:12 PM4/21/07
to
>I am looking for a strong parser to be used with CSV (Comma Separated
>Values) files.
>
>I would like to have operations such as counting the number of fields
>in each line. No, I can't simply count commas in every line. I need to
>make sure that every line is syntactically correct.

Is CSV actually well-defined enough so that "every line is syntactically
correct" is meaningful?

>What I need is some sort of 'awk' for CSV files. The "strong"
>requirement is important as I don't want some wise guy trying to sneak
>executable code into my application.

Here's a way to construct a test case:

Step 1: write a CSV line with a single value, a string, containing
all of the printable characters in the character set you are using
(ASCII?) in character-set-code order with particular attention to
comma, any kind of quote, backslash, space, and any other character
used to quote stuff. Extra credit: include *all* of the characters
with particular attention to carriage return, newline, tab and nul.

Step 2: write a CSV line with multiple numeric values, consisting of
all of the prime numbers between 2 and 100, inclusive and in order.

Step 3: write a CSV line with 10,000 string values, each one of them
consisting of the line from step 1. (Warning: this line will be over
1 meg long if there are more than about 100 printable characters.)

Step 4: write a CSV line with 3 string values, each value consisting of
the lines from steps 1, 2, and 3, respectively.

Step 5: write a CSV line with 4 string values, each value consisting of
the lines from steps 1, 2, 3, and 4, respectively.

Step 6: write a CSV line with 5 string values, each value consisting of
the lines from steps 1, 2, 3, 4, and 5, respectively. (Warning: if the
line from Step 3 is 1 meg, this one will be over 4 meg).

Now test these 6 lines against your parser.

Then test the parser using its own executable as input. It's allowed
to signal errors, but not crash the program.

Test the parser using the output of /dev/random as input. It's allowed
to signal errors, but not crash the program. On systems without /dev/random,
try using something large and encrypted with a solid encryption system.

Now, assuming that the character set is ASCII (character codes 0 -
127), and you're only going for the printable characters, can anyone
claim that there is exactly one correct sequence of characters on
a line resulting from Step 1? For the moment, forget about OS
differences in how lines are ended.

toby

unread,
Apr 22, 2007, 9:11:12 AM4/22/07
to

Recognisably wisdom from experience :-)

NIH is a pathology.

The Natural Philosopher

unread,
Apr 22, 2007, 9:52:48 PM4/22/07
to
Well I used to think like that, until I discoverd that mostly other
peoples wheels have square axles, bent shafts, fall apart..


I must have spent getting on for quarter of a million quid on database
software that ultimately never did what it what supposed to.

I could have written it better for less than that.

>> As you philosophers say: QED.
>>


If you want something done right, do it yourself.

The great advantage of home rolled programs, is you know the guy who
wrote them intimately, and he is always there to fix them for you.

>> -Ramon
>
>

Robert Gamble

unread,
Apr 23, 2007, 1:02:58 PM4/23/07
to
On Apr 20, 12:51 am, Ramon F Herrera <r...@conexus.net> wrote:
> I am looking for a strong parser to be used with CSV (Comma Separated
> Values) files.
>
> I would like to have operations such as counting the number of fields
> in each line. No, I can't simply count commas in every line. I need to
> make sure that every line is syntactically correct.
>
> What I need is some sort of 'awk' for CSV files. The "strong"
> requirement is important as I don't want some wise guy trying to sneak
> executable code into my application.
>
> Is there such a thing?

Have you checked out libcsv yet (see my response to your post at
<http://groups.google.com/group/comp.lang.c/browse_frm/thread/
a5a3bef8d6b8d057/#>). You should be able to use it to write a program
to accomplish what you are trying to do pretty easily.

> ps: is there a formal spec for CSV files?

No, but there are a set of common conventions which the majority of
applications using CSV follow described at <http://www.creativyst.com/
Doc/Articles/CSV/CSV01.htm>, my CSV library follows these conventions.
There is also an RFC for CSV as a mime-type at <http://tools.ietf.org/
html/rfc4180> but its description is overly strict and doesn't reflect
traditional conventions making its usefulness quite limited.

Robert Gamble

toby

unread,
Apr 23, 2007, 8:14:57 PM4/23/07
to
On Apr 22, 10:52 pm, The Natural Philosopher <a...@b.c> wrote:
> toby wrote:
> > On Apr 21, 3:24 pm, Ramon F Herrera <r...@conexus.net> wrote:
> >> On Apr 21, 4:57 am, The Natural Philosopher <a...@b.c> wrote:
>
> >>> John-Paul Stewart wrote:
> >>>> Ramon F Herrera wrote:
> >>>>> On Apr 19, 11:51 pm, Ramon F Herrera <r...@conexus.net> wrote:
> >>>>>> I am looking for a strong parser to be used with CSV (Comma Separated
> >>>>>> Values) files....

> >> I figure that I don't have all the test cases for CSV, I might miss
> >> something, so why reinvent the wheel? Ergo: use somebody else's fine-
> >> tuned code for CSV and only CSV.
>
> > Recognisably wisdom from experience :-)
>
> > NIH is a pathology.
>
> Well I used to think like that, until I discoverd that mostly other
> peoples wheels have square axles, bent shafts, fall apart..
>
> I must have spent getting on for quarter of a million quid

That sounds like the problem right there :-)

Admittedly the industry has changed a LOT in the last 5-10-20 years.
Who changed it were the itch-scratchers: Richard, Linus, Monty and so
on; individuals who proved that they were the *right* people to be
solving particular problems. What holds it back are the Emperors:
starting with Gates.

> on database
> software that ultimately never did what it what supposed to.
>
> I could have written it better for less than that.

If you *had*, like Monty did, for instance, you could have GPL'd it
and been a) leading a platform today or b) an acquisition target like
Sleepycat and retired gracefully, with occasional trips to the
International Space Station.

toby

unread,
Apr 23, 2007, 8:17:20 PM4/23/07
to
On Apr 22, 10:52 pm, The Natural Philosopher <a...@b.c> wrote:
> ...

> The great advantage of home rolled programs, is you know the guy who
> wrote them intimately, and he is always there to fix them for you.

The disadvantage is that he just may not be very smart, or very smart
about the problem domain. For most problems, someone else out there
knows more about it than you do. Google finds them pretty quickly...

>
> >> -Ramon


William Park

unread,
Apr 23, 2007, 9:25:41 PM4/23/07
to

As for "a formal spec", no, because it's like a formal spec for a car.
There are VW, Toyota, etc. Most implementations differ in how they
treat whitespaces around comma separators.

If you want a shell solution, then you can find my approach to this
problem in my .sig below. Essentially,

while read -C a b c; do
declare -p a b c
done << EOF
aa, "11, 22" ,cc
EOF

--
William Park <openge...@yahoo.ca>, Toronto, Canada
ThinFlash: Linux thin-client on USB key (flash) drive
http://home.eol.ca/~parkw/thinflash.html
BashDiff: Super Bash shell
http://freshmeat.net/projects/bashdiff/

0 new messages