Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

Parsing CSV files

105 views
Skip to first unread message

Phil Bewig

unread,
Mar 27, 2002, 9:34:09 AM3/27/02
to
Once again, the question of how to parse comma-separated values files
has been asked on c.l.a. It is appealing to try to set FS to some
complicated regular expression involving commas and quote marks, but
all such attempts are doomed to failure, as the meaning of a comma
changes depending on whether it is inside or outside a quoted field,
and regular grammars are below both context-free and context-sensitive
grammars in the language hierarchy of computer science.

There have been numerous previous attempts on comp.lang.awk to create
a general solution to the problem of parsing csv files, including my
own. Most fail. In fact, the hardest problem in writing a program
to parse csv files is that there is no formal definition of the format,
and there are several kinky character sequences that admit multiple
reasonable interpretations. Add the need to handle the varying newline
representations of different operating systems, so that any system can
read a csv file created on a different system, and the unfortunate fact
that some european countries use a semicolon instead of a comma to
separate fields (because they use a comma to represent the radix point
in decimal numbers) and you get a mess. And any solution written in
awk must contend with the mis-match between awk's concept of a record
separator and the need of the csv format to allow newlines embedded in
quoted fields, which invariably leads to awkward (and frequently
incorrect) code.

Because of the difficulty of dealing with csv files in awk, it is a
reasonable alternative to preprocess a csv file with a purpose-built
tool that creates a character-delimited file. As the example below
shows, this allows awk programs to remain simple, and the preprocessor
is useful with other unix text-processing tools that assume character-
delimited data.

The attached program is such a preprocessor, written in C. It allows
options to handle various input and output field separators, and is
much faster than any pure-awk solution, even with the extra pass over
over the data. Simply extract the code from this message, save it in
a file called csv.c, compile it as shown,

DOS, Windows, OS/2 Unix systems
---------------------- -----------------
C> cc -o csv.exe csv.c $ cc -o csv csv.c
$ chmod +x csv

and put it somewhere in your path.

I appreciate your comments.


/* NAME
*
* csv -- convert comma-separated values file to character-delimited
*
*
* SYNOPSIS
*
* csv [-Cc] [-Fc] [filename ...]
*
*
* DESCRIPTION
*
* Csv reads from standard input or from one or more files named on
* the command line a sequence of records in comma-separated values
* format and writes on standard output the same records in character-
* delimited format. Csv returns 0 on success, 1 for option errors,
* and 2 if any file couldn't be opened.
*
* The comma-separated values format has developed over time as a
* set of conventions that has never been formally defined, and some
* implementations are in conflict about some of the details. In
* general, the comma-separated values format is used by databases,
* spreadsheets, and other programs that need to write data consisting
* of records containing fields. The data is written as ascii text,
* with records terminated by newlines and fields containing zero or
* more characters separated by commas. Leading and trailing space in
* unquoted fields is preserved. Fields may be surrounded by double-
* quote characters (ascii \042); such fields may contain newlines,
* literal commas (ascii \054), and double-quote characters
* represented as two successive double-quotes. The examples shown
* below clarify many irregular situations that may arise.
*
* The field separator is normally a comma, but can be changed to an
* arbitrary character c with the command line option -Cc. This is
* useful in those european countries that use a comma instead of a
* decimal point, where the field separator is normally changed to a
* semicolon.
*
* Character-delimited format has records terminated by newlines and
* fields separated by a single character, which is \034 by default
* but may be changed with the -Fc option on the command line.
*
*
* EXAMPLE
*
* Each record below has five fields. For readability, the three-
* character sequence TAB represents a single tab character (ascii
* \011).
*
* $ cat testdata.csv
* 1,abc,def ghi,jkl,unquoted character strings
* 2,"abc","def ghi","jkl",quoted character strings
* 3,123,456,789,numbers
* 4, abc,def , ghi ,strings with whitespace
* 5, "abc","def" , "ghi" ,quoted strings with whitespace
* 6, 123,456 , 789 ,numbers with whitespace
* 7,TAB123,456TAB,TAB789TAB,numbers with tabs for whitespace
* 8, -123, +456, 1E3,more numbers with whitespace
* 9,123 456,123"456, 123 456 ,strange numbers
* 10,abc",de"f,g"hi,embedded quotes
* 11,"abc""","de""f","g""hi",quoted embedded quotes
* 12,"","" "",""x"",doubled quotes
* 13,"abc"def,abc"def","abc" "def",strange quotes
* 14,,"", ,empty fields
* 15,abc,"def
* ghi",jkl,embedded newline
* 16,abc,"def",789,multiple types of fields
*
* $ csv -F'|' testdata.csv
* 1|abc|def ghi|jkl|unquoted character strings
* 2|abc|def ghi|jkl|quoted character strings
* 3|123|456|789|numbers
* 4| abc|def | ghi |strings with whitespace
* 5| "abc"|def | "ghi" |quoted strings with whitespace
* 6| 123|456 | 789 |numbers with whitespace
* 7|TAB123|456TAB|TAB789TAB|numbers with tabs for whitespace
* 8| -123| +456| 1E3|more numbers with whitespace
* 9|123 456|123"456| 123 456 |strange numbers
* 10|abc"|de"f|g"hi|embedded quotes
* 11|abc"|de"f|g"hi|quoted embedded quotes
* 12|| ""|x""|doubled quotes
* 13|abcdef|abc"def"|abc "def"|strange quotes
* 14||| |empty fields
* 15|abc|def
* ghi|jkl|embedded newline
* 16|abc|def|789|multiple types of fields
*
* It is particularly easy to pipe the output from csv into any of
* the unix tools that accept character-delimited fielded text data
* files, such as sort, join, or cut. For example:
*
* csv datafile.csv | awk -F'\034' -f program.awk
*
*
* BUGS
*
* On DOS, Windows, and OS/2 systems, processing of each file stops
* at the first appearance of the ascii \032 (control-Z) end of file
* character.
*
* Because newlines embedded in quoted fields are treated literally,
* a missing closing quote can suck up all remaining input.
*
*
* LICENSE
*
* This program was written by Philip L. Bewig of Saint Louis,
* Missouri, United States of America on February 28, 2002 and
* placed in the public domain.
*/

#include <stdio.h>

/* dofile -- convert one file from comma-separated to delimited */
void dofile(char ofs, char fs, FILE *f) {
int c; /* current input character */

START:
c = fgetc(f);
if (c == EOF) { return; }
if (c == '\r') { goto CARRIAGE_RETURN; }
if (c == '\n') { goto LINE_FEED; }
if (c == '\"') { goto QUOTED_FIELD; }
if (c == fs) { putchar(ofs); goto NOT_FIELD; }
/* default */ { putchar(c); goto UNQUOTED_FIELD; }

NOT_FIELD:
c = fgetc(f);
if (c == EOF) { putchar('\n'); return; }
if (c == '\r') { goto CARRIAGE_RETURN; }
if (c == '\n') { goto LINE_FEED; }
if (c == '\"') { goto QUOTED_FIELD; }
if (c == fs) { putchar(ofs); goto NOT_FIELD; }
/* default */ { putchar(c); goto UNQUOTED_FIELD; }

QUOTED_FIELD:
c = fgetc(f);
if (c == EOF) { putchar('\n'); return; }
if (c == '\"') { goto MAY_BE_DOUBLED_QUOTES; }
/* default */ { putchar(c); goto QUOTED_FIELD; }

MAY_BE_DOUBLED_QUOTES:
c = fgetc(f);
if (c == EOF) { putchar('\n'); return; }
if (c == '\r') { goto CARRIAGE_RETURN; }
if (c == '\n') { goto LINE_FEED; }
if (c == '\"') { putchar('\"'); goto QUOTED_FIELD; }
if (c == fs) { putchar(ofs); goto NOT_FIELD; }
/* default */ { putchar(c); goto UNQUOTED_FIELD; }

UNQUOTED_FIELD:
c = fgetc(f);
if (c == EOF) { putchar('\n'); return; }
if (c == '\r') { goto CARRIAGE_RETURN; }
if (c == '\n') { goto LINE_FEED; }
if (c == fs) { putchar(ofs); goto NOT_FIELD; }
/* default */ { putchar(c); goto UNQUOTED_FIELD; }

CARRIAGE_RETURN:
c = fgetc(f);
if (c == EOF) { putchar('\n'); return; }
if (c == '\r') { putchar('\n'); goto CARRIAGE_RETURN; }
if (c == '\n') { putchar('\n'); goto START; }
if (c == '\"') { putchar('\n'); goto QUOTED_FIELD; }
if (c == fs) { printf("\n%c",ofs); goto NOT_FIELD; }
/* default */ { printf("\n%c",c); goto UNQUOTED_FIELD; }

LINE_FEED:
c = fgetc(f);
if (c == EOF) { putchar('\n'); return; }
if (c == '\r') { putchar('\n'); goto START; }
if (c == '\n') { putchar('\n'); goto LINE_FEED; }
if (c == '\"') { putchar('\n'); goto QUOTED_FIELD; }
if (c == fs) { printf("\n%c",ofs); goto NOT_FIELD; }
/* default */ { printf("\n%c",c); goto UNQUOTED_FIELD; }
}

/* main -- process command line, call appropriate conversion */
int main(int argc, char *argv[]) {
char ofs = '\034'; /* output field separator */
char fs = ','; /* input field separator */
int status = 0; /* error status for return to operating system */
char *progname; /* name of program for error messages */

FILE *f;
int i;

progname = (char *) malloc(strlen(argv[0])+1);
strcpy(progname, argv[0]);

while (argc > 1 && argv[1][0] == '-') {
switch (argv[1][1]) {
case 'c':
case 'C':
fs = argv[1][2];
break;
case 'f':
case 'F':
ofs = argv[1][2];
break;
default:
fprintf(stderr, "%s: unknown argument %s\n",
progname, argv[1]);
fprintf(stderr,
"usage: %s [-Cc] [-Fc] [filename ...]\n",
progname);
exit(1);
}
argc--;
argv++;
}

if (argc == 1)
dofile(ofs, fs, stdin);
else
for (i = 1; i < argc; i++)
if ((f = fopen(argv[i], "r")) == NULL) {
fprintf(stderr, "%s: can't open %s\n",
progname, argv[i]);
status = 2;
} else {
dofile(ofs, fs, f);
fclose(f);
}

exit(status);
}

Dan Haygood

unread,
Mar 27, 2002, 11:58:01 AM3/27/02
to
Phil -
FS-style regular-expression parsing of CSV is incredibly hard, but it CAN be
done. It is so complex, though, that I'd agree that it is doomed to
failure, even with things like FPAT. (OK, I will conceed that there may be
some CSV grammars that are theoretically impossible to parse completely with
regex, without intermediate work. And awk's FS makes it impossible.)

> [From the bottom] And any solution written in awk...
My position is that writing a general solution IN awk is a waste of source
code, and that CSV handling should be BUILT INTO awk.

> ...it is a reasonable alternative to preprocess


> a csv file with a purpose-built tool that creates
> a character-delimited file.

Yes, it is. I would like to see some engine like yours incorporated into
awk--but without the translation part; just parse out the fields, and use
them for $n. Preprocessing is an extra OS load, more file/pipe I/O, et
cetera. But it's all we have now.

Some other observations...

> ...the meaning of a comma changes...
Yet we can still parse C, where asterisk is binary mupliplication, or unary
dereferencing.

>...there is no formal definition of the format...
Its formal definition is "comma" "separated" "values", just like it says.
There is a delimiter, usually a comma, that is detectable when outside of a
"value", that is used to "separate" (as opposed to terminate, lead, or
enclose) the "values". This isn't enough to write BNF for lexx, but it got
you pretty darn far with your preprocessor.

>...several kinky character sequences that admit
> multiple reasonable interpretations...
I have not seen many of these (none I can remember), but the intended
consumer of the data dictates a deterministic interpretation of the data. A
parser would simply need to be told how to handle things. There might be
more options than FS and RS, which tell awk how to break a file into records
and fields, but it is still manageable.

For instance, suppose the quoting character is double-prime, and the
escapement character is backslash. It would have to know: Is " within a
string represented as "" (like Pascal's embedded single-prime, or like
Microsoft often does) or as \" (Unix-style)? Once you have field-quoting
out of the way, though, the data within the field could be interpretted by
the awk program, so one wouldn't need to worry about: Are only some escaped
characters "special"? Is \t == <tab>, but \u == \u?

> ...varying newline representations...
awk already does this; I don't see how it would be more difficult in a CSV
environment.

> ...some...use a semicolon instead of a comma...
Allow a delimiter value to be set, like your variable fs.

> And any solution...must contend with


> the mis-match between awk's concept of a record
> separator and the need of the csv format to allow
> newlines embedded in quoted fields

This is like saying, "C must contend with people who want to write 'and
then' instead of '&&'": No, it doesn't. (Even though yours does!) As the
designer of such an engine, one could (perhaps should) disallow things like
this--things that are usually described with question-marks and exclamation
points when people discuss them. The designer of any system can (and must)
set reasonable limits on it. Let the Excel people use whatever system they
have to determine when they are dealing with their own system's admittedly
bizarre behavior, and properly quote their new-line characters in their own
preprocessing...but let the rest of us have a working engine.

- Dan


"Phil Bewig" <pbe...@swbell.net> wrote in message
news:455f7154.02032...@posting.google.com...

> *9 DESCRIPTION

Lyndon Nerenberg

unread,
Mar 27, 2002, 2:58:58 PM3/27/02
to
>FS-style regular-expression parsing of CSV is incredibly hard, but it CAN be
>done. It is so complex, though, that I'd agree that it is doomed to
>failure, even with things like FPAT.

Years of fighting the "CSV + quoted strings" battle has brought me to
the point where now I don't even think of using awk to solve the
problem. Instead, I write the parser using lex. In almost every case
I code the productions directly in C (in the lex program source),
rather than trying to mung the data into something awk can deal with.
Which method (C productions vs. data munging) to use is a matter of
personal choice. If you're comfortable writing moderately complicated
awk programs, the migration to native lex is not difficult. Often, the
amount of time you spend trying to come up with that horribly complex
FS expression exceeds the time it takes to learn how to write simple
lex productions. The important thing is to use the tool that suits the
task. awk isn't *always* the right tool (believe it or not :-)

--lyndon

Christopher Hamel

unread,
Mar 27, 2002, 11:27:50 PM3/27/02
to
I appreciate the insight provided here (really)... I had no idea a
simple inquiry (not mine, by the way -- I was just a schlep that tried
to help) would stir such a hornet's nest. I admit, I only recently
discovered newsgroups and have obviously missed some heated CSV
discussion.

At any rate, my point is this: If you look at the poster's original
problem, it really wasn't that complex. Okay, the data was in CSV,
but what he was really after was determining uniqueness of records
from a certain field (#2) onward. Within the scope of that problem
and depending on whether or not a few other assumptions could be made
about his data, it's possible that AWK is, in fact, a fine solution.
I'll bet a dozen Krispy Kremes that it is in this case.

I think it's easy to get wrapped up in creating a portable, scalable
solution, all the while losing sight of the task at hand. If you know
your data and aren't trying to solve everyone else's problems along
with your own, you may find the solution is nearer at hand than you
imagined.

Read the original email and tell me if I'm full of poo poo.
Chris

0 new messages