trouble parsing "kind of" comma-delimited text

nun

unread,

Oct 3, 2008, 1:37:01 PM10/3/08

to

I need to process a text file of product data that's supplied to me by a
vendor. This data is "kind of" comma-delimited.... some of the rows
contain commas within the "description" field, and in these cases (and
only in these cases) that field's data is enclosed by double quotes.
Here is some sample data:

SKU,DESCRIPTION,PRICE
12345,CABLE,21.25
56789,"CONNECTOR, LARGE",13.50

Rows which do not have the double-quote-enclosed comma issue can be
processed correctly using this code:

#################################
# reading data in from file
my (@AoA);
while ( <> ) {
chomp;
push @AoA, [ split /,/ ];
}
#################################

but of course the rows which do have that issue don't parse as desired.

So far I've been unable to come up with a good way to handle this, so I
thought I'd ask for suggestions from the gurus. Any ideas would be
appreciated.

DB

Tony Curtis

unread,

Oct 3, 2008, 1:43:06 PM10/3/08

to

nun wrote:
> I need to process a text file of product data that's supplied to me by a
> vendor. This data is "kind of" comma-delimited.... some of the rows
> contain commas within the "description" field, and in these cases (and
> only in these cases) that field's data is enclosed by double quotes.
> Here is some sample data:
>
> SKU,DESCRIPTION,PRICE
> 12345,CABLE,21.25
> 56789,"CONNECTOR, LARGE",13.50

Well, if you know for sure the data lines are of the form

INTEGER "," TEXT ", " REAL

you could use a regex instead of split() to grab the bits.

hth
t

J�rgen Exner

unread,

Oct 3, 2008, 1:51:43 PM10/3/08

to

nun <ju...@yahoo.com> wrote:
>I need to process a text file of product data that's supplied to me by a
>vendor. This data is "kind of" comma-delimited.... some of the rows
>contain commas within the "description" field, and in these cases (and
>only in these cases) that field's data is enclosed by double quotes.

Sounds like a normal CSV file to me.

[...]

>but of course the rows which do have that issue don't parse as desired.
>
>So far I've been unable to come up with a good way to handle this, so I
>thought I'd ask for suggestions from the gurus. Any ideas would be
>appreciated.

Why not just use the Text::CSV module or one of its relatives to parse
that file?

jue

Ben Morrow

unread,

Oct 3, 2008, 2:01:29 PM10/3/08

to

Quoth nun <ju...@yahoo.com>:

> I need to process a text file of product data that's supplied to me by a
> vendor. This data is "kind of" comma-delimited.... some of the rows
> contain commas within the "description" field, and in these cases (and
> only in these cases) that field's data is enclosed by double quotes.

This is usual in comma-delimited data. Use Text::CSV_XS.

Ben

--
It will be seen that the Erwhonians are a meek and long-suffering people,
easily led by the nose, and quick to offer up common sense at the shrine of
logic, when a philosopher convinces them that their institutions are not based
on the strictest morality. [Samuel Butler, paraphrased] b...@morrow.me.uk

s...@netherlands.com

unread,

Oct 4, 2008, 4:24:42 PM10/4/08

to

On Fri, 03 Oct 2008 13:37:01 -0400, nun <ju...@yahoo.com> wrote:

>I need to process a text file of product data that's supplied to me by a
>vendor. This data is "kind of" comma-delimited.... some of the rows
>contain commas within the "description" field, and in these cases (and
>only in these cases) that field's data is enclosed by double quotes.
>Here is some sample data:
>
>SKU,DESCRIPTION,PRICE
>12345,CABLE,21.25
>56789,"CONNECTOR, LARGE",13.50
>
>Rows which do not have the double-quote-enclosed comma issue can be
>processed correctly using this code:
>
>#################################
># reading data in from file
>my (@AoA);
> while ( <> ) {

my @tmpa = ();
while (/"+\s*([^"]*?)\s*"+|\s*([^,\n]+)\s*/g) # trims
{
my $val = defined $1 ? $1 : $2;
push @tmpa, $val;
}
push @AoA, \@tmpa;

# or ...

push @AoA, [(/("+[^"]*?"+|[^,\n]+)/g)]; # does not trim

> }
>#################################
>
>but of course the rows which do have that issue don't parse as desired.
>
>So far I've been unable to come up with a good way to handle this, so I
>thought I'd ask for suggestions from the gurus. Any ideas would be
>appreciated.
>
>DB

If you don't want to bring in a bunch of code, do it the easy way.
sln

use strict;
use warnings;

my $str = '
SKU,DESCRIPTION,PRICE
12345, CABLE , 21.25
56789,"CONNECTOR, LARGE"",13.50
';

while ($str =~ /("+[^"]*?"+|[^,\n]+)/g)
{
my $val = $1;
$val =~ s/^[\s'"]+//; $val =~ s/[\s'"]+$//;
print "val_a = $val\n";
# ... push @ary, $val;
}
print "\n\n\n";

## or ...

while ($str =~ /"+\s*([^"]*?)\s*"+|\s*([^,\n]+)\s*/g)
{
my $val = defined $1 ? $1 : $2;
print "val_b = $val\n";
# ... push @ary, $val;
}

print "\n\n\n";

## or ...

my @ary = ($str =~ /("+[^"]*?"+|[^,\n]+)/g); # but gets all the crap as well

print "@ary\n";

__END__

output:

val_a = SKU
val_a = DESCRIPTION
val_a = PRICE
val_a = 12345
val_a = CABLE
val_a = 21.25
val_a = 56789
val_a = CONNECTOR, LARGE
val_a = 13.50

val_b = SKU
val_b = DESCRIPTION
val_b = PRICE
val_b = 12345
val_b = CABLE
val_b = 21.25
val_b = 56789
val_b = CONNECTOR, LARGE
val_b = 13.50

SKU DESCRIPTION PRICE 12345 CABLE 21.25 56789 "CONNECTOR, LARGE"" 13.50

cartercc

unread,

Oct 6, 2008, 7:52:25 AM10/6/08

to

On Oct 3, 1:37 pm, nun <j...@yahoo.com> wrote:
> I need to process a text file of product data that's supplied to me by a
> vendor. This data is "kind of" comma-delimited.... some of the rows
> contain commas within the "description" field, and in these cases (and
> only in these cases) that field's data is enclosed by double quotes.
> Here is some sample data:
>
> SKU,DESCRIPTION,PRICE
> 12345,CABLE,21.25
> 56789,"CONNECTOR, LARGE",13.50

If you want, you can open the file in Excel and save as a file
delimited in whatever you want.

I often get Excel files, and my approach is to preprocess the file to
do two things: (1) replace all commas with the pipe symbol (|) unless
the comma is between the double quotes, and (2) remove the double
quotes.

CC

Jürgen Exner

unread,

Oct 6, 2008, 9:41:41 AM10/6/08

to

Which only replaces Scylla with Charybdis, because now your target file
will have double quotes around every value that contains a pipe symbol.

Why not just use a standard CSV-parser to read the data? Simple,
straight forward, no fuss, no problems, no pre-processing, and most
important correct.

jue

cartercc

unread,

Oct 6, 2008, 2:13:09 PM10/6/08

to

On Oct 6, 9:41 am, Jürgen Exner <jurge...@hotmail.com> wrote:
> Which only replaces Scylla with Charybdis, because now your target file
> will have double quotes around every value that contains a pipe symbol.

Except that my data files do not contain pipe symbols, and I can rely
on the fact that pipe symbols will not be in the file.

> Why not just use a standard CSV-parser to read the data? Simple,
> straight forward, no fuss, no problems, no pre-processing, and most
> important correct.

TIMTOWTDI. I have to do a lot of preprocessing, and it's easier for me
to incorporate as much of the proprocessing into one function as I
can. I agree that we are usually better off using a module that has
been time tested, but there are times when the reverse is true.

CC

nun

unread,

Oct 6, 2008, 9:45:17 PM10/6/08

to

Thanks - that worked great :)

DB

s...@netherlands.com

unread,

Oct 9, 2008, 4:18:50 PM10/9/08

to

On Sat, 04 Oct 2008 20:24:42 GMT, s...@netherlands.com wrote:

>On Fri, 03 Oct 2008 13:37:01 -0400, nun <ju...@yahoo.com> wrote:
>
>>I need to process a text file of product data that's supplied to me by a
>>vendor. This data is "kind of" comma-delimited.... some of the rows
>>contain commas within the "description" field, and in these cases (and
>>only in these cases) that field's data is enclosed by double quotes.
>>Here is some sample data:
>>
>>SKU,DESCRIPTION,PRICE
>>12345,CABLE,21.25
>>56789,"CONNECTOR, LARGE",13.50
>>
>>Rows which do not have the double-quote-enclosed comma issue can be
>>processed correctly using this code:
>>
>>#################################
>># reading data in from file
>>my (@AoA);
>> while ( <> ) {
>
> my @tmpa = ();
> while (/"+\s*([^"]*?)\s*"+|\s*([^,\n]+)\s*/g) # trims

^^ /\s*"+\s*([^"]*?)\s*"+\s*|\s*([^,\n]+)\s*/g
^
if the leading \s is not there, [^,\n] will hog before it gets to the
first double quote.
this will only happen on the very first ' "...",' entry.