Comparing two CSV files

Matt

unread,

Mar 15, 2006, 1:58:47 PM3/15/06

to

I'm attempting to write a simple script that will compare values of two
files in csv format. I want to separate file1 into an array and take a
look at the 1st column value of each row and compare it to the same
column in file2.

I'm guessing I need to use nested loops. One loop to process each row
of file1 as file2 is processed against it with another loop.

open(DATA1_FILE,"data.1") ||
die "Can't open file: $!";
open(DATA2_FILE,"data.2") ||
die "Can't open file: $!";

open(OUT_FILE, ">output.csv") ||
die "Can't open file";

while(<DATA1_FILE>) {
chomp;

my $brInt1 = 'notmodified';
my @csv1 = split(/,/, $_);

while(<DATA2_FILE>) {
chomp;

my @csv2 = split(/,/, $_);

if ( $csv1[0] == $csv2[0] )
{
$brInt1 = 'modified';
}

}

print OUT_FILE<<"EOF";
$csv1[0] $csv1[1],$csv1[2],$csv1[3],$csv1[4],$csv1[5],$brInt1
EOF
}

close DATA1_FILE;
close DATA2_FILE;
close OUT_FILE;

--
Regards,
Matt
(Non-programmer learning Perl)

use...@davidfilmer.com

unread,

Mar 15, 2006, 3:14:36 PM3/15/06

to

Matt wrote:
> I'm guessing I need to use nested loops.

Well, not like that. Your approach processes every line of file2
against every line of file1. But you question indicates that you only
want to process (say) line 14 of file2 against line14 of file1 (ie,
there is a one-to-one correspondance between the two files)

I'd use modules to simplify (and robust-ify) the parsing (this approach
assumes the CSV files are not huge and you don't mind holding them both
in memory):

#!/usr/bin/perl
use warnings; use strict;
use Text::CSV::Simple;
use List::MoreUtils qw{ each_array each_arrayref };

my $parser = Text::CSV::Simple->new;

my @csv1 = $parser->read_file('/tmp/csv1');
my @csv2 = $parser->read_file('/tmp/csv2');

my $each = each_array(@csv1, @csv2); #from List::MoreUtils

while ( my ($csv1, $csv2) = $each->() ) {
print join ",", @$csv1,
($$csv1[0] eq $$csv2[0] ? 'Not_' : '')
. "Modified\n";
}

__END__

--
http://DavidFilmer.com

Matt

unread,

Mar 15, 2006, 4:10:35 PM3/15/06

to

use...@DavidFilmer.com wrote:
> Matt wrote:
>> I'm guessing I need to use nested loops.
>
> Well, not like that. Your approach processes every line of file2
> against every line of file1. But you question indicates that you only
> want to process (say) line 14 of file2 against line14 of file1 (ie,
> there is a one-to-one correspondance between the two files)
>

A little more clarification. For each row of File1, I want to compare
column 1 with file 2 column 1 and cycle through the rows of file 2 until
a match is made. Once a match is made, then perform some conditional
tasks and move on. If a match isn't made, then skip to the next row of
file 1 and scan again.

So yes, I do want file1.r1.c1 to be correlated with one particular row,
but the row may be random (file2.r[x].c1).

> I'd use modules to simplify (and robust-ify) the parsing (this approach
> assumes the CSV files are not huge and you don't mind holding them both
> in memory):
>
> #!/usr/bin/perl

[snip...]

Thanks for this. Eventually, I do want to simplify by using modules
perhaps. But just for my understanding, I want to get the script
working without (or as little use) of built in modules/functions, etc.
Briefly looking at the code you attached, I'm already quite lost. :)

--
Regards,
Matt

use...@davidfilmer.com

unread,

Mar 15, 2006, 5:07:09 PM3/15/06

to

Matt wrote:
> A little more clarification. For each row of File1, I want to compare
> column 1 with file 2 column 1 and cycle through the rows of file 2 until
> a match is made. Once a match is made, then perform some conditional
> tasks and move on.

OK, then you need the outer/inner loops as you first said. You can
still parse the CSV files with a module (which simply does the same
thing your open/read/close stuff does, except it returns an array of
array references. That might sound confusing, but it's really not, and
you would do well to learn how this sort of thing works, since it's
very handy (and efficient, and common) in Perl.

===================================
The quick and dirty tutorial: The line:

my @csv1 = $parser->read_file('/tmp/csv1');

returns an array of array references. Each array reference equals one
parsed line of the CSV file. So $csv1[0] is a reference to an array
which contains the parsed values of the first line of your first CSV
file. Thus, you could do this:
my $first_line = $csv1[0];
Now, $first_line is (obviously) this same reference. It's not an array
- it's a reference to an array. You can de-reference the whole array
like this:
print @$first_line;
or address individual elements like this:
print $$first_line[3]; #the fourth element of the first line of
the first CSV
Or, as a shortcut, you can say this:
print $csv1[3]->[5];
which prints the sixth value (ie, [5]) of the fourth line (ie, [3]) of
the first CSV file. Hey, that's pretty cool, right?
===================================

Anyway, this should do what you want (I've got some "dummy processing"
for illustration that just prints some stuff).

#!/usr/bin/perl
use warnings; use strict;
use Text::CSV::Simple;

my $parser = Text::CSV::Simple->new;

my @csv1 = $parser->read_file('/tmp/csv1');
my @csv2 = $parser->read_file('/tmp/csv2');

OUTER_LOOP:
foreach my $csv1(@csv1) {
foreach my $csv2(@csv2) {
if ($$csv1[0] eq $$csv2[0]) { # use '==' if numeric
print "YAAAY - '$$csv1[0]' matches '$$csv2[0]'\n";
print "\tCSV2 stuff is: " , (join ",", @$csv2) . "\n";
print "\tThe third field from CSV1 is: '$$csv1[2]'\n";
next OUTER_LOOP;
}
}
print "BUMMER - no match for '$$csv1[0]'\n";
}

__END__

--
http://DavidFilmer.com

Matt

unread,

Mar 15, 2006, 9:31:27 PM3/15/06

to

use...@DavidFilmer.com wrote:
[snip]

> which prints the sixth value (ie, [5]) of the fourth line (ie, [3]) of
> the first CSV file. Hey, that's pretty cool, right?

David,

Very cool indeed. I installed the Text::CSV::Simple module and your
script worked as intended. Now I just need to modify it and add
conditionals and write to a file. My goal is to be able to write a
simple script like this without "hacking" someone else's code. I
appreciate your input.

>
> OUTER_LOOP:
> foreach my $csv1(@csv1) {
> foreach my $csv2(@csv2) {

Just wondering why you used foreach instead of while? Is it just a
matter of preference, or does foreach work better with arrays such as this?

--
Regards,
Matt

goo...@davidfilmer.net

unread,

Mar 15, 2006, 9:41:31 PM3/15/06

to

Matt wrote:

> Just wondering why you used foreach instead of while? Is it just a
> matter of preference, or does foreach work better with arrays such as this?

No - there's a functional difference: while() loops are for input
(such as reading a filehandle), foreach() is for everything else
(lists, arrays, etc). It matters sometimes (see
http://tinyurl.com/hslac for an example where I made this mistake and
was reminded of the difference between for/while by more expert
coders).

for() and foreach(), however, are the exact same thing. By convention,
I (and many others) use for() when not using an index variable (ie,
allowing $_ to represent the current value). But I (and many others)
consider looping with $_ to be bad practice in most cases, so I rarely
use for().

--
http://DavidFilmer.com

Matt

unread,

Mar 15, 2006, 11:47:12 PM3/15/06

to

goo...@davidfilmer.net wrote:
Thanks again for your help and insight. Just a bit more tweaking and
I'll be done.

#!/usr/bin/perl
use warnings; use strict;
use Text::CSV::Simple;

my $outfile = "comparing.output.csv";
unlink("$outfile");
open(OUT_FILE,">>$outfile") || die("Cannot Open $outfile");
my $parser = Text::CSV::Simple->new();

my @csv1 = $parser->read_file('data.1.csv');
my @csv2 = $parser->read_file('data.2.csv');

OUTER_LOOP:
foreach my $csv1(@csv1) {
foreach my $csv2(@csv2) {

if ($$csv1[0] eq $$csv2[0]) { # use '==' if numeric

$$csv1[3] = $$csv2[3];
print OUT_FILE (join ",", @$csv1) . "\n";
next OUTER_LOOP;
}
}
print OUT_FILE (join ",", @$csv1,"NO_MATCH") . "\n";

}

close(OUT_FILE);

--
Regards,
Matt