awk - 2-dimensional associative array

Harry

unread,

Jun 14, 2013, 9:39:00 PM6/14/13

to

Hi All,

I have a text file with data like this.
First column is Facility_ID, second column is date string,
third column is a count.

--
Fac Date total
7 20130607 2112
7 20130608 314
7 20130609 143
7 20130610 3782
7 20130611 2855
7 20130612 2851
7 20130613 2788
101 20130607 63910
101 20130608 26722
101 20130609 10320
101 20130610 49265
101 20130611 61710
101 20130612 65428
101 20130613 64092
102 20130607 85
102 20130608 15
102 20130609 5
102 20130610 130
102 20130611 119
102 20130612 87
102 20130613 137
...
--

Want to summarize the data with output like this.

Fac date1, ..., date7
7 2112,..., 2788
...

I have not used a two-dimensional associative array before ...
Not sure if the following are correct.

I tried with the following ...

awk 'BEGIN {OFS="\t";}
{fac=$1; date=$2; count=$3;a[fac][date]=a[fac][date]+count;}
END {for (i in a)
{print a[i];
for (j in a[i]) print a[i][j];
print "\n";}
}
' output.csv

But got this error ...
awk: cmd. line:4: (FILENAME=output.csv FNR=228) fatal:
attempt to use array `a["322070"]' in a scalar context

Line 228 seemed OK.

$ cat -n output.csv | tail -5
224 8832263 20130613 356
225 8832266 20130607 4
226 8832266 20130608 1
227 8832266 20130612 18
228 8832266 20130613 3

Why such error and how to fix it?

TIA

Bill Marcum

unread,

Jun 15, 2013, 2:41:14 AM6/15/13

to

On 06/14/2013 09:39 PM, Harry wrote:
> awk 'BEGIN {OFS="\t";}
> {fac=$1; date=$2; count=$3;a[fac][date]=a[fac][date]+count;}
> END {for (i in a)
> {print a[i];
> for (j in a[i]) print a[i][j];
> print "\n";}
> }
> ' output.csv
>
> But got this error ...
> awk: cmd. line:4: (FILENAME=output.csv FNR=228) fatal:
> attempt to use array `a["322070"]' in a scalar context
>
> Line 228 seemed OK.
>
> $ cat -n output.csv | tail -5
> 224 8832263 20130613 356
> 225 8832266 20130607 4
> 226 8832266 20130608 1
> 227 8832266 20130612 18
> 228 8832266 20130613 3
>
> Why such error and how to fix it?
>
> TIA
>

It looks like a bug which might be caused by running out of memory.
If you're using the default awk in Solaris, don't use it, try nawk,
/usr/xpg4/bin/awk, or gawk.

Manuel Collado

unread,

Jun 15, 2013, 6:23:14 AM6/15/13

to

^-------------------------- a[i] is an array

> for (j in a[i]) print a[i][j];
> print "\n";}
> }
> ' output.csv
>
> But got this error ...
> awk: cmd. line:4: (FILENAME=output.csv FNR=228) fatal:
> attempt to use array `a["322070"]' in a scalar context

The message is quite precise: `a["322070"]' is itself an array. You can't
print it directly.

>
> Line 228 seemed OK.
>
> $ cat -n output.csv | tail -5
> 224 8832263 20130613 356
> 225 8832266 20130607 4
> 226 8832266 20130608 1
> 227 8832266 20130612 18
> 228 8832266 20130613 3

No problems with the data file.

>
> Why such error and how to fix it?

In line 4, 'print a[i]' should probably be 'print i'.

>
> TIA
--
Manuel Collado - http://lml.ls.fi.upm.es/~mcollado

Harry

unread,

Jun 15, 2013, 3:32:26 PM6/15/13

to

Manuel Collado wrote...
[...]

>> But got this error ...
>> awk: cmd. line:4: (FILENAME=output.csv FNR=228) fatal:
>> attempt to use array `a["322070"]' in a scalar context
>
>The message is quite precise: `a["322070"]' is itself an array. You can't
>print it directly.
>
>>
>> Line 228 seemed OK.
>>
>> $ cat -n output.csv | tail -5
>> 224 8832263 20130613 356
>> 225 8832266 20130607 4
>> 226 8832266 20130608 1
>> 227 8832266 20130612 18
>> 228 8832266 20130613 3
>
>No problems with the data file.
>
>>
>> Why such error and how to fix it?
>
>In line 4, 'print a[i]' should probably be 'print i'.

Ah ... that makes sense.

Thanks

Harry

unread,

Jun 16, 2013, 12:47:47 AM6/16/13

to

$ cat outfile.csv

7 20130607 2112
7 20130608 314
7 20130609 143
7 20130610 3782
7 20130611 2855
7 20130612 2851
7 20130613 2788
101 20130607 63910
101 20130608 26722
101 20130609 10320
101 20130610 49265
101 20130611 61710
101 20130612 65428
101 20130613 64092

$ awk 'BEGIN {OFS="\t";}

{fac=$1; date=$2; count=$3;a[fac][date]=a[fac][date]+count;}
END {for (i in a)

{line=i;
for (j in a[i]) line=line "\t" a[i][j];
print line
}
}
' outfile.csv

Output ...
7 3782 2855 2851 2788 2112 314 143
101 49265 61710 65428 64092 63910 26722 10320

Hmmmm... the output was not sorted horizontally.
How could I sort a row from 20130607 to 20130613 ?

TIA

Janis Papanagnou

unread,

Jun 16, 2013, 3:12:20 AM6/16/13

to

Since your initial question (the syntax error) is solved, you can
now consider using another approach. Your input data seems sorted
already so you don't need arrays (neither standard arrays nor the
2-dim extension that gawk supports).

awk '
oldkey && $1!=oldkey { print oldkey, data ; data = "" }
{ data = data "\t" $3 ; oldkey = $1 }
END { print oldkey, data }
' outfile.csv

The program preserves the original order.

Your program also counts the third column, but that count is not
used for the output. If you want to extend it to sum up the data
just add a bit code to handle the summation...

awk '
oldkey && $1!=oldkey { print oldkey, data" \ts= "sum ; data="" ; sum=0 }
{ data = data "\t" $3 ; sum += $3 ; oldkey = $1 }
END { print oldkey, data "\ts=" sum }
' outfile.csv

Janis

Harry

unread,

Jun 16, 2013, 10:37:38 AM6/16/13

to

Janis Papanagnou wrote...
[...]

>Since your initial question (the syntax error) is solved, you can
>now consider using another approach. Your input data seems sorted
>already so you don't need arrays (neither standard arrays nor the
>2-dim extension that gawk supports).
>
> awk '
> oldkey && $1!=oldkey { print oldkey, data ; data = "" }
> { data = data "\t" $3 ; oldkey = $1 }
> END { print oldkey, data }
> ' outfile.csv

[...]

Janis, thanks for the suggestion.

It worked well as long all all facility_id have had 7 rows.
But in my original outfile.csv, there were some facilities
without 7 rows of results.

$ egrep "^101|^111|^114" outfile.csv

101 20130607 63910
101 20130608 26722
101 20130609 10320
101 20130610 49265
101 20130611 61710
101 20130612 65428
101 20130613 64092

111 20130610 1
114 20130610 6
114 20130611 2
114 20130613 1

So, your code would yield the following results.

7 2112 314 143 3782 2855 2851 2788
101 63910 26722 10320 49265 61710 65428 64092
102 85 15 5 130 119 87 137
106 103 30 2 84 128 105 116
110 101 50 27 233 152 224 169
111 1
114 6 2 1

And the last two rows were shifted to the left.

Sorry my OP did not show this requirement.
Any suggestion to handle this?

TIA

Janis Papanagnou

unread,

Jun 16, 2013, 1:08:31 PM6/16/13

to

Sure, but it depends on the - yet unmentioned - further requirements.
Is there always an at most a 7-day period of data?
How to determine the start (or resp. end) of the period in cases where
the data samples are lacking the first (first few) or resp. the last
(last few) data entries in a row. (Is the date range known in advance?
Is it always starting on a certain week-day?)
Could all the data in some cases have just 2 or 3 columns?

With the mentioned new requirements you probably may indeed need arrays
(though not necessarily gawk's non-standard multi-dimensional arrays).

The problem is that you need the boundary dates, which (depending on
your data) might not be available until the file is completely read.

Here's a sample program that assumes a header line in the data file,
and the necessity for summation across each line, and for each column.
For other requirements that script may be simplified. The script uses
only standard awk features.

awk '
NR==1 { next }
!min || $2<min { min=$2 }
!max || $2>max { max=$2 }
!($1 in keyset) { keyset[$1] ; key[++nk]=$1 }
{ val[$1,$2]=$3 }
END {
for (k=1; k<=nk; k++) {
printf "%s", key[k]
s=0
for (d=min; d<=max; d++) {
v=val[key[k],d]
s+=v
sum[d]+=v
printf "\t" v
}
printf "\tsum=%s\n", s
}
printf "Sum"
for (d=min; d<=max; d++)
printf "\t%s", sum[d]
printf "\n"
}
'

Output:

101 63910 26722 10320 49265 61710 65428 64092 sum=341447
111 1 sum=1
114 6 2 1 sum=9
Sum 63910 26722 10320 49272 61712 65428 64093

Janis

>
> TIA
>

Harry

unread,

Jun 16, 2013, 3:00:02 PM6/16/13

to

Janis Papanagnou wrote...
[...]

>> Any suggestion to handle this?
>
>Sure, but it depends on the - yet unmentioned - further requirements.
>Is there always an at most a 7-day period of data?

Yes and no.
The outfile.csv was generated by an SQL query on an Excel.
The later was generated by yet another awk script to search
log files for last 7 days. So I could search for 7 days or
any arbitary numbers of days. There would always be 7-days data
for some large facilities, but not for small facilities.

>How to determine the start (or resp. end) of the period in cases where
>the data samples are lacking the first (first few) or resp. the last
>(last few) data entries in a row. (Is the date range known in advance?
>Is it always starting on a certain week-day?)
>Could all the data in some cases have just 2 or 3 columns?

No, large facilites would have data everyday.

>With the mentioned new requirements you probably may indeed need arrays
>(though not necessarily gawk's non-standard multi-dimensional arrays).

I am still intersted in sorting an 2-d associative array.

This worked perfectly.
Thank you so much for your help, Janis.

Harry

Harry

unread,

Jun 16, 2013, 6:10:55 PM6/16/13

to

Harry wrote...
[...]

>>With the mentioned new requirements you probably may indeed need arrays
>>(though not necessarily gawk's non-standard multi-dimensional arrays).
>
>I am still intersted in sorting an 2-d associative array.

[...]

I managed to sort the date keys k[] like this.

awk '
NR==1 { next; }
{fac=$1; date=$2; count=$3;a[fac][date]=a[fac][date]+count;k[date]=1}
END {n=0; for (j in k) b[n++]=j;
n=asort(b,c);
for (f in a)
{line=f;
for (i = 1; i <= n; i++) line=line "\t" a[f][c[i]];
print line;
}
}
' outfile.csv | sort -n

Output looked OK.

James

unread,

Aug 8, 2013, 6:39:40 PM8/8/13

to

A perl version:

while (<>) {
chomp;
($k, $t, $n) = split/\s+/;
$x{$t} = 1;
$h{$k}->{$t} = $n;
}
for $k (sort {$a<=>$b} keys %h) {
print "$k\t";
print "$h{$k}->{$_}\t" for sort {$a<=>$b} keys %x;
print "\n";

}

7 2112 314 143 3782 2855 2851 2788

101 63910 26722 10320 49265 61710 65428 64092

102 85 15 5 130 119 87 137

111 1
114 6 2 1

James

Kenny McCormack

unread,

Aug 8, 2013, 6:56:33 PM8/8/13

to

In article <8a3823a9-56f9-495c...@googlegroups.com>,
James <hsle...@yahoo.com> wrote:
...
>A perl version:

Where's the C# version?

--
Here's a simple test for Fox viewers:

1) Sit back, close your eyes, and think (Yes, I know that's hard for you).
2) Think about and imagine all of your ridiculous fantasies about Barack Obama.
3) Now, imagine that he is white. Cogitate on how absurd your fantasies
seem now.

See? That wasn't hard, was it?