I have to analyse data files that have the following format:
Each row corresponds to 1 day and the field logic is: $1 is the date,
followed by 24 value/flag pairs, representing measurements at 01:00,
02:00 ... 24:00 of the respective day. In short:
<date> <val1> <flag1> <val2> <flag2> ... <val24> <flag24>
Some test data is available at:
http://cdrtest.eionet.europa.eu/at/eea/colqleaoq/envrwlbtw/AT0001A0000100100hour.1-1-1990.31-12-2004/manage_document
I have to sum up the values (per day and only valid data, i.e. with
flag>0) in order to calculate the mean. That's not too difficult.
However, I also need to know what the "maximum data gap" is, i.e. the
longest period with successive invalid measurements (i.e values with
flag<=0)
My current code looks like this:
{
date=$1
for (i=1;i<=24;i++)
if ($(i*2+1)>0){
num_valid++
sum+=$(i*2)
} else {
# find out what the max_gap for this row is
}
if (num_valid>13 && max_gap<=6){
print date,sum/num_valid,1
} else {
# print something else
}
}
Could someone help me with the max_gap logic?
Thanks, Hermann
I'd have written the loop just a little differently, so instead of this:
for (i=1;i<=24;i++)
if ($(i*2+1)>0){
num_valid++
sum+=$(i*2)
}
I'd do this:
for (i=3;i<=NF;i+=2)
if ($i){
num_valid++
sum+=$(i-1)
}
because I think it's clearer (and I expect it's also more efficient but
I dobn't care so much about that).
Given that, it's an easy tweak to get the gap (assuming by "gap" you
mean number of intervals):
for (i=3;i<=NF;i+=2)
if ($i){
num_valid++
sum+=$(i-1)
last_valid = i
} else {
this_gap = (i - last_valid) / 2
max_gap = (this_gap > max_gap ? this_gap : max_gap)
}
Regards,
Ed.
My approach was based on the thought that the i-value should always be
equal to the time (hour) the measurement stands for. E.g. i=1, then I
loop over the value/flag for 01:00H which are fields $2/$3 in my
records.
>
> Given that, it's an easy tweak to get the gap (assuming by "gap" you
> mean number of intervals):
Correct. The intervals are the 24 hours of a day in my case.
>
> for (i=3;i<=NF;i+=2)
> if ($i){
> num_valid++
> sum+=$(i-1)
> last_valid = i
> } else {
> this_gap = (i - last_valid) / 2
> max_gap = (this_gap > max_gap ? this_gap : max_gap)
> }
>
It looks to me that in a row with 1 bad value first and then 23
successive good values, the result would be this_gap = ( 3 - 0) /2 =
1.5 ?
I will try out at home and perhaps blend your code with mine.
Thanks, Hermann
>
> I'd have written the loop just a little differently, so instead of this:
> ...
I took lines from both, my original code and yours. The best properly
working solution I was able to produce is:
{
num_valid=0
sum_valid=0
last_valid=0
this_gap=0
max_gap=0
date=$1
for (i=1;i<=24;i++)
if ($(i*2+1)>0){
num_valid++
sum_valid+=$(i*2)
last_valid = i
} else {
this_gap = (i - last_valid)
max_gap = (this_gap > max_gap ? this_gap : max_gap)
}
if (num_valid>=13 && max_gap<=6){
print date,sum_valid/num_valid,1,num_valid,max_gap
} else {
print date,(num_valid?sum_valid/num_valid):0),-9,num_valid,max_gap
}
}
I still can't believe that one needs so much code to find out that e.g.
in the below row, the "longest period with successive invalid values"
(i.e. a "v" followed by a flag <=0) is: 3.
date v 1 v 0 v 1 v -1 v 0 v -1 v 1 v -1 v 1 v 1 v -1 v 0 v 1 v 1 v -1
Perhaps you would prefer to use Ruby:
$, = " " # Output field-separator.
ARGF.each { |line|
fields = line.split
date = fields.first
flags = fields.map{|x| Integer(x) rescue nil}.compact
valid = flags.select{|n| n > 0}
num_valid = valid.size
sum_valid = valid.inject{|a,b| a+b}
max_gap = flags.map{|n| n > 0 ? "v" : "i" }.
join("").scan(/i+/).map{|s| s.size}.max
if num_valid > 12 and max_gap < 7
print date,sum_valid/num_valid,1,num_valid,max_gap
else
print date,(num_valid ? sum_valid/num_valid : 0),-9,
num_valid,max_gap
end
puts
}
Not really. See below.
Anyway: thanks for the hint,
Hermann
--------------
$ cat mycode
{
num_valid=0
sum_valid=0
last_valid=0
this_gap=0
max_gap=0
date=$1
for (i=1;i<=24;i++)
if ($(i*2+1)>0){
num_valid++
sum_valid+=$(i*2)
last_valid = i
} else {
this_gap = (i - last_valid)
max_gap = (this_gap > max_gap ? this_gap : max_gap)
}
if (num_valid>=13 && max_gap<=6){
print date,sum_valid/num_valid,1,num_valid,max_gap
} else {
print date,(num_valid?sum_valid/num_valid):0),-9,num_valid,max_gap
}
}
$ tr -d " " <mycode |wc -c
401
$ cat yourcode
$, = " " # Output field-separator.
ARGF.each { |line|
fields = line.split
date = fields.first
flags = fields.map{|x| Integer(x) rescue nil}.compact
valid = flags.select{|n| n > 0}
num_valid = valid.size
sum_valid = valid.inject{|a,b| a+b}
max_gap = flags.map{|n| n > 0 ? "v" : "i" }.
join("").scan(/i+/).map{|s| s.size}.max
if num_valid > 12 and max_gap < 7
print date,sum_valid/num_valid,1,num_valid,max_gap
else
print date,(num_valid ? sum_valid/num_valid : 0),-9,
num_valid,max_gap
end
puts
}
$ tr -d " " <yourcode |wc -c
461
This is stunningly naive. However, if that is what you think
you want ...
$, = " "
while s = gets do
date, *flags = s.split
flags.map!{|x| Integer(x) rescue nil}.compact!
v = flags.select{|n| n > 0}
sum_valid = v.inject{|a,b| a+b}
max_gap = flags.map{|n| n > 0 ? "v" : "i" }.
join.scan(/i+/).map{|s| s.size}.max
print date,(v.size > 0 ? sum_valid.to_f/v.size : 0),
(v.size > 12 && max_gap < 7 ? 1 : -9),
v.size, max_gap, "\n"
end
> Hi All,
>
> I have to analyse data files that have the following format:
> Each row corresponds to 1 day and the field logic is: $1 is the date,
> followed by 24 value/flag pairs, representing measurements at 01:00,
> 02:00 ... 24:00 of the respective day. In short:
>
> <date> <val1> <flag1> <val2> <flag2> ... <val24> <flag24>
>
> Some test data is available at:
> http://cdrtest.eionet.europa.eu/at/eea/colqleaoq/envrwlbtw/AT0001A0000100100hour.1-1-1990.31-12-2004/manage_document
>
> I have to sum up the values (per day and only valid data, i.e. with
> flag>0) in order to calculate the mean. That's not too difficult.
> However, I also need to know what the "maximum data gap" is, i.e. the
> longest period with successive invalid measurements (i.e values with
> flag<=0)
Hi Hermann.
A little late I know, but I did enjoy this.
First, I selected a few lines from your data that allowed me to develop
my program.
egrep '1991-03-31|1991-04-0[1-4]|1993-03-0[45]|2004-12-(26|31)'
readings.txt > readingsx.txt
The gawk program I called readings.awk which produces the following
output on the
reduced input:
$ gawk -f readings.awk readingsx.txt
Line: 1991-03-31 Reject: 0 Accept: 24 Line_tot: 565.000
Line_avg: 23.542
Line: 1991-03-31 Reject: 23 Accept: 1 Line_tot: 40.000
Line_avg: 40.000
Line: 1991-04-01 Reject: 1 Accept: 23 Line_tot: 534.000
Line_avg: 23.217
Line: 1991-04-02 Reject: 0 Accept: 24 Line_tot: 475.000
Line_avg: 19.792
Line: 1991-04-03 Reject: 0 Accept: 24 Line_tot: 335.000
Line_avg: 13.958
Line: 1991-04-04 Reject: 0 Accept: 24 Line_tot: 261.000
Line_avg: 10.875
Line: 1993-03-04 Reject: 24 Accept: 0 Line_tot: 0.000
Line_avg: 0.000
Line: 1993-03-05 Reject: 14 Accept: 10 Line_tot: 454.000
Line_avg: 45.400
Line: 2004-12-26 Reject: 1 Accept: 23 Line_tot: 154.600
Line_avg: 6.722
Line: 2004-12-31 Reject: 1 Accept: 23 Line_tot: 47.300
Line_avg: 2.057
File(s) = readingsx.txt
Total = 2865.900
Readings = 176
Average = 16.284
Maximum run(s) of 38 consecutive false readings ends at line starting
with date(s): 1993-03-05
When run on the full dataset I get the following (only summary shown)
$ gawk -f readings.awk readings.txt |tail
Line: 2004-12-29 Reject: 1 Accept: 23 Line_tot: 56.300
Line_avg: 2.448
Line: 2004-12-30 Reject: 1 Accept: 23 Line_tot: 65.300
Line_avg: 2.839
Line: 2004-12-31 Reject: 1 Accept: 23 Line_tot: 47.300
Line_avg: 2.057
File(s) = readings.txt
Total = 1358393.400
Readings = 129403
Average = 10.497
Maximum run(s) of 589 consecutive false readings ends at line starting
with date(s): 1993-03-05
Note: consecutive false readings count goes over multiple following
lines.
Note: in gvim, the following search expression highlights all the
rejected readings:
/0\.0\+[\t ]\+\(-[0-9]\|\<0\)[^0-9.]
======Start readings.awk=========
# Author Donald 'Paddy' McCarthy Jan 01 2007
BEGIN{
nodata = 0; # Curret run of consecutive flags<0 in lines
of file
nodata_max=-1; # Max consecutive flags<0 in lines of file
nodata_maxline="!"; # ... and line number(s) where it occurs
}
FNR==1 {
# Accumulate input file names
if(infiles){
infiles = infiles "," infiles
} else {
infiles = FILENAME
}
}
{
tot_line=0; # sum of line data
num_line=0; # number of line data items with flag>0
# extract field info, skipping initial date field
for(field=2; field<=NF; field+=2){
datum=$field;
flag=$(field+1);
if(flag<1){
nodata++
}else{
# check run of data-absent fields
if(nodata_max==nodata && (nodata>0)){
nodata_maxline=nodata_maxline ", " $1
}
if(nodata_max<nodata && (nodata>0)){
nodata_max=nodata
nodata_maxline=$1
}
# re-initialise run of nodata counter
nodata=0;
# gather values for averaging
tot_line+=datum
num_line++;
}
}
# totals for the file so far
tot_file += tot_line
num_file += num_line
printf "Line: %11s Reject: %2i Accept: %2i Line_tot: %10.3f
Line_avg: %10.3f\n", \
$1, ((NF -1)/2) -num_line, num_line, tot_line, (num_line>0)?
tot_line/num_line: 0
# debug prints of original data plus some of the computed values
#printf "%s %15.3g %4i\n", $0, tot_line, num_line
#printf "%s\n %15.3f %4i %4i %4i %s\n", $0, tot_line, num_line,
nodata, nodata_max, nodata_maxline
}
END{
printf "\n"
printf "File(s) = %s\n", infiles
printf "Total = %10.3f\n", tot_file
printf "Readings = %6i\n", num_file
printf "Average = %10.3f\n", tot_file / num_file
printf "\nMaximum run(s) of %i consecutive false readings ends at
line starting with date(s): %s\n", nodata_max, nodata_maxline
}
======Finish readings.awk=========
http://paddy3118.blogspot.com/2007/01/data-mining-in-three-language05.html
- Paddy.
ITYM:
awk '/1991-03-31|1991-04-0[1-4]|1993-03-0[45]|2004-12-(26|31)/'
readings.txt > readingsx.txt
;-).
No need to init a variable to zero.
> nodata_max=-1; # Max consecutive flags<0 in lines of file
> nodata_maxline="!"; # ... and line number(s) where it occurs
The above says you'll store line numbers in the variable
"nodata_maxline" but that doesn't seem to be what you're doing below
when you save a chain of "$1" values in it and in your final printf you
seem tio be saying it's value is a date. In either case, why set it to
"!" when it's default initial value of zero/null would work just fine.
> }
> FNR==1 {
> # Accumulate input file names
> if(infiles){
> infiles = infiles "," infiles
> } else {
> infiles = FILENAME
> }
ITYM:
if(infiles){
infiles = infiles "," FILENAME
} else {
infiles = FILENAME
}
but you could write it more simply as just:
infiles = infiles sep FILENAME
sep = ","
> }
> {
> tot_line=0; # sum of line data
> num_line=0; # number of line data items with flag>0
You don't need to end lines in semi-colon. I wouldn't but if you really
feel a burning desire to do it, then do it consistently (some lines
below don't end in semi-colons).
> # extract field info, skipping initial date field
> for(field=2; field<=NF; field+=2){
> datum=$field;
> flag=$(field+1);
> if(flag<1){
> nodata++
The above, for example, doesn't have the useless trailing semi-colon.
> }else{
> # check run of data-absent fields
> if(nodata_max==nodata && (nodata>0)){
Again, be consistent. If you're going to put one term in brackets
"(nodata>0)" then put the other one "nodata_max==nodata" in brackets too.
> nodata_maxline=nodata_maxline ", " $1
> }
> if(nodata_max<nodata && (nodata>0)){
> nodata_max=nodata
> nodata_maxline=$1
> }
ITYM:
if(nodata_max==nodata && (nodata>0)){
nodata_maxline=nodata_maxline ", " $1
}
else if(nodata_max<nodata && (nodata>0)){
nodata_max=nodata
nodata_maxline=$1
}
but of course that could be written more efficiently as:
if (nodata>0)
if(nodata_max==nodata){
nodata_maxline=nodata_maxline ", " $1
}
else if(nodata_max<nodata ){
nodata_max=nodata
nodata_maxline=$1
}
}
which MAY actually be just:
if (nodata>0)
if(nodata_max==nodata){
nodata_maxline=nodata_maxline ", " $1
}
else {
nodata_max=nodata
nodata_maxline=$1
}
}
but what you're doing here isn't immediately obvious and I couldn't be
bothered to think about it...
Regards,
Ed.
Thanks Ed for your critique. I've added some notes to your below.
> Paddy wrote:
> > Hermann Peifer wrote:
> >
> >
> >>Hi All,
> >>
> >>I have to analyse data files that have the following format:
> >>Each row corresponds to 1 day and the field logic is: $1 is the date,
> >>followed by 24 value/flag pairs, representing measurements at 01:00,
> >>02:00 ... 24:00 of the respective day. In short:
> >>
> >><date> <val1> <flag1> <val2> <flag2> ... <val24> <flag24>
> >>
> >>Some test data is available at:
> >>http://cdrtest.eionet.europa.eu/at/eea/colqleaoq/envrwlbtw/AT0001A0000100100hour.1-1-1990.31-12-2004/manage_document
> >>
> >>I have to sum up the values (per day and only valid data, i.e. with
> >>flag>0) in order to calculate the mean. That's not too difficult.
> >>However, I also need to know what the "maximum data gap" is, i.e. the
> >>longest period with successive invalid measurements (i.e values with
> >>flag<=0)
> >
> >
> > Hi Hermann.
> > A little late I know, but I did enjoy this.
> >
<SNIP>
> > ======Start readings.awk=========
> > # Author Donald 'Paddy' McCarthy Jan 01 2007
> >
> > BEGIN{
> > nodata = 0; # Curret run of consecutive flags<0 in lines
> > of file
>
> No need to init a variable to zero.
... Except when I want to makke a point of its (automatic) initial
value and hang a comment on it.
>
> > nodata_max=-1; # Max consecutive flags<0 in lines of file
> > nodata_maxline="!"; # ... and line number(s) where it occurs
>
> The above says you'll store line numbers in the variable
> "nodata_maxline" but that doesn't seem to be what you're doing below
> when you save a chain of "$1" values in it and in your final printf you
> seem tio be saying it's value is a date. In either case, why set it to
> "!" when it's default initial value of zero/null would work just fine.
Your right, it is the (almost) unique date field for each record.
I wanted a visually distinct initial value for nodata_maxline even
though I don't expect to see it printed, hence "!".
>
> > }
> > FNR==1 {
> > # Accumulate input file names
> > if(infiles){
> > infiles = infiles "," infiles
> > } else {
> > infiles = FILENAME
> > }
>
> ITYM:
>
> if(infiles){
> infiles = infiles "," FILENAME
> } else {
> infiles = FILENAME
> }
>
> but you could write it more simply as just:
>
> infiles = infiles sep FILENAME
> sep = ","
Neat trick - I think I'll use it in the future.
>
> > }
> > {
> > tot_line=0; # sum of line data
> > num_line=0; # number of line data items with flag>0
>
> You don't need to end lines in semi-colon. I wouldn't but if you really
> feel a burning desire to do it, then do it consistently (some lines
> below don't end in semi-colons).
Your right, I've never just programmed in awk and, first pass, I'm
always switching statement-ending idioms.
>
> > # extract field info, skipping initial date field
> > for(field=2; field<=NF; field+=2){
> > datum=$field;
> > flag=$(field+1);
> > if(flag<1){
> > nodata++
>
> The above, for example, doesn't have the useless trailing semi-colon.
>
> > }else{
> > # check run of data-absent fields
> > if(nodata_max==nodata && (nodata>0)){
>
> Again, be consistent. If you're going to put one term in brackets
> "(nodata>0)" then put the other one "nodata_max==nodata" in brackets too.
>
> > nodata_maxline=nodata_maxline ", " $1
> > }
> > if(nodata_max<nodata && (nodata>0)){
> > nodata_max=nodata
> > nodata_maxline=$1
> > }
>
> ITYM:
ITYM??
>
> if(nodata_max==nodata && (nodata>0)){
> nodata_maxline=nodata_maxline ", " $1
> }
> else if(nodata_max<nodata && (nodata>0)){
> nodata_max=nodata
> nodata_maxline=$1
> }
>
> but of course that could be written more efficiently as:
>
> if (nodata>0)
> if(nodata_max==nodata){
> nodata_maxline=nodata_maxline ", " $1
> }
> else if(nodata_max<nodata ){
> nodata_max=nodata
> nodata_maxline=$1
> }
> }
>
> which MAY actually be just
One step too far. for the cases when nodata_max>nodata:
>
> if (nodata>0)
> if(nodata_max==nodata){
> nodata_maxline=nodata_maxline ", " $1
> }
> else {
> nodata_max=nodata
> nodata_maxline=$1
> }
> }
>
> but what you're doing here isn't immediately obvious and I couldn't be
> bothered to think about it...
>
> Regards,
>
> Ed.
Thanks for your comments so far. I'll keep them for the refactoring
stage(s) that I'll do if I use the prog for a demo or something.
- Paddy.
(...)
> Your right, it is the (almost) unique date field for each record.
There are indeed 5 duplicate dates in the 5471 records of the testdata:
1990-03-25
1991-03-31
1992-03-29
1993-03-28
1995-03-26
This looks like a data bug. I will do some checks on the full data set
in order to see if I can find more of these.
Thanks for the hint, Hermann
- Paddy.
You are welcome.
BTW: I found in total 1567 duplicate dates in the complete data set
which are all bugs that I reported back to the data provider for
fixing. You hence made a valuable contribution to improving the quality
of our data set.
Hermann