The maximum data gap

34 views

Hermann Peifer

Dec 14, 2006, 1:35:53 AM12/14/06
to
Hi All,

I have to analyse data files that have the following format:
Each row corresponds to 1 day and the field logic is: \$1 is the date,
followed by 24 value/flag pairs, representing measurements at 01:00,
02:00 ... 24:00 of the respective day. In short:

<date> <val1> <flag1> <val2> <flag2> ... <val24> <flag24>

Some test data is available at:
http://cdrtest.eionet.europa.eu/at/eea/colqleaoq/envrwlbtw/AT0001A0000100100hour.1-1-1990.31-12-2004/manage_document

I have to sum up the values (per day and only valid data, i.e. with
flag>0) in order to calculate the mean. That's not too difficult.
However, I also need to know what the "maximum data gap" is, i.e. the
longest period with successive invalid measurements (i.e values with
flag<=0)

My current code looks like this:

{
date=\$1
for (i=1;i<=24;i++)
if (\$(i*2+1)>0){
num_valid++
sum+=\$(i*2)
} else {
# find out what the max_gap for this row is
}

if (num_valid>13 && max_gap<=6){
print date,sum/num_valid,1
} else {
# print something else
}
}

Could someone help me with the max_gap logic?

Thanks, Hermann

Ed Morton

Dec 14, 2006, 9:07:14 AM12/14/06
to

I'd have written the loop just a little differently, so instead of this:

for (i=1;i<=24;i++)
if (\$(i*2+1)>0){
num_valid++
sum+=\$(i*2)
}

I'd do this:

for (i=3;i<=NF;i+=2)
if (\$i){
num_valid++
sum+=\$(i-1)
}

because I think it's clearer (and I expect it's also more efficient but
I dobn't care so much about that).

Given that, it's an easy tweak to get the gap (assuming by "gap" you
mean number of intervals):

for (i=3;i<=NF;i+=2)
if (\$i){
num_valid++
sum+=\$(i-1)
last_valid = i
} else {
this_gap = (i - last_valid) / 2
max_gap = (this_gap > max_gap ? this_gap : max_gap)
}

Regards,

Ed.

Hermann

Dec 14, 2006, 10:56:43 AM12/14/06
to
Ed Morton wrote:
> I'd have written the loop just a little differently, so instead of this:
>
> for (i=1;i<=24;i++)
> if (\$(i*2+1)>0){
> num_valid++
> sum+=\$(i*2)
> }
>
> I'd do this:
>
> for (i=3;i<=NF;i+=2)
> if (\$i){
> num_valid++
> sum+=\$(i-1)
> }
>
> because I think it's clearer (and I expect it's also more efficient but
> I dobn't care so much about that).

My approach was based on the thought that the i-value should always be
equal to the time (hour) the measurement stands for. E.g. i=1, then I
loop over the value/flag for 01:00H which are fields \$2/\$3 in my
records.

>
> Given that, it's an easy tweak to get the gap (assuming by "gap" you
> mean number of intervals):

Correct. The intervals are the 24 hours of a day in my case.

>
> for (i=3;i<=NF;i+=2)
> if (\$i){
> num_valid++
> sum+=\$(i-1)
> last_valid = i
> } else {
> this_gap = (i - last_valid) / 2
> max_gap = (this_gap > max_gap ? this_gap : max_gap)
> }
>

It looks to me that in a row with 1 bad value first and then 23
successive good values, the result would be this_gap = ( 3 - 0) /2 =
1.5 ?

I will try out at home and perhaps blend your code with mine.

Thanks, Hermann

Hermann Peifer

Dec 15, 2006, 1:12:38 AM12/15/06
to
Ed Morton wrote:
> Hermann Peifer wrote:
>> Hi All,
>>
>> I have to analyse data files that have the following format:
>> Each row corresponds to 1 day and the field logic is: \$1 is the date,
>> followed by 24 value/flag pairs, representing measurements at 01:00,
>> 02:00 ... 24:00 of the respective day. In short:
>>
>> <date> <val1> <flag1> <val2> <flag2> ... <val24> <flag24>
>>
>> Some test data is available at:
>> http://cdrtest.eionet.europa.eu/at/eea/colqleaoq/envrwlbtw/AT0001A0000100100hour.1-1-1990.31-12-2004/manage_document
>>
>>
>> I have to sum up the values (per day and only valid data, i.e. with
>> flag>0) in order to calculate the mean. That's not too difficult.
>> However, I also need to know what the "maximum data gap" is, i.e. the
>> longest period with successive invalid measurements (i.e values with
>> flag<=0)
>>
>> My current code looks like this:
>> ...

>
> I'd have written the loop just a little differently, so instead of this:

> ...

I took lines from both, my original code and yours. The best properly
working solution I was able to produce is:

{
num_valid=0
sum_valid=0
last_valid=0
this_gap=0
max_gap=0
date=\$1

for (i=1;i<=24;i++)
if (\$(i*2+1)>0){
num_valid++

sum_valid+=\$(i*2)

last_valid = i
} else {
this_gap = (i - last_valid)

max_gap = (this_gap > max_gap ? this_gap : max_gap)
}

if (num_valid>=13 && max_gap<=6){
print date,sum_valid/num_valid,1,num_valid,max_gap
} else {
print date,(num_valid?sum_valid/num_valid):0),-9,num_valid,max_gap
}
}

I still can't believe that one needs so much code to find out that e.g.
in the below row, the "longest period with successive invalid values"
(i.e. a "v" followed by a flag <=0) is: 3.

date v 1 v 0 v 1 v -1 v 0 v -1 v 1 v -1 v 1 v 1 v -1 v 0 v 1 v 1 v -1

William James

Dec 15, 2006, 7:23:18 AM12/15/06
to

Perhaps you would prefer to use Ruby:

\$, = " " # Output field-separator.
ARGF.each { |line|
fields = line.split
date = fields.first
flags = fields.map{|x| Integer(x) rescue nil}.compact
valid = flags.select{|n| n > 0}
num_valid = valid.size
sum_valid = valid.inject{|a,b| a+b}
max_gap = flags.map{|n| n > 0 ? "v" : "i" }.
join("").scan(/i+/).map{|s| s.size}.max
if num_valid > 12 and max_gap < 7
print date,sum_valid/num_valid,1,num_valid,max_gap
else
print date,(num_valid ? sum_valid/num_valid : 0),-9,
num_valid,max_gap
end
puts
}

Hermann Peifer

Dec 15, 2006, 8:18:32 AM12/15/06
to
William James wrote:
>
> Perhaps you would prefer to use Ruby:
>

Not really. See below.
Anyway: thanks for the hint,
Hermann
--------------
\$ cat mycode

{
num_valid=0
sum_valid=0
last_valid=0
this_gap=0
max_gap=0
date=\$1
for (i=1;i<=24;i++)
if (\$(i*2+1)>0){
num_valid++
sum_valid+=\$(i*2)
last_valid = i
} else {
this_gap = (i - last_valid)
max_gap = (this_gap > max_gap ? this_gap : max_gap)
}
if (num_valid>=13 && max_gap<=6){
print date,sum_valid/num_valid,1,num_valid,max_gap
} else {
print date,(num_valid?sum_valid/num_valid):0),-9,num_valid,max_gap
}
}

\$ tr -d " " <mycode |wc -c
401

\$ cat yourcode

\$, = " " # Output field-separator.
ARGF.each { |line|
fields = line.split
date = fields.first
flags = fields.map{|x| Integer(x) rescue nil}.compact
valid = flags.select{|n| n > 0}
num_valid = valid.size
sum_valid = valid.inject{|a,b| a+b}
max_gap = flags.map{|n| n > 0 ? "v" : "i" }.
join("").scan(/i+/).map{|s| s.size}.max
if num_valid > 12 and max_gap < 7
print date,sum_valid/num_valid,1,num_valid,max_gap
else
print date,(num_valid ? sum_valid/num_valid : 0),-9,
num_valid,max_gap
end
puts
}

\$ tr -d " " <yourcode |wc -c
461

William James

Dec 15, 2006, 12:18:02 PM12/15/06
to

This is stunningly naive. However, if that is what you think
you want ...

\$, = " "
while s = gets do
date, *flags = s.split
flags.map!{|x| Integer(x) rescue nil}.compact!
v = flags.select{|n| n > 0}
sum_valid = v.inject{|a,b| a+b}

max_gap = flags.map{|n| n > 0 ? "v" : "i" }.

join.scan(/i+/).map{|s| s.size}.max
print date,(v.size > 0 ? sum_valid.to_f/v.size : 0),
(v.size > 12 && max_gap < 7 ? 1 : -9),
v.size, max_gap, "\n"
end

Jan 1, 2007, 3:29:38 PM1/1/07
to

Hermann Peifer wrote:

> Hi All,
>
> I have to analyse data files that have the following format:
> Each row corresponds to 1 day and the field logic is: \$1 is the date,
> followed by 24 value/flag pairs, representing measurements at 01:00,
> 02:00 ... 24:00 of the respective day. In short:
>
> <date> <val1> <flag1> <val2> <flag2> ... <val24> <flag24>
>
> Some test data is available at:
> http://cdrtest.eionet.europa.eu/at/eea/colqleaoq/envrwlbtw/AT0001A0000100100hour.1-1-1990.31-12-2004/manage_document
>
> I have to sum up the values (per day and only valid data, i.e. with
> flag>0) in order to calculate the mean. That's not too difficult.
> However, I also need to know what the "maximum data gap" is, i.e. the
> longest period with successive invalid measurements (i.e values with
> flag<=0)

Hi Hermann.
A little late I know, but I did enjoy this.

First, I selected a few lines from your data that allowed me to develop
my program.

egrep '1991-03-31|1991-04-0[1-4]|1993-03-0[45]|2004-12-(26|31)'

The gawk program I called readings.awk which produces the following
output on the
reduced input:

Line: 1991-03-31 Reject: 0 Accept: 24 Line_tot: 565.000
Line_avg: 23.542
Line: 1991-03-31 Reject: 23 Accept: 1 Line_tot: 40.000
Line_avg: 40.000
Line: 1991-04-01 Reject: 1 Accept: 23 Line_tot: 534.000
Line_avg: 23.217
Line: 1991-04-02 Reject: 0 Accept: 24 Line_tot: 475.000
Line_avg: 19.792
Line: 1991-04-03 Reject: 0 Accept: 24 Line_tot: 335.000
Line_avg: 13.958
Line: 1991-04-04 Reject: 0 Accept: 24 Line_tot: 261.000
Line_avg: 10.875
Line: 1993-03-04 Reject: 24 Accept: 0 Line_tot: 0.000
Line_avg: 0.000
Line: 1993-03-05 Reject: 14 Accept: 10 Line_tot: 454.000
Line_avg: 45.400
Line: 2004-12-26 Reject: 1 Accept: 23 Line_tot: 154.600
Line_avg: 6.722
Line: 2004-12-31 Reject: 1 Accept: 23 Line_tot: 47.300
Line_avg: 2.057

Total = 2865.900
Average = 16.284

Maximum run(s) of 38 consecutive false readings ends at line starting
with date(s): 1993-03-05

When run on the full dataset I get the following (only summary shown)

Line: 2004-12-29 Reject: 1 Accept: 23 Line_tot: 56.300
Line_avg: 2.448
Line: 2004-12-30 Reject: 1 Accept: 23 Line_tot: 65.300
Line_avg: 2.839
Line: 2004-12-31 Reject: 1 Accept: 23 Line_tot: 47.300
Line_avg: 2.057

Total = 1358393.400
Average = 10.497

Maximum run(s) of 589 consecutive false readings ends at line starting
with date(s): 1993-03-05

Note: consecutive false readings count goes over multiple following
lines.
Note: in gvim, the following search expression highlights all the
/0\.0\+[\t ]\+\(-[0-9]\|\<0\)[^0-9.]

# Author Donald 'Paddy' McCarthy Jan 01 2007

BEGIN{
nodata = 0; # Curret run of consecutive flags<0 in lines
of file
nodata_max=-1; # Max consecutive flags<0 in lines of file
nodata_maxline="!"; # ... and line number(s) where it occurs
}
FNR==1 {
# Accumulate input file names
if(infiles){
infiles = infiles "," infiles
} else {
infiles = FILENAME
}
}
{
tot_line=0; # sum of line data
num_line=0; # number of line data items with flag>0

# extract field info, skipping initial date field
for(field=2; field<=NF; field+=2){
datum=\$field;
flag=\$(field+1);
if(flag<1){
nodata++
}else{
# check run of data-absent fields
if(nodata_max==nodata && (nodata>0)){
nodata_maxline=nodata_maxline ", " \$1
}
if(nodata_max<nodata && (nodata>0)){
nodata_max=nodata
nodata_maxline=\$1
}
# re-initialise run of nodata counter
nodata=0;
# gather values for averaging
tot_line+=datum
num_line++;
}
}

# totals for the file so far
tot_file += tot_line
num_file += num_line

printf "Line: %11s Reject: %2i Accept: %2i Line_tot: %10.3f
Line_avg: %10.3f\n", \
\$1, ((NF -1)/2) -num_line, num_line, tot_line, (num_line>0)?
tot_line/num_line: 0

# debug prints of original data plus some of the computed values
#printf "%s %15.3g %4i\n", \$0, tot_line, num_line
#printf "%s\n %15.3f %4i %4i %4i %s\n", \$0, tot_line, num_line,
nodata, nodata_max, nodata_maxline

}

END{
printf "\n"
printf "File(s) = %s\n", infiles
printf "Total = %10.3f\n", tot_file
printf "Average = %10.3f\n", tot_file / num_file

printf "\nMaximum run(s) of %i consecutive false readings ends at
line starting with date(s): %s\n", nodata_max, nodata_maxline
}

Jan 1, 2007, 9:36:41 PM1/1/07
to
I also wrote versions in Perl and Python and timed their execution,
(although they are written to compare the language styles rather than
timing).
See my blog entry:

Ed Morton

Jan 2, 2007, 10:40:53 AM1/2/07
to
> Hermann Peifer wrote:
>
>
>>Hi All,
>>
>>I have to analyse data files that have the following format:
>>Each row corresponds to 1 day and the field logic is: \$1 is the date,
>>followed by 24 value/flag pairs, representing measurements at 01:00,
>>02:00 ... 24:00 of the respective day. In short:
>>
>><date> <val1> <flag1> <val2> <flag2> ... <val24> <flag24>
>>
>>Some test data is available at:
>>http://cdrtest.eionet.europa.eu/at/eea/colqleaoq/envrwlbtw/AT0001A0000100100hour.1-1-1990.31-12-2004/manage_document
>>
>>I have to sum up the values (per day and only valid data, i.e. with
>>flag>0) in order to calculate the mean. That's not too difficult.
>>However, I also need to know what the "maximum data gap" is, i.e. the
>>longest period with successive invalid measurements (i.e values with
>>flag<=0)
>
>
> Hi Hermann.
> A little late I know, but I did enjoy this.
>
> First, I selected a few lines from your data that allowed me to develop
> my program.
>
> egrep '1991-03-31|1991-04-0[1-4]|1993-03-0[45]|2004-12-(26|31)'

ITYM:

awk '/1991-03-31|1991-04-0[1-4]|1993-03-0[45]|2004-12-(26|31)/'

;-).

No need to init a variable to zero.

> nodata_max=-1; # Max consecutive flags<0 in lines of file
> nodata_maxline="!"; # ... and line number(s) where it occurs

The above says you'll store line numbers in the variable
"nodata_maxline" but that doesn't seem to be what you're doing below
when you save a chain of "\$1" values in it and in your final printf you
seem tio be saying it's value is a date. In either case, why set it to
"!" when it's default initial value of zero/null would work just fine.

> }
> FNR==1 {
> # Accumulate input file names
> if(infiles){
> infiles = infiles "," infiles
> } else {
> infiles = FILENAME
> }

ITYM:

if(infiles){
infiles = infiles "," FILENAME
} else {
infiles = FILENAME
}

but you could write it more simply as just:

infiles = infiles sep FILENAME
sep = ","

> }
> {
> tot_line=0; # sum of line data
> num_line=0; # number of line data items with flag>0

You don't need to end lines in semi-colon. I wouldn't but if you really
feel a burning desire to do it, then do it consistently (some lines
below don't end in semi-colons).

> # extract field info, skipping initial date field
> for(field=2; field<=NF; field+=2){
> datum=\$field;
> flag=\$(field+1);
> if(flag<1){
> nodata++

The above, for example, doesn't have the useless trailing semi-colon.

> }else{
> # check run of data-absent fields
> if(nodata_max==nodata && (nodata>0)){

Again, be consistent. If you're going to put one term in brackets
"(nodata>0)" then put the other one "nodata_max==nodata" in brackets too.

> nodata_maxline=nodata_maxline ", " \$1
> }
> if(nodata_max<nodata && (nodata>0)){
> nodata_max=nodata
> nodata_maxline=\$1
> }

ITYM:

if(nodata_max==nodata && (nodata>0)){
nodata_maxline=nodata_maxline ", " \$1
}

else if(nodata_max<nodata && (nodata>0)){
nodata_max=nodata
nodata_maxline=\$1
}

but of course that could be written more efficiently as:

if (nodata>0)
if(nodata_max==nodata){
nodata_maxline=nodata_maxline ", " \$1
}
else if(nodata_max<nodata ){
nodata_max=nodata
nodata_maxline=\$1
}
}

which MAY actually be just:

if (nodata>0)
if(nodata_max==nodata){
nodata_maxline=nodata_maxline ", " \$1
}
else {
nodata_max=nodata
nodata_maxline=\$1
}
}

but what you're doing here isn't immediately obvious and I couldn't be

Regards,

Ed.

Jan 2, 2007, 1:02:41 PM1/2/07
to

Ed Morton wrote:

> > Hermann Peifer wrote:
> >
> >
> >>Hi All,
> >>
> >>I have to analyse data files that have the following format:
> >>Each row corresponds to 1 day and the field logic is: \$1 is the date,
> >>followed by 24 value/flag pairs, representing measurements at 01:00,
> >>02:00 ... 24:00 of the respective day. In short:
> >>
> >><date> <val1> <flag1> <val2> <flag2> ... <val24> <flag24>
> >>
> >>Some test data is available at:
> >>http://cdrtest.eionet.europa.eu/at/eea/colqleaoq/envrwlbtw/AT0001A0000100100hour.1-1-1990.31-12-2004/manage_document
> >>
> >>I have to sum up the values (per day and only valid data, i.e. with
> >>flag>0) in order to calculate the mean. That's not too difficult.
> >>However, I also need to know what the "maximum data gap" is, i.e. the
> >>longest period with successive invalid measurements (i.e values with
> >>flag<=0)
> >
> >
> > Hi Hermann.
> > A little late I know, but I did enjoy this.
> >

<SNIP>

> > # Author Donald 'Paddy' McCarthy Jan 01 2007
> >
> > BEGIN{
> > nodata = 0; # Curret run of consecutive flags<0 in lines
> > of file
>
> No need to init a variable to zero.

... Except when I want to makke a point of its (automatic) initial
value and hang a comment on it.

>
> > nodata_max=-1; # Max consecutive flags<0 in lines of file
> > nodata_maxline="!"; # ... and line number(s) where it occurs
>
> The above says you'll store line numbers in the variable
> "nodata_maxline" but that doesn't seem to be what you're doing below
> when you save a chain of "\$1" values in it and in your final printf you
> seem tio be saying it's value is a date. In either case, why set it to
> "!" when it's default initial value of zero/null would work just fine.

Your right, it is the (almost) unique date field for each record.
I wanted a visually distinct initial value for nodata_maxline even
though I don't expect to see it printed, hence "!".

>
> > }
> > FNR==1 {
> > # Accumulate input file names
> > if(infiles){
> > infiles = infiles "," infiles
> > } else {
> > infiles = FILENAME
> > }
>
> ITYM:
>
> if(infiles){
> infiles = infiles "," FILENAME
> } else {
> infiles = FILENAME
> }
>
> but you could write it more simply as just:
>
> infiles = infiles sep FILENAME
> sep = ","

Neat trick - I think I'll use it in the future.

>
> > }
> > {
> > tot_line=0; # sum of line data
> > num_line=0; # number of line data items with flag>0
>
> You don't need to end lines in semi-colon. I wouldn't but if you really
> feel a burning desire to do it, then do it consistently (some lines
> below don't end in semi-colons).

Your right, I've never just programmed in awk and, first pass, I'm
always switching statement-ending idioms.

>
> > # extract field info, skipping initial date field
> > for(field=2; field<=NF; field+=2){
> > datum=\$field;
> > flag=\$(field+1);
> > if(flag<1){
> > nodata++
>
> The above, for example, doesn't have the useless trailing semi-colon.
>
> > }else{
> > # check run of data-absent fields
> > if(nodata_max==nodata && (nodata>0)){
>
> Again, be consistent. If you're going to put one term in brackets
> "(nodata>0)" then put the other one "nodata_max==nodata" in brackets too.
>
> > nodata_maxline=nodata_maxline ", " \$1
> > }
> > if(nodata_max<nodata && (nodata>0)){
> > nodata_max=nodata
> > nodata_maxline=\$1
> > }
>
> ITYM:

ITYM??

>
> if(nodata_max==nodata && (nodata>0)){
> nodata_maxline=nodata_maxline ", " \$1
> }
> else if(nodata_max<nodata && (nodata>0)){
> nodata_max=nodata
> nodata_maxline=\$1
> }
>
> but of course that could be written more efficiently as:
>
> if (nodata>0)
> if(nodata_max==nodata){
> nodata_maxline=nodata_maxline ", " \$1
> }
> else if(nodata_max<nodata ){
> nodata_max=nodata
> nodata_maxline=\$1
> }
> }
>
> which MAY actually be just

One step too far. for the cases when nodata_max>nodata:

>
> if (nodata>0)
> if(nodata_max==nodata){
> nodata_maxline=nodata_maxline ", " \$1
> }
> else {
> nodata_max=nodata
> nodata_maxline=\$1
> }
> }
>
> but what you're doing here isn't immediately obvious and I couldn't be
> bothered to think about it...
>
> Regards,
>
> Ed.

Thanks for your comments so far. I'll keep them for the refactoring
stage(s) that I'll do if I use the prog for a demo or something.

Ed Morton

Jan 2, 2007, 4:06:34 PM1/2/07
to

> Ed Morton wrote:
<snip>
>>ITYM:
>
> ITYM??

Ed.

Hermann Peifer

Jan 15, 2007, 3:48:04 PM1/15/07
to

(...)

> Your right, it is the (almost) unique date field for each record.

There are indeed 5 duplicate dates in the 5471 records of the testdata:

1990-03-25
1991-03-31
1992-03-29
1993-03-28
1995-03-26

This looks like a data bug. I will do some checks on the full data set
in order to see if I can find more of these.

Thanks for the hint, Hermann

Jan 16, 2007, 12:47:18 AM1/16/07
to
And thank you for the problem and the dataset. Just right for an
evening challenge.

Hermann

Jan 16, 2007, 6:25:16 AM1/16/07
to
> And thank you for the problem and the dataset. Just right for an
> evening challenge.

You are welcome.

BTW: I found in total 1567 duplicate dates in the complete data set
which are all bugs that I reported back to the data provider for
fixing. You hence made a valuable contribution to improving the quality
of our data set.

Hermann