Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

The maximum data gap

34 views
Skip to first unread message

Hermann Peifer

unread,
Dec 14, 2006, 1:35:53 AM12/14/06
to
Hi All,

I have to analyse data files that have the following format:
Each row corresponds to 1 day and the field logic is: $1 is the date,
followed by 24 value/flag pairs, representing measurements at 01:00,
02:00 ... 24:00 of the respective day. In short:

<date> <val1> <flag1> <val2> <flag2> ... <val24> <flag24>

Some test data is available at:
http://cdrtest.eionet.europa.eu/at/eea/colqleaoq/envrwlbtw/AT0001A0000100100hour.1-1-1990.31-12-2004/manage_document

I have to sum up the values (per day and only valid data, i.e. with
flag>0) in order to calculate the mean. That's not too difficult.
However, I also need to know what the "maximum data gap" is, i.e. the
longest period with successive invalid measurements (i.e values with
flag<=0)

My current code looks like this:

{
date=$1
for (i=1;i<=24;i++)
if ($(i*2+1)>0){
num_valid++
sum+=$(i*2)
} else {
# find out what the max_gap for this row is
}

if (num_valid>13 && max_gap<=6){
print date,sum/num_valid,1
} else {
# print something else
}
}

Could someone help me with the max_gap logic?

Thanks, Hermann

Ed Morton

unread,
Dec 14, 2006, 9:07:14 AM12/14/06
to

I'd have written the loop just a little differently, so instead of this:

for (i=1;i<=24;i++)
if ($(i*2+1)>0){
num_valid++
sum+=$(i*2)
}

I'd do this:

for (i=3;i<=NF;i+=2)
if ($i){
num_valid++
sum+=$(i-1)
}

because I think it's clearer (and I expect it's also more efficient but
I dobn't care so much about that).

Given that, it's an easy tweak to get the gap (assuming by "gap" you
mean number of intervals):

for (i=3;i<=NF;i+=2)
if ($i){
num_valid++
sum+=$(i-1)
last_valid = i
} else {
this_gap = (i - last_valid) / 2
max_gap = (this_gap > max_gap ? this_gap : max_gap)
}

Regards,

Ed.

Hermann

unread,
Dec 14, 2006, 10:56:43 AM12/14/06
to
Ed Morton wrote:
> I'd have written the loop just a little differently, so instead of this:
>
> for (i=1;i<=24;i++)
> if ($(i*2+1)>0){
> num_valid++
> sum+=$(i*2)
> }
>
> I'd do this:
>
> for (i=3;i<=NF;i+=2)
> if ($i){
> num_valid++
> sum+=$(i-1)
> }
>
> because I think it's clearer (and I expect it's also more efficient but
> I dobn't care so much about that).

My approach was based on the thought that the i-value should always be
equal to the time (hour) the measurement stands for. E.g. i=1, then I
loop over the value/flag for 01:00H which are fields $2/$3 in my
records.

>
> Given that, it's an easy tweak to get the gap (assuming by "gap" you
> mean number of intervals):

Correct. The intervals are the 24 hours of a day in my case.

>
> for (i=3;i<=NF;i+=2)
> if ($i){
> num_valid++
> sum+=$(i-1)
> last_valid = i
> } else {
> this_gap = (i - last_valid) / 2
> max_gap = (this_gap > max_gap ? this_gap : max_gap)
> }
>

It looks to me that in a row with 1 bad value first and then 23
successive good values, the result would be this_gap = ( 3 - 0) /2 =
1.5 ?

I will try out at home and perhaps blend your code with mine.

Thanks, Hermann

Hermann Peifer

unread,
Dec 15, 2006, 1:12:38 AM12/15/06
to
Ed Morton wrote:
> Hermann Peifer wrote:
>> Hi All,
>>
>> I have to analyse data files that have the following format:
>> Each row corresponds to 1 day and the field logic is: $1 is the date,
>> followed by 24 value/flag pairs, representing measurements at 01:00,
>> 02:00 ... 24:00 of the respective day. In short:
>>
>> <date> <val1> <flag1> <val2> <flag2> ... <val24> <flag24>
>>
>> Some test data is available at:
>> http://cdrtest.eionet.europa.eu/at/eea/colqleaoq/envrwlbtw/AT0001A0000100100hour.1-1-1990.31-12-2004/manage_document
>>
>>
>> I have to sum up the values (per day and only valid data, i.e. with
>> flag>0) in order to calculate the mean. That's not too difficult.
>> However, I also need to know what the "maximum data gap" is, i.e. the
>> longest period with successive invalid measurements (i.e values with
>> flag<=0)
>>
>> My current code looks like this:
>> ...

>
> I'd have written the loop just a little differently, so instead of this:

> ...

I took lines from both, my original code and yours. The best properly
working solution I was able to produce is:

{
num_valid=0
sum_valid=0
last_valid=0
this_gap=0
max_gap=0
date=$1

for (i=1;i<=24;i++)
if ($(i*2+1)>0){
num_valid++

sum_valid+=$(i*2)


last_valid = i
} else {
this_gap = (i - last_valid)

max_gap = (this_gap > max_gap ? this_gap : max_gap)
}

if (num_valid>=13 && max_gap<=6){
print date,sum_valid/num_valid,1,num_valid,max_gap
} else {
print date,(num_valid?sum_valid/num_valid):0),-9,num_valid,max_gap
}
}


I still can't believe that one needs so much code to find out that e.g.
in the below row, the "longest period with successive invalid values"
(i.e. a "v" followed by a flag <=0) is: 3.


date v 1 v 0 v 1 v -1 v 0 v -1 v 1 v -1 v 1 v 1 v -1 v 0 v 1 v 1 v -1

William James

unread,
Dec 15, 2006, 7:23:18 AM12/15/06
to

Perhaps you would prefer to use Ruby:

$, = " " # Output field-separator.
ARGF.each { |line|
fields = line.split
date = fields.first
flags = fields.map{|x| Integer(x) rescue nil}.compact
valid = flags.select{|n| n > 0}
num_valid = valid.size
sum_valid = valid.inject{|a,b| a+b}
max_gap = flags.map{|n| n > 0 ? "v" : "i" }.
join("").scan(/i+/).map{|s| s.size}.max
if num_valid > 12 and max_gap < 7
print date,sum_valid/num_valid,1,num_valid,max_gap
else
print date,(num_valid ? sum_valid/num_valid : 0),-9,
num_valid,max_gap
end
puts
}

Hermann Peifer

unread,
Dec 15, 2006, 8:18:32 AM12/15/06
to
William James wrote:
>
> Perhaps you would prefer to use Ruby:
>

Not really. See below.
Anyway: thanks for the hint,
Hermann
--------------
$ cat mycode


{
num_valid=0
sum_valid=0
last_valid=0
this_gap=0
max_gap=0
date=$1
for (i=1;i<=24;i++)
if ($(i*2+1)>0){
num_valid++
sum_valid+=$(i*2)
last_valid = i
} else {
this_gap = (i - last_valid)
max_gap = (this_gap > max_gap ? this_gap : max_gap)
}
if (num_valid>=13 && max_gap<=6){
print date,sum_valid/num_valid,1,num_valid,max_gap
} else {
print date,(num_valid?sum_valid/num_valid):0),-9,num_valid,max_gap
}
}

$ tr -d " " <mycode |wc -c
401

$ cat yourcode


$, = " " # Output field-separator.
ARGF.each { |line|
fields = line.split
date = fields.first
flags = fields.map{|x| Integer(x) rescue nil}.compact
valid = flags.select{|n| n > 0}
num_valid = valid.size
sum_valid = valid.inject{|a,b| a+b}
max_gap = flags.map{|n| n > 0 ? "v" : "i" }.
join("").scan(/i+/).map{|s| s.size}.max
if num_valid > 12 and max_gap < 7
print date,sum_valid/num_valid,1,num_valid,max_gap
else
print date,(num_valid ? sum_valid/num_valid : 0),-9,
num_valid,max_gap
end
puts
}

$ tr -d " " <yourcode |wc -c
461

William James

unread,
Dec 15, 2006, 12:18:02 PM12/15/06
to

This is stunningly naive. However, if that is what you think
you want ...

$, = " "
while s = gets do
date, *flags = s.split
flags.map!{|x| Integer(x) rescue nil}.compact!
v = flags.select{|n| n > 0}
sum_valid = v.inject{|a,b| a+b}


max_gap = flags.map{|n| n > 0 ? "v" : "i" }.

join.scan(/i+/).map{|s| s.size}.max
print date,(v.size > 0 ? sum_valid.to_f/v.size : 0),
(v.size > 12 && max_gap < 7 ? 1 : -9),
v.size, max_gap, "\n"
end

Paddy

unread,
Jan 1, 2007, 3:29:38 PM1/1/07
to

Hermann Peifer wrote:

> Hi All,
>
> I have to analyse data files that have the following format:
> Each row corresponds to 1 day and the field logic is: $1 is the date,
> followed by 24 value/flag pairs, representing measurements at 01:00,
> 02:00 ... 24:00 of the respective day. In short:
>
> <date> <val1> <flag1> <val2> <flag2> ... <val24> <flag24>
>
> Some test data is available at:
> http://cdrtest.eionet.europa.eu/at/eea/colqleaoq/envrwlbtw/AT0001A0000100100hour.1-1-1990.31-12-2004/manage_document
>
> I have to sum up the values (per day and only valid data, i.e. with
> flag>0) in order to calculate the mean. That's not too difficult.
> However, I also need to know what the "maximum data gap" is, i.e. the
> longest period with successive invalid measurements (i.e values with
> flag<=0)

Hi Hermann.
A little late I know, but I did enjoy this.

First, I selected a few lines from your data that allowed me to develop
my program.

egrep '1991-03-31|1991-04-0[1-4]|1993-03-0[45]|2004-12-(26|31)'
readings.txt > readingsx.txt

The gawk program I called readings.awk which produces the following
output on the
reduced input:

$ gawk -f readings.awk readingsx.txt
Line: 1991-03-31 Reject: 0 Accept: 24 Line_tot: 565.000
Line_avg: 23.542
Line: 1991-03-31 Reject: 23 Accept: 1 Line_tot: 40.000
Line_avg: 40.000
Line: 1991-04-01 Reject: 1 Accept: 23 Line_tot: 534.000
Line_avg: 23.217
Line: 1991-04-02 Reject: 0 Accept: 24 Line_tot: 475.000
Line_avg: 19.792
Line: 1991-04-03 Reject: 0 Accept: 24 Line_tot: 335.000
Line_avg: 13.958
Line: 1991-04-04 Reject: 0 Accept: 24 Line_tot: 261.000
Line_avg: 10.875
Line: 1993-03-04 Reject: 24 Accept: 0 Line_tot: 0.000
Line_avg: 0.000
Line: 1993-03-05 Reject: 14 Accept: 10 Line_tot: 454.000
Line_avg: 45.400
Line: 2004-12-26 Reject: 1 Accept: 23 Line_tot: 154.600
Line_avg: 6.722
Line: 2004-12-31 Reject: 1 Accept: 23 Line_tot: 47.300
Line_avg: 2.057

File(s) = readingsx.txt
Total = 2865.900
Readings = 176
Average = 16.284

Maximum run(s) of 38 consecutive false readings ends at line starting
with date(s): 1993-03-05


When run on the full dataset I get the following (only summary shown)

$ gawk -f readings.awk readings.txt |tail
Line: 2004-12-29 Reject: 1 Accept: 23 Line_tot: 56.300
Line_avg: 2.448
Line: 2004-12-30 Reject: 1 Accept: 23 Line_tot: 65.300
Line_avg: 2.839
Line: 2004-12-31 Reject: 1 Accept: 23 Line_tot: 47.300
Line_avg: 2.057

File(s) = readings.txt
Total = 1358393.400
Readings = 129403
Average = 10.497

Maximum run(s) of 589 consecutive false readings ends at line starting
with date(s): 1993-03-05


Note: consecutive false readings count goes over multiple following
lines.
Note: in gvim, the following search expression highlights all the
rejected readings:
/0\.0\+[\t ]\+\(-[0-9]\|\<0\)[^0-9.]


======Start readings.awk=========
# Author Donald 'Paddy' McCarthy Jan 01 2007

BEGIN{
nodata = 0; # Curret run of consecutive flags<0 in lines
of file
nodata_max=-1; # Max consecutive flags<0 in lines of file
nodata_maxline="!"; # ... and line number(s) where it occurs
}
FNR==1 {
# Accumulate input file names
if(infiles){
infiles = infiles "," infiles
} else {
infiles = FILENAME
}
}
{
tot_line=0; # sum of line data
num_line=0; # number of line data items with flag>0

# extract field info, skipping initial date field
for(field=2; field<=NF; field+=2){
datum=$field;
flag=$(field+1);
if(flag<1){
nodata++
}else{
# check run of data-absent fields
if(nodata_max==nodata && (nodata>0)){
nodata_maxline=nodata_maxline ", " $1
}
if(nodata_max<nodata && (nodata>0)){
nodata_max=nodata
nodata_maxline=$1
}
# re-initialise run of nodata counter
nodata=0;
# gather values for averaging
tot_line+=datum
num_line++;
}
}

# totals for the file so far
tot_file += tot_line
num_file += num_line

printf "Line: %11s Reject: %2i Accept: %2i Line_tot: %10.3f
Line_avg: %10.3f\n", \
$1, ((NF -1)/2) -num_line, num_line, tot_line, (num_line>0)?
tot_line/num_line: 0

# debug prints of original data plus some of the computed values
#printf "%s %15.3g %4i\n", $0, tot_line, num_line
#printf "%s\n %15.3f %4i %4i %4i %s\n", $0, tot_line, num_line,
nodata, nodata_max, nodata_maxline


}

END{
printf "\n"
printf "File(s) = %s\n", infiles
printf "Total = %10.3f\n", tot_file
printf "Readings = %6i\n", num_file
printf "Average = %10.3f\n", tot_file / num_file

printf "\nMaximum run(s) of %i consecutive false readings ends at
line starting with date(s): %s\n", nodata_max, nodata_maxline
}

======Finish readings.awk=========

Paddy

unread,
Jan 1, 2007, 9:36:41 PM1/1/07
to
I also wrote versions in Perl and Python and timed their execution,
(although they are written to compare the language styles rather than
timing).
See my blog entry:

http://paddy3118.blogspot.com/2007/01/data-mining-in-three-language05.html

- Paddy.

Ed Morton

unread,
Jan 2, 2007, 10:40:53 AM1/2/07
to
Paddy wrote:
> Hermann Peifer wrote:
>
>
>>Hi All,
>>
>>I have to analyse data files that have the following format:
>>Each row corresponds to 1 day and the field logic is: $1 is the date,
>>followed by 24 value/flag pairs, representing measurements at 01:00,
>>02:00 ... 24:00 of the respective day. In short:
>>
>><date> <val1> <flag1> <val2> <flag2> ... <val24> <flag24>
>>
>>Some test data is available at:
>>http://cdrtest.eionet.europa.eu/at/eea/colqleaoq/envrwlbtw/AT0001A0000100100hour.1-1-1990.31-12-2004/manage_document
>>
>>I have to sum up the values (per day and only valid data, i.e. with
>>flag>0) in order to calculate the mean. That's not too difficult.
>>However, I also need to know what the "maximum data gap" is, i.e. the
>>longest period with successive invalid measurements (i.e values with
>>flag<=0)
>
>
> Hi Hermann.
> A little late I know, but I did enjoy this.
>
> First, I selected a few lines from your data that allowed me to develop
> my program.
>
> egrep '1991-03-31|1991-04-0[1-4]|1993-03-0[45]|2004-12-(26|31)'
> readings.txt > readingsx.txt

ITYM:

awk '/1991-03-31|1991-04-0[1-4]|1993-03-0[45]|2004-12-(26|31)/'
readings.txt > readingsx.txt

;-).

No need to init a variable to zero.

> nodata_max=-1; # Max consecutive flags<0 in lines of file
> nodata_maxline="!"; # ... and line number(s) where it occurs

The above says you'll store line numbers in the variable
"nodata_maxline" but that doesn't seem to be what you're doing below
when you save a chain of "$1" values in it and in your final printf you
seem tio be saying it's value is a date. In either case, why set it to
"!" when it's default initial value of zero/null would work just fine.

> }
> FNR==1 {
> # Accumulate input file names
> if(infiles){
> infiles = infiles "," infiles
> } else {
> infiles = FILENAME
> }

ITYM:

if(infiles){
infiles = infiles "," FILENAME
} else {
infiles = FILENAME
}

but you could write it more simply as just:

infiles = infiles sep FILENAME
sep = ","

> }
> {
> tot_line=0; # sum of line data
> num_line=0; # number of line data items with flag>0

You don't need to end lines in semi-colon. I wouldn't but if you really
feel a burning desire to do it, then do it consistently (some lines
below don't end in semi-colons).

> # extract field info, skipping initial date field
> for(field=2; field<=NF; field+=2){
> datum=$field;
> flag=$(field+1);
> if(flag<1){
> nodata++

The above, for example, doesn't have the useless trailing semi-colon.

> }else{
> # check run of data-absent fields
> if(nodata_max==nodata && (nodata>0)){

Again, be consistent. If you're going to put one term in brackets
"(nodata>0)" then put the other one "nodata_max==nodata" in brackets too.

> nodata_maxline=nodata_maxline ", " $1
> }
> if(nodata_max<nodata && (nodata>0)){
> nodata_max=nodata
> nodata_maxline=$1
> }

ITYM:

if(nodata_max==nodata && (nodata>0)){
nodata_maxline=nodata_maxline ", " $1
}

else if(nodata_max<nodata && (nodata>0)){
nodata_max=nodata
nodata_maxline=$1
}

but of course that could be written more efficiently as:

if (nodata>0)
if(nodata_max==nodata){
nodata_maxline=nodata_maxline ", " $1
}
else if(nodata_max<nodata ){
nodata_max=nodata
nodata_maxline=$1
}
}

which MAY actually be just:

if (nodata>0)
if(nodata_max==nodata){
nodata_maxline=nodata_maxline ", " $1
}
else {
nodata_max=nodata
nodata_maxline=$1
}
}

but what you're doing here isn't immediately obvious and I couldn't be
bothered to think about it...

Regards,

Ed.

Paddy

unread,
Jan 2, 2007, 1:02:41 PM1/2/07
to

Ed Morton wrote:

Thanks Ed for your critique. I've added some notes to your below.

> Paddy wrote:
> > Hermann Peifer wrote:
> >
> >
> >>Hi All,
> >>
> >>I have to analyse data files that have the following format:
> >>Each row corresponds to 1 day and the field logic is: $1 is the date,
> >>followed by 24 value/flag pairs, representing measurements at 01:00,
> >>02:00 ... 24:00 of the respective day. In short:
> >>
> >><date> <val1> <flag1> <val2> <flag2> ... <val24> <flag24>
> >>
> >>Some test data is available at:
> >>http://cdrtest.eionet.europa.eu/at/eea/colqleaoq/envrwlbtw/AT0001A0000100100hour.1-1-1990.31-12-2004/manage_document
> >>
> >>I have to sum up the values (per day and only valid data, i.e. with
> >>flag>0) in order to calculate the mean. That's not too difficult.
> >>However, I also need to know what the "maximum data gap" is, i.e. the
> >>longest period with successive invalid measurements (i.e values with
> >>flag<=0)
> >
> >
> > Hi Hermann.
> > A little late I know, but I did enjoy this.
> >

<SNIP>

> > ======Start readings.awk=========
> > # Author Donald 'Paddy' McCarthy Jan 01 2007
> >
> > BEGIN{
> > nodata = 0; # Curret run of consecutive flags<0 in lines
> > of file
>
> No need to init a variable to zero.

... Except when I want to makke a point of its (automatic) initial
value and hang a comment on it.

>
> > nodata_max=-1; # Max consecutive flags<0 in lines of file
> > nodata_maxline="!"; # ... and line number(s) where it occurs
>
> The above says you'll store line numbers in the variable
> "nodata_maxline" but that doesn't seem to be what you're doing below
> when you save a chain of "$1" values in it and in your final printf you
> seem tio be saying it's value is a date. In either case, why set it to
> "!" when it's default initial value of zero/null would work just fine.

Your right, it is the (almost) unique date field for each record.
I wanted a visually distinct initial value for nodata_maxline even
though I don't expect to see it printed, hence "!".

>
> > }
> > FNR==1 {
> > # Accumulate input file names
> > if(infiles){
> > infiles = infiles "," infiles
> > } else {
> > infiles = FILENAME
> > }
>
> ITYM:
>
> if(infiles){
> infiles = infiles "," FILENAME
> } else {
> infiles = FILENAME
> }
>
> but you could write it more simply as just:
>
> infiles = infiles sep FILENAME
> sep = ","

Neat trick - I think I'll use it in the future.


>
> > }
> > {
> > tot_line=0; # sum of line data
> > num_line=0; # number of line data items with flag>0
>
> You don't need to end lines in semi-colon. I wouldn't but if you really
> feel a burning desire to do it, then do it consistently (some lines
> below don't end in semi-colons).

Your right, I've never just programmed in awk and, first pass, I'm
always switching statement-ending idioms.


>
> > # extract field info, skipping initial date field
> > for(field=2; field<=NF; field+=2){
> > datum=$field;
> > flag=$(field+1);
> > if(flag<1){
> > nodata++
>
> The above, for example, doesn't have the useless trailing semi-colon.
>
> > }else{
> > # check run of data-absent fields
> > if(nodata_max==nodata && (nodata>0)){
>
> Again, be consistent. If you're going to put one term in brackets
> "(nodata>0)" then put the other one "nodata_max==nodata" in brackets too.
>
> > nodata_maxline=nodata_maxline ", " $1
> > }
> > if(nodata_max<nodata && (nodata>0)){
> > nodata_max=nodata
> > nodata_maxline=$1
> > }
>
> ITYM:

ITYM??


>
> if(nodata_max==nodata && (nodata>0)){
> nodata_maxline=nodata_maxline ", " $1
> }
> else if(nodata_max<nodata && (nodata>0)){
> nodata_max=nodata
> nodata_maxline=$1
> }
>
> but of course that could be written more efficiently as:
>
> if (nodata>0)
> if(nodata_max==nodata){
> nodata_maxline=nodata_maxline ", " $1
> }
> else if(nodata_max<nodata ){
> nodata_max=nodata
> nodata_maxline=$1
> }
> }
>
> which MAY actually be just

One step too far. for the cases when nodata_max>nodata:


>
> if (nodata>0)
> if(nodata_max==nodata){
> nodata_maxline=nodata_maxline ", " $1
> }
> else {
> nodata_max=nodata
> nodata_maxline=$1
> }
> }
>
> but what you're doing here isn't immediately obvious and I couldn't be
> bothered to think about it...
>
> Regards,
>
> Ed.

Thanks for your comments so far. I'll keep them for the refactoring
stage(s) that I'll do if I use the prog for a demo or something.

- Paddy.

Ed Morton

unread,
Jan 2, 2007, 4:06:34 PM1/2/07
to
Paddy wrote:

> Ed Morton wrote:
<snip>
>>ITYM:
>
> ITYM??

http://www.google.com/search?as_q=ITYM

Ed.

Hermann Peifer

unread,
Jan 15, 2007, 3:48:04 PM1/15/07
to
Paddy wrote:

(...)

> Your right, it is the (almost) unique date field for each record.

There are indeed 5 duplicate dates in the 5471 records of the testdata:

1990-03-25
1991-03-31
1992-03-29
1993-03-28
1995-03-26

This looks like a data bug. I will do some checks on the full data set
in order to see if I can find more of these.

Thanks for the hint, Hermann

Paddy

unread,
Jan 16, 2007, 12:47:18 AM1/16/07
to
And thank you for the problem and the dataset. Just right for an
evening challenge.

- Paddy.

Hermann

unread,
Jan 16, 2007, 6:25:16 AM1/16/07
to
Paddy wrote:
> And thank you for the problem and the dataset. Just right for an
> evening challenge.

You are welcome.

BTW: I found in total 1567 duplicate dates in the complete data set
which are all bugs that I reported back to the data provider for
fixing. You hence made a valuable contribution to improving the quality
of our data set.

Hermann

0 new messages