Identifying the start of good data in a list

0 views
Skip to first unread message

tkp...@hotmail.com

unread,
Aug 26, 2008, 5:49:35 PM8/26/08
to
I have a list that starts with zeros, has sporadic data, and then has
good data. I define the point at which the data turns good to be the
first index with a non-zero entry that is followed by at least 4
consecutive non-zero data items (i.e. a week's worth of non-zero
data). For example, if my list is [0, 0, 1, 0, 1, 2, 3, 4, 5, 6, 7, 8,
9], I would define the point at which data turns good to be 4 (1
followed by 2, 3, 4, 5).

I have a simple algorithm to identify this changepoint, but it looks
crude: is there a cleaner, more elegant way to do this?

flag = True
i=-1
j=0
while flag and i < len(retHist)-1:
i += 1
if retHist[i] == 0:
j = 0
else:
j += 1
if j == 5:
flag = False

del retHist[:i-4]

Thanks in advance for your help

Thomas Philips

bearoph...@lycos.com

unread,
Aug 26, 2008, 6:44:45 PM8/26/08
to
First solutions I have found, not much tested beside the few doctests:

from itertools import islice

def start_good1(alist, good_ones=4):
"""
Maybe more efficient for Python

>>> start_good = start_good1
>>> start_good([0, 0, 1, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
4
>>> start_good([])
-1
>>> start_good([0, 0])
-1
>>> start_good([0, 0, 0])
-1
>>> start_good([0, 0, 0, 0, 1])
-1
>>> start_good([0, 0, 1, 0, 1, 2, 3])
-1
>>> start_good([0, 0, 1, 0, 1, 2, 3, 4])
4
>>> start_good([0, 0, 1, 0, 1, 2, 3, 4, 5])
4
>>> start_good([1, 2, 3, 4])
0
>>> start_good([1, 2, 3])
-1
>>> start_good([0, 0, 1, 0, 1, 2, 0, 4])
-1
"""
for i in xrange(len(alist) - good_ones + 1):
if all(islice(alist, i, i+good_ones)):
return i
return -1

def start_good2(alist, good_ones=4):
"""
Maybe more efficient for Psyco

>>> start_good = start_good2
>>> start_good([0, 0, 1, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
4
>>> start_good([])
-1
>>> start_good([0, 0])
-1
>>> start_good([0, 0, 0])
-1
>>> start_good([0, 0, 0, 0, 1])
-1
>>> start_good([0, 0, 1, 0, 1, 2, 3])
-1
>>> start_good([0, 0, 1, 0, 1, 2, 3, 4])
4
>>> start_good([0, 0, 1, 0, 1, 2, 3, 4, 5])
4
>>> start_good([1, 2, 3, 4])
0
>>> start_good([1, 2, 3])
-1
>>> start_good([0, 0, 1, 0, 1, 2, 0, 4])
-1
"""
n_good = 0
for i, el in enumerate(alist):
if alist[i]:
if n_good == good_ones:
return i - good_ones
else:
n_good += 1
else:
n_good = 0
if n_good == good_ones:
return len(alist) - good_ones
else:
return -1


if __name__ == "__main__":
import doctest
doctest.testmod()
print "Doctests done\n"

Bye,
bearophile

bearoph...@lycos.com

unread,
Aug 26, 2008, 6:49:10 PM8/26/08
to
Sorry, in the Psyco version replace this line:

for i, el in enumerate(alist):

With:
for i in xrange(len(alist)):

because Psyco doesn't digest enumerate well.

Bye,
bearophile

Matthew Fitzgibbons

unread,
Aug 26, 2008, 7:01:04 PM8/26/08
to pytho...@python.org
> --
> http://mail.python.org/mailman/listinfo/python-list
>

Maybe this will do?

reHist = [0, 0, 1, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
count = 0
for i, d in enumerate(reHist):
if d == 0:
count = 0
else:
count += 1
if count == 5:
break
else:
raise Exception("No data found")
reHist = reHist[i-4:]
print reHist


-Matt

Mensanator

unread,
Aug 26, 2008, 7:16:55 PM8/26/08
to

Here's my attempt:

LL = [0, 0, 1, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9]

i = 0

while (i<len(LL)) and (0 in LL[i:i+5]):
i += 1

print i, LL[i:i+5]

##
## 4 [1, 2, 3, 4, 5]
##

Emile van Sebille

unread,
Aug 26, 2008, 7:23:51 PM8/26/08
to pytho...@python.org
tkp...@hotmail.com wrote:
> I have a list that starts with zeros, has sporadic data, and then has
> good data. I define the point at which the data turns good to be the
> first index with a non-zero entry that is followed by at least 4
> consecutive non-zero data items (i.e. a week's worth of non-zero
> data). For example, if my list is [0, 0, 1, 0, 1, 2, 3, 4, 5, 6, 7, 8,
> 9], I would define the point at which data turns good to be 4 (1
> followed by 2, 3, 4, 5).
>
> I have a simple algorithm to identify this changepoint, but it looks
> crude: is there a cleaner, more elegant way to do this?

>>> for ii,dummy in enumerate(retHist):
... if 0 not in retHist[ii:ii+5]:
... break

>>> del retHist[:ii]

Well, to the extent short and sweet is elegant...

Emile

td...@hotmail.com

unread,
Aug 26, 2008, 8:04:19 PM8/26/08
to

With regular expressions:

import re

hist = [0, 0, 1, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
hist_str = ''.join(str(i) for i in hist)
match = re.search(r'[1-9]{5, }', hist_str)
hist = hist[-5:] if match is None else hist[match.start():]

Or slightly more concise:

import re

hist = [0, 0, 1, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
match = re.search(r'[1-9]{5, }', ''.join(str(i) for i in hist))
hist = hist[-5:] if match is None else hist[match.start():]

Tommy McDaniel

tkp...@hotmail.com

unread,
Aug 26, 2008, 10:39:13 PM8/26/08
to

This is just what the doctor ordered. Thank you, everyone, for the
help.

Sincerely

Thomas Philips

Terry Reedy

unread,
Aug 26, 2008, 10:52:27 PM8/26/08
to pytho...@python.org

Matthew Fitzgibbons wrote:
> tkp...@hotmail.com wrote:

>
> reHist = [0, 0, 1, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
> count = 0
> for i, d in enumerate(reHist):
> if d == 0:
> count = 0
> else:
> count += 1
> if count == 5:
> break
> else:
> raise Exception("No data found")
> reHist = reHist[i-4:]
> print reHist

This is what I would have suggested, except that the 'if count' test
should be left under the else clause, as in the original, so I consider
it the best of the responses ;-)

I thought of the repeated slicing alternative, but it would be slightly
slower. However, for occasional runs, the difference would be trivial.

Worrying about what Psyco does for this problem is rather premature
optimization.

My quarter's worth....

tjr


Scott David Daniels

unread,
Aug 27, 2008, 12:56:13 AM8/27/08
to


Here is one that can go iterator-to-iterator:

def started(source):
src = iter(source)
lead = []
for x in src:
if x:
lead.append(x)
if len(lead) == 5:
return itertools.chain(lead, src)
else:
lead = []


print list(started([0, 0, 1, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9]))


--Scott David Daniels
Scott....@Acm.Org

Steven D'Aprano

unread,
Aug 27, 2008, 11:50:14 AM8/27/08
to
On Tue, 26 Aug 2008 17:04:19 -0700, tdmj wrote:

> On Aug 26, 5:49 pm, tkp...@hotmail.com wrote:
>> I have a list that starts with zeros, has sporadic data, and then has
>> good data. I define the point at which the data turns good to be the
>> first index with a non-zero entry that is followed by at least 4
>> consecutive non-zero data items (i.e. a week's worth of non-zero data).
>> For example, if my list is [0, 0, 1, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9], I
>> would define the point at which data turns good to be 4 (1 followed by
>> 2, 3, 4, 5).

...

> With regular expressions:

Good grief. If you're suggesting that as a serious proposal, and not just
to prove it can be done, that's surely an example of "when all you have
is a hammer, everything looks like a nail" thinking.

In this particular case, your regex "solution" gives the wrong result,
indicating that you didn't test your code before posting. Hint:

re.search(r'[1-9]{5, }', "123456")

returns None.

The obvious fix for that specific bug is to use r'[1-9]{5,5}', but even
that will fail. Hint: what happens if an item has more than one digit?

Before posting another regex solution, make sure it does the right thing
with this:

[0, 0, 101, 0, 1002, 203, 3050, 4105, 5110, 623, 777]


--
Steven

Gerard flanagan

unread,
Aug 27, 2008, 3:00:46 PM8/27/08
to pytho...@python.org

data = [0, 0, 1, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9]


def itergood(indata):
indata = iter(indata)
buf = []
while len(buf) < 4:
buf.append(indata.next())
if buf[-1] == 0:
buf[:] = []
for x in buf:
yield x
for x in indata:
yield x


for d in itergood(data):
print d

George Sakkis

unread,
Aug 27, 2008, 4:42:36 PM8/27/08
to

Note that the version above (as well as most others posted) fail for
boundary cases; check out bearophile's doctest to see some of them.
Below are two more versions that pass all the doctests: the first
works only for lists and modifies them in place and the second works
for arbitrary iterables:

def clean_inplace(seq, good_ones=4):
start = 0
n = len(seq)
while start < n:
try: end = seq.index(0, start)
except ValueError: end = n
if end-start >= good_ones:
break
start = end+1
del seq[:start]

def clean_iter(iterable, good_ones=4):
from itertools import chain, islice, takewhile, dropwhile
iterator = iter(iterable)
is_zero = float(0).__eq__
while True:
# consume all zeros up to the next non-zero
iterator = dropwhile(is_zero, iterator)
# take up to `good_ones` non-zeros
good = list(islice(takewhile(bool,iterator), good_ones))
if not good: # iterator exhausted
return iterator
if len(good) == good_ones:
# found `good_ones` consecutive non-zeros;
# chain them to the rest items and return them
return chain(good, iterator)

HTH,
George

George Sakkis

unread,
Aug 27, 2008, 5:09:58 PM8/27/08
to

This seems the most efficient so far for arbitrary iterables. With a
few micro-optimizations it becomes:

from itertools import chain

def itergood(indata, good_ones=4):
indata = iter(indata); get_next = indata.next
buf = []; append = buf.append
while len(buf) < good_ones:
next = get_next()
if next: append(next)
else: del buf[:]
return chain(buf, indata)

$ python -m timeit -s "x = 1000*[0, 0, 0, 1, 2, 3] + [1,2,3,4]; from
itergood import itergood" "list(itergood(x))"
100 loops, best of 3: 3.09 msec per loop

And with Psyco enabled:
$ python -m timeit -s "x = 1000*[0, 0, 0, 1, 2, 3] + [1,2,3,4]; from
itergood import itergood" "list(itergood(x))"
1000 loops, best of 3: 466 usec per loop

George

bearoph...@lycos.com

unread,
Aug 27, 2008, 5:34:59 PM8/27/08
to
George Sakkis:

> This seems the most efficient so far for arbitrary iterables.

This one probably scores well with Psyco ;-)


def start_good3(seq, good_ones=4):
"""
>>> start_good = start_good3


>>> start_good([0, 0, 1, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
4
>>> start_good([])
-1
>>> start_good([0, 0])
-1
>>> start_good([0, 0, 0])
-1
>>> start_good([0, 0, 0, 0, 1])
-1
>>> start_good([0, 0, 1, 0, 1, 2, 3])
-1
>>> start_good([0, 0, 1, 0, 1, 2, 3, 4])
4
>>> start_good([0, 0, 1, 0, 1, 2, 3, 4, 5])
4
>>> start_good([1, 2, 3, 4])
0
>>> start_good([1, 2, 3])
-1

>>> start_good([0, 0, 1, 0, 1, 2, 0, 4])
-1
"""
n_good = 0

pos = 0
for el in seq:
if el:
if n_good == good_ones:
return pos - good_ones
else:
n_good += 1
elif n_good:
n_good = 0
pos += 1
if n_good == good_ones:
return pos - good_ones
else:
return -1

Bye,
bearophile

castironpi

unread,
Aug 27, 2008, 5:48:53 PM8/27/08
to
On Aug 27, 4:34 pm, bearophileH...@lycos.com wrote:
> George Sakkis:
>
> > This seems the most efficient so far for arbitrary iterables.
>
> This one probably scores well with Psyco ;-)
>
> def start_good3(seq, good_ones=4):
>     n_good = 0
>     pos = 0
>     for el in seq:
>         if el:
>             if n_good == good_ones:
>                 return pos - good_ones
>             else:
>                 n_good += 1
>         elif n_good:
>                 n_good = 0
>         pos += 1
>     if n_good == good_ones:
>         return pos - good_ones
>     else:
>         return -1
>
> Bye,
> bearophile

There, that's the regular machine for it. Too much thinking in
objects, and you can't even write a linked list anymore, right?

George Sakkis

unread,
Aug 27, 2008, 7:11:19 PM8/27/08
to
On Aug 27, 5:34 pm, bearophileH...@lycos.com wrote:
> George Sakkis:
>
> > This seems the most efficient so far for arbitrary iterables.
>
> This one probably scores well with Psyco ;-)

I think if you update this so that it returns the "good" iterable
instead of the starting index, it is equivalent to Gerard's solution.

George

George Sakkis

unread,
Aug 27, 2008, 7:14:25 PM8/27/08
to

And you're still wondering why do people killfile you or think you're
a failed AI project...

castironpi

unread,
Aug 27, 2008, 7:43:33 PM8/27/08
to

Just jumping on the bandwagon, George. And you see, everyone else's
passed the doctests perfectly. Were all the running times O( n* k )?

Gerard flanagan

unread,
Aug 28, 2008, 11:47:42 AM8/28/08
to pytho...@python.org
> --

I always forget the 'del slice' method for clearing a list, thanks.

I think that returning a `chain` means that the function is not itself a
generator. And so if the indata has length less than or equal
to the threshold (good_ones), an unhandled StopIteration is raised
before the return statement is reached.


G.

castironpi

unread,
Aug 28, 2008, 2:31:48 PM8/28/08
to

You gave me an idea-- maybe an arbitrary 'lookahead' iterable could be
useful. I haven't seen them that much on the newsgroup, but more than
once. IOW a buffered consumer. Something that you could check a
fixed number of next elements of. You might implement it as a
iterator with a __getitem__ method.

Example, unproduced:

>>> import itertools
>>> a= itertools.count( )
>>> a.next()
0
>>> a.next()
1
>>> a[ 3 ]
5
>>> a.next()
2
>>> a[ 3 ]
6

Does this make sense at all?

td...@hotmail.com

unread,
Aug 28, 2008, 6:56:11 PM8/28/08
to
On Aug 27, 11:50 am, Steven D'Aprano <st...@REMOVE-THIS-

Hey, it's clearer than a lot of the other proposals here. Too bad it
doesn't work. This is why you don't post after 8 p.m. after being at
work all day. I was seeing what I now recall as incorrect answers, but
at the time I was in the midst of a brainfart and for some reason took
them to be right. It can be made to work by removing the space in
"{5, }", inserting some kind of marker between the numbers, and using
the right regular expression to recognize nonzero numbers between the
markers, but I think I've already said too much in this thread.

Tommy McDaniel

Jorgen Grahn

unread,
Aug 29, 2008, 10:43:52 AM8/29/08
to
On 27 Aug 2008 15:50:14 GMT, Steven D'Aprano <st...@REMOVE-THIS-cybersource.com.au> wrote:
> On Tue, 26 Aug 2008 17:04:19 -0700, tdmj wrote:
>
>> On Aug 26, 5:49 pm, tkp...@hotmail.com wrote:
>>> I have a list that starts with zeros, has sporadic data, and then has
>>> good data. I define the point at which the data turns good to be the
>>> first index with a non-zero entry that is followed by at least 4
>>> consecutive non-zero data items (i.e. a week's worth of non-zero data).
>>> For example, if my list is [0, 0, 1, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9], I
>>> would define the point at which data turns good to be 4 (1 followed by
>>> 2, 3, 4, 5).
>
> ...
>
>> With regular expressions:
>
> Good grief. If you're suggesting that as a serious proposal, and not just
> to prove it can be done, that's surely an example of "when all you have
> is a hammer, everything looks like a nail" thinking.

Maybe I'm stumbling into a "REs are evil" flamewar here. Anyway:

He has a point though: this *can* be seen as a regex problem. Only a
solution which builds a string first is only good for laughs or
(possibly) quick hacks. What's missing is an RE library for lists of
objects, rather than just strings and Unicode strings.

Not sure such a library would be worth implementing -- problems like
this one are rare, I think.

/Jorgen

--
// Jorgen Grahn <grahn@ Ph'nglui mglw'nafh Cthulhu
\X/ snipabacken.se> R'lyeh wgah'nagl fhtagn!

castironpi

unread,
Aug 29, 2008, 12:49:09 PM8/29/08
to
On Aug 29, 9:43 am, Jorgen Grahn <grahn+n...@snipabacken.se> wrote:
> On 27 Aug 2008 15:50:14 GMT, Steven D'Aprano <st...@REMOVE-THIS-cybersource.com.au> wrote:
>
>
>
> > On Tue, 26 Aug 2008 17:04:19 -0700, tdmj wrote:
>
> >> On Aug 26, 5:49 pm, tkp...@hotmail.com wrote:
> >>> I have a list that starts with zeros, has sporadic data, and then has
> >>> good data. I define the point at  which the data turns good to be the
> >>> first index with a non-zero entry that is followed by at least 4
> >>> consecutive non-zero data items (i.e. a week's worth of non-zero data).
> >>> For example, if my list is [0, 0, 1, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9], I
> >>> would define the point at which data turns good to be 4 (1 followed by
> >>> 2, 3, 4, 5).
>
> He has a point though: this *can* be seen as a regex problem. Only a
> solution which builds a string first is only good for laughs or
> (possibly) quick hacks. What's missing is an RE library for lists of
> objects, rather than just strings and Unicode strings.
>
> Not sure such a library would be worth implementing -- problems like
> this one are rare, I think.

Every now and then, you see a proposal or a package for a finite state
machine--- how would you encode comparing of values into a string, if
you're not comparing a string?

Reply all
Reply to author
Forward
0 new messages