count items in generator

BartlebyScrivener

unread,

May 14, 2006, 1:05:47 AM5/14/06

to

Still new. I am trying to make a simple word count script.

I found this in the great Python Cookbook, which allows me to process
every word in a file. But how do I use it to count the items generated?

def words_of_file(thefilepath, line_to_words=str.split):
the_file = open(thefilepath)
for line in the_file:
for word in line_to_words(line):
yield word
the_file.close()
for word in words_of_file(thefilepath):
dosomethingwith(word)

The best I could come up with:

def words_of_file(thefilepath, line_to_words=str.split):
the_file = open(thefilepath)
for line in the_file:
for word in line_to_words(line):
yield word
the_file.close()
len(list(words_of_file(thefilepath)))

But that seems clunky.

Alex Martelli

unread,

May 14, 2006, 1:27:39 AM5/14/06

to

BartlebyScrivener <rpdo...@gmail.com> wrote:

My preference would be (with the original definition for
words_of_the_file) to code

numwords = sum(1 for w in words_of_the_file(thefilepath))

Alex

George Sakkis

unread,

May 14, 2006, 1:33:25 AM5/14/06

to

BartlebyScrivener wrote:

As clunky as it seems, I don't think you can beat it in terms of
brevity; if you care about memory efficiency though, here's what I use:

def length(iterable):
try: return len(iterable)
except:
i = 0
for x in iterable: i += 1
return i

You can even shadow the builtin len() if you prefer:

import __builtin__

def len(iterable):
try: return __builtin__.len(iterable)
except:
i = 0
for x in iterable: i += 1
return i

HTH,
George

BartlebyScrivener

unread,

May 14, 2006, 1:35:57 AM5/14/06

to

Thanks! And thanks for the Cookbook.

rd

"There is no abstract art. You must always start with something.
Afterward you can remove all traces of reality."--Pablo Picasso

Paul Rubin

unread,

May 14, 2006, 1:50:06 AM5/14/06

to

"George Sakkis" <george...@gmail.com> writes:
> As clunky as it seems, I don't think you can beat it in terms of
> brevity; if you care about memory efficiency though, here's what I use:
>
> def length(iterable):
> try: return len(iterable)
> except:
> i = 0
> for x in iterable: i += 1
> return i

Alex's example amounted to something like that, for the generator
case. Notice that the argument to sum() was a generator
comprehension. The sum function then iterated through it.

Alex Martelli

unread,

May 14, 2006, 12:30:37 PM5/14/06

to

True. Changing the except clause here to

except: return sum(1 for x in iterable)

keeps George's optimization (O(1), not O(N), for containers) and is a
bit faster (while still O(N)) for non-container iterables.

Alex

Cameron Laird

unread,

May 14, 2006, 12:31:25 PM5/14/06

to

In article <1hfarom.1lfetjc18leddeN%al...@mac.com>,
Alex Martelli <al...@mac.com> wrote:
.
.

.
>My preference would be (with the original definition for
>words_of_the_file) to code
>
> numwords = sum(1 for w in words_of_the_file(thefilepath))

.
.
.
There are times when

numwords = len(list(words_of_the_file(thefilepath))

will be advantageous.

For that matter, would it be an advantage for len() to operate
on iterables? It could be faster and thriftier on memory than
either of the above, and my first impression is that it's
sufficiently natural not to offend those of suspicious of
language bloat.

Paul Rubin

unread,

May 14, 2006, 1:21:49 PM5/14/06

to

cla...@lairds.us (Cameron Laird) writes:
> For that matter, would it be an advantage for len() to operate
> on iterables?

print len(itertools.count())

Ouch!!

BartlebyScrivener

unread,

May 14, 2006, 2:40:15 PM5/14/06

to

>> True. Changing the except clause here to

>> except: return sum(1 for x in iterable)

>> keeps George's optimization (O(1), not O(N), for containers) and is a
>> bit faster (while still O(N)) for non-container iterables.

Every thing was going just great. Now I have to think again.

Thank you all.

rick

George Sakkis

unread,

May 14, 2006, 6:06:00 PM5/14/06

to

Paul Rubin wrote:

How is this worse than list(itertools.count()) ?

Alex Martelli

unread,

May 14, 2006, 6:27:56 PM5/14/06

to

Cameron Laird <cla...@lairds.us> wrote:

> In article <1hfarom.1lfetjc18leddeN%al...@mac.com>,
> Alex Martelli <al...@mac.com> wrote:
> .
> .
> .
> >My preference would be (with the original definition for
> >words_of_the_file) to code
> >
> > numwords = sum(1 for w in words_of_the_file(thefilepath))
> .
> .
> .
> There are times when
>
> numwords = len(list(words_of_the_file(thefilepath))
>
> will be advantageous.

Can you please give some examples? None comes readily to mind...

> For that matter, would it be an advantage for len() to operate
> on iterables? It could be faster and thriftier on memory than
> either of the above, and my first impression is that it's
> sufficiently natural not to offend those of suspicious of
> language bloat.

I'd be a bit worried about having len(x) change x's state into an
unusable one. Yes, it happens in other cases (if y in x:), but adding
more such problematic cases doesn't seem advisable to me anyway -- I'd
evaluate this proposal as a -0, even taking into account the potential
optimizations to be garnered by having some iterables expose __len__
(e.g., a genexp such as (f(x) fox x in foo), without an if-clause, might
be optimized to delegate __len__ to foo -- again, there may be semantic
alterations lurking that make this optimization a bit iffy).

Alex

Alex Martelli

unread,

May 14, 2006, 6:30:37 PM5/14/06

to

George Sakkis <george...@gmail.com> wrote:

It's a slightly worse trap because list(x) ALWAYS iterates on x (just
like "for y in x:"), while len(x) MAY OR MAY NOT iterate on x (under
Cameron's proposal; it currently never does).

Yes, there are other subtle traps of this ilk already in Python, such as
"if y in x:" -- this, too, may or may not iterate. But the fact that a
potential problem exists in some corner cases need not be a good reason
to extend the problem to higher frequency;-).

Alex

Delaney, Timothy (Tim)

unread,

May 14, 2006, 6:37:21 PM5/14/06

to pytho...@python.org

George Sakkis wrote:

> Paul Rubin wrote:
>
>> cla...@lairds.us (Cameron Laird) writes:
>>> For that matter, would it be an advantage for len() to operate
>>> on iterables?
>>
>> print len(itertools.count())
>>
>> Ouch!!
>
> How is this worse than list(itertools.count()) ?

list(itertools.count()) will eventually fail with a MemoryError.

Actually len(itertools.count()) would as well - when a couple of long
instances used up everything available - but it would take a *lot*
longer.

Tim Delaney

Delaney, Timothy (Tim)

unread,

May 14, 2006, 6:47:10 PM5/14/06

to pytho...@python.org

Delaney, Timothy (Tim) wrote:

> Actually len(itertools.count()) would as well - when a couple of long
> instances used up everything available - but it would take a *lot*
> longer.

Actually, this would depend on whether len(iterable) used a C integral
variable to accumulate the length (which would roll over and never end)
or a Python long (which would eventually use up all memory).

Tim Delaney

Paul Rubin

unread,

May 14, 2006, 7:22:27 PM5/14/06

to

That's only because itertools.count itself uses a C int instead of a long.
IMO, that's a bug (maybe fixed in 2.5):

Python 2.3.4 (#1, Feb 2 2005, 12:11:53)
[GCC 3.4.2 20041017 (Red Hat 3.4.2-6.fc3)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import sys,itertools
>>> a=sys.maxint - 3
>>> a
2147483644
>>> b = itertools.count(a)
>>> [b.next() for i in range(8)]
[2147483644, 2147483645, 2147483646, 2147483647, -2147483648,
-2147483647, -2147483646, -2147483645]
>>>

Delaney, Timothy (Tim)

unread,

May 14, 2006, 7:28:00 PM5/14/06

to pytho...@python.org

Paul Rubin wrote:

> That's only because itertools.count itself uses a C int instead of a
> long.

True. In either case, the effect is the same in terms of whether
len(itertools.count()) will ever terminate.

Tim Delaney

George Sakkis

unread,

May 14, 2006, 10:09:24 PM5/14/06

to

Delaney, Timothy (Tim) wrote:

That's more of a theoretical argument on why the latter is worse. How
many real-world programs are prepared for MemoryError every time they
call list(), catch it and handle it graciously ? I'd say that the only
reason an exception would be preferable in such case would be
debugging; it's nice to have an informative traceback instead of a
program that entered an infinite loop.

George

Delaney, Timothy (Tim)

unread,

May 14, 2006, 10:35:35 PM5/14/06

to pytho...@python.org

George Sakkis wrote:

> Delaney, Timothy (Tim) wrote:
>>
>> list(itertools.count()) will eventually fail with a MemoryError.
>

> That's more of a theoretical argument on why the latter is worse. How
> many real-world programs are prepared for MemoryError every time they
> call list(), catch it and handle it graciously ? I'd say that the only
> reason an exception would be preferable in such case would be
> debugging; it's nice to have an informative traceback instead of a
> program that entered an infinite loop.

That's exactly my point. Assuming your test coverage is good, such an
error would be caught by the MemoryError. An infinite loop should also
be caught by timing out the tests, but that's much more dependent on the
test harness.

Tim Delaney

Cameron Laird

unread,

May 15, 2006, 11:02:10 AM5/15/06

to

In article <1hfc2sb.8hxfcv13c5v6nN%al...@mac.com>,

Alex Martelli <al...@mac.com> wrote:
>Cameron Laird <cla...@lairds.us> wrote:
>
>> In article <1hfarom.1lfetjc18leddeN%al...@mac.com>,
>> Alex Martelli <al...@mac.com> wrote:
>> .
>> .
>> .
>> >My preference would be (with the original definition for
>> >words_of_the_file) to code
>> >
>> > numwords = sum(1 for w in words_of_the_file(thefilepath))
>> .
>> .
>> .
>> There are times when
>>
>> numwords = len(list(words_of_the_file(thefilepath))
>>
>> will be advantageous.
>
>Can you please give some examples? None comes readily to mind...

.
.
.
Maybe in an alternative universe where Python style emphasizes
functional expressions. This thread--or at least the follow-ups
to my rather frivolous observation--illustrate how distinct is
Python's direction.

If we could neglect memory impact, and procedural side-effects,
then, sure, I'd argue for my len(list(...)) formulation, on the
expressive grounds that it doesn't require the two "magic tokens"
'1' and 'w'. Does category theory have a term for formulas of
the sort that introduce a free variable only to ignore (discard,
...) it? There certainly are times when that's apt ...

Cameron Laird

unread,

May 15, 2006, 11:25:34 AM5/15/06

to

In article <1hfc2sb.8hxfcv13c5v6nN%al...@mac.com>,

Alex Martelli <al...@mac.com> wrote:
.
.
.

>I'd be a bit worried about having len(x) change x's state into an
>unusable one. Yes, it happens in other cases (if y in x:), but adding
>more such problematic cases doesn't seem advisable to me anyway -- I'd
>evaluate this proposal as a -0, even taking into account the potential
>optimizations to be garnered by having some iterables expose __len__
>(e.g., a genexp such as (f(x) fox x in foo), without an if-clause, might
>be optimized to delegate __len__ to foo -- again, there may be semantic
>alterations lurking that make this optimization a bit iffy).
>
>
>Alex

Quite so. My proposal isn't at all serious; I'm doing this largely
for practice in thinking about functionalism and its complement in
Python. However, maybe I should take this one step farther: while
I think your caution about "attractive nuisance" is perfect, what is
the precise nuisance here? Is there ever a time when a developer
would be tempted to evaluate len() on an iterable even though there's
another approach that does NOT impact the iterable's state? On the
other hand, maybe all we're doing is observing that expanding the
domain of len() means we give up guarantees on its finiteness, and
that's simply not worth doing.

Cameron Laird

unread,

May 15, 2006, 11:31:33 AM5/15/06

to

In article <mailman.5692.1147660...@python.org>,
Delaney, Timothy (Tim) <tdel...@avaya.com> wrote:
.
.

.
>That's exactly my point. Assuming your test coverage is good, such an
>error would be caught by the MemoryError. An infinite loop should also
>be caught by timing out the tests, but that's much more dependent on the
>test harness.
>
>Tim Delaney

Gulp. OK, you've got me curious: how many people habitually frame
their unit tests with resource constraints? I think I take testing
seriously, and certainly also am involved with resource limits often,
but I confess I've never aimed to write all my tests in terms of
bounds on time (and presumably memory and ...). You've got me
thinking, Tim.

Timothy Grant

unread,

May 15, 2006, 5:02:16 PM5/15/06

to pytho...@python.org

> --
> http://mail.python.org/mailman/listinfo/python-list
>

I'm a huge proponent of unittest and believe I take them very
seriously also. I try never to write a line of code unless I have a
test to prove I need it.

I have written tests that take into account resource constraints, but
ONLY when I've written code that, while passing all tests, shows
resource consumption problems.

Creating resource contstraint tests out of the gate *may* fall into
the category of premature optimization.

--
Stand Fast,
tjg.

Bruno Desthuilliers

unread,

May 15, 2006, 9:10:37 PM5/15/06

to

George Sakkis a écrit :
(snip)

> def length(iterable):
> try: return len(iterable)
> except:

except TypeError:

> i = 0
> for x in iterable: i += 1
> return i
>

(snip)

Delaney, Timothy (Tim)

unread,

May 17, 2006, 7:16:12 PM5/17/06

to pytho...@python.org

Cameron Laird wrote:

> Gulp. OK, you've got me curious: how many people habitually frame
> their unit tests with resource constraints? I think I take testing
> seriously, and certainly also am involved with resource limits often,
> but I confess I've never aimed to write all my tests in terms of
> bounds on time (and presumably memory and ...). You've got me
> thinking, Tim.

Generally we only do it when we actually discover a problem i.e. the
test has been running forever ... which is precisely the problem. At
least you don't have to specifically code tests for bounds on most other
resources (such as memory) as eventually it will fail with a language
with managed memory (Python, Java, etc). It definitely becomes an issue
for C/C++ unit tests though - they don't fail nicely.

I believe JUnit 4 includes the option to time out any test, but that you
can vary the timeout using an annotation (for example, make any test
fail if it takes longer than 30 seconds - but you may have one test that
needs to run longer). We're only using JUnit 3 at my work at the moment
though.

The application I'm currently working on is very timing dependent (and
the timings vary greatly across machines), so we've got loops like:

while (test condition not met) and (timeout not exceeded):
sleep

assert test condition met

Originally these were just:

while (test condition not met):
sleep

and as a result turned into infinite loops.

I think it's much better to have every test have a default timeout (at
which point the test fails) and be able to change that default when
needed. This protects against errors in both the code, and test cases.

Tim Delaney