I found this in the great Python Cookbook, which allows me to process
every word in a file. But how do I use it to count the items generated?
def words_of_file(thefilepath, line_to_words=str.split):
the_file = open(thefilepath)
for line in the_file:
for word in line_to_words(line):
yield word
the_file.close()
for word in words_of_file(thefilepath):
dosomethingwith(word)
The best I could come up with:
def words_of_file(thefilepath, line_to_words=str.split):
the_file = open(thefilepath)
for line in the_file:
for word in line_to_words(line):
yield word
the_file.close()
len(list(words_of_file(thefilepath)))
But that seems clunky.
My preference would be (with the original definition for
words_of_the_file) to code
numwords = sum(1 for w in words_of_the_file(thefilepath))
Alex
As clunky as it seems, I don't think you can beat it in terms of
brevity; if you care about memory efficiency though, here's what I use:
def length(iterable):
try: return len(iterable)
except:
i = 0
for x in iterable: i += 1
return i
You can even shadow the builtin len() if you prefer:
import __builtin__
def len(iterable):
try: return __builtin__.len(iterable)
except:
i = 0
for x in iterable: i += 1
return i
HTH,
George
rd
"There is no abstract art. You must always start with something.
Afterward you can remove all traces of reality."--Pablo Picasso
Alex's example amounted to something like that, for the generator
case. Notice that the argument to sum() was a generator
comprehension. The sum function then iterated through it.
True. Changing the except clause here to
except: return sum(1 for x in iterable)
keeps George's optimization (O(1), not O(N), for containers) and is a
bit faster (while still O(N)) for non-container iterables.
Alex
numwords = len(list(words_of_the_file(thefilepath))
will be advantageous.
For that matter, would it be an advantage for len() to operate
on iterables? It could be faster and thriftier on memory than
either of the above, and my first impression is that it's
sufficiently natural not to offend those of suspicious of
language bloat.
print len(itertools.count())
Ouch!!
>> except: return sum(1 for x in iterable)
>> keeps George's optimization (O(1), not O(N), for containers) and is a
>> bit faster (while still O(N)) for non-container iterables.
Every thing was going just great. Now I have to think again.
Thank you all.
rick
How is this worse than list(itertools.count()) ?
> In article <1hfarom.1lfetjc18leddeN%al...@mac.com>,
> Alex Martelli <al...@mac.com> wrote:
> .
> .
> .
> >My preference would be (with the original definition for
> >words_of_the_file) to code
> >
> > numwords = sum(1 for w in words_of_the_file(thefilepath))
> .
> .
> .
> There are times when
>
> numwords = len(list(words_of_the_file(thefilepath))
>
> will be advantageous.
Can you please give some examples? None comes readily to mind...
> For that matter, would it be an advantage for len() to operate
> on iterables? It could be faster and thriftier on memory than
> either of the above, and my first impression is that it's
> sufficiently natural not to offend those of suspicious of
> language bloat.
I'd be a bit worried about having len(x) change x's state into an
unusable one. Yes, it happens in other cases (if y in x:), but adding
more such problematic cases doesn't seem advisable to me anyway -- I'd
evaluate this proposal as a -0, even taking into account the potential
optimizations to be garnered by having some iterables expose __len__
(e.g., a genexp such as (f(x) fox x in foo), without an if-clause, might
be optimized to delegate __len__ to foo -- again, there may be semantic
alterations lurking that make this optimization a bit iffy).
Alex
It's a slightly worse trap because list(x) ALWAYS iterates on x (just
like "for y in x:"), while len(x) MAY OR MAY NOT iterate on x (under
Cameron's proposal; it currently never does).
Yes, there are other subtle traps of this ilk already in Python, such as
"if y in x:" -- this, too, may or may not iterate. But the fact that a
potential problem exists in some corner cases need not be a good reason
to extend the problem to higher frequency;-).
Alex
> Paul Rubin wrote:
>
>> cla...@lairds.us (Cameron Laird) writes:
>>> For that matter, would it be an advantage for len() to operate
>>> on iterables?
>>
>> print len(itertools.count())
>>
>> Ouch!!
>
> How is this worse than list(itertools.count()) ?
list(itertools.count()) will eventually fail with a MemoryError.
Actually len(itertools.count()) would as well - when a couple of long
instances used up everything available - but it would take a *lot*
longer.
Tim Delaney
> Actually len(itertools.count()) would as well - when a couple of long
> instances used up everything available - but it would take a *lot*
> longer.
Actually, this would depend on whether len(iterable) used a C integral
variable to accumulate the length (which would roll over and never end)
or a Python long (which would eventually use up all memory).
Tim Delaney
That's only because itertools.count itself uses a C int instead of a long.
IMO, that's a bug (maybe fixed in 2.5):
Python 2.3.4 (#1, Feb 2 2005, 12:11:53)
[GCC 3.4.2 20041017 (Red Hat 3.4.2-6.fc3)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import sys,itertools
>>> a=sys.maxint - 3
>>> a
2147483644
>>> b = itertools.count(a)
>>> [b.next() for i in range(8)]
[2147483644, 2147483645, 2147483646, 2147483647, -2147483648,
-2147483647, -2147483646, -2147483645]
>>>
> That's only because itertools.count itself uses a C int instead of a
> long.
True. In either case, the effect is the same in terms of whether
len(itertools.count()) will ever terminate.
Tim Delaney
That's more of a theoretical argument on why the latter is worse. How
many real-world programs are prepared for MemoryError every time they
call list(), catch it and handle it graciously ? I'd say that the only
reason an exception would be preferable in such case would be
debugging; it's nice to have an informative traceback instead of a
program that entered an infinite loop.
George
> Delaney, Timothy (Tim) wrote:
>>
>> list(itertools.count()) will eventually fail with a MemoryError.
>
> That's more of a theoretical argument on why the latter is worse. How
> many real-world programs are prepared for MemoryError every time they
> call list(), catch it and handle it graciously ? I'd say that the only
> reason an exception would be preferable in such case would be
> debugging; it's nice to have an informative traceback instead of a
> program that entered an infinite loop.
That's exactly my point. Assuming your test coverage is good, such an
error would be caught by the MemoryError. An infinite loop should also
be caught by timing out the tests, but that's much more dependent on the
test harness.
Tim Delaney
If we could neglect memory impact, and procedural side-effects,
then, sure, I'd argue for my len(list(...)) formulation, on the
expressive grounds that it doesn't require the two "magic tokens"
'1' and 'w'. Does category theory have a term for formulas of
the sort that introduce a free variable only to ignore (discard,
...) it? There certainly are times when that's apt ...
Quite so. My proposal isn't at all serious; I'm doing this largely
for practice in thinking about functionalism and its complement in
Python. However, maybe I should take this one step farther: while
I think your caution about "attractive nuisance" is perfect, what is
the precise nuisance here? Is there ever a time when a developer
would be tempted to evaluate len() on an iterable even though there's
another approach that does NOT impact the iterable's state? On the
other hand, maybe all we're doing is observing that expanding the
domain of len() means we give up guarantees on its finiteness, and
that's simply not worth doing.
Gulp. OK, you've got me curious: how many people habitually frame
their unit tests with resource constraints? I think I take testing
seriously, and certainly also am involved with resource limits often,
but I confess I've never aimed to write all my tests in terms of
bounds on time (and presumably memory and ...). You've got me
thinking, Tim.
I'm a huge proponent of unittest and believe I take them very
seriously also. I try never to write a line of code unless I have a
test to prove I need it.
I have written tests that take into account resource constraints, but
ONLY when I've written code that, while passing all tests, shows
resource consumption problems.
Creating resource contstraint tests out of the gate *may* fall into
the category of premature optimization.
--
Stand Fast,
tjg.
except TypeError:
> i = 0
> for x in iterable: i += 1
> return i
>
(snip)
> Gulp. OK, you've got me curious: how many people habitually frame
> their unit tests with resource constraints? I think I take testing
> seriously, and certainly also am involved with resource limits often,
> but I confess I've never aimed to write all my tests in terms of
> bounds on time (and presumably memory and ...). You've got me
> thinking, Tim.
Generally we only do it when we actually discover a problem i.e. the
test has been running forever ... which is precisely the problem. At
least you don't have to specifically code tests for bounds on most other
resources (such as memory) as eventually it will fail with a language
with managed memory (Python, Java, etc). It definitely becomes an issue
for C/C++ unit tests though - they don't fail nicely.
I believe JUnit 4 includes the option to time out any test, but that you
can vary the timeout using an annotation (for example, make any test
fail if it takes longer than 30 seconds - but you may have one test that
needs to run longer). We're only using JUnit 3 at my work at the moment
though.
The application I'm currently working on is very timing dependent (and
the timings vary greatly across machines), so we've got loops like:
while (test condition not met) and (timeout not exceeded):
sleep
assert test condition met
Originally these were just:
while (test condition not met):
sleep
and as a result turned into infinite loops.
I think it's much better to have every test have a default timeout (at
which point the test fails) and be able to change that default when
needed. This protects against errors in both the code, and test cases.
Tim Delaney