[Python-ideas] An iterable version of find/index for strings?

21 views
Skip to first unread message

Tom Schumm

unread,
Apr 4, 2013, 9:21:11 PM4/4/13
to Python-Ideas
Should Python strings (and byte arrays, and other iterables for that matter)
have an iterator form of find/rfind (or index/rindex)? I've found myself
wanting one on occasion, and having more iterable things seems to be the
direction the language is moving.

Currently, looping over the instances of a substring in a larger string is a
bit awkward. You have to keep track of where you are, and you either have have
to watch for the -1 sentinel value or catch the ValueError. A "for idx in ..."
construction would just be cleaner. You could use re.finditer, but a string
method seems a more lightweight/efficient/obvious.

The best name I can think of would be "finditer()" like re.finditer(). Using
"ifind" (like izip) would be confusing, because it could be mistaken for case-
insensitive find. I thought of "iterfind" like the old dict.iteritems, and
ElementTree.iterfind but "iterrfind" (iterable rfind) is unattractive. I also
think "find" is a more obvious verb than "index".

I've got a simple Python implementation on gist:
https://gist.github.com/fwiffo/5233377

It includes an option to include overlapping instences, which may not be
necessary (it's not present in e.g. re.finditer).

I could imagine it as a method on str/unicode/bytes/list/tuple objects, or
maybe as a function in itertools.

--
Tom Schumm
http://www.fwiffo.com/
_______________________________________________
Python-ideas mailing list
Python...@python.org
http://mail.python.org/mailman/listinfo/python-ideas

MRAB

unread,
Apr 4, 2013, 10:37:05 PM4/4/13
to python-ideas
On 05/04/2013 02:21, Tom Schumm wrote:
> Should Python strings (and byte arrays, and other iterables for that matter)
> have an iterator form of find/rfind (or index/rindex)? I've found myself
> wanting one on occasion, and having more iterable things seems to be the
> direction the language is moving.
>
> Currently, looping over the instances of a substring in a larger string is a
> bit awkward. You have to keep track of where you are, and you either have have
> to watch for the -1 sentinel value or catch the ValueError. A "for idx in ..."
> construction would just be cleaner. You could use re.finditer, but a string
> method seems a more lightweight/efficient/obvious.
>
> The best name I can think of would be "finditer()" like re.finditer(). Using
> "ifind" (like izip) would be confusing, because it could be mistaken for case-
> insensitive find. I thought of "iterfind" like the old dict.iteritems, and
> ElementTree.iterfind but "iterrfind" (iterable rfind) is unattractive. I also
> think "find" is a more obvious verb than "index".
>
As you say, there's iteritems in Python 2. The os module has listdir,
which returns a list; it has been suggested that an (non-list) iterable
version should be added, and the obvious name in that case would be
iterdir. The trend appears to be towards iterfind.

> I've got a simple Python implementation on gist:
> https://gist.github.com/fwiffo/5233377
>
> It includes an option to include overlapping instences, which may not be
> necessary (it's not present in e.g. re.finditer).
>
... but it _is_ present in the regex module! :-)

> I could imagine it as a method on str/unicode/bytes/list/tuple objects, or
> maybe as a function in itertools.
>
An alternative would be to write a generator for it.

You say "on occasion", but is that often enough to justify adding it to
the language?

Tom Schumm

unread,
Apr 4, 2013, 11:09:30 PM4/4/13
to python-ideas
On Friday, April 05, 2013 03:37:05 AM MRAB wrote:
> As you say, there's iteritems in Python 2. The os module has listdir,
> which returns a list; it has been suggested that an (non-list) iterable
> version should be added, and the obvious name in that case would be
> iterdir. The trend appears to be towards iterfind.

I agree that consistency would be best, and I'm not precious about the name.
:)

> You say "on occasion", but is that often enough to justify adding it to
> the language?

And that's why I ask; does anybody else want it? I've used it a few times, but
there are some string methods I've never used even once.

--
Tom Schumm
http://www.fwiffo.com/

Raymond Hettinger

unread,
Apr 5, 2013, 3:42:43 AM4/5/13
to Tom Schumm, Python-Ideas

On Apr 4, 2013, at 6:21 PM, Tom Schumm <ph...@phong.org> wrote:

Should Python strings (and byte arrays, and other iterables for that matter) 
have an iterator form of find/rfind (or index/rindex)? I've found myself 
wanting one on occasion, 

+1 from me.

As you say, the current pattern is awkward.  Iterators are much more
natural for this task and would lead to cleaner, faster code.


Raymond

Giampaolo Rodolà

unread,
Apr 5, 2013, 6:11:45 AM4/5/13
to Tom Schumm, Python-Ideas
2013/4/5 Tom Schumm <ph...@phong.org>:
> Should Python strings (and byte arrays, and other iterables for that matter)
> have an iterator form of find/rfind (or index/rindex)? I've found myself
> wanting one on occasion, and having more iterable things seems to be the
> direction the language is moving.
>
> Currently, looping over the instances of a substring in a larger string is a
> bit awkward. You have to keep track of where you are, and you either have have
> to watch for the -1 sentinel value or catch the ValueError. A "for idx in ..."
> construction would just be cleaner. You could use re.finditer, but a string
> method seems a more lightweight/efficient/obvious.
>
> The best name I can think of would be "finditer()" like re.finditer(). Using
> "ifind" (like izip) would be confusing, because it could be mistaken for case-
> insensitive find. I thought of "iterfind" like the old dict.iteritems, and
> ElementTree.iterfind but "iterrfind" (iterable rfind) is unattractive. I also
> think "find" is a more obvious verb than "index".
>
> I've got a simple Python implementation on gist:
> https://gist.github.com/fwiffo/5233377
>
> It includes an option to include overlapping instences, which may not be
> necessary (it's not present in e.g. re.finditer).

+1.

> I could imagine it as a method on str/unicode/bytes/list/tuple objects, or
> maybe as a function in itertools.

I would personally prefer the former.

--- Giampaolo
https://code.google.com/p/pyftpdlib/
https://code.google.com/p/psutil/
https://code.google.com/p/pysendfile/

Brett Cannon

unread,
Apr 5, 2013, 6:24:53 AM4/5/13
to Tom Schumm, Python-Ideas
FYI there is already a propposal for split: http://bugs.python.org/issue17343. Getting that approved would help move towards getting iterators for other relevant methods such as find and index.

Wolfgang Maier

unread,
Apr 5, 2013, 5:11:18 AM4/5/13
to python...@python.org
Tom Schumm <phong@...> writes:

>
> Should Python strings (and byte arrays, and other iterables for that
> matter) have an iterator form of find/rfind (or index/rindex)?

+1 as well.
As you say, it's a logical thing to have, and there don't seem to be any
disadvantages to it.

Wolfgang

Yuval Greenfield

unread,
Apr 8, 2013, 2:43:38 AM4/8/13
to Wolfgang Maier, python-ideas
On Fri, Apr 5, 2013 at 12:11 PM, Wolfgang Maier <wolfgan...@biologie.uni-freiburg.de> wrote:
Tom Schumm <phong@...> writes:

>
> Should Python strings (and byte arrays, and other iterables for that
> matter) have an iterator form of find/rfind (or index/rindex)?

+1 as well.
As you say, it's a logical thing to have, and there don't seem to be any
disadvantages to it.

Wolfgang




I think there is a disadvantage:

* It adds complexity to the str/bytes API.
* These features exist in the `re` module, TSBOOWTDI.
* Strings are usually short and always entirely in memory - the iterator requirement isn't commonplace.

Yuval

Andrew Barnert

unread,
Apr 8, 2013, 2:58:09 AM4/8/13
to Yuval Greenfield, Wolfgang Maier, python-ideas
On Apr 7, 2013, at 23:43, Yuval Greenfield <ubers...@gmail.com> wrote:

On Fri, Apr 5, 2013 at 12:11 PM, Wolfgang Maier <wolfgan...@biologie.uni-freiburg.de> wrote:
Tom Schumm <phong@...> writes:

>
> Should Python strings (and byte arrays, and other iterables for that
> matter) have an iterator form of find/rfind (or index/rindex)?

+1 as well.
As you say, it's a logical thing to have, and there don't seem to be any
disadvantages to it.

Wolfgang




I think there is a disadvantage:

* It adds complexity to the str/bytes API.
* These features exist in the `re` module, TSBOOWTDI.

Yes, but regular expressions shouldn't be the one way to do a simple text search!

* Strings are usually short and always entirely in memory - the iterator requirement isn't commonplace.

This, I think, is a better point. If you need iterfind, there's a good chance you're going to want to replace the string with an mmap, an iterator around read, something that generates the string on the fly, etc. There will be _some_ programs for which str.iterfind is more useful than a generic iterfind function, but maybe not that many...


Yuval

Nick Coghlan

unread,
Apr 8, 2013, 4:01:10 AM4/8/13
to Andrew Barnert, Wolfgang Maier, python-ideas
On Mon, Apr 8, 2013 at 4:58 PM, Andrew Barnert <abar...@yahoo.com> wrote:
> This, I think, is a better point. If you need iterfind, there's a good
> chance you're going to want to replace the string with an mmap, an iterator
> around read, something that generates the string on the fly, etc. There will
> be _some_ programs for which str.iterfind is more useful than a generic
> iterfind function, but maybe not that many...

As Tom's original post shows, the existing methods are also designed
to make it relatively straightforward to implement an efficient
iterator if you do need it.

Cheers,
Nick.

--
Nick Coghlan | ncog...@gmail.com | Brisbane, Australia

Tom Schumm

unread,
Apr 8, 2013, 10:08:21 AM4/8/13
to python...@python.org
On Sun, April 07, 2013 11:58:09 PM Andrew Barnert wrote:
> > * Strings are usually short and always entirely in memory - the iterator
> > requirement isn't commonplace.
> This, I think, is a better point. If you need iterfind, there's a good
> chance you're going to want to replace the string with an mmap, an iterator
> around read, something that generates the string on the fly, etc. There
> will be _some_ programs for which str.iterfind is more useful than a
> generic iterfind function, but maybe not that many...

The big advantage in my use is that the iterator makes code shorter and more
readable, and I can use it in list comprehensions.

Maybe it's not useful enough to be a str method; perhaps something for
itertools or more-itertools? If it was rewritten to use index() instead of
find() it could be generalized for lists and such. Or maybe I'll just leave it
in my recipe box with my trees and memoization decorators.

Then again, if I'm doing lots and lots of linear searches, I feel like I've
not thought about the problem long enough...

--
Tom Schumm
http://www.fwiffo.com/

Stephen J. Turnbull

unread,
Apr 8, 2013, 11:23:58 AM4/8/13
to Andrew Barnert, Wolfgang Maier, python-ideas
Andrew Barnert writes:

> Yes, but regular expressions shouldn't be the one way to do a
> simple text search!

Why not? I don't see a real loss to "match('^start')" vs
"startswith('start')" in terms of difficulty of learning, and a
potential benefit in encouraging people to avail themselves of the
power of regexp search and matching.

As far as efficiency goes, XEmacs does the simple thing and checks
each alleged regexp for metacharacters, and if there aren't any, falls
back to Boyer-Moore. Whoosh!

It would not be hard to add similar peephole optimizations for
searches or matches that would be most efficiently implemented with
startswith, endswith, find, index, etc.

Of course, we would need to check how often people search for
punctuation (or strings including punctuation). But my suspicion is
that people who don't want to grok the most basic features of regexps
probably don't search for "\.\*" or the like very often. They
probably stick to alphanumerics anyway.

Andrew Barnert

unread,
Apr 8, 2013, 2:04:40 PM4/8/13
to Stephen J. Turnbull, Wolfgang Maier, python-ideas
On Apr 8, 2013, at 8:23, "Stephen J. Turnbull" <ste...@xemacs.org> wrote:

> Andrew Barnert writes:
>
>> Yes, but regular expressions shouldn't be the one way to do a
>> simple text search!
>
> Why not? I don't see a real loss to "match('^start')" vs
> "startswith('start')" in terms of difficulty of learning, and a
> potential benefit in encouraging people to avail themselves of the
> power of regexp search and matching.

I don't see how you could think these are equally easy to learn. You could show the latter to someone who's never written a line of code and they'd already understand it.

But that's not the important part. The benefit of startswith is that it takes less effort to read and understand the code. Reading a regex, or s[:5]=='start' for that matter, isn't _hard_, but it still takes a bit of mental effort, which slows you down a little bit, which limits how much code you can understand in one scan. There's also the fact that there's literally nothing to get wrong with startswith, which means when you're debugging your code, you don't have to mentally check a regex or slice to make sure it's right. One of the great things about python is that you can often understand what a function does, and be sure it's correct, in just a glance.

Also, the fact that new programmers can use python for serious work (even text processing work) before they've learned regex is a strength, not a weakness. If you have to tell people "before you can parse that log/csv/whatever you have to learn how to escape parentheses and create matching groups", you might as well teach them perl.

Antoine Pitrou

unread,
Apr 8, 2013, 2:14:53 PM4/8/13
to python...@python.org

Hello,

On Fri, 5 Apr 2013 00:42:43 -0700
Raymond Hettinger
I'm mildly positive as well.
If iterfind() / finditer() is awkward, let's call it findall(): other
search methods just return the first match.

Regards

Antoine.

Laurens Van Houtven

unread,
Apr 8, 2013, 2:22:23 PM4/8/13
to Antoine Pitrou, Python-Ideas
On Mon, Apr 8, 2013 at 8:14 PM, Antoine Pitrou <soli...@pitrou.net> wrote:
let's call it findall()

+1

cheers
lvh

Tom Schumm

unread,
Apr 8, 2013, 2:59:15 PM4/8/13
to python...@python.org
On Mon, April 08, 2013 11:04:40 AM Andrew Barnert wrote:
> But that's not the important part. The benefit of startswith is that it
> takes less effort to read and understand the code. Reading a regex, or
> s[:5]=='start' for that matter, isn't _hard_, but it still takes a bit of

I was a big fan of regular expressions, going way back; I was a huge Perl
fanatic. But over the years I've used them less and less. As Andrew says, if
you have a simple string method that does the job, why endure the cognitive
overhead of a regular expression? Even if you are using a great regex library
that optimized out the computational overhead for simple cases, you still have
to write a (potentially cryptic) regex, escape special characters, etc.

It's a win if you can make code self-documenting by using a descriptive method
like "startswith", "endswith", "if needle in haystack", "find", "strip", etc.
All those have trivial regex solutions, but it's better to just say what you
mean.

--
Tom Schumm
http://www.fwiffo.com/

Nick Coghlan

unread,
Apr 8, 2013, 7:09:46 PM4/8/13
to Antoine Pitrou, python...@python.org


On 9 Apr 2013 04:20, "Antoine Pitrou" <soli...@pitrou.net> wrote:
>
>
> Hello,
>
> On Fri, 5 Apr 2013 00:42:43 -0700
> Raymond Hettinger
> <raymond....@gmail.com> wrote:
> >
> > On Apr 4, 2013, at 6:21 PM, Tom Schumm <ph...@phong.org> wrote:
> >
> > > Should Python strings (and byte arrays, and other iterables for that matter)
> > > have an iterator form of find/rfind (or index/rindex)? I've found myself
> > > wanting one on occasion,
> >
> > +1 from me.
> >
> > As you say, the current pattern is awkward.  Iterators are much more
> > natural for this task and would lead to cleaner, faster code.
>
> I'm mildly positive as well.
> If iterfind() / finditer() is awkward, let's call it findall(): other
> search methods just return the first match.

+0 from me for findall/rfindall. The overlap keyword-only arg seems like a reasonable approach to that part of the problem, too.

Cheers,
Nick.

Steven D'Aprano

unread,
Apr 8, 2013, 7:52:39 PM4/8/13
to python...@python.org
On 09/04/13 01:23, Stephen J. Turnbull wrote:
> Andrew Barnert writes:
>
> > Yes, but regular expressions shouldn't be the one way to do a
> > simple text search!
>
> Why not? I don't see a real loss to "match('^start')" vs
> "startswith('start')" in terms of difficulty of learning,

I'm not Dutch, but I cannot imagine that:

import re
prefix = re.escape(prefix)
re.match(prefix, mystring)

should be considered more obvious than

mystring.startswith(prefix)


Oh, and just to demonstrate the non-obviousness of re.match, you don't
need to anchor the regex to the beginning of the string with ^ since
match automatically matches only at the start.


> and a
> potential benefit in encouraging people to avail themselves of the
> power of regexp search and matching.

The difficulty is not encouraging people to use regexes when they need
them. The difficulty is teaching people not to turn to regexes as the
first and only tool for solving every string-based problem.


--
Steven

Stephen J. Turnbull

unread,
Apr 8, 2013, 8:14:05 PM4/8/13
to Andrew Barnert, Wolfgang Maier, python-ideas
Andrew Barnert writes:

> I don't see how you could think these are equally easy to
> learn. You could show the latter to someone who's never written a
> line of code and they'd already understand it.

I didn't say they are equally easy to learn. My point is simply: How
many of these 3-line functions all alike do we need to have as
builtins? There is a cost to having them all, which may
counterbalance the ease of learning each one.

> There's also the fact that there's literally nothing to get wrong
> with startswith,

Of course there is. It may be the wrong function for the purpose.
.startswith also encourages embedding magic literals in the code.
Both of these make maintenance harder.

> you might as well teach them perl.

Now, now, let's not be invoking Godwin's Law here.

The question "how many do we need" is an empirical question. It
should be obvious I'm not seriously suggesting getting rid of
.startswith; that would have to wait for Python4 in any case.

Steven D'Aprano

unread,
Apr 8, 2013, 8:20:43 PM4/8/13
to python...@python.org
On 09/04/13 04:14, Antoine Pitrou wrote:

> If iterfind() / finditer() is awkward, let's call it findall(): other
> search methods just return the first match.

+1 on the name and the method.


--
Steven
Reply all
Reply to author
Forward
0 new messages