Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

Yet another unique() function...

119 views
Skip to first unread message

MonkeeSage

unread,
Feb 27, 2007, 9:10:30 PM2/27/07
to
Here's yet another take on a unique() function for sequences. It's
more terse than others I've seen and works for all the common use
cases (please report any errors on the recipe page):

http://aspn.activestate.com/ASPN/Cookbook/Python/Recipe/502263

Regards,
Jordan

Paul Rubin

unread,
Feb 27, 2007, 9:48:50 PM2/27/07
to

That looks pretty messy, and it's a quadratic time algorithm (or maybe
worse) because of all the list.index and deletion operations.

This version is also quadratic and passes your test suite, but
might differ in some more complicated cases:

def unique(seq, keepstr=True):
t = type(seq)
if t==str:
t = (list, ''.join)[bool(keepstr)]
seen = []
return t(c for c in seq if (c not in seen, seen.append(c))[0])

Paul Rubin

unread,
Feb 27, 2007, 9:55:11 PM2/27/07
to
Paul Rubin <http://phr...@NOSPAM.invalid> writes:
> def unique(seq, keepstr=True):
> t = type(seq)
> if t==str:
> t = (list, ''.join)[bool(keepstr)]
> seen = []
> return t(c for c in seq if (c not in seen, seen.append(c))[0])

Preferable:

def unique(seq, keepstr=True):
t = type(seq)
if t==str:
t = (list, ''.join)[bool(keepstr)]
seen = []

return t(c for c in seq if not (c in seen or seen.append(c)))

MonkeeSage

unread,
Feb 27, 2007, 10:03:45 PM2/27/07
to

Wow, nice! Very cool. :)

Regards,
Jordan

MonkeeSage

unread,
Feb 27, 2007, 10:14:08 PM2/27/07
to

I posted this (attributed to you of course) in the comments section
for the recipe.

Paul McGuire

unread,
Feb 28, 2007, 1:14:41 AM2/28/07
to
On Feb 27, 8:55 pm, Paul Rubin <http://phr...@NOSPAM.invalid> wrote:


Any reason not to use a set for the 'seen' variable? Avoids searching
through a linear list. The input order is preserved because the
return value is created in the generator expression, not by using the
seen variable directly.

def unique2(seq, keepstr=True):


t = type(seq)
if t==str:
t = (list, ''.join)[bool(keepstr)]

seen = set()
return t(c for c in seq if not (c in seen or seen.add(c)))

-- Paul

Paul Rubin

unread,
Feb 28, 2007, 2:00:50 AM2/28/07
to
"Paul McGuire" <pt...@austin.rr.com> writes:
> Any reason not to use a set for the 'seen' variable?

Yes, the sequence can contain non-hashable elements. See the test
vectors for examples.

bearoph...@lycos.com

unread,
Feb 28, 2007, 2:25:05 AM2/28/07
to
MonkeeSage:

> Here's yet another take on a unique() function for sequences. It's
> more terse than others I've seen and works for all the common use
> cases (please report any errors on the recipe page):

It's more terse, but my version is built to be faster in the more
common cases of all hashable or/and all sortable items (while working
in other cases too).
Try your unique on an unicode string, that's probably a bug (keepstr
is being ignored).
Version by Paul Rubin is very short, but rather unreadable too.

Bye,
bearophile

Paul Rubin

unread,
Feb 28, 2007, 3:18:16 PM2/28/07
to
bearoph...@lycos.com writes:
> It's more terse, but my version is built to be faster in the more
> common cases of all hashable or/and all sortable items (while working
> in other cases too).
> Try your unique on an unicode string, that's probably a bug (keepstr
> is being ignored).
> Version by Paul Rubin is very short, but rather unreadable too.
>
> Bye,
> bearophile

Unicode fix (untested):

def unique(seq, keepstr=True):
t = type(seq)

if t in (unicode, str):
t = (list, t('').join)[bool(keepstr)]


seen = []
return t(c for c in seq if not (c in seen or seen.append(c)))

Case by case optimization (untested):

def unique(seq, keepstr=True):
t = type(seq)

if t in (unicode, str):
t = (list, t('').join)[bool(keepstr)]
try:
remaining = set(seq)
seen = set()
return t(c for c in seq if (c in remaining and
not remaining.remove(c)))
except TypeError: # hashing didn't work, see if seq is sortable
try:
from itertools import groupby
s = sorted(enumerate(seq),key=lambda (i,v):(v,i))
return t(g.next() for k,g in groupby(s, lambda (i,v): v))
except: # not sortable, use brute force


seen = []
return t(c for c in seq if not (c in seen or seen.append(c)))

I don't have Python 2.4 available right now to try either of the above.

Note that all the schemes fail if seq is some arbitrary iterable,
rather one of the built-in sequence types.

I think these iterator approaches get more readable as one becomes
used to them.

MonkeeSage

unread,
Feb 28, 2007, 4:10:24 PM2/28/07
to
On Feb 28, 2:18 pm, Paul Rubin <http://phr...@NOSPAM.invalid> wrote:
> Unicode fix (untested):
>
> def unique(seq, keepstr=True):
> t = type(seq)
> if t in (unicode, str):
> t = (list, t('').join)[bool(keepstr)]
> seen = []
> return t(c for c in seq if not (c in seen or seen.append(c)))

This definitely works. I tried to post a message about this a few
hours ago, but I guess it didn't go through. I've already updated the
recipe and comments.

> I think these iterator approaches get more readable as one becomes
> used to them.

I agree. I understood your code in a matter of seconds.

MonkeeSage

unread,
Feb 28, 2007, 4:43:12 PM2/28/07
to
Paul,

In your case optimized version, in the second try clause using
itertools, it should be like this, shouldn't it?

return t(g.next()[1] for k,g in groupby(s, lambda (i,v): v))
^^^

Regards,
Jordan

Paul Rubin

unread,
Feb 28, 2007, 4:46:41 PM2/28/07
to
"MonkeeSage" <Monke...@gmail.com> writes:
> In your case optimized version, in the second try clause using
> itertools, it should be like this, shouldn't it?
>
> return t(g.next()[1] for k,g in groupby(s, lambda (i,v): v))

I didn't think so but I can't conveniently test it for now. Maybe
tonight.

MonkeeSage

unread,
Mar 1, 2007, 2:02:09 PM3/1/07
to
On Feb 28, 3:46 pm, Paul Rubin <http://phr...@NOSPAM.invalid> wrote:
> I didn't think so but I can't conveniently test it for now. Maybe
> tonight.

After playing with it a bit (only have 2.5 on this box), it looks like
you do need to subscript the next() call. For example, the return from
"unique( [[1], [2]] )" is "[(0, [1]), (1, [2])]". I'm not overly
familiar with the functional paradigm or itertools though, so I might
have missed something.

Regards,
Jordan

0 new messages