Re: determining which value is the first to appear five times in a list?

Terry Reedy

unread,

Feb 6, 2010, 1:56:25 PM2/6/10

to pytho...@python.org

On 2/6/2010 1:24 PM, Chris Colbert wrote:
> I'm working on a naive K-nearest-neighbors selection criteria for an
> optical character recognition problem.
>
> After I build my training set, I test each new image against against the
> trained feature vectors and record the scores as follows:
>
> match_vals = [(match_val_1, identifier_a), (match_val_2, identifier_b)
> .... ] and so on..
>
> then I sort the list so the smallest match_val's appear first
> (indictating a strong match, so I may end up with something like this:
>
> [(match_val_291, identifier_b), (match_val_23, identifier_b),
> (match_val_22, identifer_k) .... ]
>
> Now, what I would like to do is step through this list and find the
> identifier which appears first a K number of times.
>
> Naively, I could make a dict and iterate through the list AND the dict
> at the same time and keep a tally, breaking when the criteria is met.
>
> such as:
>
> def getnn(match_vals):
> tallies = defaultdict(lambda: 0)
> for match_val, ident in match_vals:
> tallies[ident] += 1
> for ident, tally in tallies.iteritems():
> if tally == 5:
> return ident
>
> I would think there is a better way to do this. Any ideas?

You only need to check that the incremented tally is 5, which is to say,
that the about-to-be-incremented tally is 4.
t = tallies[ident]
if t < 4: tallies[ident] = t+1
else: return ident

Wolodja Wentland

unread,

Feb 6, 2010, 2:09:01 PM2/6/10

to pytho...@python.org

On Sat, Feb 06, 2010 at 13:24 -0500, Chris Colbert wrote:
> [(match_val_291, identifier_b), (match_val_23, identifier_b), (match_val_22,
> identifer_k) .... ]
>
> Now, what I would like to do is step through this list and find the identifier
> which appears first a K number of times.

I think you can use the itertools.groupby(L, lambda el: el[1]) to group
elements in your *sorted* list L by the value el[1] (i.e. the
identifier) and then iterate through these groups until you find the
desired number of instances grouped by the same identifier.

Let me exemplify this:

>>> from itertools import groupby
>>> instances = [(1, 'b'), (2, 'b'), (3, 'a'), (4, 'c'), (5, 'c'), (6, 'c'), (7, 'd')]
>>> k = 3
>>> grouped_by_identifier = groupby(instances, lambda el: el[1])
>>> grouped_by_identifier = ((identifier, list(group)) for identifier, group in grouped_by_identifier)
>>> k_instances = (group for identifier, group in grouped_by_identifier if len(group) == k)
>>> next(k_instances)
[(4, 'c'), (5, 'c'), (6, 'c')]
>>> next(k_instances)
Traceback (most recent call last):
File "<input>", line 1, in <module>
StopIteration

There are certainly millions of ways to do this and most of them will be
better than my proposal here, but you might like this approach. Another
approach would use itertools.takewhile() or itertools.ifilter() ... Just
have a look :-)

yours sincerely
--
.''`. Wolodja Wentland <went...@cl.uni-heidelberg.de>
: :' :
`. `'` 4096R/CAF14EFC
`- 081C B7CD FF04 2BA9 94EA 36B2 8B7F 7D30 CAF1 4EFC

signature.asc

Terry Reedy

unread,

Feb 6, 2010, 2:42:52 PM2/6/10

to pytho...@python.org

On 2/6/2010 2:09 PM, Wolodja Wentland wrote:
> On Sat, Feb 06, 2010 at 13:24 -0500, Chris Colbert wrote:
>> [(match_val_291, identifier_b), (match_val_23, identifier_b), (match_val_22,
>> identifer_k) .... ]
>>
>> Now, what I would like to do is step through this list and find the identifier
>> which appears first a K number of times.
>
> I think you can use the itertools.groupby(L, lambda el: el[1]) to group
> elements in your *sorted* list L by the value el[1] (i.e. the
> identifier) and then iterate through these groups until you find the
> desired number of instances grouped by the same identifier.

This will generally not return the same result. It depends on whether OP
wants *any* item appearing at least 5 times or whether the order is
significant and the OP literally wants the first. Sorting the entire
list may also take a *lot* longer.

Terry Jan Reedy

Wolodja Wentland

unread,

Feb 6, 2010, 3:25:40 PM2/6/10

to pytho...@python.org

Order is preserved by itertools.groupby - Have a look:

>>> instances = [(1, 'b'), (2, 'b'), (3, 'a'), (4, 'c'), (5, 'c'), (6, 'c'), (7, 'b'), (8, 'b')]

>>> grouped_by_identifier = groupby(instances, lambda el: el[1])
>>> grouped_by_identifier = ((identifier, list(group)) for identifier, group in grouped_by_identifier)

>>> k_instances = (group for identifier, group in grouped_by_identifier if len(group) == 2)
>>> for group in k_instances:
... print group
...
[(1, 'b'), (2, 'b')]
[(7, 'b'), (8, 'b')]

So the first element yielded by the k_instances generator will be the
first group of elements from the original list whose identifier appears
exactly k times in a row.

> Sorting the entire list may also take a *lot* longer.

Than what?

Am I missing something? Is the "*sorted*" the culprit? If yes -> Just
forget it as it is not relevant.

signature.asc

Terry Reedy

unread,

Feb 6, 2010, 5:40:53 PM2/6/10

to pytho...@python.org

On 2/6/2010 3:25 PM, Wolodja Wentland wrote:
> On Sat, Feb 06, 2010 at 14:42 -0500, Terry Reedy wrote:
>> On 2/6/2010 2:09 PM, Wolodja Wentland wrote:
>
>>> I think you can use the itertools.groupby(L, lambda el: el[1]) to group
>>> elements in your *sorted* list L by the value el[1] (i.e. the
>>> identifier) and then iterate through these groups until you find the
>>> desired number of instances grouped by the same identifier.
>
>> This will generally not return the same result. It depends on
>> whether OP wants *any* item appearing at least 5 times or whether
>> the order is significant and the OP literally wants the first.
>
> Order is preserved by itertools.groupby - Have a look:

Sorting does not.

>
>>>> instances = [(1, 'b'), (2, 'b'), (3, 'a'), (4, 'c'), (5, 'c'), (6, 'c'), (7, 'b'), (8, 'b')]
>>>> grouped_by_identifier = groupby(instances, lambda el: el[1])
>>>> grouped_by_identifier = ((identifier, list(group)) for identifier, group in grouped_by_identifier)
>>>> k_instances = (group for identifier, group in grouped_by_identifier if len(group) == 2)
>>>> for group in k_instances:
> ... print group
> ...
> [(1, 'b'), (2, 'b')]
> [(7, 'b'), (8, 'b')]
>
> So the first element yielded by the k_instances generator will be the
> first group of elements from the original list whose identifier appears
> exactly k times in a row.
>
>> Sorting the entire list may also take a *lot* longer.
> Than what?

Than linearly scanning for the first 5x item, as in my corrected version
of the original code.

Terry Jan Reedy