related lists mean value

dimitri pater - serpia

unread,

Mar 8, 2010, 5:34:06 PM3/8/10

to Python Users

Hi,

I have two related lists:
x = [1 ,2, 8, 5, 0, 7]
y = ['a', 'a', 'b', 'c', 'c', 'c' ]

what I need is a list representing the mean value of 'a', 'b' and 'c'
while maintaining the number of items (len):
w = [1.5, 1.5, 8, 4, 4, 4]

I have looked at iter(tools) and next(), but that did not help me. I'm
a bit stuck here, so your help is appreciated!

thanks!
Dimitri

MRAB

unread,

Mar 8, 2010, 6:15:44 PM3/8/10

to Python Users

Try doing it in 2 passes.

First pass: count the number of times each string occurs in 'y' and the
total for each (zip/izip and defaultdict are useful for these).

Second pass: create the result list containing the mean values.

Chris Rebert

unread,

Mar 8, 2010, 6:22:39 PM3/8/10

to dimitri pater - serpia, Python Users

On Mon, Mar 8, 2010 at 2:34 PM, dimitri pater - serpia
<dimitr...@gmail.com> wrote:
> Hi,
>
> I have two related lists:
> x = [1 ,2, 8, 5, 0, 7]
> y = ['a', 'a', 'b', 'c', 'c', 'c' ]
>
> what I need is a list representing the mean value of 'a', 'b' and 'c'
> while maintaining the number of items (len):
> w = [1.5, 1.5, 8, 4, 4, 4]
>
> I have looked at iter(tools) and next(), but that did not help me. I'm
> a bit stuck here, so your help is appreciated!

from __future__ import division

def group(keys, values):
#requires None not in keys
groups = []
cur_key = None
cur_vals = None
for key, val in zip(keys, values):
if key != cur_key:
if cur_key is not None:
groups.append((cur_key, cur_vals))
cur_vals = [val]
cur_key = key
else:
cur_vals.append(val)
groups.append((cur_key, cur_vals))
return groups

def average(lst):
return sum(lst) / len(lst)

def process(x, y):
result = []
for key, vals in group(y, x):
avg = average(vals)
for i in xrange(len(vals)):
result.append(avg)
return result

x = [1 ,2, 8, 5, 0, 7]
y = ['a', 'a', 'b', 'c', 'c', 'c' ]

print process(x, y)
#=> [1.5, 1.5, 8.0, 4.0, 4.0, 4.0]

It could be tweaked to use itertools.groupby(), but it would probably
be less efficient/clear.

Cheers,
Chris
--
http://blog.rebertia.com

dimitri pater - serpia

unread,

Mar 8, 2010, 6:47:12 PM3/8/10

to Chris Rebert, Python Users

thanks Chris and MRAB!
Looks good, I'll try it out

--
---
You can't have everything. Where would you put it? -- Steven Wright
---
please visit www.serpia.org

John Posner

unread,

Mar 8, 2010, 9:39:59 PM3/8/10

to dimitr...@gmail.com

Nobody expects object-orientation (or the Spanish Inquisition):

#-------------------------
from collections import defaultdict

class Tally:
def __init__(self, id=None):
self.id = id
self.total = 0
self.count = 0

x = [1 ,2, 8, 5, 0, 7]
y = ['a', 'a', 'b', 'c', 'c', 'c']

# gather data
tally_dict = defaultdict(Tally)
for i in range(len(x)):
obj = tally_dict[y[i]]
obj.id = y[i]
obj.total += x[i]
obj.count += 1

# process data
result_list = []
for key in sorted(tally_dict):
obj = tally_dict[key]
mean = 1.0 * obj.total / obj.count
result_list.extend([mean] * obj.count)
print result_list
#-------------------------

-John

John Posner

unread,

Mar 8, 2010, 9:43:58 PM3/8/10

to dimitr...@gmail.com

On 3/8/2010 9:39 PM, John Posner wrote:

<snip>

> # gather data
> tally_dict = defaultdict(Tally)
> for i in range(len(x)):
> obj = tally_dict[y[i]]

> obj.id = y[i] <--- statement redundant, remove it

> obj.total += x[i]
> obj.count += 1

-John

John Posner

unread,

Mar 8, 2010, 9:53:41 PM3/8/10

to dimitr...@gmail.com

On 3/8/2010 9:43 PM, John Posner wrote:
> On 3/8/2010 9:39 PM, John Posner wrote:
>
> <snip>

>> obj.id = y[i] <--- statement redundant, remove it

Sorry for the thrashing! It's more correct to say that the Tally class
doesn't require an "id" attribute at all. So the code becomes:

#---------
from collections import defaultdict

class Tally:
def __init__(self):

self.total = 0
self.count = 0

x = [1 ,2, 8, 5, 0, 7]
y = ['a', 'a', 'b', 'c', 'c', 'c']

# gather data

tally_dict = defaultdict(Tally)
for i in range(len(x)):
obj = tally_dict[y[i]]

obj.total += x[i]
obj.count += 1

# process data

result_list = []
for key in sorted(tally_dict):
obj = tally_dict[key]
mean = 1.0 * obj.total / obj.count
result_list.extend([mean] * obj.count)
print result_list
#---------

-John

Michael Rudolf

unread,

Mar 9, 2010, 5:30:26 AM3/9/10

to

Am 08.03.2010 23:34, schrieb dimitri pater - serpia:
> Hi,
>
> I have two related lists:
> x = [1 ,2, 8, 5, 0, 7]
> y = ['a', 'a', 'b', 'c', 'c', 'c' ]
>
> what I need is a list representing the mean value of 'a', 'b' and 'c'
> while maintaining the number of items (len):
> w = [1.5, 1.5, 8, 4, 4, 4]

This kinda looks like you used the wrong data structure.
Maybe you should have used a dict, like:
{'a': [1, 2], 'c': [5, 0, 7], 'b': [8]} ?

> I have looked at iter(tools) and next(), but that did not help me. I'm
> a bit stuck here, so your help is appreciated!

As said, I'd have used a dict in the first place, so lets transform this
straight forward into one:

x = [1 ,2, 8, 5, 0, 7]
y = ['a', 'a', 'b', 'c', 'c', 'c' ]

# initialize dict
d={}
for idx in set(y):
d[idx]=[]

#collect values
for i, idx in enumerate(y):
d[idx].append(x[i])

print("d is now a dict of lists: %s" % d)

#calculate average
for key, values in d.items():
d[key]=sum(values)/len(values)

print("d is now a dict of averages: %s" % d)

# build the final list
w = [ d[key] for key in y ]

print("w is now the list of averages, corresponding with y:\n \
\n x: %s \n y: %s \n w: %s \n" % (x, y, w))

Output is:
d is now a dict of lists: {'a': [1, 2], 'c': [5, 0, 7], 'b': [8]}
d is now a dict of averages: {'a': 1.5, 'c': 4.0, 'b': 8.0}
w is now the list of averages, corresponding with y:

x: [1, 2, 8, 5, 0, 7]
y: ['a', 'a', 'b', 'c', 'c', 'c']
w: [1.5, 1.5, 8.0, 4.0, 4.0, 4.0]

Could have used a defaultdict to avoid dict initialisation, though.
Or write a custom class:

x = [1 ,2, 8, 5, 0, 7]
y = ['a', 'a', 'b', 'c', 'c', 'c' ]

class A:
def __init__(self):
self.store={}
def add(self, key, number):
if key in self.store:
self.store[key].append(number)
else:
self.store[key] = [number]
a=A()

# collect data
for idx, val in zip(y,x):
a.add(idx, val)

# build the final list:
w = [ sum(a.store[key])/len(a.store[key]) for key in y ]

print("w is now the list of averages, corresponding with y:\n \
\n x: %s \n y: %s \n w: %s \n" % (x, y, w))

Produces same output, of course.

Note that those solutions are both not very efficient, but who cares ;)

> thanks!

No Problem,

Michael

Michael Rudolf

unread,

Mar 9, 2010, 6:11:35 AM3/9/10

to

OK, I golfed it :D
Go ahead and kill me ;)

x = [1 ,2, 8, 5, 0, 7]
y = ['a', 'a', 'b', 'c', 'c', 'c' ]

def f(a,b,v={}):
try: v[a].append(b)
except: v[a]=[b]
def g(a): return sum(v[a])/len(v[a])
return g
w = [g(i) for g,i in [(f(i,v),i) for i,v in zip(y,x)]]

print("w is now the list of averages, corresponding with y:\n \
\n x: %s \n y: %s \n w: %s \n" % (x, y, w))

Output:

w is now the list of averages, corresponding with y:

x: [1, 2, 8, 5, 0, 7]
y: ['a', 'a', 'b', 'c', 'c', 'c']
w: [1.5, 1.5, 8.0, 4.0, 4.0, 4.0]

Regards,
Michael

Peter Otten

unread,

Mar 9, 2010, 7:02:12 AM3/9/10

to

Michael Rudolf wrote:

>>> [sum(a for a,b in zip(x,y) if b==c)/y.count(c)for c in y]

[1.5, 1.5, 8.0, 4.0, 4.0, 4.0]

Peter

Michael Rudolf

unread,

Mar 9, 2010, 10:00:00 AM3/9/10

to

Am 09.03.2010 13:02, schrieb Peter Otten:
>>>> [sum(a for a,b in zip(x,y) if b==c)/y.count(c)for c in y]
> [1.5, 1.5, 8.0, 4.0, 4.0, 4.0]
> Peter

... pwned.
Should be the fastest and shortest way to do it.

I tried to do something like this, but my brain hurt while trying to
visualize list comprehension evaluation orders ;)

Regards,
Michael

Steve Howell

unread,

Mar 9, 2010, 10:21:15 AM3/9/10

to

On Mar 8, 6:39 pm, John Posner <jjpos...@optimum.net> wrote:
> On 3/8/2010 5:34 PM, dimitri pater - serpia wrote:
>
> > Hi,
>
> > I have two related lists:
> > x = [1 ,2, 8, 5, 0, 7]
> > y = ['a', 'a', 'b', 'c', 'c', 'c' ]
>
> > what I need is a list representing the mean value of 'a', 'b' and 'c'
> > while maintaining the number of items (len):
> > w = [1.5, 1.5, 8, 4, 4, 4]
>
> > I have looked at iter(tools) and next(), but that did not help me. I'm
> > a bit stuck here, so your help is appreciated!
>
> Nobody expects object-orientation (or the Spanish Inquisition):
>

Heh. Yep, I avoided OO for this. Seems like a functional problem.
My solution is functional on the outside, imperative on the inside.
You could add recursion here, but I don't think it would be as
straightforward.

def num_dups_at_head(lst):
assert len(lst) > 0
val = lst[0]
i = 1
while i < len(lst) and lst[i] == val:
i += 1
return i

def smooth(x, y):
result = []
while x:
cnt = num_dups_at_head(y)
avg = sum(x[:cnt]) * 1.0 / cnt
result += [avg] * cnt
x = x[cnt:]
y = y[cnt:]
return result

Peter Otten

unread,

Mar 9, 2010, 11:10:26 AM3/9/10

to

Michael Rudolf wrote:

> Am 09.03.2010 13:02, schrieb Peter Otten:
>>>>> [sum(a for a,b in zip(x,y) if b==c)/y.count(c)for c in y]
>> [1.5, 1.5, 8.0, 4.0, 4.0, 4.0]
>> Peter
>
> ... pwned.
> Should be the fastest and shortest way to do it.

It may be short, but it is not particularly efficient. A dict-based approach
is probably the fastest. If y is guaranteed to be sorted itertools.groupby()
may also be worth a try.

$ cat tmp_average_compare.py
from __future__ import division
from collections import defaultdict
try:
from itertools import izip as zip
except ImportError:
pass

x = [1 ,2, 8, 5, 0, 7]
y = ['a', 'a', 'b', 'c', 'c', 'c' ]

def f(x=x, y=y):
p = defaultdict(int)
q = defaultdict(int)
for a, b in zip(x, y):
p[b] += a
q[b] += 1
return [p[b]/q[b] for b in y]

def g(x=x, y=y):
return [sum(a for a,b in zip(x,y)if b==c)/y.count(c)for c in y]

if __name__ == "__main__":
print(f())
print(g())
assert f() == g()
$ python3 -m timeit -s 'from tmp_average_compare import f, g' 'f()'
100000 loops, best of 3: 11.4 usec per loop
$ python3 -m timeit -s 'from tmp_average_compare import f, g' 'g()'
10000 loops, best of 3: 22.8 usec per loop

Peter

Steve Howell

unread,

Mar 9, 2010, 11:29:44 AM3/9/10

to

On Mar 8, 2:34 pm, dimitri pater - serpia <dimitri.pa...@gmail.com>
wrote:

> Hi,
>
> I have two related lists:
> x = [1 ,2, 8, 5, 0, 7]
> y = ['a', 'a', 'b', 'c', 'c', 'c' ]
>
> what I need is a list representing the mean value of 'a', 'b' and 'c'
> while maintaining the number of items (len):
> w = [1.5, 1.5, 8, 4, 4, 4]
>

What results are you expecting if you have multiple runs of 'a' in a
longer list?

Steve Howell

unread,

Mar 9, 2010, 11:38:35 AM3/9/10

to

On Mar 9, 7:21 am, Steve Howell <showel...@yahoo.com> wrote:
>
> def num_dups_at_head(lst):
> assert len(lst) > 0
> val = lst[0]
> i = 1
> while i < len(lst) and lst[i] == val:
> i += 1
> return i
>
> def smooth(x, y):
> result = []
> while x:
> cnt = num_dups_at_head(y)
> avg = sum(x[:cnt]) * 1.0 / cnt
> result += [avg] * cnt

> x = x[cnt:] # expensive?
> y = y[cnt:] # expensive?
> return result
>

BTW I recognize that my solution would be inefficient for long lists,
unless the underlying list implementation had copy-on-write. I'm
wondering what the easiest fix would be. I tried a quick shot at
islice(), but the lack of len() thwarted me.

nn

unread,

Mar 9, 2010, 1:24:27 PM3/9/10

to

I converged to the same solution but had an extra reduction step in
case there were a lot of repeats in the input. I think it is a good
compromise between efficiency, readability and succinctness.

x = [1 ,2, 8, 5, 0, 7]
y = ['a', 'a', 'b', 'c', 'c', 'c' ]

from collections import defaultdict
totdct = defaultdict(int)
cntdct = defaultdict(int)
for name, num in zip(y,x):
totdct[name] += num
cntdct[name] += 1
avgdct = {name : totdct[name]/cnts for name, cnts in cntdct.items()}
w = [avgdct[name] for name in y]