Groupby questions

48 views
Skip to first unread message

mush...@gmail.com

unread,
May 31, 2017, 7:20:01 AM5/31/17
to PyData
Hello

I am sorry, I have a few groupby() questions which answers I could not found in docs.

1) I would like to retrieve the content of the groups.

Maybe I missed something about the split-apply-combine strategy*, but I don't get how to retrieve simply the content of some rows for all or part of groups.
get_group() does this beautifully for one but not for many (i tried to pass sets, lists at no effect),


I can come up easily with something like:
groups_dict= {k: list(grouped_df.get_group(k).loc[:,idcolumn]) for k in grouped_df.groups.keys()}


But I guess this is not computationally efficient, plus it has to be repeated for all columns you want to aggregate.


I actually came up with another solution (agg_df being the original dataframe, you may need resetting the df index in case you used some exploding strategy):

groups_dict2= {k: list(agg_df2.loc[v,'n']) for k,v in grouped.indices.items()}


Are there any better methods or wouldn't this be relevant to have a setting to output the actual values instead of an index?


2) I wanted to know about 1) because I actually need  to do the following

Let's say my df is:


index    "A"    "B"
0           A1    "B1,B2,B3"
1           A2    "B2,B4,B3"
2           A3    "B2,B3,B5"

and I want to do magical_function(df)
>
index    "B'"    "A''"
0           B1    "A1"
1           B2    "A1,A2,A3"
2           B3    "A1,A2,A3"
3           B4    "A2"
4           B5    "A3"

So I used an exploding strategy
i thus dropna  first to avoid mistakes, then I make a Series with the column to split, I explode it and stack it, and then the join magic with the same index duplicate the "A" columns values where needed
dcolumn="A"
col
="B"
current_wdf
=df[[idcolumn,col]].dropna()
current_col
=current_wdf.loc[:,col]
exploded_df
=current_col.str.split('\,').apply(pd.Series,1).stack()#much slower but keep the index. I could used substitution with enumerate after dropping level
exploded_df
.index=exploded_df.index.droplevel(-1)
exploded_df
.name=col
agg_df
=pd.DataFrame(current_wdf.loc[:,idcolumn]).join(exploded_df)
grouped
=agg_df.groupby([col])


After what I have:
index    "B'"    "A''"
0           B1    "A1"
1           B2    "A1"
1           B2    "A2"
1           B2    "A3"
2           B3    "A1"
2           B3    "A2"
2           B3    "A3"
3           B4    "A2"
4           B5    "A3"

Then I do
grouped=agg_df.groupby([col])
groups_dict
= {k: list(grouped.get_group(v).loc[:,idcolumn]) for k, v in grouped.groups.items()}

And I finally have the dataframe.

But that was less than trivial, and I am dubious about the last part.
It works, but it is slow, and could easily break.

Are there no operation for such matching reversal process?

Best regards

Pietro Battiston

unread,
May 31, 2017, 5:42:30 PM5/31/17
to pyd...@googlegroups.com
Il giorno mer, 31/05/2017 alle 04.20 -0700, mush...@gmail.com ha
scritto:
> Hello
>
> I am sorry, I have a few groupby() questions which answers I could
> not found in docs.
>
> 1) I would like to retrieve the content of the groups. 
> Maybe I missed something about the split-apply-combine strategy*, but
> I don't get how to retrieve simply the content of some rows for all
> or part of groups.
> get_group() does this beautifully for one but not for many (i tried
> to pass sets, lists at no effect),
>
> I can come up easily with something like:
> groups_dict= {k: list(grouped_df.get_group(k).loc[:,idcolumn]) for k
> in grouped_df.groups.keys()}
>
> But I guess this is not computationally efficient, plus it has to be
> repeated for all columns you want to aggregate.

Not clear to me how this differs from

grouped_df.apply(lambda group : group.loc[:, idcolumn])

... a reproducible example would help.

> [...]
> Let's say my df is:
>
>
> index    "A"    "B"
> 0           A1    "B1,B2,B3"
> 1           A2    "B2,B4,B3"
> 2           A3    "B2,B3,B5"
>
> and I want to do magical_function(df)
> >
> index    "B'"    "A''"
> 0           B1    "A1"
> 1           B2    "A1,A2,A3"
> 2           B3    "A1,A2,A3"
> 3           B4    "A2"
> 4           B5    "A3"
>

Notice that you are not going to get any particularly nice/efficient
pandas code to do this, because the structure itself of the data
(strings referring to lists of multiple elements) is not very nice.
Anyway, let assume you have no choice on input and output format:

> So I used an exploding strategy
> i thus dropna  first to avoid mistakes, then I make a Series with the
> column to split, I explode it and stack it, and then the join magic
> with the same index duplicate the "A" columns values where needed
> dcolumn="A"
> col="B"
> current_wdf=df[[idcolumn,col]].dropna()
> current_col=current_wdf.loc[:,col]
> exploded_df=current_col.str.split('\,').apply(pd.Series,1).stack()#mu
> ch slower but keep the index. I could used substitution with
> enumerate after dropping level
> exploded_df.index=exploded_df.index.droplevel(-1)
> exploded_df.name=col
> agg_df=pd.DataFrame(current_wdf.loc[:,idcolumn]).join(exploded_df)
> grouped=agg_df.groupby([col])
>

Looks fine to me, the following might be more efficient if elements in
"B" vary a bit in their lenghts:

expl = pd.concat([pd.DataFrame({'B' : d}).assign(A=x)
for (x, d) in
df.set_index('A').B.str.split(',').iteritems()]

>
> After what I have:
> index    "B'"    "A''"
> 0           B1    "A1"
> 1           B2    "A1"
> 1           B2    "A2"
> 1           B2    "A3"
> 2           B3    "A1"
> 2           B3    "A2"
> 2           B3    "A3"
> 3           B4    "A2"
> 4           B5    "A3"
>
> Then I do 
> grouped=agg_df.groupby([col])
> groups_dict= {k: list(grouped.get_group(v).loc[:,idcolumn]) for k, v
> in grouped.groups.items()}

Does

expl.groupby(1).aggregate(lambda group : ",".join(group[0]))

do what you desire?

Pietro
Reply all
Reply to author
Forward
0 new messages